1. Deep Learning Spectral Denoiser
NMRflux.jl includes Flux based deep learning tools for applying deep learning to the denoising of NMR spectra. These tools integrate with the SpectData structures of NMRflux and provide a flexible interface for loading datasets, building denoising models, training them, and applying inference to new FIDs or spectra.
1.1 Overview
Many NMR experiments, especially at low concentration or short acquisition times, produce spectra with poor signal-to-noise ratio and various artefacts (baseline distortions, phase errors, solvent residuals, etc.). Classical processing (apodisation, Fourier transform, phase and baseline correction, filtering) can improve the data, but often requires hand tuning and may either oversmooth peaks or leave structured noise. The deep learning spectral denoiser in NMRflux.jl addresses this problem by learning a non linear mapping from artefact corrupted spectra to their corresponding clean spectra. It uses a deep 1D convolutional autoencoder with residual connections, trained on pairs of clean/dirty spectra, to suppress noise and artefacts while preserving peak positions and line shapes. Once trained, the model can be applied to new spectra as an automated denoising step in larger NMR processing or analysis pipelines.
1.2 Data representation
The denoiser operates on frequency-domain spectra stored as SpectData objects and saved in .jld2 files. Each SpectData used for training or validation is a 2D complex array where odd-numbered rows contain clean spectra and the corresponding even-numbered rows contain the matching noisy/artefact corrupted spectra. Internally, the training script normalizes each clean/dirty pair and converts them into WHCN tensors: the input to the model is the dirty spectrum with two channels (real and imaginary parts), and the target is the corresponding clean spectrum represented by a single channel (real part only). The model therefore learns to map (W, 2, N) dirty spectra to (W, 1, N) cleaned real spectra, where W is the number of frequency points and N is the batch size.
Training is controlled via a simple TOML configuration file (learning rate, batch size, number of epochs, etc.), and the script automatically writes model checkpoints (.bson files) during training so that runs can be resumed or warm started later.
Dataset Format and Loaders
The training script expects a single .jld2 file that contains one training split and one or more validation splits. Internally, the helper function
load_multi_snr_loaders(path; batchsize=50, norm_mode=:max, seed=1234, send_to_gpu=true)Required keys in the .jld2 file
The dataset file must contain:
- A training split stored under the key "train"
- zero or more validation splits stored under keys of the form "valsnr10", "valsnr20", ...
The suffix (e.g. 10, 20) can be any SNR label you like; the loader sorts these by SNR and reports validation losses per split.
Each value at these keys must be a SpectData object.
Shape and type of each SpectData
For each split (train, valsnrXXX):
sdis aSpectDatasd.datis a 2D array of typeComplexF32with shape (2 * nbatch, W)2 * nbatchrows,Wfrequency points per spectrum- The row axis encodes clean/dirty pairs:
- row 1 : clean spectrum 1
- row 2 : dirty spectrum 1
- row 3 : clean spectrum 2
- row 4 : dirty spectrum 2
- etc
length(sd.coord) == 2sd.coord[1]is a row index axis (e.g.1:2*nbatch)sd.coord[2]is the frequency axis (in Hz, ppm, or arbitrary units)
The loader enforces:
.jld2extension onlySpectDatawith complex element typeComplexF32- 2D layout (rows x axis)
- even number of rows (clean/dirty pairing)
1.3 TOML Configuration for Training
Training is configured via a [Trainer] table in a TOML file. A typical example is:
[Trainer]
batchSize = 10
epochs = 100000
timestamps = 10
eta = 2.0e-3
etaDecay = 0.05
weightDecay = 0.0e0
beta1 = 0.9
beta2 = 0.999
fname = "Training"
breakCriterion = -0.1 The script reads this TOML file, constructs a Trainer object, and uses it to control batch size, number of epochs, logging behaviour, and the optimiser (AdamW with eta, beta1, beta2, and weightDecay). The timestamps parameter controls how often detailed metrics are logged and checkpoints are written.
1.4 Running the training script & checkpoints
The training script is invoked from the command line:
julia spec_cleaner_train.jl <dataset_path.jld2> [config_path] [--restart path | --init path]dataset_path.jld2: Path to a.jld2file containing aSpectDataobject named "train" and one or more validation splits named "valsnrXXX"config_path(optional): Path to theTOMLconfiguration. If omitted, a default fileTrainer_repro.tomlis used--restart path: Resume a previous run from a full checkpoint (.bson) that stores both the model weights and optimiser / tracking state--init path: Warm start from an existing model checkpoint, but reset the optimiser and training history (useful when fine tuning on a new dataset)
During training, the script writes:
model_last.bson: The most recent model state (updated regularly)model_best.bson: The model with the best overall validation loss so farmodel_E-k_...bson: Additional checkpoints saved whenever the validation loss has improved by an order of magnitude relative to the previous baseline (E-1, E-2, etc.)
All training progress (loss curves, per-SNR validation metrics, and messages about checkpoints) is logged to a timestamped log file whose name starts with fname from the TOML configuration (e.g. Training-2025-11-17_10-35-22.log).
Typical example:
julia spec_cleaner_train.jl spec_dataset.jld2 Trainer.tomlTo run training in the background on a server:
nohup julia spec_cleaner_train.jl spec_dataset.jld2 Trainer.toml > out.log 2>&1 &1.5 Inference and Plotting
After training, the spec_cleaner_inference.jl script applies a saved model to new datasets and optionally produces plots. It supports two modes:
- Simulated datasets: clean + dirty rows (paired)
- Experimental datasets: dirty only rows (no ground truth)
1.5.1 Simulated datasets (paired clean/dirty)
Input format (simulated): rows are interleaved
[clean1, dirty1, clean2, dirty2, ...]Inference and plotting:
# Run inference on a simulated dataset
julia spec_cleaner_inference.jl infer-simulated <model_checkpoint.bson> <dataset.jld2>
# Plot a single example from the result file (clean, cleaned, dirty)
julia spec_cleaner_inference.jl plot-simulated <result_file.jld2> <index> [ppm_config.toml]
# Plot all pairs to a directory
julia spec_cleaner_inference.jl plot-simulated-all <result_file.jld2> [outdir] [ppm_config.toml]- The inference step writes a new
.jld2file where spectra are stored in triplets as[clean, cleaned, dirty]repeated for each pair - The plotting commands overlay clean, cleaned, and dirty spectra (in ppm if a
ppm_config.tomlis provided, otherwise in bin/frequency index)
1.5.2 Optional PPM configuration
The plotting functions (plot-simulated, plot-simulated-all, plot-experimental, etc.) can display spectra on a ppm axis if a small TOML configuration file is provided. This configuration defines the Larmor frequency and spectral width needed to convert FFT bin indices into ppm.
A typical ppm_config.toml file looks like:
# ppm_config_simulated.toml
[Hamiltonian]
baseFreq = 700.0 # MHz (e.g. 1H at 16.4 T)
shiftCtr = 4.76 # ppm (reference peak, e.g. water or TMS)
[FID]
SWH = 10000.0 # Hz spectral widthFields:
baseFreq: The transmitter frequency in MHz (e.g. 600, 700, 900 ...). This sets the overall ppm scaleshiftCtr: Reference position in ppm
Examples:
- 4.76 for water at high field
- 0.00 for TMS
- 2.01 for NAA in MRS, etc
SWH: Spectral width (in Hz) used to compute the frequency axis before ppm conversion
If no ppm config is supplied, plots fall back to a bin index axis, which is still useful but not physically calibrated.
1.6 Key Features
Configurable synthetic data generation: Users can generate arbitrarily large synthetic datasets using
GenerateFIDs, with full control over noise level, SNR, phase errors, baseline distortions, solvent artefacts, and linewidths. This enables "infinite" training data tailored to the characteristics of any experimental setup.Learned clean-dirty mapping: The training pipeline automatically extracts paired clean/dirty spectra from
.jld2datasets and normalises each pair using a norm derived from the dirty spectrum (default::maxamplitude). This ensures that training is numerically stable and reproducible.Multi-SNR evaluation: The loader automatically detects validation splits named
val_snr_XXX, sorts them by SNR, and reports validation losses per SNR level as well as an overall average. This provides detailed insight into model robustness across noise conditions.GPU acceleration and reproducibility: All tensors and model weights are moved to the GPU (
gpu(...)) for fast training. The script sets a global random seed for both Julia's RNG and CUDA (GLOBAL_SEED) to ensure fully reproducible training runs.
1.7 Building train/validation sets from synthetic FIDs (convenience script)
For user convenience, NMRflux.jl provides a script make_train_val_multi_snr_from_synthetic_dir.jl that scans a directory of synthetic FID batches (from GenerateFIDs) and builds one unified training file with:
- A single mixed SNR training set
- Separate validation sets for each SNR value
It expects .jld2 files whose names look like FIDs_16384_SNR-1000_0001.jld2 and that each file contains a SpectData under the key "batch" with rows ordered as [clean1, dirty1, clean2, dirty2, ...].
Given such a directory, the script:
- Loads all FID batches from input_dir
- Zero fills, apodizes, and Fourier transforms them to
Spec64k - Crops each spectrum into 4k tiles with 50% overlap (preserving clean/dirty pairing)
- Groups all crops by SNR (parsed from the filenames)
For each SNR:
- Concatenates all crops for that SNR
- Splits by clean/dirty pair into train/val (default 80% / 20%, no shuffle)
Builds one combined dataset with:
"train"::SpectData{ComplexF32,2}: all SNRs mixed together"val_snr_XXX"::SpectData{ComplexF32,2}: one validation set per SNR"meta"::Dict{String,Any}: parameters, counts, per SNR information
This format matches the expectations of the deep learning loader load_multi_snr_loaders and is meant to be the main entry point for training.
Basic usage:
julia make_train_val_multi_snr_from_synthetic_dir.jl ../examples/synthetic/This will:
- Scan synthetic for
FIDs_*SNR*.jld2 - Process all of them with default settings
Write a single file such as TrainVal_multiSNR_crops4096_h2048.jld2 back into the same directory.
You can then point the training script directly to this file as:
julia spec_cleaner_train.jl TrainVal_multiSNR_crops4096_h2048.jld2Custom output and parameters. All arguments except input_dir are optional:
julia make_train_val_multi_snr_from_synthetic_dir.jl <input_dir> [output_dir] [zf_pow2] [apod] [cropN] [hop] [frac_train]input_dir: directory with FID.jld2files (each with "batch" ::SpectData)output_dir: directory for the combined train/val file (default:input_dir)zf_pow2: zero fill length as power of two (default: 16 -> 65536 points)apod: exponential apodization constant in time domain (default: 0.5pi)cropN: crop length in points (default: 4096)hop: hop length between crops (default: cropN/2 -> 50% overlap)frac_train: fraction of clean/dirty pairs used for training in each SNR group (default: 0.8)
Example with custom settings:
# No overlap and 75% of pairs used for training
julia make_train_val_multi_snr_from_synthetic_dir.jl ../examples/synthetic/ ../out_multi 16 1.57 4096 4096 0.75The resulting TrainVal_multiSNR_*.jld2 file is ready to be used with load_multi_snr_loaders, which will create a shuffled "train" loader with mixed SNRs and non shuffled "valsnrXXX" loaders for each SNR, for monitoring validation loss by noise level.
1.8 Building test only sets from synthetic FIDs (convenience script)
For user convenience, NMRflux.jl also provides a standalone script make_test_from_synthetic_dir.jl that builds test only datasets from synthetic FIDs generated by GenerateFIDs. Given an input directory of .jld2 files with names starting with FIDs_ and having SpectData stored under the key "batch" and rows ordered as [clean1, dirty1, clean2, dirty2, ...] and SNR encoded in the filename (e.g. FIDs_16384_SNR-1000_0001.jld2), the script:
- Loads all FID batches in the directory
- Zero fills, apodizes, and Fourier transforms them
- Crops each spectrum into 4k tiles with 50% overlap (preserving clean/dirty pairing)
- Groups all crops by SNR (parsed from the filenames)
- Concatenates all crops per SNR without any train/val splitting
- Writes one JLD2 file per SNR with:
- "test" ::
SpectData{ComplexF32,2} - "batch" ::
SpectData{ComplexF32,2}(alias so inference scripts expecting "batch" work directly) - "meta" ::
Dict{String,Any}with parameters, counts, and input file list
- "test" ::
This is intended for final evaluation or for stress testing a trained model on large synthetic test sets.
Basic usage
julia make_test_from_synthetic_dir.jl ../examples/synthetic/Custom output and parameters. All arguments except input_dir are optional:
julia make_test_from_synthetic_dir.jl <input_dir> [output_dir] [zf_pow2] [apod] [cropN] [hop]input_dir: directory with synthetic FID .jld2 files (each with "batch" :: SpectData)output_dir: directory for the test files (default: input_dir)zf_pow2: zero-fill length as power of two (default: 16 -> 65536 points)apod: exponential apodization constant in time domain (default: 0.5pi)cropN: crop length in points (default: 4096)hop: hop length between crops (default: cropN/2 -> 50% overlap)
Example with custom settings:
# No overlap and custom apodization
julia make_test_from_synthetic_dir.jl ./synthetic_FIDs_test ./out_test 16 1.57 4096 4096The resulting Test_SNR-XXX_*.jld2 files can be used directly with the inference script for evaluating a trained denoiser across different SNR levels.