Introduction to AlpsNMR

Sergio Oller, Institute for Bioengineering of Catalonia

2021-09-19

The AlpsNMR package was written with two purposes in mind:

Functions from this package written for data analysts and NMR scientists are prefixed with nmr_, while higher level functions written for IT pipeline builders are prefixed with pipe_. The main reason why all exported functions have a prefix is to make it easy for the user to discover the functions from the package. By typing nmr_ RStudio will return the list of exported functions. In the R terminal, nmr_ followed by the tab key (⇥) twice will have the same effect. Other popular packages, follow similar approaches (e.g: forcats: fct_*, stringr: str_*).

This vignette is written for the first group. It assumes some prior basic knowledge of NMR and data analysis, as well as some basic R programming. In case you are interested in building pipelines with this package, you may want to open the file saved in this directory (run it on your computer):

pipeline_example <- system.file("pipeline-rmd", "pipeline_example.R", package = "AlpsNMR")
pipeline_example
library(AlpsNMR)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: future
#> Loading required package: magrittr
library(ggplot2)

Enable parallellization

This package is able to parallellize several functions through the use of the future and furrr packages. Whether to parallelize or not is left to the user that can control the parallellization by setting “plans”.

#plan(sequential()) # disable parallellization (default)
plan(multiprocess(workers = 4)) # enable parallellization with 4 workers
#> Warning: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead,
#> explicitly specify either 'multisession' or 'multicore'. In the current R
#> session, 'multiprocess' equals 'multicore'.

Data: The MeOH_plasma_extraction dataset

To explore the basics of the AlpsNMR package, we have included four NMR samples acquired in a 600 MHz Bruker instrument bundled with the package. The samples are pooled quality control plasma samples, that were extracted with methanol, and therefore only contain small molecules.

If you have installed this package, you can obtain the directory where the four samples are with the command

MeOH_plasma_extraction_dir <- system.file("dataset-demo", package = "AlpsNMR")
MeOH_plasma_extraction_dir
#> [1] "/tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo"

The demo directory includes four samples (zipped) and a dummy Excel metadata file.

fs::dir_ls(MeOH_plasma_extraction_dir)
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/10.zip
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/20.zip
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/30.zip
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/README.txt
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/dummy_metadata.xlsx

Given the name of the dataset, one may guess that the dataset was used to check the Methanol extraction in serum samples. The dummy metadata consists of dummy information, just for the sake of showing how this package can integrate external metadata. The excel file consists of two tidy tables, in two sheets.

MeOH_plasma_extraction_xlsx <- file.path(MeOH_plasma_extraction_dir, "dummy_metadata.xlsx")
exp_subj_id <- readxl::read_excel(MeOH_plasma_extraction_xlsx, sheet = 1)
subj_id_age <- readxl::read_excel(MeOH_plasma_extraction_xlsx, sheet = 2)
exp_subj_id
#> # A tibble: 3 × 3
#>   NMRExperiment SubjectID TimePoint
#>   <chr>         <chr>     <chr>    
#> 1 10            Ana       baseline 
#> 2 20            Ana       3 months 
#> 3 30            Elia      baseline
subj_id_age
#> # A tibble: 2 × 2
#>   SubjectID   Age
#>   <chr>     <dbl>
#> 1 Ana          29
#> 2 Elia          0

Loading samples

The function to read samples is called nmr_read_samples. It expects a character vector with the samples to load that can be paths to directories of Bruker format samples or paths to JDX files.

Additionally, this function can filter by pulse sequences (e.g. load only NOESY samples) or loading only metadata.

zip_files <- fs::dir_ls(MeOH_plasma_extraction_dir, glob = "*.zip")
zip_files
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/10.zip
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/20.zip
#> /tmp/RtmpIx1CAK/Rinst1fc6e72bf35be6/AlpsNMR/dataset-demo/30.zip
dataset <- nmr_read_samples(sample_names = zip_files)
dataset
#> An nmr_dataset (3 samples)

As we have not added any metadata to this dataset, the only column we see is the NMRExperiment:

nmr_meta_get(dataset, groups = "external")
#> # A tibble: 3 × 1
#>   NMRExperiment
#>   <chr>        
#> 1 10           
#> 2 20           
#> 3 30

Adding metadata

Initally our dataset only has the NMRExperiment column:

nmr_meta_get(dataset, groups = "external")
#> # A tibble: 3 × 1
#>   NMRExperiment
#>   <chr>        
#> 1 10           
#> 2 20           
#> 3 30

The exp_subj_id table we loaded links the NMRExperiment to the SubjectID.

As we already have the NMRExperiment column, we can use it as the merging column (note that both columns have the same column name to match the metadata such as group class, age, BMI…):

dataset <- nmr_meta_add(dataset, metadata = exp_subj_id, by = "NMRExperiment")
nmr_meta_get(dataset, groups = "external")
#> # A tibble: 3 × 3
#>   NMRExperiment SubjectID TimePoint
#>   <chr>         <chr>     <chr>    
#> 1 10            Ana       baseline 
#> 2 20            Ana       3 months 
#> 3 30            Elia      baseline

If we have info from different files we can match them. For instance, now we have the SubjectID information so we can add the table that adds the SubjectID to the Age.

dataset <- nmr_meta_add(dataset, metadata = subj_id_age, by = "SubjectID")
nmr_meta_get(dataset, groups = "external")
#> # A tibble: 3 × 4
#>   NMRExperiment SubjectID TimePoint   Age
#>   <chr>         <chr>     <chr>     <dbl>
#> 1 10            Ana       baseline     29
#> 2 20            Ana       3 months     29
#> 3 30            Elia      baseline      0

Now we have our metadata integrated in the dataset and we can make use of it in further data analysis steps.

Interpolation

1D NMR samples can be interpolated together, in order to arrange all the spectra into a matrix, with one row per sample. The main parameters we would need is the range of ppm values that we want to interpolate and the resolution.

We can see the ppm resolution by looking at the ppm axis of one sample:

ppm_res <- nmr_ppm_resolution(dataset)[[1]]
message("The ppm resolution is: ", format(ppm_res, digits = 2), " ppm")
#> The ppm resolution is: 0.00023 ppm

We can interpolate the dataset, obtaining an nmr_dataset_1D object:

dataset <- nmr_interpolate_1D(dataset, axis = c(min = -0.5, max = 10, by = 2.3E-4))

This operation changes the class of the object, as now the data is on a matrix. The dataset is now of class nmr_dataset_1D. The axis element is now a numeric vector and the data_1r element is a matrix.

Plotting samples

The AlpsNMR package offers the possibility to plot nmr_dataset_1D objects. Plotting many spectra with so many points is quite expensive so it is possible to include only some regions of the spectra or plot only some samples.

Use ?plot.nmr_dataset_1D to check the parameters, among them:

plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))

Creating interactive plots

The option interactive = TRUE described above has some performance limitations. As high performance workaround, you can make many plots interactive with the function plot_interactive.

This function will use WebGL technologies to create a webpage that, once opened, allows you to interact with the plot.

Due to technical limitations, these plots need to be opened manually and can’t be embedded in RMarkdown documents. Therefore, the function saves the plot in the directory for further exploration. Additionally, some old web browsers may not be able to display these interactive plots correctly.

plt <- plot(dataset, NMRExperiment = c("10", "30"), chemshift_range = c(2.2, 2.8))
plot_interactive(plt, "plot_region.html")

Exclude regions

Some regions can easily be excluded from the spectra with nmr_exclude_region. Note that the regions are fully removed and not zeroed, as using zeros complicates a lot the implementation1 and has little advantages.

regions_to_exclude <- list(water = c(4.6, 5), methanol = c(3.33, 3.39))
dataset <- nmr_exclude_region(dataset, exclude = regions_to_exclude)
plot(dataset, chemshift_range = c(4.2, 5.5))

Filter samples

Maybe we just want to analyze a subset of the data, e.g., only a class group or a particular gender. We can filter some samples according to their metadata as follows:

samples_10_20 <- filter(dataset, SubjectID == "Ana")
nmr_meta_get(samples_10_20, groups = "external")
#> # A tibble: 2 × 4
#>   NMRExperiment SubjectID TimePoint   Age
#>   <chr>         <chr>     <chr>     <dbl>
#> 1 10            Ana       baseline     29
#> 2 20            Ana       3 months     29

Robust PCA for outlier detection

The AlpsNMR package includes robust PCA analysis for outlier detection. With such a small demo dataset, it is not practical to use, but check out the documentation of nmr_pca_outliers_* functions.

pca_outliers_rob <- nmr_pca_outliers_robust(dataset, ncomp = 3)
nmr_pca_outliers_plot(dataset, pca_outliers_rob)

Baseline removal

Spectra may display an unstable baseline, specially when processing blood/fecal blood/fecal samples. If so, nmr_baseline_removal subtract the baseline by means of Asymmetric Least Squares method.

See before:

plot(dataset, chemshift_range = c(3.5,3.8))

And after:

dataset = nmr_baseline_removal(dataset, lambda = 6, p = 0.01)
plot(dataset, chemshift_range = c(3.5,3.8))

Peak detection

The peak detection is performed on short spectra segments using a continuous wavelet transform. See ?nmr_detect_peaks for more information.

Our current approach relies on the use of the baseline threshold (baselineThresh) automatic calculated (see ?nmr_baseline_threshold) and the Signal to Noise Threshold (SNR.Th) to discriminate valid peaks from noise.

The combination of the baselineThresh and the SNR.Th optimizes the number of actual peaks from noise.

The advantage of the SNR.Th method is that it estimates the noise level on each spectra region independently, so in practice it can be used as a dynamic baseline threshold level.

peak_table <- nmr_detect_peaks(dataset,
                               nDivRange_ppm = 0.1,
                               scales = seq(1, 16, 2),
                               baselineThresh = NULL, SNR.Th = 3)
#> Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
#> numbers without specifying argument 'seed'. There is a risk that those random
#> numbers are not statistically sound and the overall results might be invalid.
#> To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random
#> numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use
#> 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
#> Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
#> numbers without specifying argument 'seed'. There is a risk that those random
#> numbers are not statistically sound and the overall results might be invalid.
#> To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random
#> numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use
#> 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
#> Warning: UNRELIABLE VALUE: Future ('<none>') unexpectedly generated random
#> numbers without specifying argument 'seed'. There is a risk that those random
#> numbers are not statistically sound and the overall results might be invalid.
#> To fix this, specify 'seed=TRUE'. This ensures that proper, parallel-safe random
#> numbers are produced via the L'Ecuyer-CMRG method. To disable this check, use
#> 'seed=NULL', or set option 'future.rng.onMisuse' to "ignore".
NMRExp_ref <- nmr_align_find_ref(dataset, peak_table)
message("Your reference is NMRExperiment ", NMRExp_ref)
#> Your reference is NMRExperiment 20
nmr_detect_peaks_plot(dataset, peak_table, NMRExperiment = "20", chemshift_range = c(3.5,3.8))

Spectra alignment

To align the sample, we use the nmr_align function, which in turn uses a hierarchical clustering method (see ?nmr_align for further details).

The maxShift_ppm limits the maximum shift allowed for the spectra.

nmr_exp_ref <- nmr_align_find_ref(dataset, peak_table)
dataset_align <- nmr_align(dataset, peak_table, nmr_exp_ref, maxShift_ppm = 0.0015, acceptLostPeak = FALSE)
plot(dataset, chemshift_range = c(3.025, 3.063))

plot(dataset_align, chemshift_range = c(3.025, 3.063))

Normalization

There are multiple normalization techniques available. The most strongly recommended is the pqn normalization, but it may not be fully reliable when the number of samples is small, as it needs a computation of the median spectra. Nevertheless, it is possible to compute it:

dataset_norm <- nmr_normalize(dataset_align, method = "pqn")
#> Warning in norm_pqn(samples[["data_1r"]]): The Probabalistic Quotient
#> Normalization requires several samples to compute the median spectra. Your
#> number of samples is low

The AlpsNMR package offers the possibility to extract additional normalization information with nmr_normalize_extra_info(dataset), to explore the normalization factors applied to each sample:

The plot shows the dispersion with respect to the median of the normalization factors, and can highlight samples with abnormaly large or small normalization factors.

diagnostic <- nmr_normalize_extra_info(dataset_norm)
diagnostic$norm_factor
#>   NMRExperiment norm_factor norm_factor_norm
#> 1            10   520968669        0.8708718
#> 2            20   690154843        1.1536901
#> 3            30   598215122        1.0000000
diagnostic$plot

Peak integration

1. Integration based on peak center and width

If we want to integrate the whole spectra, we need ppm from the peak_table. See Peak detection section. The function nmr_integrate_peak_positions generates a new nmr_dataset_1D object containing the integrals from the peak_table (ppm values corresponding to detected peaks).

We can also integrate with a specific peak position and some arbitrary width:

2. Integration based on peak boundaries

Imagine we only want to integrate the four peaks corresponding to the pyroglutamic acid:

We define the peak regions and integrate them. Note how we can correct the baseline or not. If we correct the baseline, the limits of the integration will be connected with a straight line and that line will be used as the baseline, that will be subtracted.

We may plot the integral values to explore variation based on the baseline subtraction.

Identification

After applying any feature selection or machine learning, Alps allows the identification of features of interest through nmr_identify_regions_blood. The function gives 3 posibilities sorted by the most probable metabolite (see nmr_identify_regions_blood for details).

ppm_to_assign <- c(4.060960203, 3.048970634,2.405935596,0.990616851,0.986520147, 1.044258467)
identification <- nmr_identify_regions_blood (ppm_to_assign)

Free experimentation

Getting the spectra and manipulating it manually

Besides all those techniques, you can easily implement your own. You can extract the raw matrix and manipulate it at will. As long as you don’t permute the rows, you can always replace the raw matrix of the nmr_dataset_1D object through the nmr_data function:

Creating an nmr_dataset_1D object from a matrix

You can also create an nmr_dataset_1D object from scratch with the new_nmr_dataset_1D function:

Final thoughts

This vignette shows many of the features of the package, some features have room for improvement, others are not fully described, and the reader will need to browse the documentation. Hopefully it is a good starting point for using the package.

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.5  AlpsNMR_3.2.3  magrittr_2.0.1 future_1.22.1  dplyr_1.0.7   
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_2.0-2            ellipsis_0.3.2             
#>   [3] speaq_2.6.1                 corpcor_1.6.10             
#>   [5] XVector_0.32.0              GenomicRanges_1.44.0       
#>   [7] fs_1.5.0                    listenv_0.8.0              
#>   [9] furrr_0.2.3                 farver_2.1.0               
#>  [11] ggrepel_0.9.1               RSpectra_0.16-0            
#>  [13] fansi_0.5.0                 mvtnorm_1.1-2              
#>  [15] xml2_1.3.2                  codetools_0.2-18           
#>  [17] impute_1.66.0               knitr_1.34                 
#>  [19] mixOmics_6.16.3             itertools_0.1-3            
#>  [21] jsonlite_1.7.2              cluster_2.1.2              
#>  [23] missForest_1.4              compiler_4.1.1             
#>  [25] httr_1.4.2                  assertthat_0.2.1           
#>  [27] RcppZiggurat_0.1.6          Matrix_1.3-4               
#>  [29] fastmap_1.1.0               cli_3.0.1                  
#>  [31] htmltools_0.5.2             tools_4.1.1                
#>  [33] igraph_1.2.6                qtl_1.48-1                 
#>  [35] gtable_0.3.0                glue_1.4.2                 
#>  [37] GenomeInfoDbData_1.2.6      reshape2_1.4.4             
#>  [39] Rcpp_1.0.7                  limSolve_1.5.6             
#>  [41] Biobase_2.52.0              cellranger_1.1.0           
#>  [43] jquerylib_0.1.4             vctrs_0.3.8                
#>  [45] baseline_1.3-1              iterators_1.0.13           
#>  [47] xfun_0.26                   stringr_1.4.0              
#>  [49] globals_0.14.0              rvest_1.0.1                
#>  [51] lpSolve_5.6.15              lifecycle_1.0.0            
#>  [53] zlibbioc_1.38.0             MASS_7.3-54                
#>  [55] scales_1.1.1                doSNOW_1.0.19              
#>  [57] MatrixGenerics_1.4.3        parallel_4.1.1             
#>  [59] SummarizedExperiment_1.22.0 MassSpecWavelet_1.58.0     
#>  [61] SparseM_1.81                RColorBrewer_1.1-2         
#>  [63] yaml_2.2.1                  gridExtra_2.3              
#>  [65] sass_0.4.0                  stringi_1.7.4              
#>  [67] highr_0.9                   S4Vectors_0.30.0           
#>  [69] pcaPP_1.9-74                foreach_1.5.1              
#>  [71] randomForest_4.6-14         BiocGenerics_0.38.0        
#>  [73] BiocParallel_1.26.2         GenomeInfoDb_1.28.4        
#>  [75] rlang_0.4.11                pkgconfig_2.0.3            
#>  [77] matrixStats_0.61.0          bitops_1.0-7               
#>  [79] evaluate_0.14               lattice_0.20-44            
#>  [81] purrr_0.3.4                 labeling_0.4.2             
#>  [83] Rfast_2.0.3                 tidyselect_1.1.1           
#>  [85] parallelly_1.28.1           plyr_1.8.6                 
#>  [87] R6_2.5.1                    IRanges_2.26.0             
#>  [89] snow_0.4-3                  generics_0.1.0             
#>  [91] DelayedArray_0.18.0         DBI_1.1.1                  
#>  [93] pillar_1.6.2                withr_2.4.2                
#>  [95] RCurl_1.98-1.5              mQTL_1.0                   
#>  [97] tibble_3.1.4                crayon_1.4.1               
#>  [99] rARPACK_0.11-0              utf8_1.2.2                 
#> [101] ellipse_0.4.2               rmarkdown_2.11             
#> [103] grid_4.1.1                  readxl_1.3.1               
#> [105] data.table_1.14.0           digest_0.6.27              
#> [107] tidyr_1.1.3                 outliers_0.14              
#> [109] signal_0.7-7                stats4_4.1.1               
#> [111] munsell_0.5.0               bslib_0.3.0                
#> [113] quadprog_1.5-8

  1. e.g. it can inadvertedly distort the PQN normalization results