Clustering Spectra from High Resolution DI-MS/MS Data Using CluMSID

Tobias Depke

May 19, 2021

Introduction

Although originally developed for liquid chromatography-tandem mass spectrometry (LC-MS/MS) data, CluMSID can also be used with direct infusion-tandem mass spectrometry (DI-MS/MS) data.

Generally, the missing retention time dimension makes feature annotation in metabolomics harder but if only direct infusion data is at hand, CluMSID can help to get an overview of the chemodiversity of a sample measured by DI-MS/MS.

In this example, we will use a similar sample (1uL Pseudomonas aeruginosa PA14 cell extract) as in the General Tutorial, measured on the same machine, a Bruker maxisHD qTOF operated in ESI-(+) mode with auto-MS/MS but without chromatographic separation.

Data import

We load the file from the CluMSIDdata package:

library(CluMSID)
library(CluMSIDdata)

DIfile <- system.file("extdata", 
                        "PA14_maxis_DI.mzXML",
                        package = "CluMSIDdata")

Data preprocessing

The extraction of spectra works the same way as with LC-MS/MS data:

ms2list <- extractMS2spectra(DIfile)
length(ms2list)
#> [1] 373

Merging of redundant spectra is less straightforward when retention time is not available. Depending on the MS/MS method it can be next to impossible to decide whether two spectra with the same precursor m/z and similar fragmentation patterns derive from the same analyte or from two different but structurally similar ones.

In this example, we would like to merge spectra with identical precursor ions only if they were recorded one right after another. We can do so by setting rt_tolerance to 1 second:

featlist <- mergeMS2spectra(ms2list, rt_tolerance = 1)
length(featlist)
#> [1] 349

We see that we have hardly reduced the number of spectra in the list. If we would decide to merge all spectra with identical precursor m/z from the entire run, we could do so by setting rt_tolerance to the duration of the run, in this case approx. 250 seconds:

testlist <- mergeMS2spectra(ms2list, rt_tolerance = 250)
length(testlist)
#> [1] 75

The resulting number of spectra is drastically lower but the danger of merging spectra that do not actually derive from the same analyte is also very big.

Generation of distance matrix

In this very explorative example, we skip the integration of previous knowledge on feature identities and generate a distance matrix right away:

distmat <- distanceMatrix(featlist)

Data exploration

Starting from this distance matrix, we can use all the data exploration functions that CluMSID offers. In this example workflow, we look at a cluster dendrogram:

HCplot(distmat, cex = 0.5)
Figure 1: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on MS2 spectra similarities of the DI-MS/MS example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

Figure 1: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on MS2 spectra similarities of the DI-MS/MS example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

It is directly obvious that we have some spectra that are nearly identical and thus most likely derive from the same analyte, e.g. the many spectra with a precursor m/z of 270.19. But we still see nice clustering of similar spectra with different precursor m/z, e.g. the huge gray cluster that contains a lot of different alkylquinolone type metabolites (see General Tutorial).

In conclusion, CluMSID is very useful to provide an overview of spectral similarities within DI-MS/MS runs but wherever annotation is in the focus, one should not do without the additional layer of information created by chromatographic separation.

Session Info

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] CluMSIDdata_1.7.0 CluMSID_1.8.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-152          ProtGenerics_1.24.0   bitops_1.0-7         
#>  [4] httr_1.4.2            doParallel_1.0.16     RColorBrewer_1.1-2   
#>  [7] MSnbase_2.18.0        tools_4.1.0           bslib_0.2.5.1        
#> [10] utf8_1.2.1            R6_2.5.0              affyio_1.62.0        
#> [13] KernSmooth_2.23-20    lazyeval_0.2.2        DBI_1.1.1            
#> [16] BiocGenerics_0.38.0   colorspace_2.0-1      tidyselect_1.1.1     
#> [19] GGally_2.1.1          compiler_4.1.0        preprocessCore_1.54.0
#> [22] Biobase_2.52.0        network_1.16.1        plotly_4.9.3         
#> [25] sass_0.4.0            caTools_1.18.2        scales_1.1.1         
#> [28] affy_1.70.0           stringr_1.4.0         digest_0.6.27        
#> [31] dbscan_1.1-8          rmarkdown_2.8         pkgconfig_2.0.3      
#> [34] htmltools_0.5.1.1     highr_0.9             limma_3.48.0         
#> [37] htmlwidgets_1.5.3     rlang_0.4.11          impute_1.66.0        
#> [40] jquerylib_0.1.4       generics_0.1.0        jsonlite_1.7.2       
#> [43] mzID_1.30.0           statnet.common_4.4.1  BiocParallel_1.26.0  
#> [46] gtools_3.8.2          dplyr_1.0.6           magrittr_2.0.1       
#> [49] MALDIquant_1.19.3     Rcpp_1.0.6            munsell_0.5.0        
#> [52] S4Vectors_0.30.0      fansi_0.4.2           ape_5.5              
#> [55] MsCoreUtils_1.4.0     lifecycle_1.0.0       vsn_3.60.0           
#> [58] stringi_1.6.2         yaml_2.2.1            MASS_7.3-54          
#> [61] zlibbioc_1.38.0       gplots_3.1.1          plyr_1.8.6           
#> [64] grid_4.1.0            parallel_4.1.0        crayon_1.4.1         
#> [67] lattice_0.20-44       mzR_2.26.0            sna_2.6              
#> [70] knitr_1.33            pillar_1.6.1          codetools_0.2-18     
#> [73] stats4_4.1.0          rle_0.9.2             XML_3.99-0.6         
#> [76] glue_1.4.2            evaluate_0.14         data.table_1.14.0    
#> [79] pcaMethods_1.84.0     BiocManager_1.30.15   vctrs_0.3.8          
#> [82] foreach_1.5.1         tidyr_1.1.3           gtable_0.3.0         
#> [85] purrr_0.3.4           clue_0.3-59           reshape_0.8.8        
#> [88] assertthat_0.2.1      ggplot2_3.3.3         xfun_0.23            
#> [91] coda_0.19-4           viridisLite_0.4.0     ncdf4_1.17           
#> [94] tibble_3.1.2          iterators_1.0.13      IRanges_2.26.0       
#> [97] cluster_2.1.2         ellipsis_0.3.2