Clustering Mass Spectra from Low Resolution GC-EI-MS Data Using CluMSID

Tobias Depke

May 19, 2021

Introduction

Although originally developed for high resolution LC-MS/MS data, CluMSID can also be used to find similarities in GC-EI-MS data, i.e. data from hard ionisation mass spectrometry.

As the peak picking and spectral merging differs considerably from data dependent ESI-MS/MS, we cannot use the standard CluMSID functions extractMS2spectra() and mergeMS2spectra(). In fact, the analysis of mass spectra from hard ionisation mass spectrometry resembles the one of MS1 pseudospectra in ESI-MS. Thus, we can use the CluMSID function extractPseudospectra() in conjunction with pseudspectra generated by the CAMERA package.

Since xcms and CAMERA sometimes have difficulties in handling GC-EI-MS data, we use the metaMS package that enables workflows specialised to the analysis of such data. We also require the metaMSdata package from which we import the FEMSsettings object that contains xcms and CAMERA settings for GC-EI-MS data.

library(CluMSID)
library(CluMSIDdata)
library(metaMS)
library(metaMSdata)
data(FEMsettings)

Data import and preprocessing

As example data, we use GC-EI-MS metabolomics data from pooled cell extracts of Pseudomonas aeruginosa measured on a Thermo Scientific ITQ linear ion trap that has been converted to netCDF using Thermo Xcalibur. A netCDF file is available in the CluMSIDdata package:

pool <- system.file("extdata", 
                    "1800802_TD_pool_total_1.cdf", 
                    package = "CluMSIDdata")

To generate a list of (pseudo)spectra, we first need an xsAnnotate object as generated by CAMERA. In the case of GC-MS data, it is more convenient to use to use the metaMS function runCAMERA() than actual CAMERA functions. metaMS::runCAMERA requires an xcmsSet object which we generate by using xcms::xcmsSet on our netCDF file (we can do that in one go). We used standard GC-MS settings for runCAMERA() as they are proposed in the metaMS vignette.

xA <- metaMS::runCAMERA(xcms::xcmsSet(pool), 
                chrom = "GC", 
                settings = metaMS::metaSetting(TSQXLS.GC, "CAMERA"))

Extraction and annotation of spectra

From the xsAnnotate object, we can now extract the (pseudo)spectra using the CluMSID function extractPseudospectra() function as we would do for MS1 pseudospectra from LC-ESI-MS data.

pslist <- extractPseudospectra(xA, min_peaks = 0)

Adding annotations is not as easy as with LC-(DDA-)MS/MS data, because only the retention time and the spectrum itself describe the feature and no precursor m/z is available. Thus, feature annotations/identifications made in a different programme, in this case MetaboliteDetector, have to be compared to the spectra in the pslist object.

Like with LC-(DDA-)MS/MS data, we can use writeFeaturelist() and addAnnotations() to add external annotations. The table output from writeFeaturelist() will give NA for all precursor m/z.

writeFeaturelist(pslist, "GC_pre.csv")

To facilitate manual annotation, it helps to plot the spectra along with the relevant information for every feature/pseudospectrum. That can be done by CluMSID’s specplot function:

specplot(pslist[[27]])
Figure 1: Barplot for pseudospectrum 27, displaying fragment m/z on the x-axis and intensity normalised to the maximum intensity on the y-axis.

Figure 1: Barplot for pseudospectrum 27, displaying fragment m/z on the x-axis and intensity normalised to the maximum intensity on the y-axis.

In this example, we load the list of feature annotations from CluMSIDdata:

apslist <- addAnnotations(featlist = pslist, 
                            annolist = system.file("extdata", 
                            "GC_post.csv", 
                            package = "CluMSIDdata"))

Generation of distance matrix

This list of spectra in turn serves as an input for distanceMatrix(). As we are dealing with low resolution data, we have to adjust the m/z tolerance. The default value, 10ppm, is suitable for time-of-flight mass spectrometers while linear ion traps or single quadrupoles which are commonly used in GC-EI-MS only have unit mass resolution, equivalent to a relative mass error of 0.02 to 0.001 depending on the m/z of the analyte. We chose 0.02 to be tolerant enough for low molecular weight analytes:

pseudodistmat <- distanceMatrix(apslist, mz_tolerance = 0.02)

Starting from this distance matrix, we can use all the data exploration functions that CluMSID offers. In this example workflow, we look at a cluster dendrogram:

HCplot(pseudodistmat, type = "heatmap",
                cexRow = 0.4, cexCol = 0.4,
                margins = c(7,7))
Figure 2: Symmetric heat map of the distance matrix displaying pseudospectra similarities of the GC-EI-MS example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

Figure 2: Symmetric heat map of the distance matrix displaying pseudospectra similarities of the GC-EI-MS example data set along with dendrograms resulting from hierarchical clustering based on the distance matrix. The colour encoding is shown in the top-left insert.

It is directly visible that the resulting clusters are not as dense as with the LC-MS/MS example data. In turn, there are more between-cluster similarities. This also shows in the correlation network, resulting in a chaotic plot when used with the default minimal similarity of 0.1:

networkplot(pseudodistmat, highlight_annotated = TRUE, 
                show_labels = TRUE, exclude_singletons = TRUE)
Figure 3: Correlation network plot based on pseudospectra similarities of the GC-EI-MS example data set, generated with the default similarity threshold of 0.1. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

Figure 3: Correlation network plot based on pseudospectra similarities of the GC-EI-MS example data set, generated with the default similarity threshold of 0.1. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

By choosing a higher similarity threshold of e.g. 0.4, it is far easier to identify clusters:

networkplot(pseudodistmat, highlight_annotated = TRUE, 
                show_labels = TRUE, exclude_singletons = TRUE,
                min_similarity = 0.4)
Figure 4: Correlation network plot based on pseudospectra similarities of the GC-EI-MS example data set, generated with the custom similarity threshold of 0.4. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

Figure 4: Correlation network plot based on pseudospectra similarities of the GC-EI-MS example data set, generated with the custom similarity threshold of 0.4. Grey dots indicate non-identified features, orange dots identified ones. Labels display feature IDs, along with feature annotations, if existent. Edge widths are proportional to spectral similarity of the connected features.

Presumably, the high between-cluster similarities are due to the low resolution data and the resulting fact, that fragment with different chemical composition but same unit resolution mass cannot be distinguished.

We can also use hierarchical clustering to identify clusters of similar (pseudo-)spectra. Here, too, we have to adjust h to account for higher between-cluster similarities:

HCplot(pseudodistmat, h = 0.7, cex = 0.5)
Figure 5: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on pseudospectra similarities of the GC-EI-MS example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

Figure 5: Circularised dendrogram as a result of agglomerative hierarchical clustering with average linkage as agglomeration criterion based on pseudospectra similarities of the GC-EI-MS example data set. Each leaf represents one feature and colours encode cluster affiliation of the features. Leaf labels display feature IDs, along with feature annotations, if existent. Distance from the central point is indicative of the height of the dendrogram.

We see that e.g. octadecanoic acid, hexadecanoic acid and dodecanoic acid form a nice cluster as well as the phosphorate containing metabolites phosphoenolpyruvic acid, glyceric acid-3-phosphate, glycerol-3-phosphate and phosphoric acid itself. It is also apparent that some features have a similarity of 1 and could therefore represent the same compound, like e.g. the features 98, 67 and 72. Those three features cluster together with AMP and UMP, suggesting that they could be related to nucleotides.

To illustrate the use of CluMSID’s accessory function with this type of data, we take another look at nucleotides: A signature fragment for nucleotides in GC-EI-MS is m/z 315 that derives from pentose-5-phosphates. We see this fragment in Figure 1, the spectrum of UMP (derivatised with 5 TMS groups). We can use findFragment to see if there are more spectra outside the cluster that freature this fragment. As we deal with unit masses, we would like to find m/z of 315 +/- 0.5 which we can do by setting tolerance = 0.5/315:

fragmentlist <- findFragment(apslist, mz = 315, tolerance = 0.5/315)
#> 6 spectra were found that contain a fragment of m/z 315 +/- 1587.30158730159 ppm.

vapply(X = fragmentlist, FUN = accessID, FUN.VALUE = integer(1))
#> [1]  2 14 20 21 27 35

We find four more spectra that contain a 315 fragment that could be investigated closer.

In conclusion, every annotation method is extremely limited if only low resolution data is available and so is CluMSID. Still, we see that the tool works independently of chromatography and mass spectrometry method and even has the potential to give some good hints for feature annotation in GC-EI-MS metabolomics.

Session Info

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] metaMSdata_1.27.0   metaMS_1.28.0       CAMERA_1.48.0      
#>  [4] xcms_3.14.0         MSnbase_2.18.0      ProtGenerics_1.24.0
#>  [7] S4Vectors_0.30.0    mzR_2.26.0          Rcpp_1.0.6         
#> [10] BiocParallel_1.26.0 Biobase_2.52.0      BiocGenerics_0.38.0
#> [13] CluMSIDdata_1.7.0   CluMSID_1.8.0      
#> 
#> loaded via a namespace (and not attached):
#>   [1] backports_1.2.1             Hmisc_4.5-0                
#>   [3] plyr_1.8.6                  igraph_1.2.6               
#>   [5] lazyeval_0.2.2              splines_4.1.0              
#>   [7] GenomeInfoDb_1.28.0         ggplot2_3.3.3              
#>   [9] digest_0.6.27               foreach_1.5.1              
#>  [11] htmltools_0.5.1.1           fansi_0.4.2                
#>  [13] magrittr_2.0.1              checkmate_2.0.0            
#>  [15] rle_0.9.2                   cluster_2.1.2              
#>  [17] doParallel_1.0.16           limma_3.48.0               
#>  [19] sna_2.6                     matrixStats_0.58.0         
#>  [21] jpeg_0.1-8.1                colorspace_2.0-1           
#>  [23] xfun_0.23                   dplyr_1.0.6                
#>  [25] crayon_1.4.1                RCurl_1.98-1.3             
#>  [27] jsonlite_1.7.2              graph_1.70.0               
#>  [29] impute_1.66.0               survival_3.2-11            
#>  [31] iterators_1.0.13            ape_5.5                    
#>  [33] glue_1.4.2                  gtable_0.3.0               
#>  [35] zlibbioc_1.38.0             XVector_0.32.0             
#>  [37] DelayedArray_0.18.0         DEoptimR_1.0-8             
#>  [39] scales_1.1.1                vsn_3.60.0                 
#>  [41] DBI_1.1.1                   GGally_2.1.1               
#>  [43] viridisLite_0.4.0           htmlTable_2.2.1            
#>  [45] clue_0.3-59                 foreign_0.8-81             
#>  [47] preprocessCore_1.54.0       Formula_1.2-4              
#>  [49] MsCoreUtils_1.4.0           htmlwidgets_1.5.3          
#>  [51] httr_1.4.2                  gplots_3.1.1               
#>  [53] RColorBrewer_1.1-2          ellipsis_0.3.2             
#>  [55] farver_2.1.0                pkgconfig_2.0.3            
#>  [57] reshape_0.8.8               XML_3.99-0.6               
#>  [59] nnet_7.3-16                 sass_0.4.0                 
#>  [61] utf8_1.2.1                  tidyselect_1.1.1           
#>  [63] rlang_0.4.11                munsell_0.5.0              
#>  [65] tools_4.1.0                 dbscan_1.1-8               
#>  [67] generics_0.1.0              statnet.common_4.4.1       
#>  [69] evaluate_0.14               stringr_1.4.0              
#>  [71] mzID_1.30.0                 yaml_2.2.1                 
#>  [73] knitr_1.33                  robustbase_0.93-7          
#>  [75] caTools_1.18.2              purrr_0.3.4                
#>  [77] RANN_2.6.1                  ncdf4_1.17                 
#>  [79] RBGL_1.68.0                 nlme_3.1-152               
#>  [81] compiler_4.1.0              rstudioapi_0.13            
#>  [83] plotly_4.9.3                png_0.1-7                  
#>  [85] affyio_1.62.0               MassSpecWavelet_1.58.0     
#>  [87] tibble_3.1.2                bslib_0.2.5.1              
#>  [89] stringi_1.6.2               highr_0.9                  
#>  [91] lattice_0.20-44             Matrix_1.3-3               
#>  [93] vctrs_0.3.8                 pillar_1.6.1               
#>  [95] lifecycle_1.0.0             BiocManager_1.30.15        
#>  [97] jquerylib_0.1.4             MALDIquant_1.19.3          
#>  [99] data.table_1.14.0           bitops_1.0-7               
#> [101] GenomicRanges_1.44.0        R6_2.5.0                   
#> [103] latticeExtra_0.6-29         pcaMethods_1.84.0          
#> [105] affy_1.70.0                 network_1.16.1             
#> [107] KernSmooth_2.23-20          gridExtra_2.3              
#> [109] IRanges_2.26.0              codetools_0.2-18           
#> [111] MASS_7.3-54                 gtools_3.8.2               
#> [113] assertthat_0.2.1            SummarizedExperiment_1.22.0
#> [115] GenomeInfoDbData_1.2.6      grid_4.1.0                 
#> [117] rpart_4.1-15                tidyr_1.1.3                
#> [119] coda_0.19-4                 rmarkdown_2.8              
#> [121] MatrixGenerics_1.4.0        base64enc_0.1-3