XCMS Parameter Optimization with IPO

Gunnar Libiseller, Thomas Riebenbauer
JOANNEUM RESEARCH Forschungsgesellschaft m.b.H., Graz, Austria

2016-10-17

Introduction

This document describes how to use the R-package ‘IPO’ to optimize ‘xcms’ parameters. Code examples on how to use ‘IPO’ are provided. Additional to ‘IPO’ the R-packages ‘xcms’ and ‘rsm’ are required. The R-package ‘msdata’ and’mtbls2’ are recommended. The optimization process looks as following:


IPO optimization process

Installation

# try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("IPO")

Installing main suggested packages

# for examples of peak picking parameter optimization:
biocLite("msdata")
# for examples of optimization of retention time correction and grouping
# parameters:
biocLite("faahKO")

Raw data

‘xcms’ handles the file processing hence all files can be used that can be processed by ‘xcms’.

datapath <- system.file("cdf", package = "faahKO")
datafiles <- list.files(datapath, recursive = TRUE, full.names=TRUE)

Optimize peak picking parameters

To optimize parameters different values (levels) have to tested for these parameters. To efficiently test many different levels design of experiment (DoE) is used. Box-Behnken and central composite designs set three evenly spaced levels for each parameter. The method ‘getDefaultXcmsSetStartingParams’ provides default values for the lower and upper levels defining a range. Since the levels are evenly spaced the middle level or center point is calculated automatically. To edit the starting levels of a parameter set the lower and upper level as desired. If a parameter should not be optimized, set a single default value for ‘xcms’ processing, do not set this parameter to NULL.

The method ‘getDefaultXcmsSetStartingParams’ creates a list with default values for the optimization of the peak picking methods ‘centWave’ or ‘matchedFilter’. To choose between these two method set the parameter accordingly.

The method ‘optimizeXcmsSet’ has the following parameters: - files: the raw data which is the basis for optimization. This does not necessarly need to be the whole dataset, only quality controls should suffice. - params: a list consisting of items named according to ‘xcms’ peak picking methods parameters. A default list is created by ‘getDefaultXcmsSetStartingParams’. - nSlaves: the number ofexperiments of an DoE processed in parallel - subdir: a directory where the response surface models are stored. Can also be NULL if no rsm’s should be saved.

The optimization process starts at the specified levels. After the calculation of the DoE is finished the result is evaluated and the levels automatically set accordingly. Then a new DoE is generated and processed. This continues until an optimum is found.

The result of peak picking optimization is a list consisting of all calculated DoEs including the used levels, design, response, rsm and best setting. Additionally the last list item is a list (‘$best_settings’) providing the optimized parameters (‘$parameters’), an xcmsSet object (‘$xset’) calculated with these parameters and the response this ‘xcms’-object gives.

library(IPO)
peakpickingParameters <- getDefaultXcmsSetStartingParams('matchedFilter')
#setting levels for step to 0.2 and 0.3 (hence 0.25 is the center point)
peakpickingParameters$step <- c(0.2, 0.3)
peakpickingParameters$fwhm <- c(40, 50)
#setting only one value for steps therefore this parameter is not optimized
peakpickingParameters$steps <- 2

time.xcmsSet <- system.time({ # measuring time
resultPeakpicking <- 
  optimizeXcmsSet(files = datafiles[1:2], 
                  params = peakpickingParameters, 
                  nSlaves = 1, 
                  subdir = "rsmDirectory")
})
#> 
#> starting new DoE with:
#> fwhm: c(40, 50)
#> snthresh: c(3, 17)
#> step: c(0.2, 0.3)
#> steps: 2
#> sigma: 0
#> max: 5
#> mzdiff: 0
#> index: FALSE
#> nSlaves: 1
#> 1
#> 300:659 400:1525 500:2362 600:3329
#> 300:583 400:1342 500:2054 600:2997
#> 2
#> 300:561 400:1298 500:1978 600:2871
#> 300:651 400:1492 500:2325 600:3248
#> 3
#> 300:60 400:225 500:352 600:526
#> 300:82 400:246 500:403 600:659
#> 4
#> 300:89 400:264 500:445 600:700
#> 300:64 400:235 500:368 600:548
#> 5
#> 350:1039 500:2213
#> 350:880 500:1885
#> 6
#> 350:1048 500:2199
#> 350:858 500:1839
#> 7
#> 350:177 500:418
#> 350:147 500:354
#> 8
#> 350:152 500:372
#> 350:190 500:461
#> 9
#> 325:178 450:475 575:840
#> 325:237 450:595 575:1085
#> 10
#> 325:229 450:581 575:1064
#> 325:179 450:475 575:845
#> 11
#> 325:242 450:603 575:1103
#> 325:179 450:475 575:834
#> 12
#> 325:677 450:1558 575:2551
#> 325:807 450:1858 575:2971
#> 13
#> 325:137 450:341 575:642
#> 325:111 450:286 575:499
#> 14
#> 300:138 400:439 500:731 600:1115
#> 300:114 400:359 500:572 600:894
#> 15
#> 350:238 500:583
#> 350:302 500:739
#> 16
#> 325:178 450:475 575:840
#> 325:237 450:595 575:1085
#> 322:699 445:1528 567:2469
#> 322:814 445:1806 567:2859
#> 
#> starting new DoE with:
#> fwhm: c(45, 55)
#> snthresh: c(1, 15)
#> step: c(0.205, 0.285)
#> steps: 2
#> sigma: 0
#> max: 5
#> mzdiff: 0
#> index: FALSE
#> nSlaves: 1
#> 1
#> 302:1295 405:2648 507:4028
#> 302:1369 405:2746 507:4137
#> 2
#> 302:1247 405:2510 507:3834
#> 302:1204 405:2466 507:3705
#> 3
#> 302:110 405:318 507:532
#> 302:83 405:259 507:423
#> 4
#> 302:85 405:269 507:435
#> 302:113 405:330 507:545
#> 5
#> 342:1712 484:3390
#> 342:1721 484:3388
#> 6
#> 342:1576 484:3157
#> 342:1620 484:3248
#> 7
#> 342:194 484:475
#> 342:152 484:382
#> 8
#> 342:199 484:490
#> 342:156 484:397
#> 9
#> 322:223 445:578 567:992
#> 322:298 445:734 567:1260
#> 10
#> 322:224 445:575 567:989
#> 322:287 445:718 567:1237
#> 11
#> 322:220 445:575 567:986
#> 322:301 445:752 567:1281
#> 12
#> 322:1471 445:3013 567:4499
#> 322:1456 445:2921 567:4447
#> 13
#> 322:121 445:335 567:561
#> 322:152 445:405 567:709
#> 14
#> 302:212 405:601 507:976
#> 302:162 405:473 507:756
#> 15
#> 342:368 484:878
#> 342:279 484:682
#> 16
#> 322:298 445:734 567:1260
#> 322:223 445:578 567:992
#> 302:1204 405:2466 507:3705
#> 302:1247 405:2510 507:3834
#> no increase, stopping
#> best parameter settings:
#> fwhm: 50
#> snthresh: 3
#> step: 0.245
#> steps: 2
#> sigma: 21.2332257516562
#> max: 5
#> mzdiff: 0.31
#> index: FALSE
#> nSlaves: 1
resultPeakpicking$best_settings$result
#>    ExpId   #peaks   #NonRP      #RP      PPS 
#>    0.000 3260.000 2272.000  573.000  144.511
optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset

The response surface models of all optimization steps for the parameter optimization of peak picking look as following:


Response surface models of DoE 1 of peak picking parameter optimization


Response surface models of DoE 2 of peak picking parameter optimization

Currently the ‘xcms’ peak picking methods ‘centWave’ and ‘matchedFilter’ are supported. The parameter ‘peakwidth’ of the peak picking method ‘centWave’ needs two values defining a minimum and maximum peakwidth. These two values need separate optimization and are therefore split into ‘min_peakwidth’ and ‘max_peakwidth’ in ‘getDefaultXcmsSetStartingParams’. Also for the ‘centWave’ parameter prefilter two values have to be set. To optimize these use set ‘prefilter’ to optimize the first value and ‘prefilter_value’ to optimize the second value respectively.

Optimize retention time correction and grouping parameters

Optimization of retention time correction and grouping parameters is done simultaneously. The method ‘getDefaultRetGroupStartingParams’ provides default optimization levels for the ‘xcms’ retention time correction method ‘obiwarp’ and the grouping method ‘density’. Modifying these levels should be done the same way done for the peak picking parameter optimization.

The method ‘getDefaultRetGroupStartingParams’ only supports one retention time correction method (‘obiwarp’) and one grouping method (‘density’) at the moment.

The method ‘optimizeRetGroup’ provides the following parameter: - xset: an ‘xcmsSet’-object used as basis for retention time correction and grouping. - params: a list consisting of items named according to ‘xcms’ retention time correction and grouping methods parameters. A default list is created by ‘getDefaultRetGroupStartingParams’. - nSlaves: the number ofexperiments of an DoE processed in parallel - subdir: a directory where the response surface models are stored. Can also be NULL if no rsm’s should be saved.

A list is returned similar to the one returned from peak picking optimization. The last list item consists of the optimized retention time correction and grouping parameters (‘$best_settings’).

retcorGroupParameters <- getDefaultRetGroupStartingParams()
retcorGroupParameters$profStep <- 1
retcorGroupParameters$gapExtend <- 2.7
time.RetGroup <- system.time({ # measuring time
resultRetcorGroup <-
  optimizeRetGroup(xset = optimizedXcmsSetObject, 
                   params = retcorGroupParameters, 
                   nSlaves = 1, 
                   subdir = "rsmDirectory")
})
#> 
#> starting new DoE with:
#> distFunc: cor_opt
#> gapInit: c(0, 0.4)
#> gapExtend: 2.7
#> profStep: 1
#> plottype: none
#> response: 1
#> factorDiag: 2
#> factorGap: 1
#> localAlignment: 0
#> retcorMethod: obiwarp
#> bw: c(22, 38)
#> minfrac: c(0.3, 0.7)
#> mzwid: c(0.015, 0.035)
#> minsamp: 1
#> max: 50
#> center: 2
#> 
#> starting new DoE with:
#> gapInit: c(0.16, 0.64)
#> bw: c(12.4, 31.6)
#> minfrac: c(0.46, 0.94)
#> mzwid: c(0.023, 0.047)
#> distFunc: cor_opt
#> gapExtend: 2.7
#> profStep: 1
#> plottype: none
#> response: 1
#> factorDiag: 2
#> factorGap: 1
#> localAlignment: 0
#> retcorMethod: obiwarp
#> minsamp: 1
#> max: 50
#> center: 2
#> profStep or minfrac greater 1, decreasing to 0.54 and 1
#> 
#> starting new DoE with:
#> gapInit: c(0.304, 0.784)
#> bw: c(0.879999999999999, 23.92)
#> minfrac: c(0.54, 1)
#> mzwid: c(0.0326, 0.0614)
#> distFunc: cor_opt
#> gapExtend: 2.7
#> profStep: 1
#> plottype: none
#> response: 1
#> factorDiag: 2
#> factorGap: 1
#> localAlignment: 0
#> retcorMethod: obiwarp
#> minsamp: 1
#> max: 50
#> center: 2
#> no increase stopping

The response surface models of all optimization steps for the retention time correction and grouping parameters look as following:


Response surface models of DoE 1 of retention time correction and grouping parameter optimization


Response surface models of DoE 2 of retention time correction and grouping parameter optimizationn


Response surface models of DoE 3 of retention time correction and grouping parameter optimization

Currently the ‘xcms’ retention time correction method ‘obiwarp’ and grouping method ‘density’ are supported.

Display optimized settings

A script which you can use to process your raw data can be generated by using the function ‘writeRScript’.

writeRScript(resultPeakpicking$best_settings$parameters, 
             resultRetcorGroup$best_settings, 
             nSlaves=1)
#> library(xcms)
#> library(Rmpi)
#> xset <- xcmsSet(method="matchedFilter", fwhm=50, snthresh=3, step=0.245, steps=2, sigma=21.2332257516562, max=5, mzdiff=0.31, index=FALSE, nSlaves=1)
#> xset <- retcor(xset, method="obiwarp",
#>                   plottype="none", distFunc="cor_opt", profStep=1, center=2, response=1, gapInit=0.544, gapExtend=2.7, factorDiag=2, factorGap=1, localAlignment=0)
#> xset <- group(xset, method="density", 
#>                 bw=12.4, mzwid=0.047, minfrac=0.94, minsamp=1, max=50)
#> xset <- fillPeaks(xset, nSlaves=1)

Running times and session info

Above calculations proceeded with following running times.

time.xcmsSet # time for optimizing peak picking parameters
#>    user  system elapsed 
#>   6.572   0.928 139.100
time.RetGroup # time for optimizing retention time correction and grouping parameters
#>    user  system elapsed 
#> 292.916   1.764 298.392

sessionInfo()
#> R version 3.3.1 (2016-06-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.1 LTS
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] IPO_1.0.0           CAMERA_1.30.0       rsm_2.8            
#>  [4] faahKO_1.13.0       xcms_1.50.0         Biobase_2.34.0     
#>  [7] ProtGenerics_1.6.0  BiocGenerics_0.20.0 mzR_2.8.0          
#> [10] Rcpp_0.12.7        
#> 
#> loaded via a namespace (and not attached):
#>  [1] zoo_1.7-13             splines_3.3.1          lattice_0.20-34       
#>  [4] colorspace_1.2-7       htmltools_0.3.5        stats4_3.3.1          
#>  [7] yaml_2.1.13            chron_2.3-47           survival_2.39-5       
#> [10] RBGL_1.50.0            foreign_0.8-67         BiocParallel_1.8.0    
#> [13] RColorBrewer_1.1-2     multcomp_1.4-6         plyr_1.8.4            
#> [16] stringr_1.1.0          lsmeans_2.23-5         munsell_0.4.3         
#> [19] gtable_0.2.0           mvtnorm_1.0-5          codetools_0.2-15      
#> [22] coda_0.18-1            evaluate_0.10          latticeExtra_0.6-28   
#> [25] knitr_1.14             MassSpecWavelet_1.40.0 TH.data_1.0-7         
#> [28] acepack_1.3-3.3        xtable_1.8-2           scales_0.4.0          
#> [31] formatR_1.4            S4Vectors_0.12.0       Hmisc_3.17-4          
#> [34] graph_1.52.0           gridExtra_2.2.1        RANN_2.5              
#> [37] ggplot2_2.1.0          digest_0.6.10          stringi_1.1.2         
#> [40] grid_3.3.1             tools_3.3.1            sandwich_2.3-4        
#> [43] magrittr_1.5           tibble_1.2             Formula_1.2-1         
#> [46] cluster_2.0.5          MASS_7.3-45            Matrix_1.2-7.1        
#> [49] data.table_1.9.6       estimability_1.1-1     assertthat_0.1        
#> [52] rmarkdown_1.1          rpart_4.1-10           igraph_1.0.1          
#> [55] multtest_2.30.0        nnet_7.3-12            nlme_3.1-128