MSstatsQC: longitudinal system suitability monitoring and quality control for proteomic experiments

Eralp DOGU eralp.dogu@gmail.com

Sara TAHERI srtaheri66@gmail.com

Olga VITEK o.vitek@neu.edu

2020-10-27

Introduction

Liquid chromatography coupled with mass spectrometry (LC-MS) is a powerful tool for detection and quantification of peptides in complex matrices. An important objective of targeted LC-MS is to obtain peptide quantifications that are (1) suitable for the purpose of the investigation, and (2) reproducible across laboratories and runs. The first objective is achieved by system suitability tests (SST), which verify that mass spectrometric instrumentation performs as specified. The second objective is achieved by quality control (QC), which provides in-process quality assurance of the sample profile. A common aspect of SST and QC is the longitudinal nature of their data. Although SST and QC receive a lot of attention in the proteomic community, the currently used statistical methods are fairly limited. MSstatsQC improves upon the existing statistical methodology for SST and QC. It translates the modern methods of longitudinal statistical process control, such as simultaneous and time weighted control charts and change point analysis to the context of LC-MS experiments The methods are implemented in an open-source R-based software package (www.msstats.org/msstatsqc), and are available for use stand-alone, or for integration with automated pipelines. Example datatsets include SRM based system suitability dataset from CPTAC Study 9.1 at Site 54, DDA and SRM based QC datasets from the QCloud system and a dataset for DIA type quantification for iRT peptides from QuiC system. Although the first example here focus on targeted proteomics, the statistical methods more generally apply. Version 2.0 includes data processing extensions for missing value handling. DDA and DIA examples can be found at the end of this document.

This vignette summarizes various aspects of all functionalities in MSstatsQC package.

Installation

To install this package, start R and enter:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("MSstatsQC")

Input

In order to analyze QC/SST data in MSstatsQC, input data must be a .csv file in a “long” format with related columns. This is a common data format that can be generated from spectral processing tools such as Skyline and Panorama AutoQC..

The recommended format includes Acquired Time, Peptide name, Annotations and data for any QC metrics such as Retention Time, Total Peak Area and Mass Accuracy etc. Each input file should include Acquired Time, Peptide name and Annotations. After the Annotations column user can parse any metric of interest with a proper column name.

  1. AcquiredTime: This column shows the acquired time of the QC/SST sample in the format of MM/DD/YYYY HH:MM:SS AM/PM.

  2. Precursor: This column shows information about Precursor id. Statistical analysis will be conducted for each unique label in this column.

  3. Annotations: Annotations are free-text information given by the analyst about each run. They can be information related to any special cause or any observations related to a particular run. Annotations are carried in the plots provided by MSstatsQC interactively.

(d)-(f) RetentionTime, TotalPeakArea, FWHM, MassAccuracy, and PeakAssymetry, and other metrics: These columns define a feature of a peak for a specific peptide.

The example dataset is shown below. Each row corresponds to a single time point. Additionally, other inputs such as predefined limits or guide sets are discussed in further steps.

Example

MSnbaseToMSstatsQC functions

MSnbaseToMSstatsQC function converts MSnbase output to MSstatQC format using QCmetrics objects.

Arguments

Example

Data processing

Data is processed with DataProcess() function to ensure data sanity and efficiently use core and summary MSstatsQC funtions. MSstatsQC uses a data validation method where slight variations in column names are compansated and converted to the standard MSstatsQC format. For example, our data validation function converts column names like Best.RT, best retention time, retention time, rt and best ret into BestRetentionTime. This conversion also deals with case-sensitive typing.

Arguments

Example

## [1] "Your data is ready to go!"

MSstatsQC core functions: control charts

The fuction XmRChart() is used to generate individual (X) and moving range (mR), and the function CUSUMChart() is used to construct cumulative sum for mean (CUSUMm) and cumulative sum for variability (CUSUMv) control charts for each metric. As a follow up change point estimation procedure ChangePointEstimator can be used.

Metrics (e.g. retention time and peak area) and peptides are chosen within all core functions with ‘metric’ and ‘peptide’ arguments. MSstatsQC can handle any metrics of interest. User needs to create data columns just after Annotations to import metrics into MSstatsQC successfully.

Predefined limits are commonly used in system sutiability monitoring and quality control studies. If the mean and variability of a metric is well known, they can be defined using ‘selectMean’ and ‘selectSD’ arguments in core plot functions (e.g. XmRplots function). For example, if mean of retention time is 28.5 minutes, standard deviation is 1 minutes and X chart is used for peptide LVNELTEFAK, we use XmRplot function as follows.

The true values of mean and variability of a metric is typically unknown, and their estimates are obtained from a guide set of high quality runs. Generally, a data gathering and parameter estimation step is required. Within that phase, control limits are obtained to test the hypothesis of statistical control. These thresholds are selected to ensure a specified type I error probability (e.g. 0.0027). Constructing control charts and real time evaluation are considered after achieving this phase. Guide sets are defined with ‘L’ and ‘U’ arguments. For example, if retention time of a peptide is monitored and first 20 observations of the dataset are used as a guide set, a plot is constructed as follows.

MSstatsQC core functions: XmRChart()

Arguments

Example

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# MSstatsQC core functions: CUSUMChart()

Arguments

Example

# MSstatsQC core functions: ChangePointEstimator()

Follow-up change point analysis is helpful to identify the time of a change for each peptide and metric. ChangePointEstimator() function is used for the analysis. This function is one of the core functions and uses the same arguments. We recommend using this function after control charts generate an out-of-control observation. For example, retention time of TAAYVNAIEK increases over time as CUSUMm statistics increases steadily after the 20th time point. User can follow-up with ChangePointEstimator() function to find the exact time of retention time drift.

Arguments

Example

We don’t recommend using this function when all the observations are within control limits. In the case of retention time monitoring of LVNELTEFAK, there is no need to further analyse change point.

The time of a variability change can be analyzed with the same fucntion. For example, retention time of YSTDVSVDEVK experiences a drift in the mean of retention time and variability of retention time increases simultaneously. In this case, ChangePointEstimator() can be used to identify exact times of both changes.

Example

MSstatsQC summary functions: river and radar plots

RiverPlot() and RadarPlot() functions are the summary functions used in MSstatsQC. They are used to aggregate results over all analytes for X and mR charts or CUSUMm and CUSUMv charts. method argument is used to define the method where the results for multiple peptides are aggregated. For example, if user would like to aggregate information gathered from the X charts of retention time for all analytes, upper panel of RiverPlot() show for the increases and decreases in retention time. Next, RadarPlot() are used to find out which peptides are affected by the problem.

If the mean and standard deviation is known, summary functions uses listMean and listSD arguments. For example, if user monitors retention time and peak assymetry and mean and standard deviations of these metrics are known, arguments will require entering a vector for means and another vector for standard deviations.

Example

## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

MSstatsQC summary functions: decision map

DecisionMap() functions another summary function used in MSstatsQC. It is used to compare aggregated results over all analytes for a certain method such as XmR charts with the user defined criteria. Firstly, user defines the performance criteria and run DecisionMap() function to visualize overall performance. This function uses all the arguments of summary plots listed previously. Additionally, the following arguments are used

Arguments

Example

## Warning: Use of `data$time` is discouraged. Use `time` instead.
## Warning: Use of `data$metric` is discouraged. Use `metric` instead.
## Warning: Use of `data$bin` is discouraged. Use `bin` instead.

## Warning: Use of `data$bin` is discouraged. Use `bin` instead.

Use case: longitudinal profiling of DDA with missing values

We analyzed the QCloudDDA dataset. The dataset had many missing values and MSstatsQC 2.0 processed these missing values and generated control charts and summary plots for longitudinal performance assessment.

mydata<-DataProcess(MSstatsQC::QCloudDDA)
## [1] "Your data is ready to go!"
#Creating a missing data map
MissingDataMap(mydata)

XmRChart(mydata, "EACFAVEGPK", metric = "missing", type="mean", L = 1, U = 15)
mydata<-RemoveMissing(mydata)
RiverPlot(mydata[,-9], L=1, U=15, method="XmR")
## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

RadarPlot(mydata[,-9], L=1, U=15, method="XmR")

mydata<-DataProcess(MSstatsQC::QCloudDDA)
## [1] "Your data is ready to go!"
#Creating a missing data map
MissingDataMap(mydata)

#Creating an X chart for missing counts
XmRChart(mydata, "EACFAVEGPK", metric = "missing", type="mean", L = 1, U = 15)
#Removing missing values and analyzing the data
mydata<-RemoveMissing(mydata)
RiverPlot(mydata[,-9], L=1, U=15, method="XmR")
## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

RadarPlot(mydata[,-9], L=1, U=15, method="XmR")

# Use case: longitudinal profiling of QC with iRT peptides

We analyzed the QuiCDIA dataset which included longitudinal profiles of 11 iRT peptides. The data comprised two DIA experiments acquired in duplicate with 11 days in between the two measurement sequences.

#Checking missing values and analyzing the data
MissingDataMap(MSstatsQC::QuiCDIA)
## [1] "No missing values!"
RiverPlot(data = QuiCDIA, L = 1, U = 20, method = "XmR")
## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

RadarPlot(data = QuiCDIA, L = 1, U = 20, method = "XmR")

Use case: longitudinal profiling of QC from an SRM experiment

Following the implementation in Dogu et al. (2017), we analyzed the QCloudSRM dataset. This dataset was previously evaluated by the experts as well-performing for all peptides and metrics, except for the outlying peptide TCVADESHAGCEK.

#Checking missing values and analyzing the data
MissingDataMap(MSstatsQC::QCloudSRM)
## [1] "No missing values!"
RiverPlot(data = QCloudSRM, L = 1, U = 20, method = "CUSUM")
## Warning: Use of `dat$QCno` is discouraged. Use `QCno` instead.
## Warning: Use of `dat$pr.y` is discouraged. Use `pr.y` instead.
## Warning: Use of `dat$group` is discouraged. Use `group` instead.

## Warning: Use of `dat$group` is discouraged. Use `group` instead.
## Warning: Use of `tho.hat.df$tho.hat` is discouraged. Use `tho.hat` instead.
## Warning: Use of `tho.hat.df$y` is discouraged. Use `y` instead.
## `geom_smooth()` using formula 'y ~ x'

RadarPlot(data = QCloudSRM, L = 1, U = 20, method = "CUSUM")

Output

Plots created by the core plot functions are generate by plotly which is an R package for interactive plot generation. These interactive plots created by MSstatsQC, can be saved as an html file using the save widget function. If the user wants to save a static png file, then export function can be used. The outputs of other MSstatsQC functions are generated by ggplot2 package and saving those outputs would require using ggsave function.

Example

Output

Plots created by the core plot functions are generated by plotly which is an R package for interactive plot generation. Each output generated by ‘plotly’ can be saved using the “plotly” toolset.

Project website

Please use MSstats.org/MSstatsQC and github repository for further details about this tool.

Question and issues

Please use Google group if you want to file bug reports or feature requests.

Citation

Please cite MSstatsQC:

Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MSstatsQC_2.8.0
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.50.0        httr_1.4.2            tidyr_1.1.2          
##  [4] vsn_3.58.0            splines_4.0.3         viridisLite_0.3.0    
##  [7] jsonlite_1.7.1        foreach_1.5.1         shiny_1.5.0          
## [10] BiocManager_1.30.10   affy_1.68.0           stats4_4.0.3         
## [13] pander_0.6.3          yaml_2.2.1            impute_1.64.0        
## [16] pillar_1.4.6          lattice_0.20-41       glue_1.4.2           
## [19] limma_3.46.0          digest_0.6.27         promises_1.1.1       
## [22] colorspace_1.4-1      Matrix_1.2-18         htmltools_0.5.0      
## [25] httpuv_1.5.4          preprocessCore_1.52.0 plyr_1.8.6           
## [28] MALDIquant_1.19.3     XML_3.99-0.5          pkgconfig_2.0.3      
## [31] zlibbioc_1.36.0       purrr_0.3.4           xtable_1.8-4         
## [34] scales_1.1.1          affyio_1.60.0         later_1.1.0.1        
## [37] BiocParallel_1.24.0   tibble_3.0.4          mgcv_1.8-33          
## [40] farver_2.0.3          generics_0.0.2        IRanges_2.24.0       
## [43] ggplot2_3.3.2         ellipsis_0.3.1        BiocGenerics_0.36.0  
## [46] lazyeval_0.2.2        magrittr_1.5          crayon_1.3.4         
## [49] mime_0.9              evaluate_0.14         ncdf4_1.17           
## [52] nlme_3.1-150          doParallel_1.0.16     MASS_7.3-53          
## [55] mzR_2.24.0            tools_4.0.3           data.table_1.13.2    
## [58] lifecycle_0.2.0       stringr_1.4.0         plotly_4.9.2.1       
## [61] MSnbase_2.16.0        S4Vectors_0.28.0      munsell_0.5.0        
## [64] pcaMethods_1.82.0     compiler_4.0.3        mzID_1.28.0          
## [67] qcmetrics_1.28.0      rlang_0.4.8           grid_4.0.3           
## [70] iterators_1.0.13      htmlwidgets_1.5.2     crosstalk_1.1.0.1    
## [73] miniUI_0.1.1.1        labeling_0.4.2        rmarkdown_2.5        
## [76] gtable_0.3.0          codetools_0.2-16      R6_2.4.1             
## [79] Nozzle.R1_1.1-1       knitr_1.30            dplyr_1.0.2          
## [82] fastmap_1.0.1         ggExtra_0.9           ProtGenerics_1.22.0  
## [85] stringi_1.5.3         parallel_4.0.3        Rcpp_1.0.5           
## [88] vctrs_0.3.4           tidyselect_1.1.0      xfun_0.18