Quality Control (QC)
has been considered as an essential step in the metabolomics platform for high reproducibility and accuracy of data. The repetitive use of the same QC samples is more and more accepted for correcting the signal drift during the sequence of MS run order, especially beneficial to improve the quality of data in multi-block experiments of large-scale metabolomic study
. statTarget is an easy use tool to provide a graphical user interface for quality control based signal shift correction
, integration of metabolomic data from multi-batch experiments
, and comprehensive statistic analysis in non-targeted or targeted metabolomics. This document is intended to guide the user to use statTargetGUI
to perform metabolomic data analysis. Note that this document will not describe the inner workings of statTarget algorithm
.
Dependent on R (>= 3.3.0)
Load the package with biocLite():
source("https://bioconductor.org/biocLite.R")
#> Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help
biocLite("statTarget")
#> BioC_mirror: https://bioconductor.org
#> Using Bioconductor 3.6 (BiocInstaller 1.28.0), R 3.4.2 (2017-09-28).
#> Installing package(s) 'statTarget'
#> Old packages: 'ABAEnrichment', 'Cardinal', 'MSstats', 'Organism.dplyr',
#> 'RMassBank', 'RnBeads', 'VariantFiltering'
For mac PC, the package statTargetGUI requires X11 support (XQuartz). Download it from https://www.xquartz.org.
An easy to use tool providing a graphical user interface (Figure 1) for quality control based signal correction, integration of metabolomic data from multiple batches, and comprehensive statistic analysis for non-targeted and targeted approaches. (URL: https://github.com/13479776/statTarget)
The main GUI of statTarget has two basic sections. The first section is Shift Correction. It includes quality control-based robust LOESS signal correction (QC-RLSC) that is a widely accepted method for quality control based signal correction and integration of metabolomic data from multiple analytical batches (Dunn WB., et al. 2011; Luan H., et al. 2015). The second section is Statistical Analysis. It provides comprehensively computational and statistical methods that are commonly applied to analyze metabolomics data, and offers multiple results for biomarker discovery.
Section 1 - Shift Correction
provide QC-RLSC algorithm that fit the QC data, and each metabolites in the true sample will be normalized to the QC sample. To avoid overfitting of the observed data, LOESS based generalised cross-validation (GCV) would be automatically applied, when the QCspan was set at 0.
Section 2 - Statistical Analysis
provide features including Data preprocessing, Data descriptions, Multivariate statistics analysis and Univariate analysis.
Data preprocessing : 80-precent rule, glog transformation, KNN imputation, Median imputation and Minimum values imputation.
Data descriptions : Mean value, Median value, Sum, Quartile, Standard derivatives, etc.
Multivariate statistics analysis : PCA, PLSDA, VIP, Random forest.
Univariate analysis : Welch’s T-test, Shapiro-Wilk normality test and Mann-Whitney test.
Biomarkers analysis: ROC, Odd ratio.
Pheno File
Meta information includes the Sample name, class, batch and order. Do not change the name of each column. (a) Class: The QC should be labeled as NA. (b) Order : Injection sequence. (c) Batch: The analysis blocks or batches with ordinal number,e.g., 1,2,3,…. (d) Sample name should be consistent in Pheno file and Profile file. (See the example data)
Profile File
Expression data includes the sample name and expression data.(See the example data)
NA.Filter
NA.Filter: Removing peaks with more than 80 percent of missing values (NA or 0) in each group. (Default: 0.8)
QCspan
The smoothing parameter which controls the bias-variance tradeoff. The common range of QCspan value is from 0.2 to 0.75. If you choose a span that is too small then there will be a large variance. If the span is too large, a large bias will be produced. The default value of QCspan is set at ‘0’, the generalised cross-validation will be performed for choosing a good value, avoiding overfitting of the observed data. (Default: 0)
degree
Lets you specify local constant regression (i.e., the Nadaraya-Watson estimator, degree=0), local linear regression (degree=1), or local polynomial fits (degree=2). (Default: 2)
Imputation
Imputation: The parameter for imputation method.(i.e., nearest neighbor averaging, “KNN”; minimum values for imputed variables, “min”; median values for imputed variables (Group dependent) “median”. (Default: KNN)
Stat File
Expression data includes the sample name, group, and expression data.
NA.Filter
Removing peaks with more than 80 percent of missing values (NA or 0) in each group. (Default: 0.8)
Imputation
The parameter for imputation method.(i.e., nearest neighbor averaging, “KNN”; minimum values for imputed variables, “min”; median values for imputed variables (Group dependent) “median”. (Default: KNN)
Glog
Generalised logarithm (glog) transformation for Variance stabilization
(Default: TRUE)
Scaling Method
Scaling method before statistic analysis (PCA or PLS). Pareto can be used for specifying the Pareto scaling. Auto can be used for specifying the Auto scaling (or unit variance scaling). Vast can be used for specifying the vast scaling. Range can be used for specifying the Range scaling. (Default: Pareto)
M.U.Stat
Multiple statistical analysis and univariate analysis (Default: TRUE)
Permutation times
The number of random permutation times for PLS-DA model (Default: 20)
PCs
PCs in the Xaxis or Yaxis: Principal components in PCA-PLS model for the x or y-axis (Default: 1 and 2)
nvarRF
The number of variables in Gini plot of Randomforest model (=< 100). (Default: 20)
Labels
To show the name of sample in the Score plot. (Default: TRUE)
Multiple testing
This multiple testing correction via false discovery rate (FDR) estimation with Benjamini-Hochberg method. The false discovery rate for conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. (Default: TRUE)
Volcano FC
The up or down -regulated metabolites using Fold Changes cut off values in the Volcano plot. (Default: > 2 or < 1.5)
Volcano Pvalue
The significance level for metabolites in the Volcano plot.(Default: 0.05)
Download the statTarget tutorial and example data .
Once data files have been analysed it is time to investigate them. Please get this info. through the GitHub page. (URL: https://github.com/13479776/statTarget)
The output file:
statTarget -- shiftCor
-- After_shiftCor # The corrected results including the loplot using statTarget
-- Before_shiftCor # The raw results using statTarget
-- RSDresult # The RSD analysis
The Figures:
Loplot (left): the visible Figure of QC-RLS correction for each peak.
The RSD distribution (right): The relative standard deviation of peaks in the samples and QCs
#############################
# Shift Correction function #
#############################
Data File Checking Start..., Time: Thu Jan 5 18:58:09 2017
217 Pheno Samples vs 218 Profile samples
The Pheno samples list (*NA, missing data from the Profile File)
[1] "QC1" "QC2" "QC3" "QC4"
[5] "QC5" "A1" "A2" "A3"
[9] "A4" "A5" "A6" "A7"
[13] "A8" "A9" "A10" "QC6"
[17] "A11" "A12" "A13" "A14"
[21] "A15" "B16" "B17" "B18"
[25] "B19" "B20" "QC7" "B21"
[29] "B22" "B23" "B24" "B25"
[33] "B26" "B27" "B28" "B29"
[37] "B30" "QC8" "C31" "C32"
[41] "C33" "C34" "C35" "QC9"
[45] "QC10" "QC11" "QC12" "QC13"
[49] "C36_120918171155" "C37" "C38" "C39"
[53] "C40" "QC14" "C41" "C42"
[57] "C43" "C44" "C45" "D46"
[61] "D47" "D48" "D49" "D50"
[65] "QC15" "D51" "D52" "D53"
[69] "D54" "D55" "D56" "D57"
[73] "D58" "D59" "D60" "QC16"
[77] "E61" "E62" "E63" "E64"
[81] "E65" "E66" "E67" "E68"
[85] "E69" "E70" "QC17" "E71"
[89] "E72" "E73" "E74" "E75"
[93] "F76" "F77" "F78" "F79"
[97] "F80" "QC18" "F81" "F82"
[101] "F83" "F84" "F85" "F86"
[105] "F87" "F88" "F89" "F90"
[109] "QC19" "QC20" "QC21" "QC22"
[113] "QC23" "QC24" "a1" "a2"
[117] "a3" "a4" "a5" "a6"
[121] "a7" "a8" "a9" "a10"
[125] "QC25" "a11" "a12" "a13"
[129] "a14" "a15" "b16" "b18"
[133] "b19" "b20" "QC26" "b21"
[137] "b22" "b23" "b24" "b25"
[141] "b26" "b27" "b28" "b29"
[145] "b30" "QC27" "c31" "c32"
[149] "c33" "c34" "c35" "QC28"
[153] "QC29" "QC31" "QC32" "c36"
[157] "c37" "c38" "c39" "c40"
[161] "QC33" "c41" "c42" "c43"
[165] "c44" "c45" "d46" "d47"
[169] "d48" "d49" "d50" "QC34"
[173] "d51" "d52" "d53" "d54"
[177] "d55" "d56" "d57" "d58"
[181] "d59" "d60" "QC35" "e61"
[185] "e62" "e63" "e64" "e65"
[189] "e66" "e67" "e68" "e69"
[193] "e70" "QC36" "e71" "e72"
[197] "e73" "e74" "e75" "f76"
[201] "f77" "f78" "f79" "f80"
[205] "QC37" "f81" "f82" "f83"
[209] "f84" "f85" "f86" "f87"
[213] "f88" "f89_120921102721" "f90" "QC38"
[217] "QC39"
Warning: The sample size in Profile File is larger than Pheno File!
Pheno information:
Class No.
1 1 30
2 2 29
3 3 30
4 4 30
5 5 30
6 6 30
7 QC 38
Batch No.
1 1 108
2 2 109
Profile information:
No.
QC and samples 218
Metabolites 1312
statTarget: shiftCor start...Time: Thu Jan 5 18:58:11 2017
Step 1: Evaluation of missing value...
The number of NA value in Data Profile before QC-RLSC: 2280
The number of variables including 80 % of missing value : 3
Step 2: Imputation start...
The number of NA value in Data Profile after the initial imputation: 0
Imputation Finished!
Step 3: QC-RLSC Start... Time: Thu Jan 5 18:58:12 2017
Warning: The QCspan was set at '0'.
The GCV was used to avoid overfitting the observed data
|===============================================================================| 100%
High-resolution images output...
Calculation of CV distribution of raw peaks (QC)...
CV<5% CV<10% CV<15% CV<20% CV<25% CV<30% CV<35% CV<40%
Batch_1 0.6875477 7.944996 23.98778 37.58594 46.98243 54.39267 61.19175 67.99083
Batch_2 4.0488923 25.821238 45.76012 57.44843 64.40031 70.51184 76.39419 80.29030
Total 0.3819710 6.722689 21.08480 33.38426 44.38503 51.87166 59.20550 64.55309
CV<45% CV<50% CV<55% CV<60% CV<65% CV<70% CV<75% CV<80% CV<85%
Batch_1 72.80367 77.92208 80.97785 84.11001 87.16578 88.69366 89.45760 90.67991 91.59664
Batch_2 83.34607 86.40183 88.31169 90.52712 92.58976 93.43010 94.42322 95.64553 96.18029
Total 69.36593 74.56073 78.53323 81.51261 82.96409 85.10313 87.39496 89.53400 91.36746
CV<90% CV<95% CV<100%
Batch_1 92.66616 93.35371 94.57601
Batch_2 96.48587 97.17341 97.40260
Total 92.89534 94.27044 94.95798
Calculation of CV distribution of corrected peaks (QC)...
CV<5% CV<10% CV<15% CV<20% CV<25% CV<30% CV<35% CV<40% CV<45%
Batch_1 18.25821 45.98930 64.40031 72.72727 78.45684 83.72804 86.17265 88.54087 89.76318
Batch_2 20.24446 51.48969 68.06723 78.22765 84.56837 88.23529 90.75630 92.36058 93.50649
Total 15.73720 44.46142 64.62949 73.18564 80.36669 84.79756 87.31856 88.69366 89.68678
CV<50% CV<55% CV<60% CV<65% CV<70% CV<75% CV<80% CV<85% CV<90%
Batch_1 91.06188 91.90222 92.58976 93.04813 93.43010 94.04125 94.65241 95.11077 95.56914
Batch_2 94.11765 94.88159 95.49274 96.18029 96.63866 96.86784 97.09702 97.40260 97.70817
Total 90.75630 91.97861 93.20092 93.96486 94.57601 95.33995 95.87471 96.10390 96.63866
CV<95% CV<100%
Batch_1 95.95111 96.02750
Batch_2 98.09015 98.31933
Total 96.71505 97.09702
Correction Finished! Time: Thu Jan 5 19:00:51 2017
The output file:
statTarget -- statAnalysis
-- PCA_Data_Pareto # Principal Component Analysis
-- PLS_DA_Pareto # Partial least squares Discriminant Analysis
-- Univariate# The RSD analysis
----- BoxPlot
----- Fold_Changes
----- Mann-Whitney_Tests # For non-normally distributed variables
----- oddratio # odd ratio
----- Pvalues # Intergation pvalues from Welch_test and MWT_test
----- RForest # Random Forest
----- ROC # receiver operating characteristic curve
----- Shapiro_Tests
----- Significant_Variables # The Peaks with P-value < 0.05
----- Volcano_Plots
----- WelchTest # For normally distributed variables
The Figures: