phenomis 1.4.0
reading
: reading the datainspecting
: looking at the datahypotesting
: univariate hypothesis testingannotating
: MS annotationwriting
: Exporting the resultsAnalysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. detecting trends and outliers) and selecting features of predictive value (biomarker discovery). It comprises multiple steps including:
Post-processing (quality control, normalization and/or transformation of intensities)
Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)
Annotation (physico-chemical properties, biological pathways)
The phenomis
package focuses on the two first steps, and can be
combined with other packages for multivariate modeling, feature
selection and annotation, such as the
ropls
and
biosigner
and
biodb
Bioconductor
packages. As an example, the tutorial below reproduces the whole
workflow described in the sacurine
study (Thévenot et al. 2015).
The phenomis
package has also been used to post-processed the
preclinical, proteomics and metabolomics data from the ProMetIS study
(Imbert et al. 2021). Proteomics and metabolomics data
from this study are provided as a second dataset, named prometis
.
Examples of statistical analyses of this dataset are provided in the
“Working with multi-omics datasets” section.
SummarizedExperiment
and MultiAssayExperiment
formatsThe standard SummarizedExperiment
and MultiAssayExperiment
Bioconductor formats for single and multi-omics datasets are used by the
phenomis
methods. The methods return updated SummarizedExperiment
and MultiAssayExperiment
objects with either modified assay matrices
(e.g. when using the transforming
method) or enriched rowData
and
colData
with new columns (e.g. fold changes and p-values when using
the hypotesting
method).
Note that the phenomis
package also supports the ExpressionSet
(respectively, MultiDataSet
) formats, which are previous
(respectively, alternative) formats for the management of single
(respectively, multiple) datasets.
Input (i.e. preprocessed) data consists of a ‘variable times sample’ matrix of intensities (dataMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names).
Note that 3 sample metadata columns should be specified when working
with the correcting
method, namely:
sampleType (character): the following types are handled by the algorithms:
sample: biological sample of interest
blank: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry
pool: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches)
poolN: (where N is an integer; e.g. pool2, pool4, …): diluted quality control sample: N indicates the dilution factor
injectionOrder (integer): order of the sample injection (e.g. in the LC-MS instrument)
batch (character): name of each batch
Text and graphics can be handled with the phenomis methods by setting the two arguments:
report.c
: if set to ‘interactive’ [default], messages are
displayed interactively; if set to a character corresponding the a
filename (with the ‘.txt’ extension), messages are diverted to this
file by using the sink function internally (the same file can be
appended by successive methods); if set to ‘none’, messages are
suppressed
figure.c
: if set to ‘interactive’ [default], graphics are
displayed interactively; if set to a character corresponding the a
filename (with the ‘.pdf’ extension), a pdf file with the plot is
generated instead; if set to ‘none’, graphics are suppressed
As an example, we will use the phenomis
package to study the
sacurine human cohort. The study is aimed at characterizing the
physiological variations of the human urine metabolome with age, body
mass index (BMI), and gender (Thévenot et al. 2015). Urine samples
from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh
performance liquid chromatography (UPLC) coupled to high-resolution mass
spectrometry (LTQ-Orbitrap). Raw data are publicly available on the
MetaboLights repository
(MTBLS404).
This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):
correcting
: Correction of the signal drift by local regression
(loess) modeling of the intensity trend in pool samples
(Dunn et al. 2011); Adjustment of offset differences between the
two analytical batches by using the average of the pool intensities
in each batch (Kloet et al. 2009)
Variable quality control by discarding features with a coefficient of variation above 30% in pool samples
Normalization by the sample osmolality
transforming
: log10 transformation
inspecting
: Computing metrics to filter out outlier samples
according to the Weighted Hotellings’T2 distance
(Tenenhaus 1999), the Z-score of one of the intensity
distribution deciles (Alonso et al. 2011), and the Z-score of the
number of missing values (Alonso et al. 2011). A 0.001 threshold
for all p-values results in the HU_096 sample being discarded
hypotesting
: Univariate hypothesis testing of significant
variations with age, BMI, or between genders (Student’s T test with
Benjamini Hochberg correction)
PCA exploration of the data set;
ropls
Bioconductor
package (Thévenot et al. 2015)
clustering
: Hierarchical clustering
OPLS(-DA) modeling of age, BMI and gender;
ropls
Bioconductor
package (Thévenot et al. 2015)
Selection of the features which significantly contributes to the
discrimination between gender with PLS-DA, Random Forest, or Support
Vector Machines classifiers;
biosigner
Bioconductor package (Rinaudo et al. 2016)
A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the workflow4metabolomics.usegalaxy.fr online infrastructure (Guitton et al. 2017).
reading
: reading the dataThe reading
function reads the data sets and builds the
SummarizedExperiment
object. Additional information on how to build
and handle SummarizedExperiment
objects (as well as
MultiAssayExperiment
, ExpressionSet
, and MultiDataSet
) are
provided in the Appendix.
library(phenomis)
sacurine.se <- reading(system.file("extdata/sacurine", package = "phenomis"))
## class: SummarizedExperiment
## dim: 113 210
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(113): (2-methoxyethoxy)propanoic acid isomer
## (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(10): database_identifier chemical_formula ...
## retention_time reliability
## colnames(210): QC1_001 HU_neg_017 ... HU_neg_146_b2 QC1_012_b2
## colData names(10): subset full ... bmi gender
inspecting
: looking at the dataThe inspecting
method provides a numerical and graphical overview of
the data. Furthermore, it computes quality metrics which may
subsequently be used to filter out some samples or variables.
Graphical overview. The data matrix is visualized with a color
gradient (top right) and the score plot of the Principal Component
Analysis is shown for the two first components (bottom right). The
black ellipse corresponds to the area of 95% of the samples, as
computed with the Hotelling test. For each sample the mean of
variable intensities is shown as a function of the injection order
to detect any signal drift and/or batch correction (bottom left).
Note that for this coloring to be displayed, the sampleType,
injectionOrder and batch columns should be provided in the
colData
of (each of) the dataset(s). Finally, some metrics are
computed regarding the proportion of NAs, 0 values, intensity
quantiles, and proportion of features with a coefficient of
variation in pool intensities < 30%.
Quality metrics (as additional column in the metadata):
samples (added in the colData
):
hotel_pval: p-value from the Hotelling’s T2 test in the first plane of the PC components
miss_pval: p-value associated to the z-score of the proportion of missing values (Alonso et al. 2011)
deci_pval: p-value associated to the z-score of intensity deciles (Alonso et al. 2011)
For each test, low p-values highlight samples with extreme behavior.
features (included in the rowData
)
colData
from the
individual datasets), variable metrics are computed: sample,
pool, and blank mean, sd and coefficient of variation (if
the corresponding types are present in the ‘sampleType’
column), as well as ‘blank’ mean / ‘sample’ mean, and ‘pool’
CV / ‘sample’ CV ratiosacurine.se <- inspecting(sacurine.se, report.c = "none")