1 Introduction

1.1 Context

Analysis of a molecular phenotyping (e.g. metabolomics) data sets (i.e. samples times variables table of peak or bucket intensities generated by preprocessing tools such as XCMS) is aimed at mining the data (e.g. detecting trends and outliers) and selecting features of predictive value (biomarker discovery). It comprises multiple steps including:

  • Post-processing (quality control, normalization and/or transformation of intensities)

  • Statistical analysis (univariate hypothesis testing, multivariate modeling, feature selection)

  • Annotation (physico-chemical properties, biological pathways)

The phenomis package focuses on the two first steps, and can be combined with other packages for multivariate modeling, feature selection and annotation, such as the ropls and biosigner and biodb Bioconductor packages. As an example, the tutorial below reproduces the whole workflow described in the sacurine study (Thévenot et al. 2015). The phenomis package has also been used to post-processed the preclinical, proteomics and metabolomics data from the ProMetIS study (Imbert et al. 2021). Proteomics and metabolomics data from this study are provided as a second dataset, named prometis. Examples of statistical analyses of this dataset are provided in the “Working with multi-omics datasets” section.

1.2 Methods

Methods from the phenomis, ropls, and biosigner packages for the analysis of omics datasets; specific parameter values used for the sacurine metabolomics dataset described in the ‘Hands-on’ part below are provided as examples.

1.3 Formats

1.3.1 Managing data and metadata: the SummarizedExperiment and MultiAssayExperiment formats

The standard SummarizedExperiment and MultiAssayExperiment Bioconductor formats for single and multi-omics datasets are used by the phenomis methods. The methods return updated SummarizedExperiment and MultiAssayExperiment objects with either modified assay matrices (e.g. when using the transforming method) or enriched rowData and colData with new columns (e.g. fold changes and p-values when using the hypotesting method).

Note that the phenomis package also supports the ExpressionSet (respectively, MultiDataSet) formats, which are previous (respectively, alternative) formats for the management of single (respectively, multiple) datasets.

1.3.2 Importing from/exporting to tabular files

Input (i.e. preprocessed) data consists of a ‘variable times sample’ matrix of intensities (dataMatrix numeric matrix), in addition to sample and variable metadata (sampleMetadata and variableMetadata data frames). Importantly, the row names from the dataMatrix must be identical to the row names from the vaiableMetadata (feature names), and the column names from the dataMatrix must be identical to the row names from sampleMetadata (sample names).

Note that 3 sample metadata columns should be specified when working with the correcting method, namely:

  • sampleType (character): the following types are handled by the algorithms:

    • sample: biological sample of interest

    • blank: e.g. solvent only in Liquid Chromatography coupled to Mass Spectrometry

    • pool: quality control sample generated by pooling an equal volume of each of the sample of interest from the whole study (i.e. from all batches)

    • poolN: (where N is an integer; e.g. pool2, pool4, …): diluted quality control sample: N indicates the dilution factor

  • injectionOrder (integer): order of the sample injection (e.g. in the LC-MS instrument)

  • batch (character): name of each batch

1.4 Text and graphical outputs

Text and graphics can be handled with the phenomis methods by setting the two arguments:

  • report.c: if set to ‘interactive’ [default], messages are displayed interactively; if set to a character corresponding the a filename (with the ‘.txt’ extension), messages are diverted to this file by using the sink function internally (the same file can be appended by successive methods); if set to ‘none’, messages are suppressed

  • figure.c: if set to ‘interactive’ [default], graphics are displayed interactively; if set to a character corresponding the a filename (with the ‘.pdf’ extension), a pdf file with the plot is generated instead; if set to ‘none’, graphics are suppressed

2 Hands-on

2.1 The sacurine cohort study

As an example, we will use the phenomis package to study the sacurine human cohort. The study is aimed at characterizing the physiological variations of the human urine metabolome with age, body mass index (BMI), and gender (Thévenot et al. 2015). Urine samples from 184 volunteers were analyzed by reversed-phase (C18) ultrahigh performance liquid chromatography (UPLC) coupled to high-resolution mass spectrometry (LTQ-Orbitrap). Raw data are publicly available on the MetaboLights repository (MTBLS404).

This vignette describes the statistical analysis of the data set from the negative ionization mode (113 identified metabolites at MSI levels 1 or 2):

  • correcting: Correction of the signal drift by local regression (loess) modeling of the intensity trend in pool samples (Dunn et al. 2011); Adjustment of offset differences between the two analytical batches by using the average of the pool intensities in each batch (Kloet et al. 2009)

  • Variable quality control by discarding features with a coefficient of variation above 30% in pool samples

  • Normalization by the sample osmolality

  • transforming: log10 transformation

  • inspecting: Computing metrics to filter out outlier samples according to the Weighted Hotellings’T2 distance (Tenenhaus 1999), the Z-score of one of the intensity distribution deciles (Alonso et al. 2011), and the Z-score of the number of missing values (Alonso et al. 2011). A 0.001 threshold for all p-values results in the HU_096 sample being discarded

  • hypotesting: Univariate hypothesis testing of significant variations with age, BMI, or between genders (Student’s T test with Benjamini Hochberg correction)

  • PCA exploration of the data set; ropls Bioconductor package (Thévenot et al. 2015)

  • clustering: Hierarchical clustering

  • OPLS(-DA) modeling of age, BMI and gender; ropls Bioconductor package (Thévenot et al. 2015)

  • Selection of the features which significantly contributes to the discrimination between gender with PLS-DA, Random Forest, or Support Vector Machines classifiers; biosigner Bioconductor package (Rinaudo et al. 2016)

A Galaxy version of this analysis is available W4M00001 ‘Sacurine-statistics’ on the online infrastructure (Guitton et al. 2017).

2.2 reading: reading the data

The reading function reads the data sets and builds the SummarizedExperiment object. Additional information on how to build and handle SummarizedExperiment objects (as well as MultiAssayExperiment, ExpressionSet, and MultiDataSet) are provided in the Appendix.

library(phenomis) <- reading(system.file("extdata/sacurine", package = "phenomis"))
## class: SummarizedExperiment 
## dim: 113 210 
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(113): (2-methoxyethoxy)propanoic acid isomer
##   (gamma)Glu-Leu/Ile ... Valerylglycine isomer 2 Xanthosine
## rowData names(10): database_identifier chemical_formula ...
##   retention_time reliability
## colnames(210): QC1_001 HU_neg_017 ... HU_neg_146_b2 QC1_012_b2
## colData names(10): subset full ... bmi gender

2.3 inspecting: looking at the data

The inspecting method provides a numerical and graphical overview of the data. Furthermore, it computes quality metrics which may subsequently be used to filter out some samples or variables.

  • Graphical overview. The data matrix is visualized with a color gradient (top right) and the score plot of the Principal Component Analysis is shown for the two first components (bottom right). The black ellipse corresponds to the area of 95% of the samples, as computed with the Hotelling test. For each sample the mean of variable intensities is shown as a function of the injection order to detect any signal drift and/or batch correction (bottom left). Note that for this coloring to be displayed, the sampleType, injectionOrder and batch columns should be provided in the colData of (each of) the dataset(s). Finally, some metrics are computed regarding the proportion of NAs, 0 values, intensity quantiles, and proportion of features with a coefficient of variation in pool intensities < 30%.

  • Quality metrics (as additional column in the metadata):

    • samples (added in the colData):

      • hotel_pval: p-value from the Hotelling’s T2 test in the first plane of the PC components

      • miss_pval: p-value associated to the z-score of the proportion of missing values (Alonso et al. 2011)

      • deci_pval: p-value associated to the z-score of intensity deciles (Alonso et al. 2011)

        For each test, low p-values highlight samples with extreme behavior.

    • features (included in the rowData)

      • When the type of samples is available (ie the sampleType column is included in the input colData from the individual datasets), variable metrics are computed: sample, pool, and blank mean, sd and coefficient of variation (if the corresponding types are present in the ‘sampleType’ column), as well as ‘blank’ mean / ‘sample’ mean, and ‘pool’ CV / ‘sample’ CV ratio
      • If pool dilutions have been used and are indicated in the ‘sampleType’ column as poolN where N is an integer indicating the dilution factor (eg pool2 for a two-fold dilution of the pool; note that the non-diluted pool remains indicated as ‘pool’) the Pearson correlation (and corresponding p-value) between the intensity and the dilution factor is computed for each variable <- inspecting(, report.c = "none")