This vignette is also available as a bookdown format on our gitHub page https://mixomicsteam.github.io/Bookdown/.
mixOmics is an R toolkit dedicated to the exploration and integration of biological data sets with a specific focus on variable selection. The package currently includes nineteen multivariate methodologies, mostly developed by the
mixOmics team (see some of our references in 2.2.3). Originally, all methods were designed for omics data, however, their application is not limited to biological data only. Other applications where integration is required can be considered, but mostly for the case where the predictor variables are continuous (see also 2.1).
mixOmics, a strong focus is given to graphical representation to better translate and understand the relationships between the different data types and visualize the correlation structure at both sample and variable levels.
Note the data pre-processing requirements before analysing data with
Types of data. Different types of biological data can be explored and integrated with
mixOmics. Our methods can handle molecular features measured on a continuous scale (e.g. microarray, mass spectrometry-based proteomics and metabolomics) or sequenced-based count data (RNA-seq, 16S, shotgun metagenomics) that become `continuous’ data after pre-processing and normalisation.
Normalisation. The package does not handle normalisation as it is platform specific and we cover a too wide variety of data! Prior to the analysis, we assume the data sets have been normalised using appropriate normalisation methods and pre-processed when applicable.
mixOmics methods can handle large data sets (several tens of thousands of predictors), we recommend pre-filtering the data to less than 10K predictor variables per data set, for example by using Median Absolute Deviation (Teng et al. 2016) for RNA-seq data, by removing consistently low counts in microbiome data sets (Lê Cao et al. 2016) or by removing near zero variance predictors. Such step aims to lessen the computational time during the parameter tuning process.
Data format. Our methods use matrix decomposition techniques. Therefore, the numeric data matrix or data frames have \(n\) observations or samples in rows and \(p\) predictors or variables (e.g. genes, proteins, OTUs) in columns.
Covariates. In the current version of
mixOmics, covariates that may confound the analysis are not included in the methods. We recommend correcting for those covariates beforehand using appropriate univariate or multivariate methods for batch effect removal. Contact us for more details as we are currently working on this aspect.
We list here the main methodological or theoretical concepts you need to know to be able to efficiently apply
Individuals, observations or samples: the experimental units on which information are collected, e.g. patients, cell lines, cells, faecal samples …
Variables, predictors: read-out measured on each sample, e.g. gene (expression), protein or OTU (abundance), weight …
Variance: measures the spread of one variable. In our methods we estimate the variance of components rather that variable read-outs. A high variance indicates that the data points are very spread out from the mean, and from one another (scattered).
Covariance: measures the strength of the relationship between two variables, i.e whether they co-vary. A high covariance value indicates a strong relationship, e.g weight and height in individuals frequently vary roughly in the same way; roughly, the heaviest are the tallest. A covariance value has no lower or upper bound.
Correlation: a standardized version of the covariance that is bounded by -1 and 1.
Linear combination: variables are combined by multiplying each of of them by a coefficient and adding the results. A linear combination of height and weight could be 2 \(*\) weight - 1.5 \(*\) height with the coefficients 2 and -1.5 assigned with weight and height respectively.
Component: an artificial variable built from a linear combination of the observed variables in a given data set. Variable coefficients are optimally defined based on some statistical criterion. For example in Principal Component Analysis, the coefficients in the (principal) component is defined so as to maximise the variance of the component.
Loadings: variable coefficients used to define a component.
Sample plot: representation of the samples projected in a small space spanned (defined) by the components. Samples coordinates are determined by their components values, or scores.
Correlation circle plot: representation of the variables in a space spanned by the components. Each variable coordinate is defined as the correlation between the original variable value and each component. A correlation circle plot enables to visualise the correlation between variables - negative or positive correlation, defined by the cosine angle between the centre of the circle and each variable point) and the contribution of each variable to each component - defined by absolute value of the coordinate on each component. For this interpretation, data need to be centred and scaled (by default in most of our methods except PCA). For more details on this insightful graphic, see Figure 1 in (González et al. 2012).
Unsupervised analysis: the method does not take into account any known sample groups and the analysis is exploratory. Examples of unsupervised methods covered in this vignette are Principal Component Analysis (PCA, Chapter 4), Projection to Latent Structures (PLS, Chapter 6), and also Canonical Correlation Analysis (CCA, not covered here).
Supervised analysis: the method includes a vector indicating the class membership of each sample. The aim is to discriminate sample groups and perform sample class prediction. Examples of supervised methods covered in this vignette are PLS Discriminant Analysis (PLS-DA, Chapter 5), DIABLO (Chapter 7) and also MINT (not covered here (F Rohart et al. 2017)).
Here is an overview of the most widely used methods in
mixOmics that will be further detailed in this vignette, with the exception of rCCA and MINT. We depict them along with the type of data set they can handle.