ClassifyR provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible.
Each step can be provided with custom functions that follow some rules about parameters. The driver function runTests implements different varieties of cross-validation. They are:
runTests can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of runTests is a ClassifyResult object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart.
Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means (DM), ClassifyR also allows for classification using differential varibility (DV; changes in scale) and differential distribution (DD; changes in location and/or scale). See the appendix section “Common params Specifications for Common Classifications” for some ready-to-use parameter sets for standard use of some classifiers.
In the following sections, some of the most useful functions provided in ClassifyR will be demonstrated. However, a user can provide any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the appendix section of this guide titled “Rules for New Functions” for a description of these.
There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.
Package | Run User-defined Classifiers | Parallel Execution on any OS | Parameter Tuning | Intel DAAL Performance Metrics | Ranking and Selection Plots | Class Distribution Plot | Sample-wise Error Heatmap | Direct Support for MultiAssayExperiment Input |
---|---|---|---|---|---|---|---|---|
ClassifyR | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
caret | Yes | Yes | Yes | No | No | No | No | No |
MLInterfaces | Yes | No | No | No | No | No | No | No |
MCRestimate | Yes | No | Yes | No | No | No | No | No |
CMA | No | No | Yes | No | No | No | No | No |
Although being a cross-validation framework, a number of popular feature selection and classification functions are provided by the package which meet the requirements of functions to be used by it (see the last section).
Functions with names ending in “interface” indicate wrappers for existing methods implemented in other packages. Different methods select different types of changes (i.e. location and/or scale) between classes.
Likewise, a variety of classifiers is also provided.
If a desired selection or classification method is not already implemented, rules for writing functions to work with ClassifyR are outlined in the next section.
A number of methods are provided for users to enable classification in a network-centric way. The sub-network and meta-feature creation functions should be used before cross-validation is done.
Pre-validation is an approach to provide a fairer way to compare the benefit of omics data to traditional, freely-available clinical data. For each omics technology considered, training and testing is done on all of the various partitions of the data and the predictions of each sample are simply added as a column to the clinical data table. Then, cross-validation is done as usual. If an omics data set is important, it will often be included as a selected feature.
Pre-validation is activated by specifying params to be a named list, with one of the elements being named “prevaliated”, which specifies the kind of classification to do on the resultant clinical data table. The classification procedure is typically a logistic regression type, such as Elastic net regularised regression. Other lists must be named with each name matching an assay in the measurements object, wihch must of of type MultiAssayExperiment. For example, if a data object had two assays named RNA and protein, as well as some clinical data about the patients, then a suitable specification of params for the function runTests would be:
library(ClassifyR)
resubParams <- ResubstituteParams(nFeatures = 1:10, performanceType = "balanced error", better = "lower")
paramsList <- list(RNA = list(SelectParams(limmaSelection, resubstituteParams = resubParams),
TrainParams(DLDAtrainInterface),
PredictParams(DLDApredictInterface)),
protein = list(SelectParams(limmaSelection, resubstituteParams = resubParams),
TrainParams(DLDAtrainInterface),
PredictParams(DLDApredictInterface)),
prevalidated = list(TrainParams(elasticNetGLMtrainInterface, getFeatures = elasticNetFeatures),
PredictParams(elasticNetGLMpredictInterface))
)
elasticNetFeatures is a function that enables selected features to be extracted from the trained models, which are simply the variables with a beta coefficient that is not zero. Unlike most classifiers, for elastic net GLM, the feature selection happens during model training - not independently before it.
To demonstrate some key features of ClassifyR, a data set consisting of the 2000 most variably expressed genes and 190 people will be used to quickly obtain results. The journal article corresponding to the data set was published in Scientific Reports in 2018 and is titled A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data.
data(asthma)
measurements[1:5, 1:5]
## Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
## HBB 9.72 11.98 12.15 10.60 8.18
## BPIFA1 14.06 13.89 17.44 11.87 15.01
## XIST 12.28 6.35 10.21 6.27 11.21
## FCGR3B 11.42 13.25 7.87 14.75 6.77
## HBA2 7.83 9.42 9.68 8.96 6.43
head(classes)
## [1] No No No No Yes No
## Levels: No Yes
The numeric matrix variable measurements stores the normalised values of the RNA gene abundances for each sample and the factor vector classes identifies which class the samples belong to. The measurements were normalised using DESeq2’s varianceStabilizingTransformation function, which produces \(log_2\)-like data.
For more complex data sets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a MultiAssayExperiment is recommended for data storage and supported by ClassifyR’s methods.
runTests is the main function in ClassifyR which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be demonstrated. It uses a t-test or ANOVA ranking (depending on the number of classes) for feature selection and DLDA for classification. The differentMeansSelection function also uses DLDA for estimating a resubstitution error rate for a number of top-f ranked features, as a heuristic for picking f features from the feature ranking which are used in the training and prediction stages of classification. This classifier relies on differences in means between classes. No parameters need to be specified, because this is the default classification of runTests.
DMresults <- runTests(measurements, classes, datasetName = "Asthma",
classificationName = "Different Means", permutations = 20, folds = 5,
seed = 2018, verbose = 1)
DMresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Different Means.
## Feature Selection Name: Difference in Means.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: None calculated yet.
Here, 20 permutations and 5 folds cross-validation is specified by the values of permutations and folds. For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams. The parameter seed is important to set for result reproducibility when doing a cross-validation such as this, because it employs randomisation to partition the samples into folds. For more details about runTests and the parameter classes used by it, consult the help pages of such functions.
The most frequently selected gene can be identified using the distribution function and its relative abundance values for all samples can be displayed visually by plotFeatureClasses.
selectionPercentages <- distribution(DMresults, plot = FALSE)
sortedPercentages <- sort(selectionPercentages, decreasing = TRUE)
head(sortedPercentages)
mostChosen <- names(sortedPercentages)[1]
bestGenePlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1,
xAxisLabel = "Normalised Expression")