ClassifyR provides a structured pipeline for cross-validated classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible.
Each step can be provided with custom functions that follow some rules about parameters. The driver function runTests implements different varieties of cross-validation. They are:
runTests can use parallel processing capabilities in R to speed up cross-validations when many CPUs are available. The output of runTests is a ClassifyResult object which can be directly used by the performance evaluation functions. The process of classification is summarised by a flowchart.
Importantly, ClassifyR implements a number of methods for classification using different kinds of changes in measurements between classes. Most classifiers work with features where the means are different. In addition to changes in means (DM), ClassifyR also allows for classification using differential varibility (DV; changes in scale) and differential distribution (DD; changes in location and/or scale). See the appendix section “Common params Specifications for Common Classifications” for some ready-to-use parameter sets for standard use of some classifiers.
In the following sections, some of the most useful functions provided in ClassifyR will be demonstrated. However, a user can provide any feature selection, training, or prediction function to the classification framework, as long as it meets some simple rules about the input and return parameters. See the appendix section of this guide titled “Rules for New Functions” for a description of these.
There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.
Package | Run User-defined Classifiers | Parallel Execution on any OS | Parameter Tuning | Intel DAAL Performance Metrics | Ranking and Selection Plots | Class Distribution Plot | Sample-wise Error Heatmap | Direct Support for MultiAssayExperiment Input |
---|---|---|---|---|---|---|---|---|
ClassifyR | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
caret | Yes | Yes | Yes | No | No | No | No | No |
MLInterfaces | Yes | No | No | No | No | No | No | No |
MCRestimate | Yes | No | Yes | No | No | No | No | No |
CMA | No | No | Yes | No | No | No | No | No |
Although being a cross-validation framework, a number of popular feature selection and classification functions are provided by the package which meet the requirements of functions to be used by it (see the last section).
Functions with names ending in “interface” indicate wrappers for existing methods implemented in other packages. Different methods select different types of changes (i.e. location and/or scale) between classes.
Likewise, a variety of classifiers is also provided.
If a desired selection or classification method is not already implemented, rules for writing functions to work with ClassifyR are outlined in the next section.
To demonstrate some key features of ClassifyR, a data set consisting of the 2000 most variably expressed genes and 190 people will be used to quickly obtain results. The journal article corresponding to the data set was published in Scientific Reports in 2018 and is titled A Nasal Brush-based Classifier of Asthma Identified by Machine Learning Analysis of Nasal RNA Sequence Data.
library(ClassifyR)
data(asthma)
measurements[1:5, 1:5]
## Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
## HBB 9.72 11.98 12.15 10.60 8.18
## BPIFA1 14.06 13.89 17.44 11.87 15.01
## XIST 12.28 6.35 10.21 6.27 11.21
## FCGR3B 11.42 13.25 7.87 14.75 6.77
## HBA2 7.83 9.42 9.68 8.96 6.43
head(classes)
## [1] No No No No Yes No
## Levels: No Yes
The numeric matrix variable measurements stores the normalised values of the RNA gene abundances for each sample and the factor vector classes identifies which class the samples belong to. The measurements were normalised using DESeq2’s varianceStabilizingTransformation function, which produces \(log_2\)-like data.
For more complex data sets with multiple kinds of experiments (e.g. DNA methylation, copy number, gene expression on the same set of samples) a MultiAssayExperiment is recommended for data storage and supported by ClassifyR’s methods.
runTests is the main function in ClassifyR which handles the sample splitting and parallelisation, if used, of cross-validation. To begin with, a simple classifier will be specified. It uses a limma moderated t-test ranking for feature selection and DLDA for classification. The limmaSelection function also uses DLDA for estimating a resubstitution error rate for a number of top-f ranked features, as a heuristic for picking f features from the feature ranking which are used in the training and prediction stages of classification. This classifier relies on differences in means between classes. No parameters need to be specified, because this is the default classification of runTests.
DMresults <- runTests(measurements, classes, datasetName = "Asthma",
classificationName = "Different Means", permutations = 20, folds = 5,
seed = 2018, verbose = 1)
DMresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Different Means.
## Feature Selection Name: Limma moderated t-test.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: None calculated yet.
Here, 20 permutations and 5 folds cross-validation is specified by the values of permutations and folds. For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams. The parameter seed is important to set for result reproducibility when doing a cross-validation such as this, because it employs randomisation to partition the samples into folds. For more details about runTests and the parameter classes used by it, consult the help pages of such functions.
The most frequently selected gene can be identified using the distribution function and its relative abundance values for all samples can be displayed visually by plotFeatureClasses.
selectionPercentages <- distribution(DMresults, plot = FALSE)
sortedPercentages <- sort(selectionPercentages, decreasing = TRUE)
head(sortedPercentages)
mostChosen <- names(sortedPercentages)[1]
bestGenePlot <- plotFeatureClasses(measurements, classes, mostChosen, dotBinWidth = 0.1,
xAxisLabel = "Normalised Expression")
## allFeatures
## ZDHHC1 SSBP4 C10orf95 CTXN1 CROCC TMEM190
## 100 99 98 97 91 86
The means of the abundance levels of ZDHHC1 are substantially different between the people with and without asthma. plotFeatureClasses can also plot categorical data, such as may be found in a clinical data table, as a bar chart.
Classification error rates, as well as many other prediction performance measures, can be calculated with calcCVperformance. Next, the balanced error rate is calculated considering all samples, each of which was in the test set once. The balanced error rate is defined as the average of the classification errors of each class.
See the documentation of calcCVperformance for a list of performance metrics which may be calculated.
DMresults <- calcCVperformance(DMresults, "balanced error")
DMresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Different Means.
## Feature Selection Name: Limma moderated t-test.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: Balanced Error Rate.
performance(DMresults)
## $`Balanced Error Rate`
## [1] 0.2184751 0.2149316 0.1992913 0.2265396 0.2073558 0.2073558 0.2149316 0.2033236 0.2220186 0.2154203 0.2108993 0.2068671
## [13] 0.2033236 0.2068671 0.2189638 0.2225073 0.2300831 0.1962366 0.2108993 0.2270283
The error rate is about 20%. If only a vector of predictions and a vector of actual classes is available, such as from an old study which did not use ClassifyR for cross-validation, then calcExternalPerformance can be used on a pair of factor vectors which have the same length.
The samplesMetricMap function allows the visual comparison of sample-wise error rate or accuracy measures from different ClassifyResult objects. Firstly, a classifier will be run that uses Kullback-Leibler divergence ranking and resubstitution error as a feature selection heuristic and a naive Bayes classifier for classification. This classification will use features that have either a change in location or in scale between classes.
selectParams <- SelectParams(KullbackLeiblerSelection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(naiveBayesKernel)
predictParams <- PredictParams(predictor = NULL, getClasses = function(result) result,
weighted = "weighted", weight = "height difference", returnType = "both")
paramsList <- list(selectParams, trainParams, predictParams)
DDresults <- runTests(measurements, classes, datasetName = "Asthma",
classificationName = "Differential Distribution",
permutations = 20, folds = 5, seed = 2018,
params = paramsList, verbose = 1)
DDresults
## An object of class 'ClassifyResult'.
## Data Set Name: Asthma.
## Classification Name: Differential Distribution.
## Feature Selection Name: Kullback-Leibler Divergence.
## Features: List of length 20 of lists of length 5 of feature identifiers.
## Validation: 20 Permutations, 5 Folds.
## Predictions: List of data frames of length 20.
## Performance Measures: None calculated yet.
The naive Bayes kernel classifier has options specifying how the distances between class densities are used. For more information, consult the documentation of the naiveBayesKernel function.
Now, the classification error for each sample is also calculated for both the differential means and differential distribution classifiers and both ClassifyResult objects generated so far are plotted with samplesMetricMap.
library(grid)
DMresults <- calcCVperformance(DMresults, "sample error")
DDresults <- calcCVperformance(DDresults, "sample error")
resultsList <- list(Abundance = DMresults, Distribution = DDresults)
errorPlot <- samplesMetricMap(resultsList, metric = "error", xAxisLabel = "Sample",
showXtickLabels = FALSE, plot = FALSE)
grid.draw(errorPlot)
The benefit of this plot is that it allows the easy identification of samples which are hard to classify and could be explained by considering additional information about them. Differential distribution class prediction appears to be biased to the majority class (No Asthma).
The features being ranked and selected in the feature selection stage can be compared within and between classifiers by the plotting functions rankingPlot and selectionPlot. Consider the task of visually representing how consistent the feature rankings of the top 50 different features were for the differential distribution classifier for all 5 folds in the 20 cross-validations.
rankOverlaps <- rankingPlot(list(DDresults), topRanked = 1:100,
xLabelPositions = c(1, seq(10, 100, 10)),
lineColourVariable = "None", pointTypeVariable = "None",
columnVariable = "None", plot = FALSE)
rankOverlaps
The top-ranked features are fairly similar between all pairs of the 100 cross-validations.
For a large cross-validation scheme, such as leave-2-out cross-validation, or when results contains many classifications, there are many feature set comparisons to make. Note that rankingPlot and selectionPlot have a parallelParams options which allows for the calculation of feature set overlaps to be done on multiple processors.
Sometimes, cross-validation is unnecessary. This happens when studies have large sample sizes and are well-designed such that a large number of samples is prespecified to form a test set. The classifier is only trained on the training sample set, and makes predictions only on the test sample set. This can be achieved by using the function runTest directly. See its documentation for required inputs.
Once a cross-validated classification is complete, the usefulness of the features selected may be explored in another data set. previousSelection is a function which takes an existing ClassifyResult object and returns the features selected at the equivalent iteration which is currently being processed. This is necessary, because the models trained on one dataset are not directly transferrable to a new dataset; the classifier training (e.g. choosing thresholds, fitting model coefficients) is redone.
Some classifiers can be set to output scores or probabilities representing how likely a sample is to be from one of the classes, rather than class labels. This enables different score thresholds to be tried, to generate pairs of false positive and false negative rates. The naive Bayes classifier used previously had its returnType parameter set to “both”, so class predictions and scores were both stored in the classification result. In this case, a data.frame with two columns (named “class” and “score”) is returned by the classifier to the framework. Setting returnType to “score” is also sufficient to generate a ROC plot. Many existing classifiers in other R packages also have an option that allows a score or probability to be calculated.
ROCcurves <- ROCplot(list(DDresults), fontSizes = c(24, 12, 12, 12, 12))
This ROC plot shows the classifiability of the asthma data set is high. Other included functions which can output scores are fisherDiscriminant, DLDApredictInterface, and SVMpredictInterface.
Some classifiers allow the setting of a tuning parameter, which controls some aspect of their model learning. An example of doing parameter tuning with a linear SVM is presented. This particular SVM has a single tuning parameter, the cost. Higher values of this parameter penalise misclassifications more.
This is achieved in ClassifyR by providing a variable called tuneParams to the TrainParams container constructor. tuneParams is a named list, with the names being the names of the tuning variables, and the contents being vectors of values to try. If tuneParams has more than one element, all combination of values of the tuning variables are tried. The performance criterion specified to resubstituteParams is also used as the criterion for choosing the best tuning parameter(s). Any of the non-sample-specific performance metrics which calcCVperformance calculates can be optimised.
SVMresults <- runTests(measurements, classes, datasetName = "Asthma",
classificationName = "Tuned SVM", permutations = 20, folds = 5, seed = 2018,
params = list(SelectParams(limmaSelection, resubstituteParams = ResubstituteParams()),
TrainParams(SVMtrainInterface, kernel = "linear",
resubstituteParams = ResubstituteParams(),
tuneParams = list(cost = c(0.01, 0.1, 1, 10))),
PredictParams(SVMpredictInterface,
getClasses = function(result) result))
)
The chosen values of the parameters are stored for every validation, and can be accessed with the tunedParameters function.
length(tunedParameters(SVMresults))
## [1] 20
tunedParameters(SVMresults)[[1]]
## [[1]]
## [[1]]$cost
## [1] 10
##
##
## [[2]]
## [[2]]$cost
## [1] 10
##
##
## [[3]]
## [[3]]$cost
## [1] 10
##
##
## [[4]]
## [[4]]$cost
## [1] 10
##
##
## [[5]]
## [[5]]$cost
## [1] 10
The cost value of 10 is chosen all five folds of the first sample permutation.
ClassifyR is a framework for cross-validated classification that provides a variety of unique functions for performance evaluation. It provides wrappers for many popular classifiers but is designed to be extensible if other classifiers are desired.
These sets of parameters can simply be specified as the params parameter to runTest or runTests.
selectParams <- SelectParams(edgeRselection, resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(classifyInterface)
predictParams <- PredictParams(NULL, getClasses = function(result) result[["ytehat"]])
params = list(selectParams, trainParams, predictParams)
transformParams <- TransformParams(subtractFromLocation, intermediate = "training",
location = "median")
selectParams <- SelectParams(bartlettSelection,
resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(fisherDiscriminant)
predictParams <- PredictParams(NULL, getClasses = function(result) result)
params = list(transformParams, selectParams, trainParams, predictParams)
selectParams <- SelectParams(KullbackLeiblerSelection,
resubstituteParams = ResubstituteParams())
trainParams <- TrainParams(naiveBayesKernel)
predictParams <- PredictParams(NULL, getClasses = function(result) result)
params = list(selectParams, trainParams, predictParams)
trainParams <- TrainParams(NSCtrainInterface)
selectParams <- SelectParams(NSCselectionInterface, intermediate = "trained")
predictParams <- PredictParams(NSCpredictInterface, getClasses = function(result) result)
params = list(trainParams, selectParams, predictParams)
trainParams <- TrainParams(randomForestInterface, ntree = 100, getFeatures = forestFeatures)
predictParams <- PredictParams(NULL, getClasses = function(result) result[["test"]][["predicted"]])
params = list(trainParams, predictParams)
The required inputs and type of output that each stage of classifiation has is summarised by the table below. The functions can have any number of other arguments after the set of arguments which are mandatory.
The argument verbose is sent from runTest to these functions so they must handle it, even if not explicitly using it. In the ClassifyR framework, verbose is a number which indicates the amount of progress messages to be printed. If verbose is 0, no progress messages are printed. If it is 1, only one message is printed for every 10 cross-validations completed. If it is 2, in addition to the messages printed when it is 1, a message is printed each time one of the stages of classification (transformation, feature selection, training, prediction) is done. If it is 3, in addition to the messages printed for values 1 and 2, progress messages are printed from within the classification functions themselves.
A version of each included transformation, selection, training and prediction function is typically implemented for (1) a numeric matrix for which the rows are for features and columns are for samples (a data storage convention in bioinformatics) and a factor vector of the same length as the number of columns of the matrix, (2) a DataFrame where the columns are naturally for the features, possibly of different data types (i.e. categorical and numeric), and rows are for samples, and a class specification and (3) a MultiAssayExperiment which stores sample class information in the colData slot’s DataFrame with column name “class”. For the inputs (1 and 3) which are not DataFrame, they are converted to one, because the other data types can be stored as a DataFrame without loss of information and the transformation, selection and classification functions which accept a DataFrame contain the code to do the actual computations. At a minimum, a new function must have a method taking a DataFrame as input with the sample classes either stored in a column named “class” or provided as a factor vector. Although not required, providing a version of a function that accepts a numeric matrix with an accompanying factor vector and another version that accepts a MultiAssayExperiment is desirable to provide flexibility regarding input data. See the code of existing functions in the package for examples of this, if intending to implement novel classification-related functions to be used with ClassifyR.
Strbenac D., Yang, J., Mann, G.J. and Ormerod, J. T. (2015) ClassifyR: an R package for performance assessment of classification with applications to transcriptomics, Bioinformatics, 31(11):1851-1853
Strbenac D., Mann, G.J., Yang, J. and Ormerod, J. T. (2016) Differential distribution improves gene selection stability and has competitive classification performance for patient survival, Nucleic Acids Research, 44(13):e119