Overview

ClassifyR provides two contributions. Firstly, there is a structured pipeline for two-class classification. Classification is viewed in terms of four stages, data transformation, feature selection, classifier training, and prediction. The stages can be run in any order that is sensible. Each step can be provided with functions that follow some rules about parameters. Additionally, the driver function implements resampling with replacement k-fold cross validation as well as leave k out cross validation. This function can use parallel processing capabilities in R to speed up cross validations when many CPUs are available. Some convenience function interfaces are provided for microarray and RNA-seq data, while other functions work directly with the framework without the need for an interface.

Secondly, it implements a number of methods for classification using different feature types. Most classifiers work with features where the means are different. In addition to differential expression, ClassifyR also considers differential deviation and differential distribution.

The function that drives the classification is runTests. For cross validation, it repeatedly calls runTest, which runs a classification for a single split of the data.

In the following sections, the functions provided in ClassifyR will be demonstrated. However, a user can provide any function to the classification framework, as long as it meets some minimal rules. See the last section “Rules for New Functions” for a description of these.

Comparison to Existing Classification Frameworks.

There are a few other frameworks for classification in R. The table below provides a comparison of which features they offer.

Package Run User-defined Classifiers Parallel Execution on any OS Parameter Tuning Calculate over 20 Performance Metrics Ranking and Selection Plots Class Distribution Plot Error Heatmap
ClassifyR Yes Yes Yes Yes Yes Yes Yes
caret Yes Yes Yes No No No No
MLInterfaces Yes No No No No No No
MCRestimate Yes No Yes No No No No
CMA No No Yes No No No No

Case Study : Survival for Ovarian Cancer Patients.

A survival study was performed on microarrays for ovarian cancers and is available from curatedOvarianData on Bioconductor. Load the dataset into the current R session. Only 1000 genes are used for illustration.

library(ClassifyR)
library(ggplot2)
library(curatedOvarianData)
data(GSE26712_eset)
GSE26712_eset <- GSE26712_eset[1:1000, ]

Define patients who died less than 1 years as poor outcomes, and those that survived more than 5 years as good outcomes.

curatedClinical <- pData(GSE26712_eset)
ovarPoor <- curatedClinical[, "vital_status"] == "deceased" & curatedClinical[, "days_to_death"] < 365 * 1
ovarGood <- curatedClinical[, "vital_status"] == "living" & curatedClinical[, "days_to_death"] > 365 * 5
sum(ovarPoor, na.rm = TRUE)
sum(ovarGood, na.rm = TRUE)
## [1] 27
## [1] 20

There are 27 poor prognosis patients and 20 good prognosis patients. The expression data is subset to only keep patients in the Poor or Good group.

ovarExpression <- exprs(GSE26712_eset)[, c(which(ovarPoor), which(ovarGood))]
ovarGroups <- factor(rep(c("Poor", "Good"), c(length(which(ovarPoor)), length(which(ovarGood)))),
                     levels = c("Poor", "Good"))

Boxplots are drawn to get an idea of the distrbution of the data.

plotData <- data.frame(expression = as.numeric(ovarExpression),
                       sample = factor(rep(1:ncol(ovarExpression), each = nrow(ovarExpression))))

ggplot(plotData, aes(x = sample, y = expression)) + geom_boxplot() +
       scale_y_continuous(limits = c(0, 15)) + xlab("Sample") + ylab("Expression Value") +
       ggtitle("Expression for All Arrays") 

All functions provided in ClassifyR work with either a matrix and class vector or an ExpressionSet object. Here, an ExpressionSet object is used.

groupsTable <- data.frame(class = ovarGroups)
rownames(groupsTable) <- colnames(ovarExpression)
ovarSet <- ExpressionSet(ovarExpression, AnnotatedDataFrame(groupsTable))
featureNames(ovarSet) <- rownames(ovarExpression)
dim(ovarSet)
## Features  Samples 
##     1000       47

Differential Expression

Differential expression classifiers look for consistent changes in means between groups. This is the most common form of classification.

Interfaces to existing feature selection and classification algorithms for this type of change included are:

limmaSelection is suited to microarray data and edgeRselection is suited to RNA-seq data where the expression values are raw counts.

Here, a feature selection based on a ranked list from limma followed by a DLDA classifier will be used to do 10 resamples and four folds of cross validation. The dlda function is directly used from the sparsediscrim package in the ClassifyR framework, without any interface being necessary.

library(sparsediscrim)
DEresults <- runTests(ovarSet, "Ovarian Cancer", "Differential Expression", validation = "bootstrap", resamples = 5, folds = 3,
                      params = list(SelectionParams(limmaSelection, resubstituteParams = ResubstituteParams(nFeatures = c(25, 50, 75, seq(100, 1000, 100)), performanceType = "balanced", better = "lower")),
                                    TrainParams(dlda, TRUE, doesTests = FALSE),
                                    PredictParams(predict, TRUE, getClasses = function(result) result[["class"]])),
                      parallelParams = bpparam(), verbose = 1)
DEresults
## An object of class 'ClassifyResult'.
## Dataset Name: Ovarian Cancer.
## Classification Name: Differential Expression.
## Validation: 3 fold cross-validation of 5 resamples.
## Predictions: List of data frames of length 5.
## Features: List of length 5 of lists of length 3 of row indices.
## Performance Measures: None calculated yet.

For computers with more than 1 CPU, the number of cores to use can be given to runTests by using the argument parallelParams.

This example introduces the classes SelectionParams, TrainParams, and PredictParams. They store details about the functions and the parameters they use for selection, training, and prediction. The first argument to their constructors is always a function, followed by other arguments. Any named arguments can be provided, if the function specified to the constructor knows how to use an argument of that name. The order in which they are specified in the list determines the order the stages are run in.

The limmaSelection function specified to selectionParams ranks probes based on p-value and uses the classifier specified for trainParams and calculates the resubstitution error rate for the top nFeatures, picking the value with the lowest error rate.

TrainParams has four mandatory arguments. The first is the function that trains a classifier. The second is a logical value that specifies whether expression should be transposed, before being passed to the classifier function. Many classification functions in existing R packages in the CRAN repository need the features to be the columns and samples to be the rows. In ClassifyR, the expression data that is passed to runTests or runTest must have features as rows and samples as columns. This is more common in bioinformatics. In this example, the function dlda expects columns to be features, so transposeExpression is TRUE. Another common difference between classifiers on CRAN is that some of them do training and testing separately, whereas in other packages, one function does training and testing. In the case of dlda, it only does training, so doesTests, the third argument to the constructor, is set to FALSE.

PredictParams has three mandatory arguments. The first is a function which takes a built classifier and does predictions on unseen data. The second is a function which extracts a vector of predicted class labels, from the object returned from the function. In this case, the predict method returns an object which stores predictions in a list element called class. Additionally, transposeExpression is mandatory. Like for TrainParams, it specifies whether the numeric measurements need to be transposed.

The top five probes selected in the feature selection step can be checked visually. DEresults is a ClassifyResult object returned by runTests. features is a function that allows access to the row indices that were chosen for each fold.

DEplots <- plotFeatureClasses(ovarSet, features(DEresults)[[1]][[2]][1:5])