library(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The FCBF package is an R implementation of an algorithm developed by Yu and Liu, 2003 : Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution.

The algorithm uses the idea of “predominant correlation”. It selects features with high correlation with the target and little correlation with other variables. It does not use the classical Pearson or Spearman correlations, but a metric called Symmetrical Uncertainty (SU).

The SU is based on information theory, drawing from the concepts of Shannon entropy and information gain. A detailed description is outside the scope of this vignette, but these details are described comprehensively in the Yu and Liu, 2003 article.

The algorithm selects features correlated with the target above a given SU threshold. It then detects predominant correlations of features with the target. A predominant correlation occurs when for a feature “X”, no other feature is more correlated to “X” than “X” is to the target.

The features are ranked, and we start the iteration with X, the best ranked. The features more correlated with X than with the target are removed. Any features less correlated with X than with the target are kept. The algorithm then proceeds to the 2nd best ranked feature on the trimmed list. Once more, the detailed description is available in Yu and Liu, 2003.

Usage

Discretizing gene expression

The first step to use FCBF is to install FCBF and load the sample data. Expression data from single macrophages is used here as an example of how to apply the method to expression data.

library("FCBF")
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
# load expression data
data(scDengue)
exprs <- SummarizedExperiment::assay(scDengue,'logcounts')
#> Loading required package: SingleCellExperiment
head(exprs[,1:4])
#>                 1001700604_A13 1001700604_A16 1001700604_A20 1001700604_C21
#> ENSG00000000003       6.593133       5.817558       6.985514       6.736407
#> ENSG00000000419       4.284856       0.000000       5.252389       5.579961
#> ENSG00000000457       0.000000       4.252808       1.135761       0.000000
#> ENSG00000000460       0.000000       0.000000       1.135761       0.000000
#> ENSG00000000971       0.000000       0.000000       0.000000       0.000000
#> ENSG00000001036       6.071789       7.353848       8.072563       7.361245

This normalized single cell expression can be used in machine learning (ML) models to obtain classifiers for that separate dengue macrophages from control macrophage.

To avoid overfitting and lengthy running times, we select features that capture the relations that we need.

For FCBF, the gene expression has to be discretized. The entropy concept behind the symmetrical uncertainty is built upon discrete classes.

There is no consensus of the optimal way of discretizing gene expression. The function simply gets the range of expression (min to max) for each gene and, based on a threshold assigns the first bit of the genes the label of “low” and, to the other bit, the label of “high”. In this way, all the zeros and low expression values are assigned the label “low” in a gene-dependent manner.

Selecting features

FCBF does not target a specific machine learning model. It is, nevertheless, a supervised approaches, and requires you to give a target variable as input. In our example, we have two classes: dengue infected and not infected macrophages.

Now we have a class variable and a discrete, categorical feature table. That means we can run the function fcbf. The code as follows would run for a minute or two and generate an error.

As you can see, we get an error.

That is because the built-in threshold for the SU, 0.25, is too high for this data set. You can use the function to decide for yourself which threshold is the best. In this way you can see the distribution of correlations (as calculated from symmetrical uncertainty) for each variable with the target classes.

Seems like 0.05 is something reasonable for this dataset. This changes, of course, the number of features selected. A lower threshold will explore more features, but they will have less relation to the target variable. Let’s run it again with the new threshold.

In the end, this process has selected a very low number of variables in comparison with the original dataset.
From 19541 features, we ended with a lean set of 70. Now we should see if they are really useful for classification.

First, we will compare two subsets of the original datasets: - expression of the FCBF genes - expression of the 100 most variable genes

For each case we are using the original, non-discretized dataframe. The discretization was only used for FCBF.

The variance filter is a quick-and-dirty unsupervised feature selection. Nevertheless, we can use it as a benchmark. (as a side note, many single-cell workflows use variance as a filter).

In the code above we created a plan to test the classifiers by 5-fold cross validation

Any classifier can be used, but for illustration, here we will use the very popular radial svm. As metric for comparison we will use the area-under-the-curve (AUC) of the receiver operating characteristic (ROC) curve :

As we can see, the selected variables gave rise to a better SVM model than selecting the top 100 variables (as measured by the area under the curve of the receiver-operating characteristic curve.

Notably, running the svm radial with the full exprs set gives an error of the sort : Error: protect(): “protection stack overflow”.

The multiplicity of infection (MOI) of the cells was relatively low (1 virus/cell), so maybe that some cells were not infected. In any case, FCBF seems to do a reasonably good job at selecting variables, and all the metrics are better than the “100 most variable” benchmark.

Final remarks

Even though tools are available for feature selection in packages such as FSelector (https://cran.r-project.org/web/packages/FSelector/FSelector.pdf), up to date, there were no easy implementations in R for FCBF. The article describing FCBF has more than 1800 citations, but almost none from the biomedical community.

We expect that by carefully implementing and documenting FCBF in R, this package might improve usage of filter-based feature selection approaches that aim at reducing redundancy among selected features. Other tools similar to FCBF are available in Weka (https://www.cs.waikato.ac.nz/ml/weka/) and Python/scikit learn (http://featureselection.asu.edu/).

A recent good review, for those interested in going deeper, is the following:

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 94.

We note that other techniques based on predominant correlation might be better depending on the dataset and the objectives. FCBF provides an interpretable and robust option, with results that are generally good.

The application of filter-based feature selections for big data analysis in the biomedical sciences can have not only a direct effect in classification efficiency but might lead to interesting biological interpretations and possible quick identification of biomarkers.