GSA-Lightning: Ultra-fast Permutation-based Gene Set Analysis

Billy Heung Wing Chang

2016-05-15

1. Introduction

GSA-Lightning is an ultra-fast implementation of permutation-based gene set analysis. Similar to existing methods, GSA-Lightning takes as inputs a gene expression data set and a set of gene sets. The functionality is similar to the GSA algorithm of (Efron 2007), and performs permutation tests using a combined Student-T test statistics. In particular GSA-Lightning retains the “mean”, “absmean”, and the restandardization procedure of GSA. The speed of GSA-Lightning, however, is much faster, particularly when the number of gene sets and the number of permutation are large.

2. Demonstration

2.1 Quick Start

We begin by first loading the GSA-Lightning package.

library(GSALightning)

We now read in a breast cancer expression data set and the patients’ status data. The data set is obtained from The Cancer Genome Atlas (TCGA) consortium (The Cancer Genome Atlas 2012), processed by the Pan-Cancer Project group (Weinstein 2013), and downloaded using the Bioconductor package ELMER (Yao 2015). The gene names have been converted to gene symbols in this data set.

data(expression)
data(sampleInfo)

We next read in the gene sets of interest for our demonstration. This gene sets contain the target genes of 104,636 distal regulatory elements, obtained from the supplementary materials of (Lu 2015).

data(targetGenes)

All necessary material are now ready for running GSA-Lightning. The main function of GSA-Lightning is GSALight(), which performs the permutation-based gene set analysis. Below we will restrict the analysis to gene sets with more than or equal to 8 genes (by setting “minsize = 8”, and consider the “absmean” gene-set test-statistics (by setting “method = ‘absmean’”. 1000 permutations (by setting “nperm = 1000”) will be performed:

GSALightResults <- GSALight(eset = expression, fac = factor(sampleInfo$TN), gs = targetGenes, 
                            nperm = 1000, method = 'absmean', restandardize = FALSE, minsize = 8, 
                            rmGSGenes = 'gene', verbose = TRUE)
## Some genes within the gene sets are not contained in the expression data set.
##  These genes are removed from the gene sets since rmGSGenes == 'gene'.
## After gene set size filtering, there are 1920 gene sets,
##  containing a total of 909 genes for analysis.
## Obtaining observed gene set statistics.
## Running permutation.
## Permutation done. Evaluating P-values.
head(GSALightResults)
##       p-value q-value statistics # genes
## 22863       0       0  10.633783       8
## 22864       0       0  10.633783       8
## 23404       0       0   7.152923      15
## 26610       0       0  11.493681       9
## 28310       0       0   4.520274       8
## 28424       0       0   6.377989       9

Note that in the above the option “rmGSGenes” was set to “TRUE”, and genes without expression measurements were removed from the gene sets. One must be careful whether to perform this automatically using GSALight(); will removing a gene from a gene set or pathway alter the implications of the results? Alternatively, one may also consider removing the gene sets with non-measured expression by setting “rmGSGenes = ‘gs’”.

2.2 Default Number of Permutations

In the previous function call, the number of permutations was set at 1000. In practice this is not enough for producing accurate p-values. By leaving the number of permutation unspecified, GSALight will automatically set the number of permutations to:

(number of gene sets)/0.05 \(\times\) 2

This number of permutations will suffice for accurate p-values estimation, when the significance level is set at 0.05, even after Bonferroni correction.

We now run GSALight using the default number of permutations. To minimize outputs, we set “verbose” to FALSE to suppress outputs from GSALight.

GSALightResults <- GSALight(eset = expression, fac = factor(sampleInfo$TN), gs = targetGenes, 
                            method = 'absmean', restandardize = FALSE, minsize = 8, 
                            rmGSGenes = 'gene', verbose = FALSE)
## Number of permutations is not specified. Automatically set to 76800.

Note that the number of permutation is automatically set to 76800 here, which is 1920/0.05 \(\times\) 2, where 1920 is the number of gene sets with 8 or more target genes (as specified by “minsize = 8”).

We now investigate the p-value distribution:

hist(GSALightResults[,'p-value'], main=NULL, xlab='p-value')

From the histogram of the p-values, we observe that almost all gene sets have small p-values. (Efron 2007) has discussed this problem, and suggested restandardization as a remedial method.

2.3 Restandardization

GSALightning can perform the restandardization method of (Efron 2007) by setting the option “restandardization” to “TRUE”. We now run GSALight with restandardization, and investigate the p-value distribution:

GSALightResultsReStand <- GSALight(eset = expression, fac = factor(sampleInfo$TN), gs = targetGenes, 
                                   method = 'absmean', restandardize = TRUE, minsize = 8, 
                                   rmGSGenes = 'gene', verbose = FALSE)
## Number of permutations is not specified. Automatically set to 76800.
hist(GSALightResultsReStand[,'p-value'], main=NULL, xlab='p-value')

Notice that the p-values are now more evenly distributed compared to the previous results obtained without restandardization.

2.4 Single Gene Permutation Test

A fast implementation of single gene permutation test is also included in the GSA-Lightning package, taking advantage of the fast permutation test implementation used in GSALight(). To run single gene testing, use the permTestLight() function, as follows:

singleGeneAnalysis <- permTestLight(eset = expression, fac = factor(sampleInfo$TN),
                                    nperm = 1000,  method = 'absmean', verbose = TRUE)
## After gene set size filtering, there are 909 gene sets,
##  containing a total of 909 genes for analysis.
## Obtaining observed gene set statistics.
## Running permutation.
## Permutation done. Evaluating P-values.
head(singleGeneAnalysis)
##        p-value    q-value statistics
## RBFOX1   0.051 0.07289151   3.089081
## ABCA12   0.007 0.01146486   2.814674
## ABCA1    0.000 0.00000000  19.457356
## ABCB11   0.000 0.00000000   5.857852
## ABI3BP   0.000 0.00000000  19.473927
## ACOX2    0.070 0.09670213   1.780874

References

BHW Chang and W Tian (2015). GSA-Lightning: Ultra Fast Permutation-based Gene Set Analysis. (Submitted)

B Efron and RJ Tibshirani (2007). “On testing the significance of sets of genes.” The annals of applied statistics 1(1):107-129

The Cancer Genome Atlas (2012) Comprehensive molecular characterization of human colon and rectal cancer. Nature 487:330-337.

Weinstein, John N., et al. “The cancer genome atlas pan-cancer analysis project.” Nature genetics 45.10 (2013): 1113-1120.

L Yao et. al. (2015) Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biology 16:105

Lu, Yulan, Yuanpeng Zhou, and Weidong Tian. “Combining Hi-C data with phylogenetic correlation to predict the target genes of distal regulatory elements in human genome.” Nucleic acids research (2013): gkt785.