Using MAST with RNASeq: MAIT Analysis.

Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex Shalek, Chloe K. Schlicter, Hannah W. Miller, M. Juliana McElrath, Martin Prlic, Peter Linsley, Raphael Gottardo

2016-11-22

Overview

We will learn how to use the package MAST to analyze single cell RNAseq experiments.

Starting from a matrix of counts of transcripts (cells by transcripts), we will discuss the preliminary steps of quality control, filtering, and exploratory data analysis. Once we are satisfied that we have high-quality expression, we will consider tests for differential expression and ways to visualize results. It is often helpful to synthesize from gene-level into module-level statements. Therefore, we will learn how to use MAST to test for gene set enrichment.

We will apply these methods to a data set of Mucosal Associated Invariant T cells (MAITs), measured on the Fluidigm C1. Half of the cells have been cytokine stimulated.

library(devtools)
library(BiocInstaller)
useDevel()
biocLite('SummarizedExperiment')
biocValid()
install_github('RGLab/MAST@summarizedExpt')

If you are running this at BioC, you should not need to run this chunk. Otherwise you will need to install the summarizedExpt branch and the devel (3.4) branch of Bioconductor.

suppressPackageStartupMessages({
    library(ggplot2)
    library(GGally)
    library(GSEABase)
    library(limma)
    library(reshape2)
    library(data.table)
    library(knitr)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(stringr)
    library(NMF)
    library(rsvd)
    library(RColorBrewer)
    library(MAST)
})
#options(mc.cores = detectCores() - 1) #if you have multiple cores to spin
options(mc.cores = 1)
knitr::opts_chunk$set(message = FALSE,error = FALSE,warning = FALSE,cache = TRUE,fig.width=8,fig.height=6, auto.dep=TRUE)

Loading and transforming data

freq_expressed <- 0.2
FCTHRESHOLD <- log2(1.5)

First, let’s set some constants that we can use throughout this analysis.

data(maits, package='MAST')
dim(maits$expressionmat)
head(maits$cdat)
head(maits$fdat)

Next, let’s load the data, which consists of a matrix of log2 + 1 transcripts per million (TPM), as output by RSEM. Internally, we use the packages RNASeqPipelineR and preprocessData to facilitate aligning and quantitating the raw reads. This is an experiment on Mucosal Associated Invariant T-cells from a single healthy donor. A number of cells were subjected to reverse transcription and library prep immediately, while others were first stimulated for 24 hours with a cytokine cocktail.

scaRaw <- FromMatrix(t(maits$expressionmat), maits$cdat, maits$fdat)

FromMatrix constructs an object from a matrix (genes \(\times\) cells) of expression, a data.frame of cell-level covariance and a data.frame of feature-level covariates. In this case, there are 96 single cells measured over 16302 genes. We derive from the SummarizedExperiment class, which makes it easy for feature (row) and cell (column) metadata to come along for the ride. There is also a constructor, FromFlatDF for flattened data where measurements, and cell and feature covariates are intermingled.

Exploratory Data Analysis

Let’s explore these data with a heatmap and some PCA.

aheatmap(assay(scaRaw[1:1000,]), labRow='', annCol=as.data.frame(colData(scaRaw)[,c('condition', 'ourfilter')]), distfun='spearman')

set.seed(123)
plotPCA <- function(sca_obj){
    projection <- rpca(t(assay(sca_obj)), retx=TRUE, k=4)$x
    pca <- data.table(projection,  as.data.frame(colData(sca_obj)))
    print(ggpairs(pca, columns=c('PC1', 'PC2', 'PC3', 'libSize', 'PercentToHuman', 'nGeneOn', 'exonRate'),
            mapping=aes(color=condition), upper=list(continuous='blank')))
    invisible(pca)
}

plotPCA(scaRaw)

filterCrit <- with(colData(scaRaw), pastFastqc=="PASS"& exonRate >0.3 & PercentToHuman>0.6 & nGeneOn> 4000)

From the PCA and heatmap we can tell that we’ve got some data quality issues with low-diversity libraries, and libraries where we capture a lot of non mRNA. We’ll set a filtering threshold using the Potter Stewart method, though Shalek et al and others have come up with more principled ways, such as training SVM on the expression data and quality metrics to predict failed libraries. Though this still presupposes a set of labeled “failures” and “successes”.

Filtering

sca <- subset(scaRaw,filterCrit)
eid <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,keys = mcols(sca)$entrez,keytype ="GENEID",columns = c("GENEID","TXNAME"))
ueid <- unique(na.omit(eid)$GENEID)
sca <- sca[mcols(sca)$entrez %in% ueid,]
## Remove invariant genes
sca <- sca[sample(which(freq(sca)>0), 6000),]

We’ll now consider only cells that pass the filter. subset(scaRaw, filterCrit) is just syntactic sugar for subsetting by columns, but who doesn’t like a little sugar? We’ll also limit ourselves to transcripts that have an EntrezGene id to facilitate interpretation. Lastly, we will take a subsample of about 50% of the total genes to keep this vignette from taking too long or too much memory.

Recalculating the cellular detection rate (ngeneson)

We and others have found that the number of genes detected in a sample, which we deemed the cellular detection rate is often the first principal component. After we removed genes low expression, perhaps we want to recalculate it.

cdr2 <-colSums(assay(sca)>0)
qplot(x=cdr2, y=colData(sca)$nGeneOn) + xlab('New CDR') + ylab('Old CDR')

We can assign this to a new column in the colData in the obvious fashion:

colData(sca)$cngeneson <- scale(cdr2)

PCA on filtered cells

plotPCA(sca)

PC1 now primarily captures stimulation effects. We still observe that PC2 correlates with the exonRate and ngeneson, but the data are more smoothly distributed.

Exercises on loading and transforming data

  1. These samples were reverse-transcribed and amplified on two Fluidigm C1 chips (and stimulation is completely confounded with chip…) The chip location is contained in the column wellKey in colData(sca) as a three-character string like CXX. How could you extract the chip location and add it as column to the cellular data? Hint: str_split_fixed or str_extract may be your friends here.

Adaptive thresholding

Single cell gene expression data are known to be zero-inflated and bimodal, which is feature we observe here as well.

scaSample <- sca[sample(which(freq(sca)>.1), 20),]
flat <- as(scaSample, 'data.table')
ggplot(flat, aes(x=value))+geom_density() +facet_wrap(~symbolid, scale='free_y')

Here we’ve taken subsample of size 20, then flattened it into a data.table to make it easy to plot with ggplot2.

thres <- thresholdSCRNACountMatrix(assay(sca), nbins = 20, min_per_bin = 30)
par(mfrow=c(5,4))
plot(thres)

We suspect that the left-most mode corresponds to non-specific hybridization of mRNA or genomic DNA. Here we apply an adaptive scheme to threshold values below a cut-off that depends on the intensity of the signal cluster from the gene (determined from the median expression value). When we plot the threshold vs genes binned by median expression value, we can see this evolution of thresholding value.

assays(sca) <- list(thresh=thres$counts_threshold, tpm=assay(sca))
expressed_genes <- freq(sca) > freq_expressed
sca <- sca[expressed_genes,]

We’ll limit ourselves to genes that are found in at least 0.2 of the sample (since we won’t have any power to conclude much about less frequent transcripts). We will also downsample to 1000 genes to keep this vignette from running too slowly.

Exercises on thresholding

  1. Many modeling procedures assume (or perform best) when the data are Normally distributed. How could we evaluate how thresholding affects the Normality of the data? How about the Normality of the non-zero data?

Differential Expression using a Hurdle model

We’ll fit a hurdle model, modeling the condition and (centered) ngeneson factor, thus adjusting for the cellular detection rate.

In order to have more interpretable coefficients, we’ll set the reference level of the factor to be the “unstimulated” cells.

cond<-factor(colData(sca)$condition)
cond<-relevel(cond,"Unstim")
colData(sca)$condition<-cond
zlmCond <- zlm.SingleCellAssay(~condition + cngeneson, sca)
# The following are equivalent
## lrt <- lrTest(zlm, "condition")
## lrt <- lrTest(zlm, CoefficientHypothesis('conditionStim'))

# This would test if 2*cngeneson=conditionStim
#  This is sheer nonsense biologically and statistically, but gives an example of the flexibility.
## lrt <- lrTest(zlm, Hypothesis('2*cngeneson-conditionStim'))

We could run a likelihood ratio test here, testing for differences when we drop the condition factor. Note that any arbitrary contrast matrix can be tested here, and specified either using a matrix or syntactically. See Hypothesis for details.

#only test the condition coefficient.
summaryCond <- summary(zlmCond, doLRT='conditionStim') 
#print the top 4 genes by contrast using the logFC
print(summaryCond, n=4)
## Fitted zlm with top 4 genes per contrast:
## ( log fold change Z-score )
##  primerid conditionStim cngeneson
##  1327        0.9           3.8*  
##  255252     -5.1           4.1*  
##  285962    -10.0*          0.4   
##  3727       -7.8*          0.5   
##  392490     -1.1          -4.3*  
##  7295        7.7*          0.4   
##  84314      -7.0*          1.7   
##  92014      -3.2           3.7*
## by discrete Z-score
print(summaryCond, n=4, by='D')
## Fitted zlm with top 4 genes per contrast:
## ( Wald Z-scores on discrete )
##  primerid  conditionStim cngeneson
##  100132356   -4.0*          2.2   
##  100506548   -2.6           4.2*  
##  134147      -4.1*          3.1   
##  163702      -3.4           4.5*  
##  285962      -4.3*          1.4   
##  56996       -3.5           4.5*  
##  6275        -4.0*          0.9   
##  6360        -3.1           4.4*
## by continuous Z-score
print(summaryCond, n=4, by='C')
## Fitted zlm with top 4 genes per contrast:
## ( Wald Z-scores tests on continuous )
##  primerid conditionStim cngeneson
##  152926      6.0*         -2.9   
##  3727       -5.9*          0.4   
##  514         3.1          -4.4*  
##  54964       1.9          -4.0*  
##  6232       -6.1*         -0.7   
##  6482        3.9          -5.8*  
##  7295        5.9*         -1.2   
##  9459        3.2          -4.0*

But often of more general use is this delicious syntactic sugar to make a giant data.table containing coefficients, standard errors, etc, for the various model components. Many Bothan spies died so that we could pretty-print this summary of the top differentially expressed genes.

Strip off the $datatable component to stop this pretty-printing (or call print.default.)

summaryDt <- summaryCond$datatable
fcHurdle <- merge(summaryDt[contrast=='conditionStim' & component=='H',.(primerid, `Pr(>Chisq)`)], #hurdle P values
                      summaryDt[contrast=='conditionStim' & component=='logFC', .(primerid, coef, ci.hi, ci.lo)], by='primerid') #logFC coefficients

fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
##        primerid   Pr(>Chisq)        coef       ci.hi      ci.lo
##    1: 100093630 0.5684025059 -0.48891879  2.60232594 -3.5801635
##    2:     10010 0.2222896558  3.14003456  6.32218131 -0.0421122
##    3: 100128025 0.0002161915 -0.95005072 -0.38236718 -1.5177343
##    4: 100128398 0.0030936353 -1.35056067 -0.59158500 -2.1095363
##    5: 100128675 0.0425128635 -1.06436962  0.03506014 -2.1637994
##   ---                                                          
## 2382:      9976 0.4024856400  0.08534798  2.49549736 -2.3248014
## 2383:       998 0.2586231745  1.98393025  5.15074361 -1.1828831
## 2384:      9984 0.3510295986 -1.45791254  0.83162505 -3.7474501
## 2385:      9988 0.4072016324  0.71706845  2.53021858 -1.0960817
## 2386:      9991 0.7754479104 -0.37921891  1.84353135 -2.6019692
##               fdr
##    1: 0.699076484
##    2: 0.404289303
##    3: 0.007265252
##    4: 0.033617062
##    5: 0.143068677
##   ---            
## 2382: 0.576429014
## 2383: 0.442830273
## 2384: 0.533150290
## 2385: 0.580743033
## 2386: 0.839482175
fcHurdleSig <- merge(fcHurdle[fdr<.05 & abs(coef)>FCTHRESHOLD], as.data.table(mcols(sca)), by='primerid')
setorder(fcHurdleSig, fdr)

We see that there are 264 genes significant at a FDR of 5% and with a log-fold change larger than 0.6.

Visualization of 50 most differentially expressed genes

entrez_to_plot <- fcHurdleSig[1:50,primerid]
symbols_to_plot <- fcHurdleSig[1:50,symbolid]
flat_dat <- as(sca[entrez_to_plot,], 'data.table')
ggbase <- ggplot(flat_dat, aes(x=condition, y=thresh,color=condition)) + geom_jitter()+facet_wrap(~symbolid, scale='free_y')+ggtitle("DE Genes in Activated MAIT Cells")
ggbase+geom_violin() 

Here is a standard violin plot showing how expression differs in each gene between stimulated and unstimulated cells. What could be improved about this visualization to make it more accurately reflect the model that’s being fit?

flat_dat[,lmPred:=lm(thresh~cngeneson + condition)$fitted, key=symbolid]
##       primerid entrez symbolid               wellKey condition nGeneOn
##    1:    83451  83451   ABHD11   1-MAIT-Stim-C05_S91      Stim    8805
##    2:    83451  83451   ABHD11   1-MAIT-Stim-C07_S70      Stim    9184
##    3:    83451  83451   ABHD11   1-MAIT-Stim-C08_S68      Stim    8895
##    4:    83451  83451   ABHD11   1-MAIT-Stim-C10_S95      Stim    8000
##    5:    83451  83451   ABHD11   1-MAIT-Stim-C11_S92      Stim    9453
##   ---                                                                 
## 3646:   285962 285962 WEE2-AS1 1-MAIT-Unstim-C80_S38    Unstim    7360
## 3647:   285962 285962 WEE2-AS1 1-MAIT-Unstim-C83_S63    Unstim    7534
## 3648:   285962 285962 WEE2-AS1 1-MAIT-Unstim-C86_S39    Unstim    7480
## 3649:   285962 285962 WEE2-AS1 1-MAIT-Unstim-C93_S33    Unstim    6326
## 3650:   285962 285962 WEE2-AS1 1-MAIT-Unstim-C96_S60    Unstim    7336
##        libSize PercentToHuman MedianCVCoverage PCRDuplicate exonRate
##    1: 10055037          0.896         0.938330     0.756202    0.651
##    2: 13822410          0.896         0.687973     0.788929    0.764
##    3: 12122964          0.891         0.947196     0.799650    0.706
##    4: 12199441          0.893         0.992205     0.821164    0.758
##    5: 11497351          0.899         0.726385     0.735270    0.667
##   ---                                                               
## 3646: 11561229          0.858         1.843665     0.874829    0.555
## 3647: 11953460          0.860         1.989806     0.878499    0.476
## 3648: 10401089          0.850         2.222974     0.873902    0.470
## 3649:  5433069          0.814         2.555629     0.869408    0.476
## 3650:  6412166          0.832         2.509249     0.858618    0.515
##       pastFastqc ncells  ngeneson  cngeneson     TRAV1     TRBV6 TRBV4
##    1:       PASS      1 0.2764538  0.9510350 10.390061  0.000000     0
##    2:       PASS      1 0.2882285  1.3203917  7.639777  0.000000     0
##    3:       PASS      1 0.2789186  0.9870698  8.657032  0.000000     0
##    4:       PASS      1 0.2507850  0.3767303        NA        NA    NA
##    5:       PASS      1 0.2964864  1.5095745  0.000000  6.478740     0
##   ---                                                                 
## 3646:       PASS      1 0.2308779 -0.4272963  8.432552 10.821170     0
## 3647:       PASS      1 0.2363257 -0.1164961  0.000000  7.203629     0
## 3648:       PASS      1 0.2345202 -0.2043309        NA        NA    NA
## 3649:       PASS      1 0.1985525 -1.1885315        NA        NA    NA
## 3650:       PASS      1 0.2301557 -0.2899136  0.000000  7.575067     0
##       TRBV20 alpha  beta           ac           bc ourfilter   thresh
##    1:      0 TRAV1    NA   TRAV1.Stim           NA      TRUE 3.571677
##    2:      0 TRAV1    NA   TRAV1.Stim           NA      TRUE 5.163096
##    3:      0 other    NA   other.Stim           NA      TRUE 2.784504
##    4:     NA    NA    NA           NA           NA      TRUE 0.000000
##    5:      0    NA TRBV6           NA   TRBV6.Stim      TRUE 2.283922
##   ---                                                                
## 3646:      0 TRAV1 TRBV6 TRAV1.Unstim TRBV6.Unstim      TRUE 8.508547
## 3647:      0    NA other           NA other.Unstim      TRUE 8.147917
## 3648:     NA    NA    NA           NA           NA      TRUE 9.664127
## 3649:     NA    NA    NA           NA           NA      TRUE 6.124742
## 3650:      0    NA TRBV6           NA TRBV6.Unstim      TRUE 3.357552
##             tpm   lmPred
##    1: 3.5716768 2.119838
##    2: 5.1630961 2.567200
##    3: 2.7845040 2.163483
##    4: 0.2016339 1.424244
##    5: 2.2839218 2.796337
##   ---                   
## 3646: 8.5085475 6.615504
## 3647: 8.1479172 6.918809
## 3648: 9.6641272 6.833092
## 3649: 6.1247417 5.872626
## 3650: 3.3575520 6.749574
ggbase +aes(x=cngeneson) + geom_line(aes(y=lmPred), lty=1) + xlab('Standardized Cellular Detection Rate')