The Swish method

The swish method for differential expression analysis of RNA-seq data using inferential replicate counts is described in the following reference: Zhu et al. (2019) - doi: 10.1093/nar/gkz622.

We note that swish extends and builds on another method, SAMseq (Li and Tibshirani 2011), implemented in the samr package, by taking into account inferential uncertainty, and allowing to control for batch effects and matched samples. Additionally, swish has methods for testing changes in effect size across secondary covariates, which we refer to as “interactions”. swish calls functions from the qvalue (Storey and Tibshirani 2003) or samr package for calculation of local FDR and q-value. This vignette gives an example of differential analysis of matched samples, and an interaction test for matched samples, to see if a condition effect changes in magnitude across two groups of samples.

Quick start

The following lines of code will perform a basic transcript-level swish two group analysis. For more details, read on.

The results can be found in mcols(y). For example, one can calculate the number of genes passing a 5% FDR threshold:

One can at any point remove the genes that didn’t pass the expression filter with the following line of code (can be run before or after swish). These genes are ignored by swish, and so will have NA in the results columns in mcols(y).

A gene-level analysis looks identical to a transcript-level analysis, only the input data changes. Examples follow.

Lastly, what is the structure of the output of tximeta (Love et al. 2019), which is used in swish? See the section below, Structure of tximeta output / swish input.

Macrophage stimulation experiment

We begin the fishpond vignette by loading data from a Bioconductor Experiment Data package, macrophage. The package contains RNA-seq quantification from 24 RNA-seq samples, which are a subset of the RNA-seq samples generated and analyzed by Alasoo et al. (2018) - doi: 10.1038/s41588-018-0046-7.

The experiment involved treatment of macrophage cell lines from a number of human donors with IFN gamma, Salmonella infection, or both treatments combined. In the beginning of this vignette, we will focus on comparing the IFN gamma stimulated cell lines with the control cell lines, accounting for the paired nature of the data (cells from the same donor). Later in the vignette we will analyze differences in the Salmonella infection response by IFN gamma treatment status – whether the cells are primed for immune response.

We load the package, and point to the extdata directory. For a typical analysis, the user would just point dir to the location on the machine or cluster where the transcript quantifications are stored (e.g. the quant.sf files).

The data was quantified using Salmon (Patro et al. 2017) 0.12.0 against the Gencode v29 human reference transcripts (Frankish, GENCODE-consoritum, and Flicek 2018). For more details and all code used for quantification, refer to the macrophage package vignette.

Importantly, --numGibbsSamples 20 was used to generate 20 inferential replicates with Salmon’s Gibbs sampling procedure. Inferential replicates, either from Gibbs sampling or bootstrapping of reads, are required for the swish method shown below. We also recommend to use --gcBias when running Salmon to protect against common sample-specific biases present in RNA-seq data.

Data import

Read in the column data from CSV

We start by reading in a CSV with the column data, that is, information about the samples, which are represented as columns of the SummarizedExperiment object we will construct containing the counts of reads per gene or transcript.

##            names sample_id line_id replicate condition_name
## 1 SAMEA103885102    diku_A  diku_1         1          naive
## 2 SAMEA103885347    diku_B  diku_1         1           IFNg
## 3 SAMEA103885043    diku_C  diku_1         1         SL1344
## 4 SAMEA103885392    diku_D  diku_1         1    IFNg_SL1344
## 5 SAMEA103885182    eiwy_A  eiwy_1         1          naive
## 6 SAMEA103885136    eiwy_B  eiwy_1         1           IFNg
##   macrophage_harvest salmonella_date ng_ul_mean rna_extraction rna_submit
## 1          11/6/2015      11/13/2015   293.9625     11/30/2015  12/9/2015
## 2          11/6/2015      11/13/2015   293.9625     11/30/2015  12/9/2015
## 3          11/6/2015      11/13/2015   293.9625     11/30/2015  12/9/2015
## 4          11/6/2015      11/13/2015   293.9625     11/30/2015  12/9/2015
## 5         11/25/2015       12/2/2015   193.5450      12/3/2015  12/9/2015
## 6         11/25/2015       12/2/2015   193.5450      12/3/2015  12/9/2015
##          library_pool chemistry rna_auto
## 1 3226_RNAauto-091215   V4_auto        1
## 2 3226_RNAauto-091215   V4_auto        1
## 3 3226_RNAauto-091215   V4_auto        1
## 4 3226_RNAauto-091215   V4_auto        1
## 5 3226_RNAauto-091215   V4_auto        1
## 6 3226_RNAauto-091215   V4_auto        1

We will subset to certain columns of interest, and re-name them for later.

Add a column pointing to your files

coldata needs to have a column files which specifies the path to the quantification files. In this case, we’ve gzipped the quantification files, so we point to the quant.sf.gz file. We make sure that all the files exist in the location we specified.

## [1] TRUE

Read in quants with tximeta

We will read in quantification data for some of the samples. First we load the SummarizedExperiment package. We will store out data and the output of the statistical method in a SummarizedExperiment object. We use the tximeta (Love et al. 2019) package to read in the data:

We load in the quantification data with tximeta:

We can see that all the assays have been loaded:

##  [1] "counts"    "abundance" "length"    "infRep1"   "infRep2"  
##  [6] "infRep3"   "infRep4"   "infRep5"   "infRep6"   "infRep7"  
## [11] "infRep8"   "infRep9"   "infRep10"  "infRep11"  "infRep12" 
## [16] "infRep13"  "infRep14"  "infRep15"  "infRep16"  "infRep17" 
## [21] "infRep18"  "infRep19"  "infRep20"

tximeta loads transcript-level data, although it can later be summarized to the gene levels:

## [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000488147.1"
## [4] "ENST00000619216.1" "ENST00000473358.1" "ENST00000469289.1"

We will rename our SummarizedExperiment y for the statistical analysis. For speed of the vignette, we subset to the transcripts on chromosome 1.

Two demonstrate a two group comparison, we subset to the “naive” and “IFNg” groups.

Differential transcript expression

Running Swish at the transcript level

Running swish has three steps: scaling the inferential replicates, labeling the rows with sufficient counts for running differential expression, and then calculating the statistics. As swish makes use of pseudo-random number generation in breaking ties and in calculating permutations (through samr), to obtain identical results, one needs to set a random seed before running swish(), as we do below:

A note about labelKeep: by default we keep features with minN=3 samples with a minimal count of 10. For scRNA-seq data with de-duplicated UMI counts, we recommend to lower the count, e.g. a count of 3, across a higher number of minN cells, depending on the number of cells being compared. You can also set x="condition" when running labelKeep which will use the condition variable to set minN.

The results are stored in mcols(y). We will show below how to pull out the top up- and down-regulated transcripts.

We can see how many transcripts are in a 5% FDR set:

##  5081  1329

Plotting results

We can check the distribution of p-values. This looks as expected for a comparison where we expect many transcripts will be affected by the treatment (IFNg stimulation of macrophage cells). There is a flat component and then an enrichment of transcripts with p-values near 0.

Of the transcripts in this set, which have the most extreme log2 fold change? Note that often many transcripts will share the same q-value, so it’s valuable to look at the log2 fold change as well (see further note below on q-value computation). The log2 fold change computed by swish is the median over inferential replicates, and uses a pseudo-count of 5 on the scaled counts, to stabilize the variance on the fold change from division by small counts. Here we make two vectors that give the significant genes with the lowest (most negative) and highest (most positive) log fold changes.

##        sign.lfc
## sig       -1    0    1
##   FALSE 2827    2 2252
##   TRUE   616    0  713

Here we print a small table with just the calculated statistics for the large positive log fold change transcripts (up-regulation):

##  [1] "tx_id"     "gene_id"   "tx_name"   "log10mean" "keep"     
##  [6] "stat"      "log2FC"    "pvalue"    "locfdr"    "qvalue"
##                   log10mean log2FC  pvalue   qvalue
## ENST00000370459.7      3.85  10.27 5.2e-06 5.14e-05
## ENST00000355754.6      4.46   9.80 5.2e-06 5.14e-05
## ENST00000481145.1      3.41   8.78 5.2e-06 5.14e-05
## ENST00000443807.1      3.35   8.37 5.2e-06 5.14e-05
## ENST00000370473.4      4.87   8.11 5.2e-06 5.14e-05
## ENST00000368042.7      3.44   7.98 5.2e-06 5.14e-05

Likewise for the largest negative log fold change transcripts (down-regulation):

##                   log10mean log2FC   pvalue   qvalue
## ENST00000649724.1      2.83  -6.26 9.75e-03 4.83e-02
## ENST00000305352.6      2.90  -5.54 5.20e-06 5.14e-05
## ENST00000348581.9      2.19  -4.33 5.20e-06 5.14e-05
## ENST00000451341.1      1.73  -4.11 5.20e-06 5.14e-05
## ENST00000393688.7      2.18  -3.86 7.71e-03 4.27e-02
## ENST00000610222.2      2.07  -3.45 4.71e-03 3.48e-02

We can plot the scaled counts for the inferential replicates, and also group the samples by a covariate, in this case the cell line. The analysis was paired, so the statistic assessed if the change within pairs was consistent. Here we plot the 100th top up-regulated transcript:

We can make an MA plot, where the transcripts in our FDR set are colored: