4. RNA-Seq expression example, components of the airway dataset

Let’s look at the airway dataset as an example of a typical small-scale RNA-Seq experiment. In this experiment, four Airway Smooth Muscle (ASM) cell lines are treated with the asthma medication dexamethasone.

The limma voom function will be used to assign precision weights, then the result converted to a weitrix.

library(weitrix)
library(EnsDb.Hsapiens.v86)
library(edgeR)
library(limma)
library(reshape2)
library(tidyverse)
library(airway)

set.seed(1234)

# BiocParallel supports multiple backends. 
# If the default hangs or errors, try others.
BiocParallel::register( BiocParallel::SnowParam() )

# The most reliable backed is to use serial processing
#BiocParallel::register( BiocParallel::SerialParam() )
data("airway")
airway
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

Initial processing

Initial steps are the same as for a differential expression analysis.

counts <- assay(airway,"counts")

design <- model.matrix(~ dex + cell, data=colData(airway))

good <- filterByExpr(counts, design=design) 
table(good)
## good
## FALSE  TRUE 
## 47242 16860
airway_elist <- 
    DGEList(counts[good,]) %>%
    calcNormFactors() %>%
    voom(design, plot=TRUE)

There are many possible variations on this:

Conversion to weitrix

## class: SummarizedExperiment 
## dim: 16860 8 
## metadata(1): weitrix
## assays(2): x weights
## rownames(16860): ENSG00000000003 ENSG00000000419 ... ENSG00000273487 ENSG00000273488
## rowData names(2): gene_name gene_biotype
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

Exploration

RNA-Seq expression is well trodden ground. The main contribution of the weitrix package here is to aid exploration by discovering components of variation, providing not just column scores but row loadings and respecting precision weights.

Find components of variation

This will find various numbers of components, from 1 to 6. In each case, the components discovered have varimax rotation applied to their gene loadings to aid interpretability. The result is a list of Components objects.

## [[1]]
## Components are: (Intercept), C1 
## $row :  16860 x 2 matrix
## $col :  8 x 2 matrix
## $R2  :  0.4133809 
## 
## [[2]]
## Components are: (Intercept), C1, C2 
## $row :  16860 x 3 matrix
## $col :  8 x 3 matrix
## $R2  :  0.6561205 
## 
## [[3]]
## Components are: (Intercept), C1, C2, C3 
## $row :  16860 x 4 matrix
## $col :  8 x 4 matrix
## $R2  :  0.8174942 
## 
## [[4]]
## Components are: (Intercept), C1, C2, C3, C4 
## $row :  16860 x 5 matrix
## $col :  8 x 5 matrix
## $R2  :  0.9122109 
## 
## [[5]]
## Components are: (Intercept), C1, C2, C3, C4, C5 
## $row :  16860 x 6 matrix
## $col :  8 x 6 matrix
## $R2  :  0.9542232 
## 
## [[6]]
## Components are: (Intercept), C1, C2, C3, C4, C5, C6 
## $row :  16860 x 7 matrix
## $col :  8 x 7 matrix
## $R2  :  0.9800651

We can compare the proportion of variation explained to what would be explained in a completely random weitrix. Random normally distributed values are generated with variances equal to one over the weights.

Examining components

Up to 4 components may be justified.

Without varimax rotation, components may be harder to interpret

If varimax rotation isn’t used, weitrix_components and weitrix_components_seq will produce a Principal Components Analysis, with components ordered from most to least variance explained.

Without varimax rotation the treatment effect is still mostly in the first component, but has also leaked a small amount into the other components.

col can potentially be used as a design matrix with limma

If you’re not sure of the experimental design, for example the exact timing of a time series or how evenly a drug treatment was applied, the extracted component might actually be more accurate.

Note that this ignores uncertainty about the col matrix itself.

This may be useful for hypothesis generation – finding some potentially interesting genes, while discounting noisy or lowly expressed genes – but don’t use it as proof of significance.

## [1] 6.531096
## [1] 1.036364
##                 gene_name   gene_biotype     logFC  AveExpr         t      P.Value    adj.P.Val        B
## ENSG00000101347    SAMHD1 protein_coding  6.427453 8.135257  46.07562 1.599481e-12 2.696724e-08 18.63745
## ENSG00000189221      MAOA protein_coding  5.584493 5.950559  35.38160 1.955539e-11 1.648520e-07 16.61574
## ENSG00000139132      FGD4 protein_coding  3.721531 5.415341  32.67051 4.157001e-11 1.653757e-07 16.00037
## ENSG00000120129     DUSP1 protein_coding  4.986316 6.644455  31.92671 5.167966e-11 1.653757e-07 15.85715
## ENSG00000178695    KCTD12 protein_coding -4.295197 6.451725 -31.49058 5.885257e-11 1.653757e-07 15.72681
## ENSG00000134243     SORT1 protein_coding  3.567966 7.569410  28.91065 1.318965e-10 2.223775e-07 15.02324
## ENSG00000179094      PER1 protein_coding  5.370151 4.420422  29.44179 1.110809e-10 2.080915e-07 14.86859
## ENSG00000165995    CACNB2 protein_coding  5.590915 3.682244  30.23761 8.635872e-11 1.820010e-07 14.75178
## ENSG00000157214    STEAP2 protein_coding  3.413301 6.790009  27.49205 2.119535e-10 2.748874e-07 14.58748
## ENSG00000122035   RASL11A protein_coding  4.051210 4.886461  27.65426 2.005232e-10 2.748874e-07 14.54112

You might also consider using my topconfects package. This will find the largest confident effect sizes, while still correcting for multiple testing.

## $table
##  rank index confect effect    AveExpr    name            gene_name     gene_biotype        
##   1   10918  7.848  13.528918 -1.4798754 ENSG00000179593 ALOX15B       protein_coding      
##   2    3029  6.026  11.838735  1.3807218 ENSG00000109906 ZBTB16        protein_coding      
##   3    8636  5.927   7.810468  3.2401058 ENSG00000163884 KLF15         protein_coding      
##   4    9976  5.927   9.116726  0.4626392 ENSG00000171819 ANGPTL7       protein_coding      
##   5    7468  5.698   7.717228  4.1669039 ENSG00000152583 SPARCL1       protein_coding      
##   6    2066  5.415   6.427453  8.1352566 ENSG00000101347 SAMHD1        protein_coding      
##   7    9458  5.184   8.196551  0.8379382 ENSG00000168309 FAM107A       protein_coding      
##   8    4755  5.083   8.844567  1.6193300 ENSG00000127954 STEAP4        protein_coding      
##   9    8391 -4.869  -6.306789  3.5709602 ENSG00000162692 VCAM1         protein_coding      
##  10   15487  4.556  10.181240 -0.7521822 ENSG00000250978 RP11-357D18.1 processed_transcript
## ...
## 5420 of 16860 non-zero log2 fold change at FDR 0.05
## Prior df 6.5