1 Basics

1.1 Install recount

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. recount is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install recount by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("recount")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

1.2 Required knowledge

recount is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like GenomicFeatures and rtracklayer that allow you to import the data. A recount user is not expected to deal with those packages directly but will need to be familiar with SummarizedExperiment to understand the results recount generates. It might also prove to be highly beneficial to check the

  • DESeq2 package for performing differential expression analysis with the RangedSummarizedExperiment objects,
  • DEFormats package for switching the objects to those used by other differential expression packages such as edgeR,
  • derfinder package for performing annotation-agnostic differential expression analysis.

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

1.3 Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the recount tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

1.4 Workflow using recount

We have written a workflow on how to use recount that explains more details on how to use this package with other Bioconductor packages as well as the details on the actual counts provided by recount. Check it at f1000research.com/articles/6-1558/v1 or bioconductor.org/help/workflows/recountWorkflow/ (Collado-Torres, Nellore, and Jaffe, 2017).

1.5 Citing recount

We hope that recount will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation('recount')
## 
## Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen
## KD, Jaffe AE, Langmead B, Leek JT (2017). "Reproducible RNA-seq
## analysis using recount2." _Nature Biotechnology_. doi:
## 10.1038/nbt.3838 (URL: https://doi.org/10.1038/nbt.3838), <URL:
## http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html>.
## 
## Collado-Torres L, Nellore A, Jaffe AE (2017). "recount workflow:
## Accessing over 70,000 human RNA-seq samples with Bioconductor
## [version 1; referees: 1 approved, 2 approved with reservations]."
## _F1000Research_. doi: 10.12688/f1000research.12223.1 (URL:
## https://doi.org/10.12688/f1000research.12223.1), <URL:
## https://f1000research.com/articles/6-1558/v1>.
## 
## Ellis SE, Collado-Torres L, Jaffe AE, Leek JT (2018). "Improving the
## value of public RNA-seq expression data by phenotype prediction."
## _Nucl. Acids Res._. doi: 10.1093/nar/gky102 (URL:
## https://doi.org/10.1093/nar/gky102), <URL:
## https://doi.org/10.1093/nar/gky102>.
## 
## Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen
## KD, Jaffe AE, Langmead B, Leek JT (2019). _Explore and download data
## from the recount project_. doi: 10.18129/B9.bioc.recount (URL:
## https://doi.org/10.18129/B9.bioc.recount),
## https://github.com/leekgroup/recount - R package version 1.12.1,
## <URL: http://www.bioconductor.org/packages/recount>.
## 
## Frazee AC, Langmead B, Leek JT (2011). "ReCount: A multi-experiment
## resource of analysis-ready RNA-seq gene count datasets." _BMC
## Bioinformatics_. doi: 10.1186/1471-2105-12-449 (URL:
## https://doi.org/10.1186/1471-2105-12-449), <URL:
## https://doi.org/10.1186/1471-2105-12-449>.
## 
## Razmara A, Ellis SE, Sokolowski DJ, Davis S, Wilson MD, Leek JT,
## Jaffe AE, Collado-Torres L (2019). "recount-brain: a curated
## repository of human brain RNA-seq datasets metadata." _bioRxiv_.
## doi: 10.1101/618025 (URL: https://doi.org/10.1101/618025), <URL:
## https://doi.org/10.1101/618025>.
## 
## Imada E, Sanchez DF, Collado-Torres L, Wilks C, Matam T, Dinalankara
## W, Stupnikov A, Lobo-Pereira F, Yip C, Yasuzawa K, Kondo N, Itoh M,
## Suzuki H, Kasukawa T, Hon CC, de Hoon M, Shin JW, Carninci P,
## FANTOM-consortium, Jaffe AE, Leek JT, Favorov A, Franco GR, Langmead
## B, Marchionni L (2019). "Recounting the FANTOM Cage Associated
## Transcriptome." _bioRxiv_. doi: 10.1101/659490 (URL:
## https://doi.org/10.1101/659490), <URL:
## https://doi.org/10.1101/659490>.
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

2 Quick start to using to recount

Main updates:

  • As of January 30, 2017 the annotation used for the exon and gene counts is Gencode v25.
  • As of January 12, 2018 transcripts counts are available via recount2 thanks to the work of Fu et al, bioRxiv, 2018. Disjoint exon counts (version 2) were also released as described in detail in the recount website documentation tab.
  • As of April 29, 2019 FANTOM-CAT/recount2 annotation quantifications area available via recount2 thanks to the work by Imada, Sanchez, et al., bioRxiv, 2019.

Here is a very quick example of how to download a RangedSummarizedExperiment object with the gene counts for a 2 groups project (12 samples) with SRA study id SRP009615 using the recount package (Collado-Torres, Nellore, Kammers, Ellis, et al., 2017). The RangedSummarizedExperiment object is defined in the SummarizedExperiment (Morgan, Obenchain, Hester, and Pagès, 2017) package and can be used for differential expression analysis with different packages. Here we show how to use DESeq2 (Love, Huber, and Anders, 2014) to perform the differential expresion analysis.

This quick analysis is explained in more detail later on in this document. Further information about the recount project can be found in the main publication. Check the recount website for related publications.

## Load library
library('recount')

## Find a project of interest
project_info <- abstract_search('GSE32465')

## Download the gene-level RangedSummarizedExperiment data
download_study(project_info$project)

## Load the data
load(file.path(project_info$project, 'rse_gene.Rdata'))

## Browse the project at SRA
browse_study(project_info$project)

## View GEO ids
colData(rse_gene)$geo_accession

## Extract the sample characteristics
geochar <- lapply(split(colData(rse_gene), seq_len(nrow(colData(rse_gene)))), geo_characteristics)

## Note that the information for this study is a little inconsistent, so we
## have to fix it.
geochar <- do.call(rbind, lapply(geochar, function(x) {
    if('cells' %in% colnames(x)) {
        colnames(x)[colnames(x) == 'cells'] <- 'cell.line'
        return(x)
    } else {
        return(x)
    }
}))

## We can now define some sample information to use
sample_info <- data.frame(
    run = colData(rse_gene)$run,
    group = ifelse(grepl('uninduced', colData(rse_gene)$title), 'uninduced', 'induced'),
    gene_target = sapply(colData(rse_gene)$title, function(x) { strsplit(strsplit(x,
        'targeting ')[[1]][2], ' gene')[[1]][1] }),
    cell.line = geochar$cell.line
)

## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)

## Add sample information for DE analysis
colData(rse)$group <- sample_info$group
colData(rse)$gene_target <- sample_info$gene_target

## Perform differential expression analysis with DESeq2
library('DESeq2')

## Specify design and switch to DESeq2 format
dds <- DESeqDataSet(rse, ~ gene_target + group)

## Perform DE analysis
dds <- DESeq(dds, test = 'LRT', reduced = ~ gene_target, fitType = 'local')
res <- results(dds)

## Explore results
plotMA(res, main="DESeq2 results for SRP009615")

## Make a report with the results
library('regionReport')
DESeq2Report(dds, res = res, project = 'SRP009615',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results')

The recount project also hosts the necessary data to perform annotation-agnostic differential expression analyses with derfinder (Collado-Torres, Nellore, Frazee, Wilks, et al., 2017). An example analysis would like this:

## Define expressed regions for study SRP009615, only for chromosome Y
regions <- expressed_regions('SRP009615', 'chrY', cutoff = 5L, 
    maxClusterGap = 3000L)

## Compute coverage matrix for study SRP009615, only for chromosome Y
system.time( rse_ER <- coverage_matrix('SRP009615', 'chrY', regions) )

## Round the coverage matrix to integers
covMat <- round(assays(rse_ER)$counts, 0)

## Get phenotype data for study SRP009615
pheno <- colData(rse_ER)

## Complete the phenotype table with the data we got from GEO
m <- match(pheno$run, sample_info$run)
pheno <- cbind(pheno, sample_info[m, 2:3])

## Build a DESeqDataSet
dds_ers <- DESeqDataSetFromMatrix(countData = covMat, colData = pheno,
    design =  ~ gene_target + group)

## Perform differential expression analysis with DESeq2 at the ER-level
dds_ers <- DESeq(dds_ers, test = 'LRT', reduced = ~ gene_target,
    fitType = 'local')
res_ers <- results(dds_ers)

## Explore results
plotMA(res_ers, main="DESeq2 results for SRP009615 (ER-level, chrY)")

## Create a more extensive exploratory report
DESeq2Report(dds_ers, res = res_ers,
    project = 'SRP009615 (ER-level, chrY)',
    intgroup = c('group', 'gene_target'), outdir = '.',
    output = 'SRP009615-results-ER-level-chrY')

3 Introduction

recount is an R package that provides an interface to the recount project website. This package allows you to download the files from the recount project and has helper functions for getting you started with differential expression analyses. This vignette will walk you through an example.

4 Sample DE analysis

This is a brief overview of what you can do with recount. In this particular example we will download data from the SRP009615 study which sequenced 12 samples as described in the previous link.

If you don’t have recount installed, please do so with:

install.packages("BiocManager")
BiocManager::install("recount")

Next we load the required packages. Loading recount will load the required dependencies.

## Load recount R package
library('recount')

Lets say that we don’t know the actual SRA accession number for this study but we do know a particular term which will help us identify it. If that’s the case, we can use the abstract_search() function to identify the study of interest as shown below.

## Find a project of interest
project_info <- abstract_search('GSE32465')

## Explore info
project_info
##     number_samples species
## 340             12   human
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               abstract
## 340 Summary: K562-shX cells are made in an effort to validate TFBS data and ChIP-seq antibodies in Myers lab (GSE32465). K562 cells are transduced with lentiviral vector having Tet-inducible shRNA targeting a transcription factor gene. Cells with stable integration of shRNA constructs are selected using puromycin in growth media. Doxycycline is added to the growth media to induce the expression of shRNA and a red fluorescent protein marker.  A successful shRNA cell line shows at least a 70% reduction in expression of the target transcription factor as measured by qPCR. For identification, we designated these cell lines as K562-shX, where X is the transcription factor targeted by shRNA and K562 denotes the parent cell line.  For example, K562-shATF3 cells are K562 derived cells selected for stable integration of shRNA targeting the transcription factor ATF3 gene and showed at least a 70% reduction in the expression of ATF3 gene when measured by qPCR. Cells growing without doxycycline (uninduced) are used as a control to measure the change in expression of target transcription factor gene after induction of shRNA using doxycycline. For detailed growth and culturing protocols for these cells please refer to http://hudsonalpha.org/myers-lab/protocols . To identify the potential downstream targets of the candidate transcription factor, analyze the mRNA expression profile of the uninduced and induced K562-shX using RNA-seq.    For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf    Overall Design: Make K562-shX cells as described in the http://hudsonalpha.org/myers-lab/protocols . Measure the mRNA expression levels in uninduced K562-shX and induced K562-shX cells in two biological replicates using RNA-seq. Identify the potential downstream targets of the candidate transcription factor.
##       project
## 340 SRP009615

4.1 Download data

Now that we have a study that we are interested in, we can download the RangedSummarizedExperiment object (see SummarizedExperiment) with the data summarized at the gene level. The function download_study() helps us do this. If you are interested on how the annotation was defined, check reproduce_ranges() described in the Annotation section further down.

## Download the gene-level RangedSummarizedExperiment data
download_study(project_info$project)
## 2019-11-06 22:17:32 downloading file rse_gene.Rdata to SRP009615
## Load the data
load(file.path(project_info$project, 'rse_gene.Rdata'))

## Delete it if you don't need it anymore
unlink(project_info$project, recursive = TRUE)

We can explore a bit this RangedSummarizedExperiment as shown below.

rse_gene
## class: RangedSummarizedExperiment 
## dim: 58037 12 
## metadata(0):
## assays(1): counts
## rownames(58037): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000283698.1 ENSG00000283699.1
## rowData names(3): gene_id bp_length symbol
## colnames(12): SRR387777 SRR387778 ... SRR389083 SRR389084
## colData names(21): project sample ... title characteristics
## This is the sample phenotype data provided by the recount project
colData(rse_gene)
## DataFrame with 12 rows and 21 columns
##               project      sample  experiment         run
##           <character> <character> <character> <character>
## SRR387777   SRP009615   SRS281685   SRX110461   SRR387777
## SRR387778   SRP009615   SRS281686   SRX110462   SRR387778
## SRR387779   SRP009615   SRS281687   SRX110463   SRR387779
## SRR387780   SRP009615   SRS281688   SRX110464   SRR387780
## SRR389077   SRP009615   SRS282369   SRX111299   SRR389077
## ...               ...         ...         ...         ...
## SRR389080   SRP009615   SRS282372   SRX111302   SRR389080
## SRR389081   SRP009615   SRS282373   SRX111303   SRR389081
## SRR389082   SRP009615   SRS282374   SRX111304   SRR389082
## SRR389083   SRP009615   SRS282375   SRX111305   SRR389083
## SRR389084   SRP009615   SRS282376   SRX111306   SRR389084
##           read_count_as_reported_by_sra reads_downloaded
##                               <integer>        <integer>
## SRR387777                      30631853         30631853
## SRR387778                      37001306         37001306
## SRR387779                      40552001         40552001
## SRR387780                      32466352         32466352
## SRR389077                      27819603         27819603
## ...                                 ...              ...
## SRR389080                      34856203         34856203
## SRR389081                      23351679         23351679
## SRR389082                      18144828         18144828
## SRR389083                      24417368         24417368
## SRR389084                      23060084         23060084
##           proportion_of_reads_reported_by_sra_downloaded paired_end
##                                                <numeric>  <logical>
## SRR387777                                              1      FALSE
## SRR387778                                              1      FALSE
## SRR387779                                              1      FALSE
## SRR387780                                              1      FALSE
## SRR389077                                              1      FALSE
## ...                                                  ...        ...
## SRR389080                                              1      FALSE
## SRR389081                                              1      FALSE
## SRR389082                                              1      FALSE
## SRR389083                                              1      FALSE
## SRR389084                                              1      FALSE
##           sra_misreported_paired_end mapped_read_count        auc
##                            <logical>         <integer>  <numeric>
## SRR387777                      FALSE          28798572 1029494445
## SRR387778                      FALSE          33170281 1184877985
## SRR387779                      FALSE          37322762 1336528969
## SRR387780                      FALSE          29970735 1073178116
## SRR389077                      FALSE          24966859  893978355
## ...                              ...               ...        ...
## SRR389080                      FALSE          32469994 1163527939
## SRR389081                      FALSE          21904197  781685955
## SRR389082                      FALSE          17199795  616048853
## SRR389083                      FALSE          22499386  806323346
## SRR389084                      FALSE          21957003  787795710
##           sharq_beta_tissue sharq_beta_cell_type biosample_submission_date
##                 <character>          <character>               <character>
## SRR387777             blood                 k562   2011-12-05T15:40:03.870
## SRR387778             blood                 k562   2011-12-05T15:40:03.897
## SRR387779             blood                 k562   2011-12-05T15:40:03.910
## SRR387780             blood                 k562   2011-12-05T15:40:03.923
## SRR389077             blood                 k562   2011-12-13T11:26:05.720
## ...                     ...                  ...                       ...
## SRR389080             blood                 k562   2011-12-13T11:26:05.787
## SRR389081             blood                 k562   2011-12-13T11:26:05.800
## SRR389082             blood                 k562   2011-12-13T11:26:05.817
## SRR389083             blood                 k562   2011-12-13T11:26:05.830
## SRR389084             blood                 k562   2011-12-13T11:26:05.847
##           biosample_publication_date   biosample_update_date avg_read_length
##                          <character>             <character>       <integer>
## SRR387777    2011-12-07T09:29:59.890 2014-08-27T04:18:20.530              36
## SRR387778    2011-12-07T09:29:59.890 2014-08-27T04:18:21.053              36
## SRR387779    2011-12-07T09:29:59.890 2014-08-27T04:18:21.800              36
## SRR387780    2011-12-07T09:29:59.890 2014-08-27T04:18:22.320              36
## SRR389077    2011-12-13T11:26:06.663 2014-08-27T04:22:14.857              36
## ...                              ...                     ...             ...
## SRR389080    2011-12-13T11:26:06.663 2014-08-27T04:22:15.933              36
## SRR389081    2011-12-13T11:26:06.663 2014-08-27T04:22:16.290              36
## SRR389082    2011-12-13T11:26:06.663 2014-08-27T04:22:16.647              36
## SRR389083    2011-12-13T11:26:06.663 2014-08-27T04:22:17.037              36
## SRR389084    2011-12-13T11:26:06.663 2014-08-27T04:22:17.473              36
##           geo_accession  bigwig_file
##             <character>  <character>
## SRR387777     GSM836270 SRR387777.bw
## SRR387778     GSM836271 SRR387778.bw
## SRR387779     GSM836272 SRR387779.bw
## SRR387780     GSM836273 SRR387780.bw
## SRR389077     GSM847561 SRR389077.bw
## ...                 ...          ...
## SRR389080     GSM847564 SRR389080.bw
## SRR389081     GSM847565 SRR389081.bw
## SRR389082     GSM847566 SRR389082.bw
## SRR389083     GSM847567 SRR389083.bw
## SRR389084     GSM847568 SRR389084.bw
##                                                                                                     title
##                                                                                               <character>
## SRR387777   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR387778  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep1.
## SRR387779   K562 cells with shRNA targeting SRF gene cultured with no doxycycline (uninduced - UI), rep2.
## SRR387780  K562 cells with shRNA targeting SRF gene cultured with doxycycline for 48 hours (48 hr), rep2.
## SRR389077  K562 cells with shRNA targeting EGR1 gene cultured with no doxycycline (uninduced - UI), rep1.
## ...                                                                                                   ...
## SRR389080 K562 cells with shRNA targeting EGR1 gene cultured with doxycycline for 72 hours (72 hr), rep2.
## SRR389081  K562 cells with shRNA targeting ATF3 gene cultured with no doxycycline (uninduced - UI), rep1.
## SRR389082 K562 cells with shRNA targeting ATF3 gene cultured with doxycycline for 24 hours (24 hr), rep1.
## SRR389083  K562 cells with shRNA targeting ATF3 gene cultured with no doxycycline (uninduced - UI), rep2.
## SRR389084 K562 cells with shRNA targeting ATF3 gene cultured with doxycycline for 24 hours (24 hr), rep2.
##                                                                                               characteristics
##                                                                                               <CharacterList>
## SRR387777                                               cells: K562,shRNA expression: no,treatment: Puromycin
## SRR387778                  cells: K562,shRNA expression: yes, targeting SRF,treatment: Puromycin, doxycycline
## SRR387779                                               cells: K562,shRNA expression: no,treatment: Puromycin
## SRR387780                   cells: K562,shRNA expression: yes targeting SRF,treatment: Puromycin, doxycycline
## SRR389077                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## ...                                                                                                       ...
## SRR389080 cell line: K562,shRNA expression: expressing shRNA targeting EGR1,treatment: Puromycin, doxycycline
## SRR389081                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## SRR389082 cell line: K562,shRNA expression: expressing shRNA targeting ATF3,treatment: Puromycin, doxycycline
## SRR389083                          cell line: K562,shRNA expression: no shRNA expression,treatment: Puromycin
## SRR389084 cell line: K562,shRNA expression: expressing shRNA targeting ATF3,treatment: Puromycin, doxycycline
## At the gene level, the row data includes the gene Gencode ids, the gene
## symbols and the sum of the disjoint exons widths, which can be used for 
## taking into account the gene length.
rowData(rse_gene)
## DataFrame with 58037 rows and 3 columns
##                               gene_id bp_length          symbol
##                           <character> <integer> <CharacterList>
## ENSG00000000003.14 ENSG00000000003.14      4535          TSPAN6
## ENSG00000000005.5   ENSG00000000005.5      1610            TNMD
## ENSG00000000419.12 ENSG00000000419.12      1207            DPM1
## ENSG00000000457.13 ENSG00000000457.13      6883           SCYL3
## ENSG00000000460.16 ENSG00000000460.16      5967        C1orf112
## ...                               ...       ...             ...
## ENSG00000283695.1   ENSG00000283695.1        61              NA
## ENSG00000283696.1   ENSG00000283696.1       997              NA
## ENSG00000283697.1   ENSG00000283697.1      1184           HSFX3
## ENSG00000283698.1   ENSG00000283698.1       940              NA
## ENSG00000283699.1   ENSG00000283699.1        60         MIR4481
## At the exon level, you can get the gene Gencode ids from the names of:
# rowRanges(rse_exon)

4.2 Finding phenotype information

Once we have identified the study of interest, we can use the browse_study() function to browse the study at the SRA website.

## Browse the project at SRA
browse_study(project_info$project)

The SRA website includes an Experiments link which further describes each of the samples. From the information available for SRP009615 at NCBI we have some further sample information that we can save for use in our differential expression analysis. We can get some of this information from GEO. The function find_geo() finds the GEO accession id for a given SRA run accession id, which we can then use with geo_info() and geo_characteristics() to parse this information. The rse_gene object already has some of this information.

## View GEO ids
colData(rse_gene)$geo_accession
##  [1] "GSM836270" "GSM836271" "GSM836272" "GSM836273" "GSM847561" "GSM847562"
##  [7] "GSM847563" "GSM847564" "GSM847565" "GSM847566" "GSM847567" "GSM847568"
## Extract the sample characteristics
geochar <- lapply(split(colData(rse_gene), seq_len(nrow(colData(rse_gene)))), geo_characteristics)

## Note that the information for this study is a little inconsistent, so we
## have to fix it.
geochar <- do.call(rbind, lapply(geochar, function(x) {
    if('cells' %in% colnames(x)) {
        colnames(x)[colnames(x) == 'cells'] <- 'cell.line'
        return(x)
    } else {
        return(x)
    }
}))

## We can now define some sample information to use
sample_info <- data.frame(
    run = colData(rse_gene)$run,
    group = ifelse(grepl('uninduced', colData(rse_gene)$title), 'uninduced', 'induced'),
    gene_target = sapply(colData(rse_gene)$title, function(x) { strsplit(strsplit(x,
        'targeting ')[[1]][2], ' gene')[[1]][1] }),
    cell.line = geochar$cell.line
)

4.3 Predicted phenotype information

Shannon Ellis et at have predicted phenotypic information that can be used with any data from the recount project thanks to add_predictions(). Check that function for more details (Ellis, Collado-Torres, Jaffe, and Leek, 2018).

4.4 DE analysis

The recount project records the sum of the base level coverage for each gene (or exon). These raw counts have to be scaled and there are several ways in which you can choose to do so. The function scale_counts() helps you scale them in a way that is tailored to Rail-RNA output. If you prefer read counts without scaling, check the function read_counts(). Below we show some of the differences.

## Scale counts by taking into account the total coverage per sample
rse <- scale_counts(rse_gene)

##### Details about counts #####

## Scale counts to 40 million mapped reads. Not all mapped reads are in exonic
## sequence, so the total is not necessarily 40 million.
colSums(assays(rse)$counts) / 1e6
## SRR387777 SRR387778 SRR387779 SRR387780 SRR389077 SRR389078 SRR389079 
##  30.26702  29.07199  35.38355  34.65798  23.36050  22.62014  34.74629 
## SRR389080 SRR389081 SRR389082 SRR389083 SRR389084 
##  35.20971  33.41459  36.78133  34.05013  33.95469
## Compute read counts
rse_read_counts <- read_counts(rse_gene)
## Difference between read counts and number of reads downloaded by Rail-RNA
colSums(assays(rse_read_counts)$counts) / 1e6 -
    colData(rse_read_counts)$reads_downloaded / 1e6
##  SRR387777  SRR387778  SRR387779  SRR387780  SRR389077  SRR389078  SRR389079 
##  -8.839994 -12.892541  -7.536284  -6.497571 -13.239233 -15.816168 -14.775568 
##  SRR389080  SRR389081  SRR389082  SRR389083  SRR389084 
##  -6.274001  -5.054220  -2.328694  -5.265160  -4.421728
## Check the help page for read_counts() for more details

We are almost ready to perform our differential expression analysis. Lets just add the information we recovered GEO about these samples.

## Add sample information for DE analysis
colData(rse)$group <- sample_info$group
colData(rse)$gene_target <- sample_info$gene_target

Now that the RangedSummarizedExperiment is complete, we can use DESeq2 or another package to perform the differential expression test. Note that you can use DEFormats for switching between formats if you want to use another package, like edgeR.

In this particular analysis, we’ll test whether there is a group difference adjusting for the gene target.

## Perform differential expression analysis with DESeq2
library('DESeq2')

## Specify design and switch to DESeq2 format
dds <- DESeqDataSet(rse, ~ gene_target + group)
## converting counts to integer mode
## Perform DE analysis
dds <- DESeq(dds, test = 'LRT', reduced = ~ gene_target, fitType = 'local')
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)

We can now use functions from DESeq2 to explore the results. For more details check the DESeq2 vignette. For example, we can make a MA plot as shown in Figure 1.

## Explore results
plotMA(res, main="DESeq2 results for SRP009615")