1 Overview

The recount3 R/Bioconductor package is an interface to the recount3 project. recount3 provides uniformly processed RNA-seq data for hundreds of thousands of samples. The R package makes it possible to easily retrieve this data in standard Bioconductor containers, including RangedSummarizedExperiment. The sections on terminology and available data contains more detail on those subjects.

The main documentation website for all the recount3-related projects is available at recount.bio. Please check that website for more information about how this R/Bioconductor package and other tools are related to each other.

2 Basics

2.1 Installing recount3

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. recount3 is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install recount3 by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("recount3")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

You can install the development version from GitHub with:

BiocManager::install("LieberInstitute/recount3")

2.2 Required knowledge

recount3 is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A recount3 user will benefit from being familiar with SummarizedExperiment to understand the objects recount3 generates. It might also prove to be highly beneficial to check the

If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.

2.3 Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the recount3 tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

2.4 Citing recount3

We hope that recount3 will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("recount3")
#> To cite package 'recount3' in publications use:
#> 
#>   Collado-Torres L (2023). _Explore and download data from the recount3
#>   project_. doi:10.18129/B9.bioc.recount3
#>   <https://doi.org/10.18129/B9.bioc.recount3>,
#>   https://github.com/LieberInstitute/recount3 - R package version
#>   1.12.0, <http://www.bioconductor.org/packages/recount3>.
#> 
#>   Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL,
#>   Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L,
#>   Hansen KD, Langmead B (2021). "recount3: summaries and queries for
#>   large-scale RNA-seq expression and splicing." _Genome Biol_.
#>   doi:10.1186/s13059-021-02533-6
#>   <https://doi.org/10.1186/s13059-021-02533-6>,
#>   <https://doi.org/10.1186/s13059-021-02533-6>.
#> 
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.

3 Quick start

After installing recount3 (Wilks, Zheng, Chen, Charles et al., 2021), we need to load the package, which will automatically load the required dependencies.

## Load recount3 R package
library("recount3")

If you have identified a study of interest and want to access the gene level expression data, use create_rse() as shown below. create_rse() has arguments that will allow you to specify the annotation of interest for the given organism, and whether you want to download gene, exon or exon-exon junction expression data.

## Find all available human projects
human_projects <- available_projects()
#> 2023-10-24 18:40:14.887718 caching file sra.recount_project.MD.gz.
#> 2023-10-24 18:40:15.151981 caching file gtex.recount_project.MD.gz.
#> 2023-10-24 18:40:15.381705 caching file tcga.recount_project.MD.gz.

## Find the project you are interested in,
## here we use SRP009615 as an example
proj_info <- subset(
    human_projects,
    project == "SRP009615" & project_type == "data_sources"
)

## Create a RangedSummarizedExperiment (RSE) object at the gene level
rse_gene_SRP009615 <- create_rse(proj_info)
#> 2023-10-24 18:40:21.259138 downloading and reading the metadata.
#> 2023-10-24 18:40:21.700581 caching file sra.sra.SRP009615.MD.gz.
#> 2023-10-24 18:40:21.928274 caching file sra.recount_project.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.265292 caching file sra.recount_qc.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.513124 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.764667 caching file sra.recount_pred.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.861618 downloading and reading the feature information.
#> 2023-10-24 18:40:23.025411 caching file human.gene_sums.G026.gtf.gz.
#> 2023-10-24 18:40:24.776322 downloading and reading the counts: 12 samples across 63856 features.
#> 2023-10-24 18:40:24.936855 caching file sra.gene_sums.SRP009615.G026.gz.
#> 2023-10-24 18:40:25.138526 constructing the RangedSummarizedExperiment (rse) object.

## Explore that RSE object
rse_gene_SRP009615
#> class: RangedSummarizedExperiment 
#> dim: 63856 12 
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
#>   ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
#> rowData names(10): source type ... havana_gene tag
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#>   recount_pred.curated.cell_line BigWigURL

You can also interactively choose your study of interest

## Note that you can interactively explore the available projects
proj_info_interactive <- interactiveDisplayBase::display(human_projects)

## Select a single row, then hit "send". The following code checks this.
stopifnot(nrow(proj_info_interactive) == 1)

## Then create the RSE object
rse_gene_interactive <- create_rse(proj_info_interactive)

Once you have a RSE file, you can use transform_counts() to transform the raw coverage counts.

## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)

Now you are ready to continue with downstream analysis software.

recount3 also supports accessing the BigWig raw coverage files as well as specific study or collection sample metadata. Please continue to the users guide for more detailed information.

4 Users guide

recount3 (Wilks, Zheng, Chen, Charles et al., 2021) provides an interface for downloading the recount3 raw files and building Bioconductor-friendly R objects (Huber, Carey, Gentleman, Anders et al., 2015; Morgan, Obenchain, Hester, and Pagès, 2019) that can be used with many downstream packages. To achieve this, the raw data is organized by study from a specific data source. That same study can be a part of one or more collections, which is a manually curated set of studies with collection-specific sample metadata (see the Data source vs collection for details). To get started with recount3, you will need to identify the ID for the study of interest from either human or mouse for a particular annotation of interest. Once you have identified study, data source or collection, and annotation, recount3 can be used to build a RangedSummarizedExperiment object (Morgan, Obenchain, Hester, and Pagès, 2019) for either gene, exon or exon-exon junction expression feature data. Furthermore, recount3 provides access to the coverage BigWig files that can be quantified for custom set of genomic regions using megadepth. Furthermore, snapcount allows fast-queries for custom exon-exon junctions and other custom input.

4.1 Available data

recount3 provides access to most of the recount3 raw files in a form that is R/Bioconductor-friendly. As a summary of the data provided by the recount3 project (Figure 1), the main data files provided are:

  • metadata: information about the samples in recount3, which can come from sources such as the Sequence Read Archive as well as recount3 quality metrics.
  • gene: RNA expression data quantified at the gene annotation level. This information is provided for multiple annotations that are organism-specific. Similar to the recount2 project, the recount3 project provides counts at the base-pair coverage level (Collado-Torres, Nellore, and Jaffe, 2017).
  • exon: RNA expression data quantified at the exon annotation level. The data is also annotation-specific and the counts are also at the base-pair coverage level.
  • exon-exon junctions: RNA expression data quantified at the exon-exon junction level. This data is annotation-agnostic (it does not depend on the annotation) and is variable across each study because different sets of exon-exon junctions are measured in each study.
  • bigWig: RNA expression data in raw format at the base-pair coverage resolution. This raw data when coupled with a given annotation can be used to generate gene and exon level counts using software such as megadepth. It enables exploring the RNA expression landscape in an annotation-agnostic way.
Overview of the data available in recount2 and recount3. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices. DOI: < https://doi.org/10.12688/f1000research.12223.1>. Overview of the data available in recount2 and recount3. Reads (pink boxes) aligned to the reference genome can be used to compute a base-pair coverage curve and identify exon-exon junctions (split reads). Gene and exon count matrices are generated using annotation information providing the gene (green boxes) and exon (blue boxes) coordinates together with the base-level coverage curve. The reads spanning exon-exon junctions (jx) are used to compute a third count matrix that might include unannotated junctions (jx 3 and 4). Without using annotation information, expressed regions (orange box) can be determined from the base-level coverage curve to then construct data-driven count matrices. DOI: < https://doi.org/10.12688/f1000research.12223.1>.

4.2 Terminology

Here we describe some of the common terminology and acronyms used throughout the rest of the documentation. recount3 enables creating RangedSummarizedExperiment objects that contain expression quantitative data (Figure 2). As a quick overview, some of the main terms are:

  • rse: a RangedSummarizedExperiment object from SummarizedExperiment (Morgan, Obenchain, Hester, and Pagès, 2019) that contains:
    • counts: a matrix with the expression feature data (either: gene, exon, or exon-exon junctions) and that can be accessed using assays(counts).
    • metadata: a table with information about the samples and quality metrics that can be accessed using colData(rse).
    • annotation: a table-like object with information about the expression features which can be annotation-specific (gene and exons) or annotation-agnostic (exon-exon junctions). This information can be accessed using rowRanges(rse).