recount3 1.12.0
The recount3 R/Bioconductor package is an interface to the recount3 project. recount3 provides uniformly processed RNA-seq data for hundreds of thousands of samples. The R package makes it possible to easily retrieve this data in standard Bioconductor containers, including RangedSummarizedExperiment. The sections on terminology and available data contains more detail on those subjects.
The main documentation website for all the recount3
-related projects is available at recount.bio. Please check that website for more information about how this R/Bioconductor package and other tools are related to each other.
recount3
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages. recount3 is a R
package available via the Bioconductor repository for packages. R
can be installed on any operating system from CRAN after which you can install recount3 by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("recount3")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
You can install the development version from GitHub with:
BiocManager::install("LieberInstitute/recount3")
recount3 is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. A recount3 user will benefit from being familiar with SummarizedExperiment to understand the objects recount3 generates. It might also prove to be highly beneficial to check the
recount2
project (Collado-Torres, Nellore, Kammers, Ellis et al., 2017; Collado-Torres, Nellore, and Jaffe, 2017).If you are asking yourself the question “Where do I start using Bioconductor?” you might be interested in this blog post.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R
and Bioconductor
have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the recount3
tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
We hope that recount3 will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation("recount3")
#> To cite package 'recount3' in publications use:
#>
#> Collado-Torres L (2023). _Explore and download data from the recount3
#> project_. doi:10.18129/B9.bioc.recount3
#> <https://doi.org/10.18129/B9.bioc.recount3>,
#> https://github.com/LieberInstitute/recount3 - R package version
#> 1.12.0, <http://www.bioconductor.org/packages/recount3>.
#>
#> Wilks C, Zheng SC, Chen FY, Charles R, Solomon B, Ling JP, Imada EL,
#> Zhang D, Joseph L, Leek JT, Jaffe AE, Nellore A, Collado-Torres L,
#> Hansen KD, Langmead B (2021). "recount3: summaries and queries for
#> large-scale RNA-seq expression and splicing." _Genome Biol_.
#> doi:10.1186/s13059-021-02533-6
#> <https://doi.org/10.1186/s13059-021-02533-6>,
#> <https://doi.org/10.1186/s13059-021-02533-6>.
#>
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.
After installing recount3 (Wilks, Zheng, Chen, Charles et al., 2021), we need to load the package, which will automatically load the required dependencies.
## Load recount3 R package
library("recount3")
If you have identified a study of interest and want to access the gene level expression data, use create_rse()
as shown below. create_rse()
has arguments that will allow you to specify the annotation of interest for the given organism, and whether you want to download gene, exon or exon-exon junction expression data.
## Find all available human projects
human_projects <- available_projects()
#> 2023-10-24 18:40:14.887718 caching file sra.recount_project.MD.gz.
#> 2023-10-24 18:40:15.151981 caching file gtex.recount_project.MD.gz.
#> 2023-10-24 18:40:15.381705 caching file tcga.recount_project.MD.gz.
## Find the project you are interested in,
## here we use SRP009615 as an example
proj_info <- subset(
human_projects,
project == "SRP009615" & project_type == "data_sources"
)
## Create a RangedSummarizedExperiment (RSE) object at the gene level
rse_gene_SRP009615 <- create_rse(proj_info)
#> 2023-10-24 18:40:21.259138 downloading and reading the metadata.
#> 2023-10-24 18:40:21.700581 caching file sra.sra.SRP009615.MD.gz.
#> 2023-10-24 18:40:21.928274 caching file sra.recount_project.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.265292 caching file sra.recount_qc.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.513124 caching file sra.recount_seq_qc.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.764667 caching file sra.recount_pred.SRP009615.MD.gz.
#> 2023-10-24 18:40:22.861618 downloading and reading the feature information.
#> 2023-10-24 18:40:23.025411 caching file human.gene_sums.G026.gtf.gz.
#> 2023-10-24 18:40:24.776322 downloading and reading the counts: 12 samples across 63856 features.
#> 2023-10-24 18:40:24.936855 caching file sra.gene_sums.SRP009615.G026.gz.
#> 2023-10-24 18:40:25.138526 constructing the RangedSummarizedExperiment (rse) object.
## Explore that RSE object
rse_gene_SRP009615
#> class: RangedSummarizedExperiment
#> dim: 63856 12
#> metadata(8): time_created recount3_version ... annotation recount3_url
#> assays(1): raw_counts
#> rownames(63856): ENSG00000278704.1 ENSG00000277400.1 ...
#> ENSG00000182484.15_PAR_Y ENSG00000227159.8_PAR_Y
#> rowData names(10): source type ... havana_gene tag
#> colnames(12): SRR387777 SRR387778 ... SRR389077 SRR389078
#> colData names(175): rail_id external_id ...
#> recount_pred.curated.cell_line BigWigURL
You can also interactively choose your study of interest
## Note that you can interactively explore the available projects
proj_info_interactive <- interactiveDisplayBase::display(human_projects)
## Select a single row, then hit "send". The following code checks this.
stopifnot(nrow(proj_info_interactive) == 1)
## Then create the RSE object
rse_gene_interactive <- create_rse(proj_info_interactive)
Once you have a RSE file, you can use transform_counts()
to transform the raw coverage counts.
## Once you have your RSE object, you can transform the raw coverage
## base-pair coverage counts using transform_counts().
## For RPKM, TPM or read outputs, check the details in transform_counts().
assay(rse_gene_SRP009615, "counts") <- transform_counts(rse_gene_SRP009615)
Now you are ready to continue with downstream analysis software.
recount3 also supports accessing the BigWig raw coverage files as well as specific study or collection sample metadata. Please continue to the users guide for more detailed information.
recount3 (Wilks, Zheng, Chen, Charles et al., 2021) provides an interface for downloading the recount3 raw files and building Bioconductor-friendly R objects (Huber, Carey, Gentleman, Anders et al., 2015; Morgan, Obenchain, Hester, and Pagès, 2019) that can be used with many downstream packages. To achieve this, the raw data is organized by study from a specific data source. That same study can be a part of one or more collections, which is a manually curated set of studies with collection-specific sample metadata (see the Data source vs collection for details). To get started with recount3, you will need to identify the ID for the study of interest from either human or mouse for a particular annotation of interest. Once you have identified study, data source or collection, and annotation, recount3 can be used to build a RangedSummarizedExperiment
object (Morgan, Obenchain, Hester, and Pagès, 2019) for either gene, exon or exon-exon junction expression feature data. Furthermore, recount3 provides access to the coverage BigWig files that can be quantified for custom set of genomic regions using megadepth. Furthermore, snapcount allows fast-queries for custom exon-exon junctions and other custom input.
recount3 provides access to most of the recount3
raw files in a form that is R/Bioconductor-friendly. As a summary of the data provided by the recount3
project (Figure 1), the main data files provided are:
recount3
, which can come from sources such as the Sequence Read Archive as well as recount3
quality metrics.recount2
project, the recount3
project provides counts at the base-pair coverage level (Collado-Torres, Nellore, and Jaffe, 2017).Here we describe some of the common terminology and acronyms used throughout the rest of the documentation. recount3 enables creating RangedSummarizedExperiment
objects that contain expression quantitative data (Figure 2). As a quick overview, some of the main terms are:
RangedSummarizedExperiment
object from SummarizedExperiment (Morgan, Obenchain, Hester, and Pagès, 2019) that contains:
assays(counts)
.colData(rse)
.rowRanges(rse)
.