Introduction

The tximeta package (Love et al. 2020) extends the tximport package (Soneson, Love, and Robinson 2015) for import of transcript-level quantification data into R/Bioconductor. It automatically adds annotation metadata when the RNA-seq data has been quantified with Salmon (Patro et al. 2017) or for scRNA-seq data quantified with alevin (Srivastava et al. 2019). To our knowledge, tximeta is the only package for RNA-seq data import that can automatically identify and attach transcriptome metadata based on the unique sequence of the reference transcripts. For more details on these packages – including the motivation for tximeta and description of similar work – consult the References below.

Note: tximeta requires that the entire output directory of Salmon / alevin is present and unmodified in order to identify the provenance of the reference transcripts. In general, it’s a good idea to not modify or re-arrange the output directory of bioinformatic software as other downstream software rely on and assume a consistent directory structure. For sharing multiple samples, one can use, for example, tar -czf to bundle up a set of Salmon output directories, or to bundle one alevin output directory. For tips on using tximeta with other quantifiers see the other quantifiers section below.

Analysis starts with sample table

The first step using tximeta is to read in the sample table, which will become the column data, colData, of the final object, a SummarizedExperiment. The sample table should contain all the information we need to identify the Salmon quantification directories. For alevin quantification, one should point to the quants_mat.gz file that contains the counts for all of the cells.

Here we will use a Salmon quantification file in the tximportData package to demonstrate the usage of tximeta. We do not have a sample table, so we construct one in R. It is recommended to keep a sample table as a CSV or TSV file while working on an RNA-seq project with multiple samples.

dir <- system.file("extdata/salmon_dm", package="tximportData")
files <- file.path(dir, "SRR1197474", "quant.sf") 
file.exists(files)

## [1] TRUE

coldata <- data.frame(files, names="SRR1197474", condition="A", stringsAsFactors=FALSE)
coldata

##                                                                                        files
## 1 /home/biocbuild/bbs-3.11-bioc/R/library/tximportData/extdata/salmon_dm/SRR1197474/quant.sf
##        names condition
## 1 SRR1197474         A

tximeta expects at least two columns in coldata:

files - a pointer to the quant.sf files
names - the unique names that should be used to identify samples

Running tximeta

Normally, we would just run tximeta like so:

library(tximeta)
se <- tximeta(coldata)

However, to avoid downloading remote GTF files during this vignette, we will point to a GTF file saved locally (in the tximportData package). We link the transcriptome of the Salmon index to its locally saved GTF. The standard recommended usage of tximeta would be the code chunk above, or to specify a remote GTF source, not a local one. This following code is therefore not recommended for a typically workflow, but is particular to the vignette code.

indexDir <- file.path(dir, "Dm.BDGP6.22.98_salmon-0.14.1")
fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.22.ncrna.fa.gz")
gtfPath <- file.path(dir,"Drosophila_melanogaster.BDGP6.22.98.gtf.gz")
suppressPackageStartupMessages(library(tximeta))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl",
                organism="Drosophila melanogaster",
                release="98",
                genome="BDGP6.22",
                fasta=fastaFTP,
                gtf=gtfPath,
                write=FALSE)

## saving linkedTxome in bfc (first time)

library(tximeta)
se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 98 ]
## useHub=TRUE: checking for EnsDb via 'AnnotationHub'
## snapshotDate(): 2020-04-27
## found matching EnsDb via 'AnnotationHub'
## loading from cache
## require("ensembldb")
## generating transcript ranges

What happened?

tximeta recognized the hashed checksum of the transcriptome that the files were quantified against, it accessed the GTF file of the transcriptome source, found and attached the transcript ranges, and added the appropriate transcriptome and genome metadata. A remote GTF is only downloaded once, and a local or remote GTF is only parsed to build a TxDb or EnsDb once: if tximeta recognizes that it has seen this Salmon index before, it will use a cached version of the metadata and transcript ranges.

Note the warning above that 5 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by tximeta.

TxDb, EnsDb, and AnnotationHub

tximeta makes use of Bioconductor packages for storing transcript databases as TxDb or EnsDb objects, which both are connected by default to sqlite backends. For GENCODE and RefSeq GTF files, tximeta uses the GenomicFeatures package (Lawrence 2013) to parse the GTF and build a TxDb. For Ensembl GTF files, tximeta will first attempt to obtain the correct EnsDb object using AnnotationHub. The ensembldb package (Rainer, Gatto, and Weichenberger 2019) contains classes and methods for extracting relevant data from Ensembl files. If the EnsDb has already been made available on AnnotationHub, tximeta will download the database directly, which saves the user time parsing the GTF into a database (to avoid this, set useHub=FALSE). If the relevant EnsDb is not available on AnnotationHub, tximeta will build an EnsDb using ensembldb after downloading the GTF file. Again, the download/construction of a transcript database occurs only once, and upon subsequent usage of tximeta functions, the cached version will be used.

Pre-computed checksums

We plan to support a wide variety of sources and organisms for transcriptomes with pre-computed checksums, though for now the software focuses on predominantly human and mouse transcriptomes (see Next steps below for details). The following checksums are supported in this version of tximeta:

source	organism	releases
GENCODE	Homo sapiens	23-35
GENCODE	Mus musculus	M6-M25
Ensembl	Homo sapiens	76-101
Ensembl	Mus musculus	76-101
Ensembl	Drosophila melanogaster	79-101
RefSeq	Homo sapiens	p1-p12
RefSeq	Mus musculus	p2-p5

For Ensembl transcriptomes, we support the combined protein coding (cDNA) and non-coding (ncRNA) sequences, as well as the protein coding alone (although the former approach combining coding and non-coding transcripts is recommended for more accurate quantification).

tximeta also has functions to support linked transcriptomes, where one or more sources for transcript sequences have been combined or filtered. See the Linked transcriptome section below for a demonstration. (The makeLinkedTxome function was used above to avoid downloading the GTF during the vignette building process.)

SummarizedExperiment output

We have our coldata from before. Note that we’ve removed files.

suppressPackageStartupMessages(library(SummarizedExperiment))
colData(se)

## DataFrame with 1 row and 2 columns
##                  names   condition
##            <character> <character>
## SRR1197474  SRR1197474           A

Here we show the three matrices that were imported.

assayNames(se)

## [1] "counts"    "abundance" "length"

If there were inferential replicates (Gibbs samples or bootstrap samples), these would be imported as additional assays named "infRep1", "infRep2", …

tximeta has imported the correct ranges for the transcripts:

rowRanges(se)

## GRanges object with 33706 ranges and 9 metadata columns:
##               seqnames            ranges strand |       tx_id     tx_biotype
##                  <Rle>         <IRanges>  <Rle> | <character>    <character>
##   FBtr0070129        X     656673-657899      + | FBtr0070129 protein_coding
##   FBtr0070126        X     656356-657899      + | FBtr0070126 protein_coding
##   FBtr0070128        X     656673-657899      + | FBtr0070128 protein_coding
##   FBtr0070124        X     656114-657899      + | FBtr0070124 protein_coding
##   FBtr0070127        X     656356-657899      + | FBtr0070127 protein_coding
##           ...      ...               ...    ... .         ...            ...
##   FBtr0114299       2R 21325218-21325323      + | FBtr0114299         snoRNA
##   FBtr0113582       3R   5598638-5598777      - | FBtr0113582         snoRNA
##   FBtr0091635       3L   1488906-1489045      + | FBtr0091635         snoRNA
##   FBtr0113599       3L     261803-261953      - | FBtr0113599         snoRNA
##   FBtr0113600       3L     831870-832008      - | FBtr0113600         snoRNA
##               tx_cds_seq_start tx_cds_seq_end     gene_id tx_support_level
##                      <integer>      <integer> <character>        <integer>
##   FBtr0070129           657110         657595 FBgn0025637             <NA>
##   FBtr0070126           657110         657595 FBgn0025637             <NA>
##   FBtr0070128           657110         657595 FBgn0025637             <NA>
##   FBtr0070124           657110         657595 FBgn0025637             <NA>
##   FBtr0070127           657110         657595 FBgn0025637             <NA>
##           ...              ...            ...         ...              ...
##   FBtr0114299             <NA>           <NA> FBgn0086023             <NA>
##   FBtr0113582             <NA>           <NA> FBgn0082989             <NA>
##   FBtr0091635             <NA>           <NA> FBgn0086670             <NA>
##   FBtr0113599             <NA>           <NA> FBgn0083014             <NA>
##   FBtr0113600             <NA>           <NA> FBgn0083057             <NA>
##               tx_id_version gc_content     tx_name
##                 <character>  <numeric> <character>
##   FBtr0070129   FBtr0070129    44.7641 FBtr0070129
##   FBtr0070126   FBtr0070126    44.8128 FBtr0070126
##   FBtr0070128   FBtr0070128    44.7974 FBtr0070128
##   FBtr0070124   FBtr0070124    43.8859 FBtr0070124
##   FBtr0070127   FBtr0070127    44.8571 FBtr0070127
##           ...           ...        ...         ...
##   FBtr0114299   FBtr0114299    35.8491 FBtr0114299
##   FBtr0113582   FBtr0113582    32.8571 FBtr0113582
##   FBtr0091635   FBtr0091635    45.0000 FBtr0091635
##   FBtr0113599   FBtr0113599    48.3444 FBtr0113599
##   FBtr0113600   FBtr0113600    44.6043 FBtr0113600
##   -------
##   seqinfo: 25 sequences (1 circular) from BDGP6.22 genome

We have appropriate genome information, which prevents us from making bioinformatic mistakes:

seqinfo(se)

## Seqinfo object with 25 sequences (1 circular) from BDGP6.22 genome:
##   seqnames                        seqlengths isCircular   genome
##   211000022278279                      12714      FALSE BDGP6.22
##   211000022278436                       2815      FALSE BDGP6.22
##   211000022278449                       1947      FALSE BDGP6.22
##   211000022278760                       1144      FALSE BDGP6.22
##   211000022279165                       1118      FALSE BDGP6.22
##   ...                                    ...        ...      ...
##   Unmapped_Scaffold_8_D1580_D1567      88768      FALSE BDGP6.22
##   X                                 23542271      FALSE BDGP6.22
##   Y                                  3667352      FALSE BDGP6.22
##   mitochondrion_genome                 19524       TRUE BDGP6.22
##   rDNA                                 76973      FALSE BDGP6.22

Retrieve the transcript database

The se object has associated metadata that allows tximeta to link to locally stored cached databases and other Bioconductor objects. In further sections, we will show examples functions that leverage this databases for adding exon information, summarize transcript-level data to the gene level, or add identifiers. However, first we mention that the user can easily access the cached database with the following helper function. In this case, tximeta has an associated EnsDb object that we can retrieve and use in our R session:

edb <- retrieveDb(se)

## loading existing EnsDb created: 2020-08-21 01:21:08

class(edb)

## [1] "EnsDb"
## attr(,"package")
## [1] "ensembldb"

The database returned by retrieveDb is either a TxDb in the case of GENCODE or RefSeq GTF annotation file, or an EnsDb in the case of an Ensembl GTF annotation file. For further use of these two database objects, consult the GenomicFeatures vignettes and the ensembldb vignettes, respectively (both Bioconductor packages).

Add exons per transcript

Because the SummarizedExperiment maintains all the metadata of its creation, it also keeps a pointer to the necessary database for pulling out additional information, as demonstrated in the following sections.

If necessary, the tximeta package can pull down the remote source to build a TxDb, but given that we’ve already built a TxDb once, it simply loads the cached version. In order to remove the cached TxDb and regenerate, one can remove the relevant entry from the tximeta file cache that resides at the location given by getTximetaBFC().

The se object created by tximeta, has the start, end, and strand information for each transcript. Here, we swap out the transcript GRanges for exons-by-transcript GRangesList (it is a list of GRanges, where each element of the list gives the exons for a particular transcript).

se.exons <- addExons(se)

## loading existing EnsDb created: 2020-08-21 01:21:08

## generating exon ranges

rowRanges(se.exons)[[1]]

## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames        ranges strand |        exon_id exon_rank
##          <Rle>     <IRanges>  <Rle> |    <character> <integer>
##   [1]        X 656673-656740      + | FBtr0070129-E1         1
##   [2]        X 657099-657899      + | FBtr0070129-E2         2
##   -------
##   seqinfo: 25 sequences (1 circular) from BDGP6.22 genome

As with the transcript ranges, the exon ranges will be generated once and cached locally. As it takes a non-negligible amount of time to generate the exon-by-transcript GRangesList, this local caching offers substantial time savings for repeated usage of addExons with the same transcriptome.

We have implemented addExons to work only on the transcript-level SummarizedExperiment object. We provide some motivation for this choice in ?addExons. Briefly, if it is desired to know the exons associated with a particular gene, we feel that it makes more sense to pull out the relevant set of exons-by-transcript for the transcripts for this gene, rather than losing the hierarchical structure (exons to transcripts to genes) that would occur with a GRangesList of exons grouped per gene.

Easy summarization to gene-level

Likewise, the tximeta package can make use of the cached TxDb database for the purpose of summarizing transcript-level quantifications and bias corrections to the gene-level. After summarization, the rowRanges reflect the start and end position of the gene, which in Bioconductor are defined by the left-most and right-most genomic coordinates of all the transcripts. As with the transcript and exons, the gene ranges are cached locally for repeated usage.

gse <- summarizeToGene(se)

## loading existing EnsDb created: 2020-08-21 01:21:08

## obtaining transcript-to-gene mapping from database

## generating gene ranges

## summarizing abundance

## summarizing counts

## summarizing length

rowRanges(gse)

## GRanges object with 17208 ranges and 8 metadata columns:
##               seqnames            ranges strand |     gene_id      gene_name
##                  <Rle>         <IRanges>  <Rle> | <character>    <character>
##   FBgn0000003       3R   6822498-6822796      + | FBgn0000003 7SLRNA:CR32864
##   FBgn0000008       2R 22136968-22172834      + | FBgn0000008              a
##   FBgn0000014       3R 16807214-16830049      - | FBgn0000014          abd-A
##   FBgn0000015       3R 16927212-16972236      - | FBgn0000015          Abd-B
##   FBgn0000017       3L 16615866-16647882      - | FBgn0000017            Abl
##           ...      ...               ...    ... .         ...            ...
##   FBgn0286199       3R 24279572-24281576      + | FBgn0286199           shps
##   FBgn0286203       2R   5413744-5456095      + | FBgn0286203            stw
##   FBgn0286204       3R   8950246-8963037      - | FBgn0286204            ich
##   FBgn0286213       3L 13023352-13024762      + | FBgn0286213          RpS12
##   FBgn0286222        X   6678424-6681845      + | FBgn0286222           Fum1
##                 gene_biotype seq_coord_system description gene_id_version
##                  <character>      <character> <character>     <character>
##   FBgn0000003          ncRNA       chromosome        NULL     FBgn0000003
##   FBgn0000008 protein_coding       chromosome        NULL     FBgn0000008
##   FBgn0000014 protein_coding       chromosome        NULL     FBgn0000014
##   FBgn0000015 protein_coding       chromosome        NULL     FBgn0000015
##   FBgn0000017 protein_coding       chromosome        NULL     FBgn0000017
##           ...            ...              ...         ...             ...
##   FBgn0286199 protein_coding       chromosome        NULL     FBgn0286199
##   FBgn0286203 protein_coding       chromosome        NULL     FBgn0286203
##   FBgn0286204 protein_coding       chromosome        NULL     FBgn0286204
##   FBgn0286213 protein_coding       chromosome        NULL     FBgn0286213
##   FBgn0286222 protein_coding       chromosome        NULL     FBgn0286222
##                       symbol entrezid
##                  <character>   <list>
##   FBgn0000003 7SLRNA:CR32864       NA
##   FBgn0000008              a    43852
##   FBgn0000014          abd-A    42037
##   FBgn0000015          Abd-B    47763
##   FBgn0000017            Abl    45821
##           ...            ...      ...
##   FBgn0286199           shps    42892
##   FBgn0286203            stw    35494
##   FBgn0286204            ich    41069
##   FBgn0286213          RpS12    39480
##   FBgn0286222           Fum1    31605
##   -------
##   seqinfo: 25 sequences (1 circular) from BDGP6.22 genome

Add different identifiers

We would like to add support to easily map transcript or gene identifiers from one annotation to another. This is just a prototype function, but we show how we can easily add alternate IDs given that we know the organism and the source of the transcriptome. (This function currently only works for GENCODE and Ensembl gene or transcript IDs but could be extended to work for arbitrary sources.)

library(org.Dm.eg.db)

##

gse <- addIds(gse, "REFSEQ", gene=TRUE)

## mapping to new IDs using 'org.Dm.eg.db' data package
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs

## 'select()' returned 1:many mapping between keys and columns

mcols(gse)

## DataFrame with 17208 rows and 9 columns
##                 gene_id      gene_name   gene_biotype seq_coord_system
##             <character>    <character>    <character>      <character>
## FBgn0000003 FBgn0000003 7SLRNA:CR32864          ncRNA       chromosome
## FBgn0000008 FBgn0000008              a protein_coding       chromosome
## FBgn0000014 FBgn0000014          abd-A protein_coding       chromosome
## FBgn0000015 FBgn0000015          Abd-B protein_coding       chromosome
## FBgn0000017 FBgn0000017            Abl protein_coding       chromosome
## ...                 ...            ...            ...              ...
## FBgn0286199 FBgn0286199           shps protein_coding       chromosome
## FBgn0286203 FBgn0286203            stw protein_coding       chromosome
## FBgn0286204 FBgn0286204            ich protein_coding       chromosome
## FBgn0286213 FBgn0286213          RpS12 protein_coding       chromosome
## FBgn0286222 FBgn0286222           Fum1 protein_coding       chromosome
##             description gene_id_version         symbol entrezid       REFSEQ
##             <character>     <character>    <character>   <list>  <character>
## FBgn0000003        NULL     FBgn0000003 7SLRNA:CR32864       NA           NA
## FBgn0000008        NULL     FBgn0000008              a    43852 NM_001014543
## FBgn0000014        NULL     FBgn0000014          abd-A    42037 NM_001170161
## FBgn0000015        NULL     FBgn0000015          Abd-B    47763 NM_001275719
## FBgn0000017        NULL     FBgn0000017            Abl    45821 NM_001104153
## ...                 ...             ...            ...      ...          ...
## FBgn0286199        NULL     FBgn0286199           shps    42892    NM_142982
## FBgn0286203        NULL     FBgn0286203            stw    35494 NM_001144134
## FBgn0286204        NULL     FBgn0286204            ich    41069 NM_001275464
## FBgn0286213        NULL     FBgn0286213          RpS12    39480    NM_168534
## FBgn0286222        NULL     FBgn0286222           Fum1    31605    NM_132111

Differential expression analysis

The following code chunk demonstrates how to build a DESeqDataSet and begin a differential expression analysis.

suppressPackageStartupMessages(library(DESeq2))
# here there is a single sample so we use ~1.
# expect a warning that there is only a single sample...
suppressWarnings({dds <- DESeqDataSet(gse, ~1)})

## using counts and average transcript lengths from tximeta

# ... see DESeq2 vignette

We have a convenient wrapper function that will build a DGEList object for use with edgeR.

suppressPackageStartupMessages(library(edgeR))
y <- makeDGEList(gse)
# ... see edgeR User's Guide for further steps

The following code chunk demonstrates the code inside of the above wrapper function, and produces the same output.

cts <- assays(gse)[["counts"]]
normMat <- assays(gse)[["length"]]
normMat <- normMat / exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
y <- DGEList(cts)
y <- scaleOffset(y, t(t(log(normMat)) + o))
# ... see edgeR User's Guide for further steps

The following code chunk demonstrates how one could use the Swish method in the fishpond Bioconductor package. Here we use the transcript-level object se. This dataset only has a single sample and no inferential replicates, but the analysis would begin with such code. See the Swish vignette in the fishpond package for a complete example:

y <- se # rename the object to 'y'
library(fishpond)
# if inferential replicates existed in the data,
# analysis would begin with:
#
# y <- scaleInfReps(y)
# ... see Swish vignette in the fishpond package

For limma with voom transformation we recommend, as in the tximport vignette to generate counts-from-abundance instead of providing an offset for average transcript length.

gse <- summarizeToGene(se, countsFromAbundance="lengthScaledTPM")

## loading existing EnsDb created: 2020-08-21 01:21:08

## obtaining transcript-to-gene mapping from database

## loading existing gene ranges created: 2020-08-21 01:21:16

## summarizing abundance

## summarizing counts

## summarizing length

library(limma)
y <- DGEList(assays(gse)[["counts"]])
# see limma User's Guide for further steps

Above we generated counts-from-abundance when calling summarizeToGene. The counts-from-abundance status is then stored in the metadata:

metadata(gse)$countsFromAbundance

## [1] "lengthScaledTPM"

Additional metadata

The following information is attached to the SummarizedExperiment by tximeta:

names(metadata(se))

## [1] "tximetaInfo"         "quantInfo"           "countsFromAbundance"
## [4] "level"               "txomeInfo"           "txdbInfo"

str(metadata(se)[["quantInfo"]])

## List of 31
##  $ salmon_version                                        : chr "0.14.1"
##  $ samp_type                                             : chr "none"
##  $ opt_type                                              : chr "vb"
##  $ quant_errors                                          :List of 1
##   ..$ : list()
##  $ num_libraries                                         : int 1
##  $ library_types                                         : chr "ISR"
##  $ frag_dist_length                                      : int 1001
##  $ seq_bias_correct                                      : logi FALSE
##  $ gc_bias_correct                                       : logi TRUE
##  $ num_bias_bins                                         : int 4096
##  $ mapping_type                                          : chr "mapping"
##  $ num_valid_targets                                     : int 33706
##  $ num_decoy_targets                                     : int 0
##  $ num_eq_classes                                        : int 70718
##  $ serialized_eq_classes                                 : logi FALSE
##  $ length_classes                                        : int [1:5, 1] 867 1533 2379 3854 71382
##  $ index_seq_hash                                        : chr "7ba5e9597796ea86cf11ccf6635ca88fbc37c2848d38083c23986aa2c6a21eae"
##  $ index_name_hash                                       : chr "b6426061057bba9b7afb4dc76fa68238414cf35b4190c95ca6fc44280d4ca87c"
##  $ index_seq_hash512                                     : chr "05f111abcda1efd2e489ace6324128cdaaa311712a28ed716d957fdfd8706ec41ca9177ebf12f54e99c2a89582d06f31c5e09dc1dce2d13"| __truncated__
##  $ index_name_hash512                                    : chr "ccdf58f23e48c8c53cd122b5f5990b5adce9fec87ddf8bd88153afbe93296d87b818fba89d12dbc20c882f7d98353840394c5040fea7432"| __truncated__
##  $ num_bootstraps                                        : int 0
##  $ num_processed                                         : int 42422337
##  $ num_mapped                                            : int 34098209
##  $ num_decoy_fragments                                   : int 0
##  $ num_dovetail_fragments                                : int 2048810
##  $ num_fragments_filtered_vm                             : int 989383
##  $ num_alignments_below_threshold_for_mapped_fragments_vm: int 267540
##  $ percent_mapped                                        : num 80.4
##  $ call                                                  : chr "quant"
##  $ start_time                                            : chr "Sat Oct 12 13:55:01 2019"
##  $ end_time                                              : chr "Sat Oct 12 14:08:11 2019"

str(metadata(se)[["txomeInfo"]])

## List of 9
##  $ index      : chr "Dm.BDGP6.22.98_salmon-0.14.1"
##  $ source     : chr "Ensembl"
##  $ organism   : chr "Drosophila melanogaster"
##  $ release    : chr "98"
##  $ genome     : chr "BDGP6.22"
##  $ fasta      :List of 1
##   ..$ : chr [1:2] "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz" "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.22.ncrna.fa.gz"
##  $ gtf        : chr "/home/biocbuild/bbs-3.11-bioc/R/library/tximportData/extdata/salmon_dm/Drosophila_melanogaster.BDGP6.22.98.gtf.gz"
##  $ sha256     : chr "7ba5e9597796ea86cf11ccf6635ca88fbc37c2848d38083c23986aa2c6a21eae"
##  $ linkedTxome: logi TRUE

metadata(se)[["tximetaInfo"]]

## $version
## [1] '1.6.3'
## 
## $importTime
## [1] "2020-08-20 21:20:59 EDT"

metadata(se)[["txdbInfo"]]

##                               Db type                       Type of Gene ID 
##                               "EnsDb"                     "Ensembl Gene ID" 
##                    Supporting package                         Db created by 
##                           "ensembldb" "ensembldb package from Bioconductor" 
##                        script_version                         Creation time 
##                               "0.3.5"            "Tue Nov 19 08:33:44 2019" 
##                       ensembl_version                          ensembl_host 
##                                  "98"                           "localhost" 
##                              Organism                           taxonomy_id 
##             "Drosophila melanogaster"                                "7227" 
##                          genome_build                       DBSCHEMAVERSION 
##                            "BDGP6.22"                                 "2.1"

Errors connecting to a database

tximeta makes use of BiocFileCache to store transcript and other databases, so saving the relevant databases in a centralized location used by other Bioconductor packages as well. It is possible that an error can occur in connecting to these databases, either if the files were accidentally removed from the file system, or if there was an error generating or writing the database to the cache location. In each of these cases, it is easy to remove the entry in the BiocFileCache so that tximeta will know to regenerate the transcript database or any other missing database.

If you have used the default cache location, then you can obtain access to your BiocFileCache with:

library(BiocFileCache)

## Loading required package: dbplyr

bfc <- BiocFileCache()

Otherwise, you can recall your particular tximeta cache location with getTximetaBFC().

You can then inspect the entries in your BiocFileCache using bfcinfo and remove the entry associated with the missing database with bfcremove. See the BiocFileCache vignette for more details on finding and removing entries from a BiocFileCache.

Note that there may be many entries in the BiocFileCache location, including .sqlite database files and serialized .rds files. You should only remove the entry associated with the missing database, e.g. if R gave an error when trying to connect to the TxDb associated with GENCODE v99 human transcripts, you should look for the rid of the entry associated with the human v99 GTF from GENCODE.

What if checksum isn’t known?

tximeta automatically imports relevant metadata when the transcriptome matches a known source – known in the sense that it is in the set of pre-computed hashed checksums in tximeta (GENCODE, Ensembl, and RefSeq for human and mouse). tximeta also facilitates the linking of transcriptomes used in building the Salmon index with relevant public sources, in the case that these are not part of this pre-computed set known to tximeta. The linking of the transcriptome source with the quantification files is important in the case that the transcript sequence no longer matches a known source (uniquely combined or filtered FASTA files), or if the source is not known to tximeta. Combinations of coding and non-coding human, mouse, and fruit fly Ensembl transcripts should be automatically recognized by tximeta and does not require making a linkedTxome. As the package is further developed, we plan to roll out support for all common transcriptomes, from all sources. Below we demonstrate how to make a linkedTxome and how to share and load a linkedTxome.

We point to a Salmon quantification file which was quantified against a transcriptome that included the coding and non-coding Drosophila melanogaster transcripts, as well as an artificial transcript of 960 bp (for demonstration purposes only).

file <- file.path(dir, "SRR1197474.plus", "quant.sf")
file.exists(file)

## [1] TRUE

coldata <- data.frame(files=file, names="SRR1197474", sample="1",
                      stringsAsFactors=FALSE)

Trying to import the files gives a message that tximeta couldn’t find a matching transcriptome, so it returns an non-ranged SummarizedExperiment.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## couldn't find matching transcriptome, returning non-ranged SummarizedExperiment

Linked transcriptomes

If the transcriptome used to generate the Salmon index does not match any transcriptomes from known sources (e.g. from combining or filtering known transcriptome files), there is not much that can be done to automatically populate the metadata during quantification import. However, we can facilitate the following two cases:

the transcriptome was created locally and has been linked to its public source(s)
the transcriptome was produced by another group, and they have produced and shared a file that links the transcriptome to public source(s)

tximeta offers functionality to assist reproducible analysis in both of these cases.

To make this quantification reproducible, we make a linkedTxome which records key information about the sources of the transcript FASTA files, and the location of the relevant GTF file. It also records the checksum of the transcriptome that was computed by Salmon during the index step.

Multiple GTF/GFF files: linkedTxome and tximeta do not currently support multiple GTF/GFF files, which is a more complicated case than multiple FASTA, which is supported. Currently, we recommend that users should add or combine GTF/GFF files themselves to create a single GTF/GFF file that contains all features used in quantification, and then upload such a file to Zenodo, which can then be linked as shown below. Feel free to contact the developers on the Bioconductor support site or GitHub Issue page for further details or feature requests.

By default, linkedTxome will write out a JSON file which can be shared with others, linking the checksum of the index with the other metadata, including FASTA and GTF sources. By default, it will write out to a file with the same name as the indexDir, but with a .json extension added. This can be prevented with write=FALSE, and the file location can be changed with jsonFile.

First we specify the path where the Salmon index is located.

Typically you would not use system.file and file.path to locate this directory, but simply define indexDir to be the path of the Salmon directory on your machine. Here we use system.file and file.path because we have included parts of a Salmon index directory in the tximeta package itself for demonstration of functionality in this vignette.

indexDir <- file.path(dir, "Dm.BDGP6.22.98.plus_salmon-0.14.1")

Now we provide the location of the FASTA files and the GTF file for this transcriptome.

Note: the basename for the GTF file is used as a unique identifier for the cached versions of the TxDb and the transcript ranges, which are stored on the user’s behalf via BiocFileCache. This is not an issue, as GENCODE, Ensembl, and RefSeq all provide GTF files which are uniquely identified by their filename, e.g. Drosophila_melanogaster.BDGP6.22.98.gtf.gz.

The recommended usage of tximeta would be to specify a remote GTF source, as seen in the commented-out line below:

fastaFTP <- c("ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP6.22.cdna.all.fa.gz",
              "ftp://ftp.ensembl.org/pub/release-98/fasta/drosophila_melanogaster/ncrna/Drosophila_melanogaster.BDGP6.22.ncrna.fa.gz",
              "extra_transcript.fa.gz")
#gtfFTP <- "ftp://path/to/custom/Drosophila_melanogaster.BDGP6.22.98.plus.gtf.gz"

Instead of the above commented-out FTP location for the GTF file, we specify a location within an R package. This step is just to avoid downloading from a remote FTP during vignette building. This use of file.path to point to a file in an R package is specific to this vignette and should not be used in a typical workflow. The following GTF file is a modified version of the release 98 from Ensembl, which includes description of a one transcript, one exon artificial gene which was inserted into the transcriptome (for demonstration purposes only).

gtfPath <- file.path(dir,"Drosophila_melanogaster.BDGP6.22.98.plus.gtf.gz")

Finally, we create a linkedTxome. In this vignette, we point to a temporary directory for the JSON file, but a more typical workflow would write the JSON file to the same location as the Salmon index by not specifying jsonFile.

makeLinkedTxome performs two operation: (1) it creates a new entry in an internal table that links the transcriptome used in the Salmon index to its sources, and (2) it creates a JSON file such that this linkedTxome can be shared.

tmp <- tempdir()
jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
makeLinkedTxome(indexDir=indexDir,
                source="Ensembl", organism="Drosophila melanogaster",
                release="98", genome="BDGP6.22",
                fasta=fastaFTP, gtf=gtfPath,
                jsonFile=jsonFile)

## writing linkedTxome to /tmp/RtmpkRsB4U/Dm.BDGP6.22.98.plus_salmon-0.14.1.json

## saving linkedTxome in bfc

After running makeLinkedTxome, the connection between this Salmon index (and its checksum) with the sources is saved for persistent usage. Note that because we added a single transcript of 960bp to the FASTA file used for quantification, tximeta could tell that this was not quantified against release 98 of the Ensembl transcripts for Drosophila melanogaster. Only when the correct set of transcripts were specified does tximeta recognize and import the correct metadata.

With use of tximeta and a linkedTxome, the software figures out if the remote GTF has been accessed and compiled into a TxDb before, and on future calls, it will simply load the pre-computed metadata and transcript ranges.

Note the warning that 5 of the transcripts are missing from the GTF file and so are dropped from the final output. This is a problem coming from the annotation source, and not easily avoided by tximeta.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 98 ]
## useHub=TRUE: checking for EnsDb via 'AnnotationHub'
## snapshotDate(): 2020-04-27
## found matching EnsDb via 'AnnotationHub'
## loading from cache
## generating transcript ranges

## Warning in checkAssays2Txps(assays, txps): 
## 
## Warning: the annotation is missing some transcripts that were quantified.
## 1 out of 33707 txps were missing from GTF/GFF but were in the indexed FASTA.
## (This occurs sometimes with Ensembl txps on haplotype chromosomes.)
## In order to build a ranged SummarizedExperiment, these txps were removed.
## To keep these txps, and to skip adding ranges, use skipMeta=TRUE
## 
## Missing txps: [Newgene]

We can see that the appropriate metadata and transcript ranges are attached.

rowRanges(se)

## GRanges object with 33706 ranges and 9 metadata columns:
##               seqnames            ranges strand |       tx_id     tx_biotype
##                  <Rle>         <IRanges>  <Rle> | <character>    <character>
##   FBtr0070129        X     656673-657899      + | FBtr0070129 protein_coding
##   FBtr0070126        X     656356-657899      + | FBtr0070126 protein_coding
##   FBtr0070128        X     656673-657899      + | FBtr0070128 protein_coding
##   FBtr0070124        X     656114-657899      + | FBtr0070124 protein_coding
##   FBtr0070127        X     656356-657899      + | FBtr0070127 protein_coding
##           ...      ...               ...    ... .         ...            ...
##   FBtr0114299       2R 21325218-21325323      + | FBtr0114299         snoRNA
##   FBtr0113582       3R   5598638-5598777      - | FBtr0113582         snoRNA
##   FBtr0091635       3L   1488906-1489045      + | FBtr0091635         snoRNA
##   FBtr0113599       3L     261803-261953      - | FBtr0113599         snoRNA
##   FBtr0113600       3L     831870-832008      - | FBtr0113600         snoRNA
##               tx_cds_seq_start tx_cds_seq_end     gene_id tx_support_level
##                      <integer>      <integer> <character>        <integer>
##   FBtr0070129           657110         657595 FBgn0025637             <NA>
##   FBtr0070126           657110         657595 FBgn0025637             <NA>
##   FBtr0070128           657110         657595 FBgn0025637             <NA>
##   FBtr0070124           657110         657595 FBgn0025637             <NA>
##   FBtr0070127           657110         657595 FBgn0025637             <NA>
##           ...              ...            ...         ...              ...
##   FBtr0114299             <NA>           <NA> FBgn0086023             <NA>
##   FBtr0113582             <NA>           <NA> FBgn0082989             <NA>
##   FBtr0091635             <NA>           <NA> FBgn0086670             <NA>
##   FBtr0113599             <NA>           <NA> FBgn0083014             <NA>
##   FBtr0113600             <NA>           <NA> FBgn0083057             <NA>
##               tx_id_version gc_content     tx_name
##                 <character>  <numeric> <character>
##   FBtr0070129   FBtr0070129    44.7641 FBtr0070129
##   FBtr0070126   FBtr0070126    44.8128 FBtr0070126
##   FBtr0070128   FBtr0070128    44.7974 FBtr0070128
##   FBtr0070124   FBtr0070124    43.8859 FBtr0070124
##   FBtr0070127   FBtr0070127    44.8571 FBtr0070127
##           ...           ...        ...         ...
##   FBtr0114299   FBtr0114299    35.8491 FBtr0114299
##   FBtr0113582   FBtr0113582    32.8571 FBtr0113582
##   FBtr0091635   FBtr0091635    45.0000 FBtr0091635
##   FBtr0113599   FBtr0113599    48.3444 FBtr0113599
##   FBtr0113600   FBtr0113600    44.6043 FBtr0113600
##   -------
##   seqinfo: 25 sequences (1 circular) from BDGP6.22 genome

seqinfo(se)

## Seqinfo object with 25 sequences (1 circular) from BDGP6.22 genome:
##   seqnames                        seqlengths isCircular   genome
##   211000022278279                      12714      FALSE BDGP6.22
##   211000022278436                       2815      FALSE BDGP6.22
##   211000022278449                       1947      FALSE BDGP6.22
##   211000022278760                       1144      FALSE BDGP6.22
##   211000022279165                       1118      FALSE BDGP6.22
##   ...                                    ...        ...      ...
##   Unmapped_Scaffold_8_D1580_D1567      88768      FALSE BDGP6.22
##   X                                 23542271      FALSE BDGP6.22
##   Y                                  3667352      FALSE BDGP6.22
##   mitochondrion_genome                 19524       TRUE BDGP6.22
##   rDNA                                 76973      FALSE BDGP6.22

Clear linkedTxomes

The following code removes the entire table with information about the linkedTxomes. This is just for demonstration, so that we can show how to load a JSON file below.

Note: Running this code will clear any information about linkedTxomes. Don’t run this unless you really want to clear this table!

library(BiocFileCache)
if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)

## # A tibble: 7 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
## 1 BFC1  link… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 2 BFC2  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 3 BFC3  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 4 BFC4  exon… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 5 BFC5  gene… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 6 BFC6  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 7 BFC7  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## # … with 1 more variable: expires <dbl>

bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)

## # A tibble: 6 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
## 1 BFC2  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 2 BFC3  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 3 BFC4  exon… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 4 BFC5  gene… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 5 BFC6  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 6 BFC7  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## # … with 1 more variable: expires <dbl>

Loading linkedTxome JSON files

If a collaborator or the Suppmentary Files for a publication shares a linkedTxome JSON file, we can likewise use tximeta to automatically assemble the relevant metadata and transcript ranges. This implies that the other person has used tximeta with the function makeLinkedTxome demonstrated above, pointing to their Salmon index and to the FASTA and GTF source(s).

We point to the JSON file and use loadLinkedTxome and then the relevant metadata is saved for persistent usage. In this case, we saved the JSON file in a temporary directory.

jsonFile <- file.path(tmp, paste0(basename(indexDir), ".json"))
loadLinkedTxome(jsonFile)

## saving linkedTxome in bfc (first time)

Again, using tximeta figures out whether it needs to access the remote GTF or not, and assembles the appropriate object on the user’s behalf.

se <- tximeta(coldata)

## importing quantifications

## reading in files with read_tsv

## 1 
## found matching linked transcriptome:
## [ Ensembl - Drosophila melanogaster - release 98 ]
## loading existing EnsDb created: 2020-08-21 01:21:28
## loading existing transcript ranges created: 2020-08-21 01:21:29

## Warning in checkAssays2Txps(assays, txps): 
## 
## Warning: the annotation is missing some transcripts that were quantified.
## 1 out of 33707 txps were missing from GTF/GFF but were in the indexed FASTA.
## (This occurs sometimes with Ensembl txps on haplotype chromosomes.)
## In order to build a ranged SummarizedExperiment, these txps were removed.
## To keep these txps, and to skip adding ranges, use skipMeta=TRUE
## 
## Missing txps: [Newgene]

Clear linkedTxomes again

Finally, we clear the linkedTxomes table again so that the above examples will work. This is just for the vignette code and not part of a typical workflow.

Note: Running this code will clear any information about linkedTxomes. Don’t run this unless you really want to clear this table!

if (interactive()) {
  bfcloc <- getTximetaBFC()
} else {
  bfcloc <- tempdir()
}
bfc <- BiocFileCache(bfcloc)
bfcinfo(bfc)

## # A tibble: 7 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
## 1 BFC2  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 2 BFC3  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 3 BFC4  exon… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 4 BFC5  gene… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 5 BFC6  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 6 BFC7  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 7 BFC8  link… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## # … with 1 more variable: expires <dbl>

bfcremove(bfc, bfcquery(bfc, "linkedTxomeTbl")$rid)
bfcinfo(bfc)

## # A tibble: 6 x 10
##   rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
##   <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
## 1 BFC2  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 2 BFC3  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 3 BFC4  exon… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 4 BFC5  gene… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## 5 BFC6  Dros… 2020-08-21… 2020-08-21… /tmp… rela… /hom…               NA <NA> 
## 6 BFC7  txpR… 2020-08-21… 2020-08-21… /tmp… rela… 2571…               NA <NA> 
## # … with 1 more variable: expires <dbl>

Other quantifiers

tximeta can import the output from any quantifiers that are supported by tximport, and if these are not Salmon, alevin, or Sailfish output, it will simply return a non-ranged SummarizedExperiment by default.

An alternative solution is to wrap other quantifiers in workflows that include metadata information JSON files along with each quantification file. One can place these files in aux_info/meta_info.json or any relative location specified by customMetaInfo, for example customMetaInfo="meta_info.json". This JSON file is located relative to the quantification file and should contain a tag index_seq_hash with an associated value of the SHA-256 hash of the reference transcripts. For computing the hash value of the reference transcripts, see the FastaDigest python package. The hash value used by Salmon is the SHA-256 hash value of the reference sequences stripped of the header lines, and concatenated together with the empty string (so only cDNA sequences combined without any new line characters). FastaDigest can be installed with pip install fasta_digest.

Automated analysis with ARMOR

This vignette described the use of tximeta to import quantification data into R/Bioconductor with automatic detection and addition of metadata. The SummarizedExperiment produced by tximeta can then be provided to downstream statistical analysis packages as described above. The tximeta package does not contain any functionality for automated differential analysis.

The ARMOR workflow does automate a variety of differential analyses, and make use of tximeta for creation of a SummarizedExperiment with attached annotation metadata. ARMOR stands for ``An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data’’ and is described in more detail in the article by Orjuela et al. (2019).

Acknowledgments

The development of tximeta has benefited from suggestions from these and other individuals in the community:

Vincent Carey
Lori Shepherd
Martin Morgan
Koen Van den Berge
Johannes Rainer
James Ashmore

Next steps

Integration with GA4GH / refget API

We are collaborating and in communication with GA4GH working groups to build out functionality to perform lookup on collections of transcripts, i.e. transcriptomes, provided by Ensembl. This will greatly expand the applicability of tximeta. The current version of tximeta relies on hashing of common transcriptomes, which are stored within the package’s extdata directory in a CSV file, or on the use of linkedTxome for any additional transcriptomes. However, with the GA4GH / refget API in place, tximeta will be able to identify and access transcriptome metadata for vastly more reference transcriptomes (collections of transcripts).

Facilitate plots and summaries

Basic plots across samples: abundances, mapping rates, rich bias model parameters
Time summaries: when quantified? when imported?

Extended functionality

Facilitate functional annotation, either with vignettes/workflow or with additional functionality. E.g.: housekeeping genes, arbitrary gene sets, genes expressed in GTEx tissues
liftOver is clunky and doesn’t integrate with GenomeInfoDb. It requires user input and there’s a chance to mis-annotate. Ideally this should all be automated.

Session info

library(devtools)

## Loading required package: usethis

session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.2 (2020-06-22)
##  os       Ubuntu 18.04.4 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  ctype    en_US.UTF-8                 
##  tz       America/New_York            
##  date     2020-08-20                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package                * version  date       lib source        
##  annotate                 1.66.0   2020-08-20 [2] Bioconductor  
##  AnnotationDbi          * 1.50.3   2020-08-20 [2] Bioconductor  
##  AnnotationFilter       * 1.12.0   2020-08-20 [2] Bioconductor  
##  AnnotationHub            2.20.2   2020-08-20 [2] Bioconductor  
##  askpass                  1.1      2019-01-13 [2] CRAN (R 4.0.2)
##  assertthat               0.2.1    2019-03-21 [2] CRAN (R 4.0.2)
##  backports                1.1.8    2020-06-17 [2] CRAN (R 4.0.2)
##  Biobase                * 2.48.0   2020-08-20 [2] Bioconductor  
##  BiocFileCache          * 1.12.1   2020-08-20 [2] Bioconductor  
##  BiocGenerics           * 0.34.0   2020-08-20 [2] Bioconductor  
##  BiocManager              1.30.10  2019-11-16 [2] CRAN (R 4.0.2)
##  BiocParallel             1.22.0   2020-08-20 [2] Bioconductor  
##  BiocVersion              3.11.1   2020-08-20 [2] Bioconductor  
##  biomaRt                  2.44.1   2020-08-20 [2] Bioconductor  
##  Biostrings               2.56.0   2020-08-20 [2] Bioconductor  
##  bit                      4.0.4    2020-08-04 [2] CRAN (R 4.0.2)
##  bit64                    4.0.2    2020-07-30 [2] CRAN (R 4.0.2)
##  bitops                   1.0-6    2013-08-17 [2] CRAN (R 4.0.2)
##  blob                     1.2.1    2020-01-20 [2] CRAN (R 4.0.2)
##  callr                    3.4.3    2020-03-28 [2] CRAN (R 4.0.2)
##  cli                      2.0.2    2020-02-28 [2] CRAN (R 4.0.2)
##  colorspace               1.4-1    2019-03-18 [2] CRAN (R 4.0.2)
##  crayon                   1.3.4    2017-09-16 [2] CRAN (R 4.0.2)
##  curl                     4.3      2019-12-02 [2] CRAN (R 4.0.2)
##  DBI                      1.1.0    2019-12-15 [2] CRAN (R 4.0.2)
##  dbplyr                 * 1.4.4    2020-05-27 [2] CRAN (R 4.0.2)
##  DelayedArray           * 0.14.1   2020-08-20 [2] Bioconductor  
##  desc                     1.2.0    2018-05-01 [2] CRAN (R 4.0.2)
##  DESeq2                 * 1.28.1   2020-08-20 [2] Bioconductor  
##  devtools               * 2.3.1    2020-07-21 [2] CRAN (R 4.0.2)
##  digest                   0.6.25   2020-02-23 [2] CRAN (R 4.0.2)
##  dplyr                    1.0.2    2020-08-18 [2] CRAN (R 4.0.2)
##  edgeR                  * 3.30.3   2020-08-20 [2] Bioconductor  
##  ellipsis                 0.3.1    2020-05-15 [2] CRAN (R 4.0.2)
##  ensembldb              * 2.12.1   2020-08-20 [2] Bioconductor  
##  evaluate                 0.14     2019-05-28 [2] CRAN (R 4.0.2)
##  fansi                    0.4.1    2020-01-08 [2] CRAN (R 4.0.2)
##  fastmap                  1.0.1    2019-10-08 [2] CRAN (R 4.0.2)
##  fs                       1.5.0    2020-07-31 [2] CRAN (R 4.0.2)
##  genefilter               1.70.0   2020-08-20 [2] Bioconductor  
##  geneplotter              1.66.0   2020-08-20 [2] Bioconductor  
##  generics                 0.0.2    2018-11-29 [2] CRAN (R 4.0.2)
##  GenomeInfoDb           * 1.24.2   2020-08-20 [2] Bioconductor  
##  GenomeInfoDbData         1.2.3    2020-07-07 [2] Bioconductor  
##  GenomicAlignments        1.24.0   2020-08-20 [2] Bioconductor  
##  GenomicFeatures        * 1.40.1   2020-08-20 [2] Bioconductor  
##  GenomicRanges          * 1.40.0   2020-08-20 [2] Bioconductor  
##  ggplot2                  3.3.2    2020-06-19 [2] CRAN (R 4.0.2)
##  glue                     1.4.1    2020-05-13 [2] CRAN (R 4.0.2)
##  gtable                   0.3.0    2019-03-25 [2] CRAN (R 4.0.2)
##  highr                    0.8      2019-03-20 [2] CRAN (R 4.0.2)
##  hms                      0.5.3    2020-01-08 [2] CRAN (R 4.0.2)
##  htmltools                0.5.0    2020-06-16 [2] CRAN (R 4.0.2)
##  httpuv                   1.5.4    2020-06-06 [2] CRAN (R 4.0.2)
##  httr                     1.4.2    2020-07-20 [2] CRAN (R 4.0.2)
##  interactiveDisplayBase   1.26.3   2020-08-20 [2] Bioconductor  
##  IRanges                * 2.22.2   2020-08-20 [2] Bioconductor  
##  jsonlite                 1.7.0    2020-06-25 [2] CRAN (R 4.0.2)
##  knitr                    1.29     2020-06-23 [2] CRAN (R 4.0.2)
##  later                    1.1.0.1  2020-06-05 [2] CRAN (R 4.0.2)
##  lattice                  0.20-41  2020-04-02 [2] CRAN (R 4.0.2)
##  lazyeval                 0.2.2    2019-03-15 [2] CRAN (R 4.0.2)
##  lifecycle                0.2.0    2020-03-06 [2] CRAN (R 4.0.2)
##  limma                  * 3.44.3   2020-08-20 [2] Bioconductor  
##  locfit                   1.5-9.4  2020-03-25 [2] CRAN (R 4.0.2)
##  magrittr                 1.5      2014-11-22 [2] CRAN (R 4.0.2)
##  Matrix                   1.2-18   2019-11-27 [2] CRAN (R 4.0.2)
##  matrixStats            * 0.56.0   2020-03-13 [2] CRAN (R 4.0.2)
##  memoise                  1.1.0    2017-04-21 [2] CRAN (R 4.0.2)
##  mime                     0.9      2020-02-04 [2] CRAN (R 4.0.2)
##  munsell                  0.5.0    2018-06-12 [2] CRAN (R 4.0.2)
##  openssl                  1.4.2    2020-06-27 [2] CRAN (R 4.0.2)
##  org.Dm.eg.db           * 3.11.4   2020-07-07 [2] Bioconductor  
##  pillar                   1.4.6    2020-07-10 [2] CRAN (R 4.0.2)
##  pkgbuild                 1.1.0    2020-07-13 [2] CRAN (R 4.0.2)
##  pkgconfig                2.0.3    2019-09-22 [2] CRAN (R 4.0.2)
##  pkgload                  1.1.0    2020-05-29 [2] CRAN (R 4.0.2)
##  prettyunits              1.1.1    2020-01-24 [2] CRAN (R 4.0.2)
##  processx                 3.4.3    2020-07-05 [2] CRAN (R 4.0.2)
##  progress                 1.2.2    2019-05-16 [2] CRAN (R 4.0.2)
##  promises                 1.1.1    2020-06-09 [2] CRAN (R 4.0.2)
##  ProtGenerics             1.20.0   2020-08-20 [2] Bioconductor  
##  ps                       1.3.4    2020-08-11 [2] CRAN (R 4.0.2)
##  purrr                    0.3.4    2020-04-17 [2] CRAN (R 4.0.2)
##  R6                       2.4.1    2019-11-12 [2] CRAN (R 4.0.2)
##  rappdirs                 0.3.1    2016-03-28 [2] CRAN (R 4.0.2)
##  RColorBrewer             1.1-2    2014-12-07 [2] CRAN (R 4.0.2)
##  Rcpp                     1.0.5    2020-07-06 [2] CRAN (R 4.0.2)
##  RCurl                    1.98-1.2 2020-04-18 [2] CRAN (R 4.0.2)
##  readr                    1.3.1    2018-12-21 [2] CRAN (R 4.0.2)
##  remotes                  2.2.0    2020-07-21 [2] CRAN (R 4.0.2)
##  rlang                    0.4.7    2020-07-09 [2] CRAN (R 4.0.2)
##  rmarkdown                2.3      2020-06-18 [2] CRAN (R 4.0.2)
##  rprojroot                1.3-2    2018-01-03 [2] CRAN (R 4.0.2)
##  Rsamtools                2.4.0    2020-08-20 [2] Bioconductor  
##  RSQLite                  2.2.0    2020-01-07 [2] CRAN (R 4.0.2)
##  rtracklayer              1.48.0   2020-08-20 [2] Bioconductor  
##  S4Vectors              * 0.26.1   2020-08-20 [2] Bioconductor  
##  scales                   1.1.1    2020-05-11 [2] CRAN (R 4.0.2)
##  sessioninfo              1.1.1    2018-11-05 [2] CRAN (R 4.0.2)
##  shiny                    1.5.0    2020-06-23 [2] CRAN (R 4.0.2)
##  stringi                  1.4.6    2020-02-17 [2] CRAN (R 4.0.2)
##  stringr                  1.4.0    2019-02-10 [2] CRAN (R 4.0.2)
##  SummarizedExperiment   * 1.18.2   2020-08-20 [2] Bioconductor  
##  survival                 3.2-3    2020-06-13 [2] CRAN (R 4.0.2)
##  testthat                 2.3.2    2020-03-02 [2] CRAN (R 4.0.2)
##  tibble                   3.0.3    2020-07-10 [2] CRAN (R 4.0.2)
##  tidyselect               1.1.0    2020-05-11 [2] CRAN (R 4.0.2)
##  tximeta                * 1.6.3    2020-08-20 [1] Bioconductor  
##  tximport                 1.16.1   2020-08-20 [2] Bioconductor  
##  usethis                * 1.6.1    2020-04-29 [2] CRAN (R 4.0.2)
##  utf8                     1.1.4    2018-05-24 [2] CRAN (R 4.0.2)
##  vctrs                    0.3.2    2020-07-15 [2] CRAN (R 4.0.2)
##  withr                    2.2.0    2020-04-20 [2] CRAN (R 4.0.2)
##  xfun                     0.16     2020-07-24 [2] CRAN (R 4.0.2)
##  XML                      3.99-0.5 2020-07-23 [2] CRAN (R 4.0.2)
##  xtable                   1.8-4    2019-04-21 [2] CRAN (R 4.0.2)
##  XVector                  0.28.0   2020-08-20 [2] Bioconductor  
##  yaml                     2.2.1    2020-02-01 [2] CRAN (R 4.0.2)
##  zlibbioc                 1.34.0   2020-08-20 [2] Bioconductor  
## 
## [1] /tmp/RtmpZGeWJR/Rinst1e2f6f0c7490
## [2] /home/biocbuild/bbs-3.11-bioc/R/library

References

Lawrence, Wolfgang AND Pagès, Michael AND Huber. 2013. “Software for Computing and Annotating Genomic Ranges.” PLOS Computational Biology 9 (8). Public Library of Science:1–10.

Love, Michael I., Charlotte Soneson, Peter F. Hickey, Lisa K. Johnson, N. Tessa Pierce, Lori Shepherd, Martin Morgan, and Rob Patro. 2020. “Tximeta: Reference sequence checksums for provenance identification in RNA-seq.” PLOS Computational Biology. https://doi.org/10.1371/journal.pcbi.1007664.

Orjuela, Stephany, Ruizhu Huang, Katharina M. Hembach, Mark D. Robinson, and Charlotte Soneson. 2019. “ARMOR: An Automated Reproducible MOdular Workflow for Preprocessing and Differential Analysis of RNA-seq Data.” G3: Genes, Genomes, Genetics. G3: Genes, Genomes, Genetics.

Patro, Rob, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods. https://doi.org/10.1038/nmeth.4197.

Rainer, Johannes, Laurent Gatto, and Christian X Weichenberger. 2019. “ensembldb: an R package to create and use Ensembl-based annotation resources.” Bioinformatics, January.

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.

Srivastava, Avi, Laraib Malik, Tom Sean Smith, Ian Sudbery, and Rob Patro. 2019. “Alevin efficiently estimates accurate gene abundances from dscRNA-seq data.” Genome Biology 20 (65). https://doi.org/10.1186/s13059-019-1670-y.

Tximeta: transcript quantification import with automatic metadata

Michael I. Love, Charlotte Soneson, Peter F. Hickey, Rob Patro

08/20/2020