txdbmaker 0.99.8
The txdbmaker
package provides functions to make TxDb
objects based on data downloaded from the UCSC Genome Browser
(https://genome.ucsc.edu/), Ensembl (https://ensembl.org/), or
BioMart (http://www.biomart.org/), or directly from a GFF or GTF file.
See the vignette in the GenomicFeatures
package for an
introduction to TxDb
objects.
This document demonstrates the use of these functions.
There is also support for creating TxDb
objects from custom data
sources using makeTxDb
; see the help page for this function for
details.
txdbmaker
packageInstall the package with:
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("txdbmaker")
Then load it with:
suppressPackageStartupMessages(library(txdbmaker))
makeTxDbFromUCSC
The function makeTxDbFromUCSC
downloads UCSC
Genome Bioinformatics transcript tables (e.g. knownGene
,
refGene
, ensGene
) for a genome build (e.g.
mm9
, hg19
). Use the supportedUCSCtables
utility function to get the list of tables known to work with
makeTxDbFromUCSC
.
supportedUCSCtables(genome="mm9")
## tablename track composite_track
## 1 acembly AceView Genes <NA>
## 2 augustusGene AUGUSTUS <NA>
## 3 ccdsGene CCDS <NA>
## 4 ensGene Ensembl Genes <NA>
## 5 exoniphy Exoniphy <NA>
## 6 geneid Geneid Genes <NA>
## 7 genscan Genscan Genes <NA>
## 8 knownGene UCSC Genes <NA>
## 9 knownGeneOld4 Old UCSC Genes <NA>
## 10 nscanGene N-SCAN <NA>
## 11 pseudoYale60 Yale Pseudo60 <NA>
## 12 refGene UCSC RefSeq NCBI RefSeq
## 13 sgpGene SGP Genes <NA>
## 14 transcriptome Transcriptome <NA>
## 15 vegaPseudoGene Vega Pseudogenes Vega Genes
## 16 vegaGene Vega Protein Genes Vega Genes
## 17 xenoRefGene Other RefSeq <NA>
mm9KG_txdb <- makeTxDbFromUCSC(genome="mm9", tablename="knownGene")
## Download the knownGene table ... OK
## Download the knownToLocusLink table ... OK
## Extract the 'transcripts' data frame ... OK
## Extract the 'splicings' data frame ... OK
## Download and preprocess the 'chrominfo' data frame ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
mm9KG_txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: mm9
## # Organism: Mus musculus
## # Taxonomy ID: 10090
## # UCSC Table: knownGene
## # UCSC Track: UCSC Genes
## # Resource URL: https://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # Nb of transcripts: 55419
## # Db created by: txdbmaker package from Bioconductor
## # Creation time: 2024-03-26 19:47:45 -0400 (Tue, 26 Mar 2024)
## # txdbmaker version at creation time: 0.99.8
## # RSQLite version at creation time: 2.3.5
## # DBSCHEMAVERSION: 1.2
makeTxDbFromBiomart
Retrieve data from BioMart by specifying the mart and the data set to
the makeTxDbFromBiomart
function (not all BioMart
data sets are currently supported):
mmusculusEnsembl <- makeTxDbFromBiomart(dataset="mmusculus_gene_ensembl")
As with the makeTxDbFromUCSC
function, the
makeTxDbFromBiomart
function also has a
circ_seqs
argument that will default to using the contents
of the DEFAULT_CIRC_SEQS
vector. And just like those UCSC
sources, there is also a helper function called
getChromInfoFromBiomart
that can show what the different
chromosomes are called for a given source.
Using the makeTxDbFromBiomart
makeTxDbFromUCSC
functions can take a while and
may also require some bandwidth as these methods have to download and
then assemble a database from their respective sources. It is not
expected that most users will want to do this step every time.
Instead, we suggest that you save your annotation objects and label
them with an appropriate time stamp so as to facilitate reproducible
research.
makeTxDbFromEnsembl
The makeTxDbFromEnsembl
function creates a TxDb
object
for a given organism by importing the genomic locations of its transcripts,
exons, CDS, and genes from an Ensembl database.
See ?makeTxDbFromEnsembl
for more information.
makeTxDbFromGFF
You can also extract transcript information from either GFF3 or GTF
files by using the makeTxDbFromGFF
function.
Usage is similar to makeTxDbFromBiomart
and
makeTxDbFromUCSC
.
TxDb
ObjectOnce a TxDb
object has been created, it can be saved
to avoid the time and bandwidth costs of recreating it and to make it
possible to reproduce results with identical genomic feature data at a
later date. Since TxDb
objects are backed by a
SQLite database, the save format is a SQLite database file (which
could be accessed from programs other than R if desired). Note that
it is not possible to serialize a TxDb
object using
R’s save
function.
saveDb(mm9KG_txdb, file="mm9KG_txdb.sqlite")
And as was mentioned earlier, a saved TxDb
object can
be initialized from a .sqlite file by simply using loadDb
.
mm9KG_txdb <- loadDb("mm9KG_txdb.sqlite")
makeTxDbPackageFromUCSC
and makeTxDbPackageFromBiomart
It is often much more convenient to just make an annotation package
out of your annotations. If you are finding that this is the case,
then you should consider the convenience functions:
makeTxDbPackageFromUCSC
and
makeTxDbPackageFromBiomart
. These functions are similar
to makeTxDbFromUCSC
and
makeTxDbFromBiomart
except that they will take the
extra step of actually wrapping the database up into an annotation
package for you. This package can then be installed and used as of
the standard TxDb packages found on in the Bioconductor
repository.
## R Under development (unstable) (2024-03-18 r86148)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] txdbmaker_0.99.8 GenomicFeatures_1.55.4 AnnotationDbi_1.65.2
## [4] Biobase_2.63.0 GenomicRanges_1.55.4 GenomeInfoDb_1.39.9
## [7] IRanges_2.37.1 S4Vectors_0.41.5 BiocGenerics_0.49.1
## [10] BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 dplyr_1.1.4
## [3] blob_1.2.4 filelock_1.0.3
## [5] Biostrings_2.71.5 bitops_1.0-7
## [7] fastmap_1.1.1 RCurl_1.98-1.14
## [9] BiocFileCache_2.11.1 GenomicAlignments_1.39.5
## [11] XML_3.99-0.16.1 digest_0.6.35
## [13] timechange_0.3.0 lifecycle_1.0.4
## [15] KEGGREST_1.43.0 RSQLite_2.3.5
## [17] magrittr_2.0.3 compiler_4.4.0
## [19] rlang_1.1.3 sass_0.4.9
## [21] progress_1.2.3 tools_4.4.0
## [23] utf8_1.2.4 yaml_2.3.8
## [25] rtracklayer_1.63.1 knitr_1.45
## [27] prettyunits_1.2.0 S4Arrays_1.3.6
## [29] bit_4.0.5 curl_5.2.1
## [31] DelayedArray_0.29.9 xml2_1.3.6
## [33] abind_1.4-5 BiocParallel_1.37.1
## [35] grid_4.4.0 fansi_1.0.6
## [37] biomaRt_2.59.1 SummarizedExperiment_1.33.3
## [39] cli_3.6.2 rmarkdown_2.26
## [41] crayon_1.5.2 generics_0.1.3
## [43] httr_1.4.7 rjson_0.2.21
## [45] DBI_1.2.2 cachem_1.0.8
## [47] stringr_1.5.1 zlibbioc_1.49.3
## [49] parallel_4.4.0 BiocManager_1.30.22
## [51] XVector_0.43.1 restfulr_0.0.15
## [53] matrixStats_1.2.0 vctrs_0.6.5
## [55] Matrix_1.7-0 jsonlite_1.8.8
## [57] bookdown_0.38 hms_1.1.3
## [59] bit64_4.0.5 jquerylib_0.1.4
## [61] glue_1.7.0 codetools_0.2-19
## [63] lubridate_1.9.3 stringi_1.8.3
## [65] BiocIO_1.13.0 tibble_3.2.1
## [67] pillar_1.9.0 rappdirs_0.3.3
## [69] htmltools_0.5.8 GenomeInfoDbData_1.2.11
## [71] R6_2.5.1 dbplyr_2.5.0
## [73] httr2_1.0.0 evaluate_0.23
## [75] lattice_0.22-6 RMariaDB_1.3.1
## [77] png_0.1-8 Rsamtools_2.19.4
## [79] memoise_2.0.1 bslib_0.6.2
## [81] SparseArray_1.3.4 xfun_0.43
## [83] MatrixGenerics_1.15.0 pkgconfig_2.0.3