ensembldb 2.0.4
The ensembldb
package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb
packages from the GenomicFeatures
package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb
package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb
databases created by the ensembldb
package contain also protein annotation data (see Section 11 for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the proteins vignette.
Another main goal of this package is to generate versioned annotation packages, i.e. annotation packages that are build for a specific Ensembl release, and are also named according to that (e.g. EnsDb.Hsapiens.v75
for human gene definitions of the Ensembl code database version 75). This ensures reproducibility, as it allows to load annotations from a specific Ensembl release also if newer versions of annotation packages/releases are available. It also allows to load multiple annotation packages at the same time in order to e.g. compare gene models between Ensembl releases.
In the example below we load an Ensembl based annotation package for Homo sapiens, Ensembl version 75. The EnsDb
object providing access to the underlying SQLite database is bound to the variable name EnsDb.Hsapiens.v75
.
library(EnsDb.Hsapiens.v75)
## Making a "short cut"
edb <- EnsDb.Hsapiens.v75
## print some informations for this package
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.2.3
## |Creation time: Tue Nov 15 23:35:19 2016
## |ensembl_version: 75
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |genome_build: GRCh37
## |DBSCHEMAVERSION: 1.0
## | No. of genes: 64102.
## | No. of transcripts: 215647.
## |Protein data available.
## For what organism was the database generated?
organism(edb)
## [1] "Homo sapiens"
ensembldb
annotation packages to retrieve specific annotationsOne of the strengths of the ensembldb
package and the related EnsDb
databases is its implementation of a filter framework that enables to efficiently extract data sub-sets from the databases. The ensembldb
package supports most of the filters defined in the AnnotationFilter
Bioconductor package and defines some additional filters specific to the data stored in EnsDb
databases. The supportedFilters
method can be used to get an overview over all supported filter classes, each of them (except the GRangesFilter
) working on a single column/field in the database.
supportedFilters(edb)
## [1] "EntrezFilter" "ExonEndFilter"
## [3] "ExonIdFilter" "ExonRankFilter"
## [5] "ExonStartFilter" "GRangesFilter"
## [7] "GeneBiotypeFilter" "GeneEndFilter"
## [9] "GeneIdFilter" "GeneStartFilter"
## [11] "GenenameFilter" "ProtDomIdFilter"
## [13] "ProteinIdFilter" "SeqNameFilter"
## [15] "SeqStrandFilter" "SymbolFilter"
## [17] "TxBiotypeFilter" "TxEndFilter"
## [19] "TxIdFilter" "TxNameFilter"
## [21] "TxStartFilter" "UniprotDbFilter"
## [23] "UniprotFilter" "UniprotMappingTypeFilter"
These filters can be divided into 3 main filter types:
IntegerFilter
: filter classes extending this basic object can take a single numeric value as input and support the conditions =, !
, >, <, >= and <=. All filters that work on chromosomal coordinates, such as the GeneEndFilter
extend IntegerFilter
.CharacterFilter
: filter classes extending this object can take a single or multiple character values as input and allow conditions: =, !
, “startsWith” and “endsWith”. All filters working on IDs extend this class.GRangesFilter
: takes a GRanges
object as input and supports all conditions that findOverlaps
from the IRanges
package supports (“any”, “start”, “end”, “within”, “equal”). Note that these have to be passed using the parameter type
to the constructor function.The supported filters are:
EntrezFilter
: allows to filter results based on NCBI Entrezgene identifiers of the genes.ExonEndFilter
: filter using the chromosomal end coordinate of exons.ExonIdFilter
: filter based on the (Ensembl) exon identifiers.ExonRankFilter
: filter based on the rank (index) of an exon within the transcript model. Exons are always numbered from 5’ to 3’ end of the transcript, thus, also on the reverse strand, the exon 1 is the most 5’ exon of the transcript.ExonStartFilter
: filter using the chromosomal start coordinate of exons.GeneBiotypeFilter
: filter using the gene biotypes defined in the Ensembl database; use the listGenebiotypes
method to list all available biotypes.GeneEndFilter
: filter using the chromosomal end coordinate of gene.GeneIdFilter
: filter based on the Ensembl gene IDs.GenenameFilter
: filter based on the names (symbols) of the genes.GeneStartFilter
: filter using the chromosomal start coordinate of gene.GRangesFilter
: allows to retrieve all features (genes, transcripts or exons) that are either within (setting parameter type
to “within”) or partially overlapping (setting type
to “any”) the defined genomic region/range. Note that, depending on the called method (genes
, transcripts
or exons
) the start and end coordinates of either the genes, transcripts or exons are used for the filter. For methods exonsBy
, cdsBy
and txBy
the coordinates of by
are used.SeqNameFilter
: filter by the name of the chromosomes the genes are encoded on.SeqStrandFilter
: filter for the chromosome strand on which the genes are encoded.SymbolFilter
: filter on gene symbols; note that no database columns symbol is available in an EnsDb
database and hence the gene name is used for filtering.TxBiotypeFilter
: filter on the transcript biotype defined in Ensembl; use the listTxbiotypes
method to list all available biotypes.TxEndFilter
: filter using the chromosomal end coordinate of transcripts.TxIdFilter
: filter on the Ensembl transcript identifiers.TxNameFilter
: filter on the Ensembl transcript names (currently identical to the transcript IDs).TxStartFilter
: filter using the chromosomal start coordinate of transcripts.In addition to the above listed DNA-RNA-based filters, protein-specific filters are also available:
ProtDomIdFilter
: filter by the protein domain ID.ProteinIdFilter
: filter by Ensembl protein ID filters.UniprotDbFilter
: filter by the name of the Uniprot database.UniprotFilter
: filter by the Uniprot ID.UniprotMappingTypeFilter
: filter by the mapping type of Ensembl protein IDs to Uniprot IDs.These can however only be used on EnsDb
databases that provide protein annotations, i.e. for which a call to hasProteinData
returns TRUE
.
A simple use case for the filter framework would be to get all transcripts for the gene BCL2L11. To this end we specify a GenenameFilter
with the value BCL2L11. As a result we get a GRanges
object with start
, end
, strand
and seqname
being the start coordinate, end coordinate, chromosome name and strand for the respective transcripts. All additional annotations are available as metadata columns. Alternatively, by setting return.type
to “DataFrame”, or “data.frame” the method would return a DataFrame
or data.frame
object instead of the default GRanges
.
Tx <- transcripts(edb, filter = list(GenenameFilter("BCL2L11")))
Tx
## GRanges object with 17 ranges and 7 metadata columns:
## seqnames ranges strand | tx_id
## <Rle> <IRanges> <Rle> | <character>
## ENST00000432179 2 [111876955, 111881689] + | ENST00000432179
## ENST00000308659 2 [111878491, 111922625] + | ENST00000308659
## ENST00000357757 2 [111878491, 111919016] + | ENST00000357757
## ENST00000393253 2 [111878491, 111909428] + | ENST00000393253
## ENST00000337565 2 [111878491, 111886423] + | ENST00000337565
## ... ... ... ... . ...
## ENST00000452231 2 [111881323, 111921808] + | ENST00000452231
## ENST00000361493 2 [111881323, 111921808] + | ENST00000361493
## ENST00000431217 2 [111881323, 111921929] + | ENST00000431217
## ENST00000439718 2 [111881323, 111922220] + | ENST00000439718
## ENST00000438054 2 [111881329, 111903861] + | ENST00000438054
## tx_biotype tx_cds_seq_start tx_cds_seq_end
## <character> <integer> <integer>
## ENST00000432179 protein_coding 111881323 111881689
## ENST00000308659 protein_coding 111881323 111921808
## ENST00000357757 protein_coding 111881323 111919016
## ENST00000393253 protein_coding 111881323 111909428
## ENST00000337565 protein_coding 111881323 111886328
## ... ... ... ...
## ENST00000452231 nonsense_mediated_decay 111881323 111919016
## ENST00000361493 nonsense_mediated_decay 111881323 111887812
## ENST00000431217 nonsense_mediated_decay 111881323 111902078
## ENST00000439718 nonsense_mediated_decay 111881323 111909428
## ENST00000438054 protein_coding 111881329 111902068
## gene_id tx_name gene_name
## <character> <character> <character>
## ENST00000432179 ENSG00000153094 ENST00000432179 BCL2L11
## ENST00000308659 ENSG00000153094 ENST00000308659 BCL2L11
## ENST00000357757 ENSG00000153094 ENST00000357757 BCL2L11
## ENST00000393253 ENSG00000153094 ENST00000393253 BCL2L11
## ENST00000337565 ENSG00000153094 ENST00000337565 BCL2L11
## ... ... ... ...
## ENST00000452231 ENSG00000153094 ENST00000452231 BCL2L11
## ENST00000361493 ENSG00000153094 ENST00000361493 BCL2L11
## ENST00000431217 ENSG00000153094 ENST00000431217 BCL2L11
## ENST00000439718 ENSG00000153094 ENST00000439718 BCL2L11
## ENST00000438054 ENSG00000153094 ENST00000438054 BCL2L11
## -------
## seqinfo: 1 sequence from GRCh37 genome
## as this is a GRanges object we can access e.g. the start coordinates with
head(start(Tx))
## [1] 111876955 111878491 111878491 111878491 111878491 111878506
## or extract the biotype with
head(Tx$tx_biotype)