1 Introduction

The ensembldb package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb packages from the GenomicFeatures package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb databases created by the ensembldb package contain also protein annotation data (see Section 11 for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the proteins vignette.

Another main goal of this package is to generate versioned annotation packages, i.e. annotation packages that are build for a specific Ensembl release, and are also named according to that (e.g. EnsDb.Hsapiens.v75 for human gene definitions of the Ensembl code database version 75). This ensures reproducibility, as it allows to load annotations from a specific Ensembl release also if newer versions of annotation packages/releases are available. It also allows to load multiple annotation packages at the same time in order to e.g. compare gene models between Ensembl releases.

In the example below we load an Ensembl based annotation package for Homo sapiens, Ensembl version 75. The EnsDb object providing access to the underlying SQLite database is bound to the variable name EnsDb.Hsapiens.v75.

library(EnsDb.Hsapiens.v75)

## Making a "short cut"
edb <- EnsDb.Hsapiens.v75
## print some informations for this package
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.2.3
## |Creation time: Tue Nov 15 23:35:19 2016
## |ensembl_version: 75
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |genome_build: GRCh37
## |DBSCHEMAVERSION: 1.0
## | No. of genes: 64102.
## | No. of transcripts: 215647.
## |Protein data available.
## For what organism was the database generated?
organism(edb)
## [1] "Homo sapiens"

2 Using ensembldb annotation packages to retrieve specific annotations

One of the strengths of the ensembldb package and the related EnsDb databases is its implementation of a filter framework that enables to efficiently extract data sub-sets from the databases. The ensembldb package supports most of the filters defined in the AnnotationFilter Bioconductor package and defines some additional filters specific to the data stored in EnsDb databases. The supportedFilters method can be used to get an overview over all supported filter classes, each of them (except the GRangesFilter) working on a single column/field in the database.

supportedFilters(edb)
##  [1] "EntrezFilter"             "ExonEndFilter"           
##  [3] "ExonIdFilter"             "ExonRankFilter"          
##  [5] "ExonStartFilter"          "GRangesFilter"           
##  [7] "GeneBiotypeFilter"        "GeneEndFilter"           
##  [9] "GeneIdFilter"             "GeneStartFilter"         
## [11] "GenenameFilter"           "ProtDomIdFilter"         
## [13] "ProteinIdFilter"          "SeqNameFilter"           
## [15] "SeqStrandFilter"          "SymbolFilter"            
## [17] "TxBiotypeFilter"          "TxEndFilter"             
## [19] "TxIdFilter"               "TxNameFilter"            
## [21] "TxStartFilter"            "UniprotDbFilter"         
## [23] "UniprotFilter"            "UniprotMappingTypeFilter"

These filters can be divided into 3 main filter types:

  • IntegerFilter: filter classes extending this basic object can take a single numeric value as input and support the conditions =, !, >, <, >= and <=. All filters that work on chromosomal coordinates, such as the GeneEndFilter extend IntegerFilter.
  • CharacterFilter: filter classes extending this object can take a single or multiple character values as input and allow conditions: =, !, “startsWith” and “endsWith”. All filters working on IDs extend this class.
  • GRangesFilter: takes a GRanges object as input and supports all conditions that findOverlaps from the IRanges package supports (“any”, “start”, “end”, “within”, “equal”). Note that these have to be passed using the parameter type to the constructor function.

The supported filters are:

  • EntrezFilter: allows to filter results based on NCBI Entrezgene identifiers of the genes.
  • ExonEndFilter: filter using the chromosomal end coordinate of exons.
  • ExonIdFilter: filter based on the (Ensembl) exon identifiers.
  • ExonRankFilter: filter based on the rank (index) of an exon within the transcript model. Exons are always numbered from 5’ to 3’ end of the transcript, thus, also on the reverse strand, the exon 1 is the most 5’ exon of the transcript.
  • ExonStartFilter: filter using the chromosomal start coordinate of exons.
  • GeneBiotypeFilter: filter using the gene biotypes defined in the Ensembl database; use the listGenebiotypes method to list all available biotypes.
  • GeneEndFilter: filter using the chromosomal end coordinate of gene.
  • GeneIdFilter: filter based on the Ensembl gene IDs.
  • GenenameFilter: filter based on the names (symbols) of the genes.
  • GeneStartFilter: filter using the chromosomal start coordinate of gene.
  • GRangesFilter: allows to retrieve all features (genes, transcripts or exons) that are either within (setting parameter type to “within”) or partially overlapping (setting type to “any”) the defined genomic region/range. Note that, depending on the called method (genes, transcripts or exons) the start and end coordinates of either the genes, transcripts or exons are used for the filter. For methods exonsBy, cdsBy and txBy the coordinates of by are used.
  • SeqNameFilter: filter by the name of the chromosomes the genes are encoded on.
  • SeqStrandFilter: filter for the chromosome strand on which the genes are encoded.
  • SymbolFilter: filter on gene symbols; note that no database columns symbol is available in an EnsDb database and hence the gene name is used for filtering.
  • TxBiotypeFilter: filter on the transcript biotype defined in Ensembl; use the listTxbiotypes method to list all available biotypes.
  • TxEndFilter: filter using the chromosomal end coordinate of transcripts.
  • TxIdFilter: filter on the Ensembl transcript identifiers.
  • TxNameFilter: filter on the Ensembl transcript names (currently identical to the transcript IDs).
  • TxStartFilter: filter using the chromosomal start coordinate of transcripts.

In addition to the above listed DNA-RNA-based filters, protein-specific filters are also available:

  • ProtDomIdFilter: filter by the protein domain ID.
  • ProteinIdFilter: filter by Ensembl protein ID filters.
  • UniprotDbFilter: filter by the name of the Uniprot database.
  • UniprotFilter: filter by the Uniprot ID.
  • UniprotMappingTypeFilter: filter by the mapping type of Ensembl protein IDs to Uniprot IDs.

These can however only be used on EnsDb databases that provide protein annotations, i.e. for which a call to hasProteinData returns TRUE.

A simple use case for the filter framework would be to get all transcripts for the gene BCL2L11. To this end we specify a GenenameFilter with the value BCL2L11. As a result we get a GRanges object with start, end, strand and seqname being the start coordinate, end coordinate, chromosome name and strand for the respective transcripts. All additional annotations are available as metadata columns. Alternatively, by setting return.type to “DataFrame”, or “data.frame” the method would return a DataFrame or data.frame object instead of the default GRanges.

Tx <- transcripts(edb, filter = list(GenenameFilter("BCL2L11")))

Tx
## GRanges object with 17 ranges and 7 metadata columns:
##                   seqnames                 ranges strand |           tx_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENST00000432179        2 [111876955, 111881689]      + | ENST00000432179
##   ENST00000308659        2 [111878491, 111922625]      + | ENST00000308659
##   ENST00000357757        2 [111878491, 111919016]      + | ENST00000357757
##   ENST00000393253        2 [111878491, 111909428]      + | ENST00000393253
##   ENST00000337565        2 [111878491, 111886423]      + | ENST00000337565
##               ...      ...                    ...    ... .             ...
##   ENST00000452231        2 [111881323, 111921808]      + | ENST00000452231
##   ENST00000361493        2 [111881323, 111921808]      + | ENST00000361493
##   ENST00000431217        2 [111881323, 111921929]      + | ENST00000431217
##   ENST00000439718        2 [111881323, 111922220]      + | ENST00000439718
##   ENST00000438054        2 [111881329, 111903861]      + | ENST00000438054
##                                tx_biotype tx_cds_seq_start tx_cds_seq_end
##                               <character>        <integer>      <integer>
##   ENST00000432179          protein_coding        111881323      111881689
##   ENST00000308659          protein_coding        111881323      111921808
##   ENST00000357757          protein_coding        111881323      111919016
##   ENST00000393253          protein_coding        111881323      111909428
##   ENST00000337565          protein_coding        111881323      111886328
##               ...                     ...              ...            ...
##   ENST00000452231 nonsense_mediated_decay        111881323      111919016
##   ENST00000361493 nonsense_mediated_decay        111881323      111887812
##   ENST00000431217 nonsense_mediated_decay        111881323      111902078
##   ENST00000439718 nonsense_mediated_decay        111881323      111909428
##   ENST00000438054          protein_coding        111881329      111902068
##                           gene_id         tx_name   gene_name
##                       <character>     <character> <character>
##   ENST00000432179 ENSG00000153094 ENST00000432179     BCL2L11
##   ENST00000308659 ENSG00000153094 ENST00000308659     BCL2L11
##   ENST00000357757 ENSG00000153094 ENST00000357757     BCL2L11
##   ENST00000393253 ENSG00000153094 ENST00000393253     BCL2L11
##   ENST00000337565 ENSG00000153094 ENST00000337565     BCL2L11
##               ...             ...             ...         ...
##   ENST00000452231 ENSG00000153094 ENST00000452231     BCL2L11
##   ENST00000361493 ENSG00000153094 ENST00000361493     BCL2L11
##   ENST00000431217 ENSG00000153094 ENST00000431217     BCL2L11
##   ENST00000439718 ENSG00000153094 ENST00000439718     BCL2L11
##   ENST00000438054 ENSG00000153094 ENST00000438054     BCL2L11
##   -------
##   seqinfo: 1 sequence from GRCh37 genome
## as this is a GRanges object we can access e.g. the start coordinates with
head(start(Tx))
## [1] 111876955 111878491 111878491 111878491 111878491 111878506
## or extract the biotype with
head(Tx$tx_biotype)