1 Introduction

The ensembldb package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the TxDb packages from the GenomicFeatures package, but, in addition to retrieve all gene/transcript models and annotations from the database, the ensembldb package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, EnsDb databases created by the ensembldb package contain also protein annotation data (see Section 11 for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the proteins vignette.

Another main goal of this package is to generate versioned annotation packages, i.e. annotation packages that are build for a specific Ensembl release, and are also named according to that (e.g. EnsDb.Hsapiens.v75 for human gene definitions of the Ensembl code database version 75). This ensures reproducibility, as it allows to load annotations from a specific Ensembl release also if newer versions of annotation packages/releases are available. It also allows to load multiple annotation packages at the same time in order to e.g. compare gene models between Ensembl releases.

In the example below we load an Ensembl based annotation package for Homo sapiens, Ensembl version 75. The EnsDb object providing access to the underlying SQLite database is bound to the variable name EnsDb.Hsapiens.v75.

library(EnsDb.Hsapiens.v75)

## Making a "short cut"
edb <- EnsDb.Hsapiens.v75
## print some informations for this package
edb
## EnsDb for Ensembl:
## |Backend: SQLite
## |Db type: EnsDb
## |Type of Gene ID: Ensembl Gene ID
## |Supporting package: ensembldb
## |Db created by: ensembldb package from Bioconductor
## |script_version: 0.3.0
## |Creation time: Thu May 18 09:15:45 2017
## |ensembl_version: 75
## |ensembl_host: localhost
## |Organism: homo_sapiens
## |taxonomy_id: 9606
## |genome_build: GRCh37
## |DBSCHEMAVERSION: 2.0
## | No. of genes: 64102.
## | No. of transcripts: 215647.
## |Protein data available.
## For what organism was the database generated?
organism(edb)
## [1] "Homo sapiens"

2 Using ensembldb annotation packages to retrieve specific annotations

One of the strengths of the ensembldb package and the related EnsDb databases is its implementation of a filter framework that enables to efficiently extract data sub-sets from the databases. The ensembldb package supports most of the filters defined in the AnnotationFilter Bioconductor package and defines some additional filters specific to the data stored in EnsDb databases. Filters can be passed directly to all methods extracting data from an EnsDb (such as genes, transcripts or exons). Alternatively it is possible with the addFilter or filter functions to add a filter directly to an EnsDb which will then be used in all queries on that object.

The supportedFilters method can be used to get an overview over all supported filter classes, each of them (except the GRangesFilter) working on a single column/field in the database.

supportedFilters(edb)
##                      filter                field
## 1              EntrezFilter               entrez
## 2             ExonEndFilter             exon_end
## 3              ExonIdFilter              exon_id
## 4            ExonRankFilter            exon_rank
## 5           ExonStartFilter           exon_start
## 6             GRangesFilter                 <NA>
## 7         GeneBiotypeFilter         gene_biotype
## 8             GeneEndFilter             gene_end
## 9              GeneIdFilter              gene_id
## 10          GeneStartFilter           gene_start
## 11           GenenameFilter             genename
## 12          ProtDomIdFilter          prot_dom_id
## 13          ProteinIdFilter           protein_id
## 14            SeqNameFilter             seq_name
## 15          SeqStrandFilter           seq_strand
## 16             SymbolFilter               symbol
## 17          TxBiotypeFilter           tx_biotype
## 18              TxEndFilter               tx_end
## 19               TxIdFilter                tx_id
## 20             TxNameFilter              tx_name
## 21            TxStartFilter             tx_start
## 22          UniprotDbFilter           uniprot_db
## 23            UniprotFilter              uniprot
## 24 UniprotMappingTypeFilter uniprot_mapping_type

These filters can be divided into 3 main filter types:

  • IntegerFilter: filter classes extending this basic object can take a single numeric value as input and support the conditions =, !, >, <, >= and <=. All filters that work on chromosomal coordinates, such as the GeneEndFilter extend IntegerFilter.
  • CharacterFilter: filter classes extending this object can take a single or multiple character values as input and allow conditions: =, !, “startsWith” and “endsWith”. All filters working on IDs extend this class.
  • GRangesFilter: takes a GRanges object as input and supports all conditions that findOverlaps from the IRanges package supports (“any”, “start”, “end”, “within”, “equal”). Note that these have to be passed using the parameter type to the constructor function.

The supported filters are:

  • EntrezFilter: allows to filter results based on NCBI Entrezgene identifiers of the genes.
  • ExonEndFilter: filter using the chromosomal end coordinate of exons.
  • ExonIdFilter: filter based on the (Ensembl) exon identifiers.
  • ExonRankFilter: filter based on the rank (index) of an exon within the transcript model. Exons are always numbered from 5’ to 3’ end of the transcript, thus, also on the reverse strand, the exon 1 is the most 5’ exon of the transcript.
  • ExonStartFilter: filter using the chromosomal start coordinate of exons.
  • GeneBiotypeFilter: filter using the gene biotypes defined in the Ensembl database; use the listGenebiotypes method to list all available biotypes.
  • GeneEndFilter: filter using the chromosomal end coordinate of gene.
  • GeneIdFilter: filter based on the Ensembl gene IDs.
  • GenenameFilter: filter based on the names (symbols) of the genes.
  • GeneStartFilter: filter using the chromosomal start coordinate of gene.
  • GRangesFilter: allows to retrieve all features (genes, transcripts or exons) that are either within (setting parameter type to “within”) or partially overlapping (setting type to “any”) the defined genomic region/range. Note that, depending on the called method (genes, transcripts or exons) the start and end coordinates of either the genes, transcripts or exons are used for the filter. For methods exonsBy, cdsBy and txBy the coordinates of by are used.
  • SeqNameFilter: filter by the name of the chromosomes the genes are encoded on.
  • SeqStrandFilter: filter for the chromosome strand on which the genes are encoded.
  • SymbolFilter: filter on gene symbols; note that no database columns symbol is available in an EnsDb database and hence the gene name is used for filtering.
  • TxBiotypeFilter: filter on the transcript biotype defined in Ensembl; use the listTxbiotypes method to list all available biotypes.
  • TxEndFilter: filter using the chromosomal end coordinate of transcripts.
  • TxIdFilter: filter on the Ensembl transcript identifiers.
  • TxNameFilter: filter on the Ensembl transcript names (currently identical to the transcript IDs).
  • TxStartFilter: filter using the chromosomal start coordinate of transcripts.

In addition to the above listed DNA-RNA-based filters, protein-specific filters are also available:

  • ProtDomIdFilter: filter by the protein domain ID.
  • ProteinIdFilter: filter by Ensembl protein ID filters.
  • UniprotDbFilter: filter by the name of the Uniprot database.
  • UniprotFilter: filter by the Uniprot ID.
  • UniprotMappingTypeFilter: filter by the mapping type of Ensembl protein IDs to Uniprot IDs.

These can however only be used on EnsDb databases that provide protein annotations, i.e. for which a call to hasProteinData returns TRUE.

EnsDb databases for more recent Ensembl versions (starting from Ensembl 87) provide also evidence levels for individual transcripts in the tx_support_level database column. Such databases support also a TxSupportLevelFilter filter to use this columns for filtering.

A simple use case for the filter framework would be to get all transcripts for the gene BCL2L11. To this end we specify a GenenameFilter with the value BCL2L11. As a result we get a GRanges object with start, end, strand and seqname being the start coordinate, end coordinate, chromosome name and strand for the respective transcripts. All additional annotations are available as metadata columns. Alternatively, by setting return.type to “DataFrame”, or “data.frame” the method would return a DataFrame or data.frame object instead of the default GRanges.

Tx <- transcripts(edb, filter = list(GenenameFilter("BCL2L11")))

Tx
## GRanges object with 17 ranges and 7 metadata columns:
##                   seqnames                 ranges strand |           tx_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENST00000432179        2 [111876955, 111881689]      + | ENST00000432179
##   ENST00000308659        2 [111878491, 111922625]      + | ENST00000308659
##   ENST00000357757        2 [111878491, 111919016]      + | ENST00000357757
##   ENST00000393253        2 [111878491, 111909428]      + | ENST00000393253
##   ENST00000337565        2 [111878491, 111886423]      + | ENST00000337565
##               ...      ...                    ...    ... .             ...
##   ENST00000452231        2 [111881323, 111921808]      + | ENST00000452231
##   ENST00000361493        2 [111881323, 111921808]      + | ENST00000361493
##   ENST00000431217        2 [111881323, 111921929]      + | ENST00000431217
##   ENST00000439718        2 [111881323, 111922220]      + | ENST00000439718
##   ENST00000438054        2 [111881329, 111903861]      + | ENST00000438054
##                                tx_biotype tx_cds_seq_start tx_cds_seq_end
##                               <character>        <integer>      <integer>
##   ENST00000432179          protein_coding        111881323      111881689
##   ENST00000308659          protein_coding        111881323      111921808
##   ENST00000357757          protein_coding        111881323      111919016
##   ENST00000393253          protein_coding        111881323      111909428
##   ENST00000337565          protein_coding        111881323      111886328
##               ...                     ...              ...            ...
##   ENST00000452231 nonsense_mediated_decay        111881323      111919016
##   ENST00000361493 nonsense_mediated_decay        111881323      111887812
##   ENST00000431217 nonsense_mediated_decay        111881323      111902078
##   ENST00000439718 nonsense_mediated_decay        111881323      111909428
##   ENST00000438054          protein_coding        111881329      111902068
##                           gene_id         tx_name   gene_name
##                       <character>     <character> <character>
##   ENST00000432179 ENSG00000153094 ENST00000432179     BCL2L11
##   ENST00000308659 ENSG00000153094 ENST00000308659     BCL2L11
##   ENST00000357757 ENSG00000153094 ENST00000357757     BCL2L11
##   ENST00000393253 ENSG00000153094 ENST00000393253     BCL2L11
##   ENST00000337565 ENSG00000153094 ENST00000337565     BCL2L11
##               ...             ...             ...         ...
##   ENST00000452231 ENSG00000153094 ENST00000452231     BCL2L11
##   ENST00000361493 ENSG00000153094 ENST00000361493     BCL2L11
##   ENST00000431217 ENSG00000153094 ENST00000431217     BCL2L11
##   ENST00000439718 ENSG00000153094 ENST00000439718     BCL2L11
##   ENST00000438054 ENSG00000153094 ENST00000438054     BCL2L11
##   -------
##   seqinfo: 1 sequence from GRCh37 genome
## as this is a GRanges object we can access e.g. the start coordinates with
head(start(Tx))
## [1] 111876955 111878491 111878491 111878491 111878491 111878506
## or extract the biotype with
head(Tx$tx_biotype)
## [1] "protein_coding" "protein_coding" "protein_coding" "protein_coding"
## [5] "protein_coding" "protein_coding"

The parameter columns of the extractor methods (such as exons, genes or transcripts) allows to specify which database attributes (columns) should be retrieved. The exons method returns by default all exon-related columns, the transcripts all columns from the transcript database table and the genes all from the gene table. Note however that in the example above we got also a column gene_name although this column is not present in the transcript database table. By default the methods return also all columns that are used by any of the filters submitted with the filter argument (thus, because a GenenameFilter was used, the column gene_name is also returned). Setting returnFilterColumns(edb) <- FALSE disables this option and only the columns specified by the columns parameter are retrieved.

Instead of passing a filter object to the method it is also possible to provide a filter expression written as a formula. The formula has to be written in the form ~ <field> <condition> <value> with <field> being the field (database column) in the database, <condition> the condition for the filter object and <value> its value. Use the supportedFilter method to get the field names corresponding to each filter class.

## Use a filter expression to perform the filtering.
transcripts(edb, filter = ~ genename == "ZBTB16")
## GRanges object with 9 ranges and 7 metadata columns:
##                   seqnames                 ranges strand |           tx_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENST00000335953       11 [113930315, 114121398]      + | ENST00000335953
##   ENST00000541602       11 [113930447, 114060486]      + | ENST00000541602
##   ENST00000544220       11 [113930459, 113934368]      + | ENST00000544220
##   ENST00000535700       11 [113930979, 113934466]      + | ENST00000535700
##   ENST00000392996       11 [113931229, 114121374]      + | ENST00000392996
##   ENST00000539918       11 [113935134, 114118066]      + | ENST00000539918
##   ENST00000545851       11 [114051488, 114118018]      + | ENST00000545851
##   ENST00000535379       11 [114107929, 114121279]      + | ENST00000535379
##   ENST00000535509       11 [114117512, 114121198]      + | ENST00000535509
##                                tx_biotype tx_cds_seq_start tx_cds_seq_end
##                               <character>        <integer>      <integer>
##   ENST00000335953          protein_coding        113934023      114121277
##   ENST00000541602         retained_intron             <NA>           <NA>
##   ENST00000544220          protein_coding        113934023      113934368
##   ENST00000535700          protein_coding        113934023      113934466
##   ENST00000392996          protein_coding        113934023      114121277
##   ENST00000539918 nonsense_mediated_decay        113935134      113992549
##   ENST00000545851    processed_transcript             <NA>           <NA>
##   ENST00000535379    processed_transcript             <NA>           <NA>
##   ENST00000535509         retained_intron             <NA>           <NA>
##                           gene_id         tx_name   gene_name
##                       <character>     <character> <character>
##   ENST00000335953 ENSG00000109906 ENST00000335953      ZBTB16
##   ENST00000541602 ENSG00000109906 ENST00000541602      ZBTB16
##   ENST00000544220 ENSG00000109906 ENST00000544220      ZBTB16
##   ENST00000535700 ENSG00000109906 ENST00000535700      ZBTB16
##   ENST00000392996 ENSG00000109906 ENST00000392996      ZBTB16
##   ENST00000539918 ENSG00000109906 ENST00000539918      ZBTB16
##   ENST00000545851 ENSG00000109906 ENST00000545851      ZBTB16
##   ENST00000535379 ENSG00000109906 ENST00000535379      ZBTB16
##   ENST00000535509 ENSG00000109906 ENST00000535509      ZBTB16
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

Filter expression have to be written as a formula (i.e. starting with a ~) in the form column name followed by the logical condition.

Alternatively, EnsDb objects can be filtered directly using the filter function. In the example below we use the filter function to filter the EnsDb object and pass that filtered database to the transcripts method using the %>% from the magrittr package.

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:AnnotationFilter':
## 
##     not
filter(edb, ~ symbol == "BCL2" & tx_biotype != "protein_coding") %>% transcripts
## GRanges object with 1 range and 6 metadata columns:
##                   seqnames               ranges strand |           tx_id
##                      <Rle>            <IRanges>  <Rle> |     <character>
##   ENST00000590515       18 [60795445, 60829102]      - | ENST00000590515
##                             tx_biotype tx_cds_seq_start tx_cds_seq_end
##                            <character>        <integer>      <integer>
##   ENST00000590515 processed_transcript             <NA>           <NA>
##                           gene_id         tx_name
##                       <character>     <character>
##   ENST00000590515 ENSG00000171791 ENST00000590515
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

Adding a filter to an EnsDb enables this filter (globally) on all subsequent queries on that object. We could thus filter an EnsDb to (virtually) contain only features encoded on chromosome Y.

edb_y <- addFilter(edb, SeqNameFilter("Y"))

## All subsequent filters on that EnsDb will only work on features encoded on
## chromosome Y
genes(edb_y)
## GRanges object with 495 ranges and 6 metadata columns:
##                   seqnames               ranges strand |         gene_id
##                      <Rle>            <IRanges>  <Rle> |     <character>
##   ENSG00000251841        Y   [2652790, 2652894]      + | ENSG00000251841
##   ENSG00000184895        Y   [2654896, 2655740]      - | ENSG00000184895
##   ENSG00000237659        Y   [2657868, 2658369]      + | ENSG00000237659
##   ENSG00000232195        Y   [2696023, 2696259]      + | ENSG00000232195
##   ENSG00000129824        Y   [2709527, 2800041]      + | ENSG00000129824
##               ...      ...                  ...    ... .             ...
##   ENSG00000224240        Y [28695572, 28695890]      + | ENSG00000224240
##   ENSG00000227629        Y [28732789, 28737748]      - | ENSG00000227629
##   ENSG00000237917        Y [28740998, 28780799]      - | ENSG00000237917
##   ENSG00000231514        Y [28772667, 28773306]      - | ENSG00000231514
##   ENSG00000235857        Y [59001391, 59001635]      + | ENSG00000235857
##                     gene_name   gene_biotype seq_coord_system      symbol
##                   <character>    <character>      <character> <character>
##   ENSG00000251841  RNU6-1334P          snRNA       chromosome  RNU6-1334P
##   ENSG00000184895         SRY protein_coding       chromosome         SRY
##   ENSG00000237659  RNASEH2CP1     pseudogene       chromosome  RNASEH2CP1
##   ENSG00000232195    TOMM22P2     pseudogene       chromosome    TOMM22P2
##   ENSG00000129824      RPS4Y1 protein_coding       chromosome      RPS4Y1
##               ...         ...            ...              ...         ...
##   ENSG00000224240     CYCSP49     pseudogene       chromosome     CYCSP49
##   ENSG00000227629  SLC25A15P1     pseudogene       chromosome  SLC25A15P1
##   ENSG00000237917     PARP4P1     pseudogene       chromosome     PARP4P1
##   ENSG00000231514     FAM58CP     pseudogene       chromosome     FAM58CP
##   ENSG00000235857     CTBP2P1     pseudogene       chromosome     CTBP2P1
##                   entrezid
##                     <list>
##   ENSG00000251841       NA
##   ENSG00000184895     6736
##   ENSG00000237659       NA
##   ENSG00000232195       NA
##   ENSG00000129824     6192
##               ...      ...
##   ENSG00000224240       NA
##   ENSG00000227629       NA
##   ENSG00000237917       NA
##   ENSG00000231514       NA
##   ENSG00000235857       NA
##   -------
##   seqinfo: 1 sequence from GRCh37 genome
## Get all lincRNAs on chromosome Y
genes(edb_y, filter = ~ gene_biotype == "lincRNA")
## GRanges object with 48 ranges and 6 metadata columns:
##                   seqnames               ranges strand |         gene_id
##                      <Rle>            <IRanges>  <Rle> |     <character>
##   ENSG00000231535        Y   [2870953, 2970313]      + | ENSG00000231535
##   ENSG00000229308        Y   [3904538, 3968361]      + | ENSG00000229308
##   ENSG00000237069        Y   [6110487, 6111651]      - | ENSG00000237069
##   ENSG00000229643        Y   [6225260, 6229454]      - | ENSG00000229643
##   ENSG00000129816        Y   [6258472, 6279605]      + | ENSG00000129816
##               ...      ...                  ...    ... .             ...
##   ENSG00000228296        Y [27209230, 27246039]      - | ENSG00000228296
##   ENSG00000223641        Y [27329790, 27330920]      - | ENSG00000223641
##   ENSG00000228786        Y [27524447, 27540866]      - | ENSG00000228786
##   ENSG00000240450        Y [27629055, 27632852]      + | ENSG00000240450
##   ENSG00000231141        Y [27874637, 27879535]      + | ENSG00000231141
##                      gene_name gene_biotype seq_coord_system       symbol
##                    <character>  <character>      <character>  <character>
##   ENSG00000231535    LINC00278      lincRNA       chromosome    LINC00278
##   ENSG00000229308   AC010084.1      lincRNA       chromosome   AC010084.1
##   ENSG00000237069      TTTY23B      lincRNA       chromosome      TTTY23B
##   ENSG00000229643    LINC00280      lincRNA       chromosome    LINC00280
##   ENSG00000129816       TTTY1B      lincRNA       chromosome       TTTY1B
##               ...          ...          ...              ...          ...
##   ENSG00000228296       TTTY4C      lincRNA       chromosome       TTTY4C
##   ENSG00000223641      TTTY17C      lincRNA       chromosome      TTTY17C
##   ENSG00000228786 LINC00266-4P      lincRNA       chromosome LINC00266-4P
##   ENSG00000240450     CSPG4P1Y      lincRNA       chromosome     CSPG4P1Y
##   ENSG00000231141        TTTY3      lincRNA       chromosome        TTTY3
##                               entrezid
##                                 <list>
##   ENSG00000231535            100873962
##   ENSG00000229308                   NA
##   ENSG00000237069     100101121,252955
##   ENSG00000229643                   NA
##   ENSG00000129816      100101116,50858
##               ...                  ...
##   ENSG00000228296 474150,474149,114761
##   ENSG00000223641 474152,474151,252949
##   ENSG00000228786                   NA
##   ENSG00000240450               114758
##   ENSG00000231141        474148,114760
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

To get an overview of database tables and available columns the function listTables can be used. The method listColumns on the other hand lists columns for the specified database table.

## list all database tables along with their columns
listTables(edb)
## $gene
## [1] "gene_id"          "gene_name"        "gene_biotype"    
## [4] "gene_seq_start"   "gene_seq_end"     "seq_name"        
## [7] "seq_strand"       "seq_coord_system" "symbol"          
## 
## $tx
## [1] "tx_id"            "tx_biotype"       "tx_seq_start"    
## [4] "tx_seq_end"       "tx_cds_seq_start" "tx_cds_seq_end"  
## [7] "gene_id"          "tx_name"         
## 
## $tx2exon
## [1] "tx_id"    "exon_id"  "exon_idx"
## 
## $exon
## [1] "exon_id"        "exon_seq_start" "exon_seq_end"  
## 
## $chromosome
## [1] "seq_name"    "seq_length"  "is_circular"
## 
## $protein
## [1] "tx_id"            "protein_id"       "protein_sequence"
## 
## $uniprot
## [1] "protein_id"           "uniprot_id"           "uniprot_db"          
## [4] "uniprot_mapping_type"
## 
## $protein_domain
## [1] "protein_id"            "protein_domain_id"     "protein_domain_source"
## [4] "interpro_accession"    "prot_dom_start"        "prot_dom_end"         
## 
## $entrezgene
## [1] "gene_id"  "entrezid"
## 
## $metadata
## [1] "name"  "value"
## list columns from a specific table
listColumns(edb, "tx")
## [1] "tx_id"            "tx_biotype"       "tx_seq_start"    
## [4] "tx_seq_end"       "tx_cds_seq_start" "tx_cds_seq_end"  
## [7] "gene_id"          "tx_name"

Thus, we could retrieve all transcripts of the biotype nonsense_mediated_decay (which, according to the definitions by Ensembl are transcribed, but most likely not translated in a protein, but rather degraded after transcription) along with the name of the gene for each transcript. Note that we are changing here the return.type to DataFrame, so the method will return a DataFrame with the results instead of the default GRanges.

Tx <- transcripts(edb,
                  columns = c(listColumns(edb , "tx"), "gene_name"),
                  filter = TxBiotypeFilter("nonsense_mediated_decay"),
                  return.type = "DataFrame")
nrow(Tx)
## [1] 13812
Tx
## DataFrame with 13812 rows and 9 columns
##                 tx_id              tx_biotype tx_seq_start tx_seq_end
##           <character>             <character>    <integer>  <integer>
## 1     ENST00000495251 nonsense_mediated_decay        64085      69409
## 2     ENST00000462860 nonsense_mediated_decay        64085      69452
## 3     ENST00000483390 nonsense_mediated_decay        65739      68764
## 4     ENST00000538848 nonsense_mediated_decay        66411      68843
## 5     ENST00000567466 nonsense_mediated_decay        97578      99521
## ...               ...                     ...          ...        ...
## 13808 ENST00000496411 nonsense_mediated_decay    249149927  249153217
## 13809 ENST00000483223 nonsense_mediated_decay    249150714  249152728
## 13810 ENST00000533647 nonsense_mediated_decay    249151472  249152523
## 13811 ENST00000528141 nonsense_mediated_decay    249151590  249153284
## 13812 ENST00000530986 nonsense_mediated_decay    249151668  249153284
##       tx_cds_seq_start tx_cds_seq_end         gene_id         tx_name
##              <integer>      <integer>     <character>     <character>
## 1                68052          68789 ENSG00000234769 ENST00000495251
## 2                68052          68789 ENSG00000234769 ENST00000462860
## 3                66428          68764 ENSG00000234769 ENST00000483390
## 4                67418          68789 ENSG00000234769 ENST00000538848
## 5                98546          98893 ENSG00000261456 ENST00000567466
## ...                ...            ...             ...             ...
## 13808        249152153      249152508 ENSG00000171163 ENST00000496411
## 13809        249152153      249152508 ENSG00000171163 ENST00000483223
## 13810        249152153      249152508 ENSG00000171163 ENST00000533647
## 13811        249152203      249152508 ENSG00000171163 ENST00000528141
## 13812        249152203      249152508 ENSG00000171163 ENST00000530986
##         gene_name
##       <character>
## 1          WASH4P
## 2          WASH4P
## 3          WASH4P
## 4          WASH4P
## 5           TUBB8
## ...           ...
## 13808      ZNF692
## 13809      ZNF692
## 13810      ZNF692
## 13811      ZNF692
## 13812      ZNF692

For protein coding transcripts, we can also specifically extract their coding region. In the example below we extract the CDS for all transcripts encoded on chromosome Y.

yCds <- cdsBy(edb, filter = SeqNameFilter("Y"))
yCds
## GRangesList object of length 160:
## $ENST00000155093 
## GRanges object with 7 ranges and 3 metadata columns:
##       seqnames             ranges strand |    seq_name         exon_id
##          <Rle>          <IRanges>  <Rle> | <character>     <character>
##   [1]        Y [2821978, 2822038]      + |           Y ENSE00002223884
##   [2]        Y [2829115, 2829687]      + |           Y ENSE00003645989
##   [3]        Y [2843136, 2843285]      + |           Y ENSE00003548678
##   [4]        Y [2843552, 2843695]      + |           Y ENSE00003611496
##   [5]        Y [2844711, 2844863]      + |           Y ENSE00001649504
##   [6]        Y [2845981, 2846121]      + |           Y ENSE00001777381
##   [7]        Y [2846851, 2848034]      + |           Y ENSE00001368923
##       exon_rank
##       <integer>
##   [1]         2
##   [2]         3
##   [3]         4
##   [4]         5
##   [5]         6
##   [6]         7
##   [7]         8
## 
## $ENST00000215473 
## GRanges object with 6 ranges and 3 metadata columns:
##       seqnames             ranges strand | seq_name         exon_id
##   [1]        Y [4924865, 4925500]      + |        Y ENSE00001436852
##   [2]        Y [4966256, 4968748]      + |        Y ENSE00001640924
##   [3]        Y [5369098, 5369296]      + |        Y ENSE00001803775
##   [4]        Y [5483308, 5483316]      + |        Y ENSE00001731866
##   [5]        Y [5491131, 5491145]      + |        Y ENSE00001711324
##   [6]        Y [5605313, 5605983]      + |        Y ENSE00001779807
##       exon_rank
##   [1]         1
##   [2]         2
##   [3]         3
##   [4]         4
##   [5]         5
##   [6]         6
## 
## $ENST00000215479 
## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames             ranges strand | seq_name         exon_id
##   [1]        Y [6740596, 6740649]      - |        Y ENSE00001671586
##   [2]        Y [6738047, 6738094]      - |        Y ENSE00001645681
##   [3]        Y [6736773, 6736817]      - |        Y ENSE00000652250
##   [4]        Y [6736078, 6736503]      - |        Y ENSE00001667251
##   [5]        Y [6734114, 6734119]      - |        Y ENSE00001494454
##       exon_rank
##   [1]         2
##   [2]         3
##   [3]         4
##   [4]         5
##   [5]         6
## 
## ...
## <157 more elements>
## -------
## seqinfo: 1 sequence from GRCh37 genome

Using a GRangesFilter we can retrieve all features from the database that are either within or overlapping the specified genomic region. In the example below we query all genes that are partially overlapping with a small region on chromosome 11. The filter restricts to all genes for which either an exon or an intron is partially overlapping with the region.

## Define the filter
grf <- GRangesFilter(GRanges("11", ranges = IRanges(114000000, 114000050),
                             strand = "+"), type = "any")

## Query genes:
gn <- genes(edb, filter = grf)
gn
## GRanges object with 1 range and 6 metadata columns:
##                   seqnames                 ranges strand |         gene_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENSG00000109906       11 [113930315, 114121398]      + | ENSG00000109906
##                     gene_name   gene_biotype seq_coord_system      symbol
##                   <character>    <character>      <character> <character>
##   ENSG00000109906      ZBTB16 protein_coding       chromosome      ZBTB16
##                   entrezid
##                     <list>
##   ENSG00000109906     7704
##   -------
##   seqinfo: 1 sequence from GRCh37 genome
## Next we retrieve all transcripts for that gene so that we can plot them.
txs <- transcripts(edb, filter = GenenameFilter(gn$gene_name))
plot(3, 3, pch = NA, xlim = c(start(gn), end(gn)), ylim = c(0, length(txs)),
     yaxt = "n", ylab = "")
## Highlight the GRangesFilter region
rect(xleft = start(grf), xright = end(grf), ybottom = 0, ytop = length(txs),
     col = "red", border = "red")
for(i in 1:length(txs)) {
    current <- txs[i]
    rect(xleft = start(current), xright = end(current), ybottom = i-0.975,
         ytop = i-0.125, border = "grey")
    text(start(current), y = i-0.5, pos = 4, cex = 0.75, labels = current$tx_id)
}

As we can see, 4 transcripts of the gene ZBTB16 are also overlapping the region. Below we fetch these 4 transcripts. Note, that a call to exons will not return any features from the database, as no exon is overlapping with the region.

transcripts(edb, filter = grf)
## GRanges object with 4 ranges and 6 metadata columns:
##                   seqnames                 ranges strand |           tx_id
##                      <Rle>              <IRanges>  <Rle> |     <character>
##   ENST00000335953       11 [113930315, 114121398]      + | ENST00000335953
##   ENST00000541602       11 [113930447, 114060486]      + | ENST00000541602
##   ENST00000392996       11 [113931229, 114121374]      + | ENST00000392996
##   ENST00000539918       11 [113935134, 114118066]      + | ENST00000539918
##                                tx_biotype tx_cds_seq_start tx_cds_seq_end
##                               <character>        <integer>      <integer>
##   ENST00000335953          protein_coding        113934023      114121277
##   ENST00000541602         retained_intron             <NA>           <NA>
##   ENST00000392996          protein_coding        113934023      114121277
##   ENST00000539918 nonsense_mediated_decay        113935134      113992549
##                           gene_id         tx_name
##                       <character>     <character>
##   ENST00000335953 ENSG00000109906 ENST00000335953
##   ENST00000541602 ENSG00000109906 ENST00000541602
##   ENST00000392996 ENSG00000109906 ENST00000392996
##   ENST00000539918 ENSG00000109906 ENST00000539918
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

The GRangesFilter supports also GRanges defining multiple regions and a query will return all features overlapping any of these regions. Besides using the GRangesFilter it is also possible to search for transcripts or exons overlapping genomic regions using the exonsByOverlaps or transcriptsByOverlaps known from the GenomicFeatures package. Note that the implementation of these methods for EnsDb objects supports also to use filters to further fine-tune the query.

The functions listGenebiotypes and listTxbiotypes can be used to get an overview of allowed/available gene and transcript biotype

## Get all gene biotypes from the database. The GeneBiotypeFilter
## allows to filter on these values.
listGenebiotypes(edb)
##  [1] "protein_coding"           "pseudogene"              
##  [3] "processed_transcript"     "antisense"               
##  [5] "lincRNA"                  "polymorphic_pseudogene"  
##  [7] "IG_V_pseudogene"          "IG_V_gene"               
##  [9] "sense_overlapping"        "sense_intronic"          
## [11] "TR_V_gene"                "misc_RNA"                
## [13] "snRNA"                    "miRNA"                   
## [15] "snoRNA"                   "rRNA"                    
## [17] "Mt_tRNA"                  "Mt_rRNA"                 
## [19] "IG_C_gene"                "IG_J_gene"               
## [21] "TR_J_gene"                "TR_C_gene"               
## [23] "TR_V_pseudogene"          "TR_J_pseudogene"         
## [25] "IG_D_gene"                "IG_C_pseudogene"         
## [27] "TR_D_gene"                "IG_J_pseudogene"         
## [29] "3prime_overlapping_ncrna" "processed_pseudogene"    
## [31] "LRG_gene"
## Get all transcript biotypes from the database.
listTxbiotypes(edb)
##  [1] "protein_coding"                    
##  [2] "processed_transcript"              
##  [3] "retained_intron"                   
##  [4] "nonsense_mediated_decay"           
##  [5] "unitary_pseudogene"                
##  [6] "non_stop_decay"                    
##  [7] "unprocessed_pseudogene"            
##  [8] "processed_pseudogene"              
##  [9] "transcribed_unprocessed_pseudogene"
## [10] "antisense"                         
## [11] "lincRNA"                           
## [12] "polymorphic_pseudogene"            
## [13] "transcribed_processed_pseudogene"  
## [14] "miRNA"                             
## [15] "pseudogene"                        
## [16] "IG_V_pseudogene"                   
## [17] "snoRNA"                            
## [18] "IG_V_gene"                         
## [19] "sense_overlapping"                 
## [20] "sense_intronic"                    
## [21] "TR_V_gene"                         
## [22] "snRNA"                             
## [23] "misc_RNA"                          
## [24] "rRNA"                              
## [25] "Mt_tRNA"                           
## [26] "Mt_rRNA"                           
## [27] "IG_C_gene"                         
## [28] "IG_J_gene"                         
## [29] "TR_J_gene"                         
## [30] "TR_C_gene"                         
## [31] "TR_V_pseudogene"                   
## [32] "TR_J_pseudogene"                   
## [33] "IG_D_gene"                         
## [34] "IG_C_pseudogene"                   
## [35] "TR_D_gene"                         
## [36] "IG_J_pseudogene"                   
## [37] "3prime_overlapping_ncrna"          
## [38] "translated_processed_pseudogene"   
## [39] "LRG_gene"

Data can be fetched in an analogous way using the exons and genes methods. In the example below we retrieve gene_name, entrezid and the gene_biotype of all genes in the database which names start with “BCL2”.

## We're going to fetch all genes which names start with BCL. To this end
## we define a GenenameFilter with partial matching, i.e. condition "like"
## and a % for any character/string.
BCLs <- genes(edb,
          columns = c("gene_name", "entrezid", "gene_biotype"),
          filter = GenenameFilter("BCL", condition = "startsWith"),
          return.type = "DataFrame")
nrow(BCLs)
## [1] 25
BCLs
## DataFrame with 25 rows and 4 columns
##       gene_name entrezid   gene_biotype         gene_id
##     <character>   <list>    <character>     <character>
## 1         BCL10     8915 protein_coding ENSG00000142867
## 2        BCL11A    53335 protein_coding ENSG00000119866
## 3        BCL11B    64919 protein_coding ENSG00000127152
## 4          BCL2      596 protein_coding ENSG00000171791
## 5        BCL2A1      597 protein_coding ENSG00000140379
## ...         ...      ...            ...             ...
## 21        BCL7C     9274 protein_coding ENSG00000099385
## 22         BCL9      607 protein_coding ENSG00000116128
## 23         BCL9      607 protein_coding ENSG00000266095
## 24        BCL9L   283149 protein_coding ENSG00000186174
## 25       BCLAF1     9774 protein_coding ENSG00000029363

Sometimes it might be useful to know the length of genes or transcripts (i.e. the total sum of nucleotides covered by their exons). Below we calculate the mean length of transcripts from protein coding genes on chromosomes X and Y as well as the average length of snoRNA, snRNA and rRNA transcripts encoded on these chromosomes. For the first query we combine two AnnotationFilter objects using an AnnotationFilterList object, in the second we define the query using a filter expression.

## determine the average length of snRNA, snoRNA and rRNA genes encoded on
## chromosomes X and Y.
mean(lengthOf(edb, of = "tx", filter = AnnotationFilterList(
                                  GeneBiotypeFilter(c("snRNA", "snoRNA", "rRNA")),
                                  SeqNameFilter(c("X", "Y")))))
## [1] 116.3046
## determine the average length of protein coding genes encoded on the same
## chromosomes.
mean(lengthOf(edb, of = "tx", filter = ~ gene_biotype == "protein_coding" &
                                  seq_name %in% c("X", "Y")))
## [1] 1920

Not unexpectedly, transcripts of protein coding genes are longer than those of snRNA, snoRNA or rRNA genes.

At last we extract the first two exons of each transcript model from the database.

## Extract all exons 1 and (if present) 2 for all genes encoded on the
## Y chromosome
exons(edb, columns = c("tx_id", "exon_idx"),
      filter = list(SeqNameFilter("Y"),
                    ExonRankFilter(3, condition = "<")))
## GRanges object with 1287 ranges and 3 metadata columns:
##                   seqnames               ranges strand |           tx_id
##                      <Rle>            <IRanges>  <Rle> |     <character>
##   ENSE00002088309        Y   [2652790, 2652894]      + | ENST00000516032
##   ENSE00001494622        Y   [2654896, 2655740]      - | ENST00000383070
##   ENSE00002323146        Y   [2655049, 2655069]      - | ENST00000525526
##   ENSE00002201849        Y   [2655075, 2655644]      - | ENST00000525526
##   ENSE00002214525        Y   [2655145, 2655168]      - | ENST00000534739
##               ...      ...                  ...    ... .             ...
##   ENSE00001632993        Y [28737695, 28737748]      - | ENST00000456738
##   ENSE00001616687        Y [28772667, 28773306]      - | ENST00000435741
##   ENSE00001638296        Y [28779492, 28779578]      - | ENST00000435945
##   ENSE00001797328        Y [28780670, 28780799]      - | ENST00000435945
##   ENSE00001794473        Y [59001391, 59001635]      + | ENST00000431853
##                    exon_idx         exon_id
##                   <integer>     <character>
##   ENSE00002088309         1 ENSE00002088309
##   ENSE00001494622         1 ENSE00001494622
##   ENSE00002323146         2 ENSE00002323146
##   ENSE00002201849         1 ENSE00002201849
##   ENSE00002214525         2 ENSE00002214525
##               ...       ...             ...
##   ENSE00001632993         1 ENSE00001632993
##   ENSE00001616687         1 ENSE00001616687
##   ENSE00001638296         2 ENSE00001638296
##   ENSE00001797328         1 ENSE00001797328
##   ENSE00001794473         1 ENSE00001794473
##   -------
##   seqinfo: 1 sequence from GRCh37 genome

3 Extracting gene/transcript/exon models for RNASeq feature counting

For the feature counting step of an RNAseq experiment, the gene or transcript models (defined by the chromosomal start and end positions of their exons) have to be known. To extract these from an Ensembl based annotation package, the exonsBy, genesBy and transcriptsBy methods can be used in an analogous way as in TxDb packages generated by the GenomicFeatures package. However, the transcriptsBy method does not, in contrast to the method in the GenomicFeatures package, allow to return transcripts by “cds”. While the annotation packages built by the ensembldb contain the chromosomal start and end coordinates of the coding region (for protein coding genes) they do not assign an ID to each CDS.

A simple use case is to retrieve all genes encoded on chromosomes X and Y from the database.

TxByGns <- transcriptsBy(edb, by = "gene", filter = SeqNameFilter(c("X", "Y")))
TxByGns
## GRangesList object of length 2908:
## $ENSG00000000003 
## GRanges object with 3 ranges and 6 metadata columns:
##       seqnames               ranges strand |           tx_id
##          <Rle>            <IRanges>  <Rle> |     <character>
##   [1]        X [99888439, 99894988]      - | ENST00000494424
##   [2]        X [99883667, 99891803]      - | ENST00000373020
##   [3]        X [99887538, 99891686]      - | ENST00000496771
##                 tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id
##                <character>        <integer>      <integer>     <character>
##   [1] processed_transcript             <NA>           <NA> ENSG00000000003
##   [2]       protein_coding         99885795       99891691 ENSG00000000003
##   [3] processed_transcript             <NA>           <NA> ENSG00000000003
##               tx_name
##           <character>
##   [1] ENST00000494424
##   [2] ENST00000373020
##   [3] ENST00000496771
## 
## $ENSG00000000005 
## GRanges object with 2 ranges and 6 metadata columns:
##       seqnames               ranges strand |           tx_id
##   [1]        X [99839799, 99854882]      + | ENST00000373031
##   [2]        X [99848621, 99852528]      + | ENST00000485971
##                 tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id
##   [1]       protein_coding         99840016       99854714 ENSG00000000005
##   [2] processed_transcript             <NA>           <NA> ENSG00000000005
##               tx_name
##   [1] ENST00000373031
##   [2] ENST00000485971
## 
## $ENSG00000001497 
## GRanges object with 6 ranges and 6 metadata columns:
##       seqnames               ranges strand |           tx_id
##   [1]        X [64732463, 64754655]      - | ENST00000484069
##   [2]        X [64732462, 64754636]      - | ENST00000374811
##   [3]        X [64732463, 64754636]      - | ENST00000374804
##   [4]        X [64732463, 64754636]      - | ENST00000312391
##   [5]        X [64732462, 64754634]      - | ENST00000374807
##   [6]        X [64740309, 64743497]      - | ENST00000469091
##                    tx_biotype tx_cds_seq_start tx_cds_seq_end
##   [1] nonsense_mediated_decay         64744901       64754595
##   [2]          protein_coding         64732655       64754595
##   [3]          protein_coding         64732655       64754595
##   [4]          protein_coding         64744901       64754595
##   [5]          protein_coding         64732655       64754595
##   [6]          protein_coding         64740535       64743497
##               gene_id         tx_name
##   [1] ENSG00000001497 ENST00000484069
##   [2] ENSG00000001497 ENST00000374811
##   [3] ENSG00000001497 ENST00000374804
##   [4] ENSG00000001497 ENST00000312391
##   [5] ENSG00000001497 ENST00000374807
##   [6] ENSG00000001497 ENST00000469091
## 
## ...
## <2905 more elements>
## -------
## seqinfo: 2 sequences from GRCh37 genome

Since Ensembl contains also definitions of genes that are on chromosome variants (supercontigs), it is advisable to specify the chromosome names for which the gene models should be returned.

In a real use case, we might thus want to retrieve all genes encoded on the standard chromosomes. In addition it is advisable to use a GeneIdFilter to restrict to Ensembl genes only, as also LRG (Locus Reference Genomic) genes2 are defined in the database, which are partially redundant with Ensembl genes.

## will just get exons for all genes on chromosomes 1 to 22, X and Y.
## Note: want to get rid of the "LRG" genes!!!
EnsGenes <- exonsBy(edb, by = "gene", filter = AnnotationFilterList(
                                          SeqNameFilter(c(1:22, "X", "Y")),
                                          GeneIdFilter("ENSG", "startsWith")))

The code above returns a GRangesList that can be used directly as an input for the summarizeOverlaps function from the GenomicAlignments package 3.

Alternatively, the above GRangesList can be transformed to a data.frame in SAF format that can be used as an input to the featureCounts function of the Rsubread package 4.

## Transforming the GRangesList into a data.frame in SAF format
EnsGenes.SAF <- toSAF(EnsGenes)

Note that the ID by which the GRangesList is split is used in the SAF formatted data.frame as the GeneID. In the example below this would be the Ensembl gene IDs, while the start, end coordinates (along with the strand and chromosomes) are those of the the exons.

In addition, the disjointExons function (similar to the one defined in GenomicFeatures) can be used to generate a GRanges of non-overlapping exon parts which can be used in the DEXSeq package.

## Create a GRanges of non-overlapping exon parts.
DJE <- disjointExons(edb, filter = AnnotationFilterList(
                  SeqNameFilter(c(1:22, "X", "Y")),
                  GeneIdFilter("ENSG%", "startsWith")))

4 Retrieving sequences for gene/transcript/exon models

The methods to retrieve exons, transcripts and genes (i.e. exons, transcripts and genes) return by default GRanges objects that can be used to retrieve sequences using the getSeq method e.g. from BSgenome packages. The basic workflow is thus identical to the one for TxDb packages, however, it is not straight forward to identify the BSgenome package with the matching genomic sequence. Most BSgenome packages are named according to the genome build identifier used in UCSC which does not (always) match the genome build name used by Ensembl. Using the Ensembl version provided by the EnsDb, the correct genomic sequence can however be retrieved easily from the AnnotationHub using the getGenomeFaFile. If no Fasta file matching the Ensembl version is available, the function tries to identify a Fasta file with the correct genome build from the closest Ensembl release and returns that instead.

In the code block below we retrieve first the FaFile with the genomic DNA sequence, extract the genomic start and end coordinates for all genes defined in the package, subset to genes encoded on sequences available in the FaFile and extract all of their sequences. Note: these sequences represent the sequence between the chromosomal start and end coordinates of the gene.

library(EnsDb.Hsapiens.v75)
library(Rsamtools)
edb <- EnsDb.Hsapiens.v75

## Get the FaFile with the genomic sequence matching the Ensembl version
## using the AnnotationHub package.
Dna <- getGenomeFaFile(edb)

## Get start/end coordinates of all genes.
genes <- genes(edb)
## Subset to all genes that are encoded on chromosomes for which
## we do have DNA sequence available.
genes <- genes[seqnames(genes) %in% seqnames(seqinfo(Dna))]

## Get the gene sequences, i.e. the sequence including the sequence of
## all of the gene's exons and introns.
geneSeqs <- getSeq(Dna, genes)

To retrieve the (exonic) sequence of transcripts (i.e. without introns) we can use directly the extractTranscriptSeqs method defined in the GenomicFeatures on the EnsDb object, eventually using a filter to restrict the query.

## get all exons of all transcripts encoded on chromosome Y
yTx <- exonsBy(edb, filter = SeqNameFilter("Y"))

## Retrieve the sequences for these transcripts from the FaFile.
library(GenomicFeatures)
yTxSeqs <- extractTranscriptSeqs(Dna, yTx)
yTxSeqs

## Extract the sequences of all transcripts encoded on chromosome Y.
yTx <- extractTranscriptSeqs(Dna, edb, filter = SeqNameFilter("Y"))

## Along these lines, we could use the method also to retrieve the coding sequence
## of all transcripts on the Y chromosome.
cdsY <- cdsBy(edb, filter = SeqNameFilter("Y"))
extractTranscriptSeqs(Dna, cdsY)

Note: in the next section we describe how transcript sequences can be retrieved from a BSgenome package that is based on UCSC, not Ensembl.

5 Integrating annotations from Ensembl based EnsDb packages with UCSC based annotations

Sometimes it might be useful to combine (Ensembl based) annotations from EnsDb packages/objects with annotations from other Bioconductor packages, that might base on UCSC annotations. To support such an integration of annotations, the ensembldb packages implements the seqlevelsStyle and seqlevelsStyle<- from the GenomeInfoDb package that allow to change the style of chromosome naming. Thus, sequence/chromosome names other than those used by Ensembl can be used in, and are returned by, the queries to EnsDb objects as long as a mapping for them is provided by the GenomeInfoDb package (which provides a mapping mostly between UCSC, NCBI and Ensembl chromosome names for the main chromosomes).

In the example below we change the seqnames style to UCSC.

## Change the seqlevels style form Ensembl (default) to UCSC:
seqlevelsStyle(edb) <- "UCSC"

## Now we can use UCSC style seqnames in SeqNameFilters or GRangesFilter:
genesY <- genes(edb, filter = ~ seq_name == "chrY")
## The seqlevels of the returned GRanges are also in UCSC style
seqlevels(genesY)
## [1] "chrY"

Note that in most instances no mapping is available for sequences not corresponding to the main chromosomes (i.e. contigs, patched chromosomes etc). What is returned in cases in which no mapping is available can be specified with the global ensembldb.seqnameNotFound option. By default (with ensembldb.seqnameNotFound set to “ORIGINAL”), the original seqnames (i.e. the ones from Ensembl) are returned. With ensembldb.seqnameNotFound “MISSING” each time a seqname can not be found an error is thrown. For all other cases (e.g. ensembldb.seqnameNotFound = NA) the value of the option is returned.

seqlevelsStyle(edb) <- "UCSC"

## Getting the default option:
getOption("ensembldb.seqnameNotFound")
## [1] "ORIGINAL"
## Listing all seqlevels in the database.
seqlevels(edb)[1:30]
## Warning in .formatSeqnameByStyleFromQuery(x, sn, ifNotFound): More than 5
## seqnames with seqlevels style of the database (Ensembl) could not be mapped
## to the seqlevels style: UCSC!) Returning the orginal seqnames for these.
##  [1] "chr1"       "chr10"      "chr11"      "chr12"      "chr13"     
##  [6] "chr14"      "chr15"      "chr16"      "chr17"      "chr18"     
## [11] "chr19"      "chr2"       "chr20"      "chr21"      "chr22"     
## [16] "chr3"       "chr4"       "chr5"       "chr6"       "chr7"      
## [21] "chr8"       "chr9"       "GL000191.1" "GL000192.1" "GL000193.1"
## [26] "GL000194.1" "GL000195.1" "GL000196.1" "GL000199.1" "GL000201.1"
## Setting the option to NA, thus, for each seqname for which no mapping is available,
## NA is returned.
options(ensembldb.seqnameNotFound=NA)
seqlevels(edb)[1:30]
## Warning in .formatSeqnameByStyleFromQuery(x, sn, ifNotFound): More than 5
## seqnames with seqlevels style of the database (Ensembl) could not be mapped
## to the seqlevels style: UCSC!) Returning NA for these.
##  [1] "chr1"  "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17"
## [10] "chr18" "chr19" "chr2"  "chr20" "chr21" "chr22" "chr3"  "chr4"  "chr5" 
## [19] "chr6"  "chr7"  "chr8"  "chr9"  NA      NA      NA      NA      NA     
## [28] NA      NA      NA
## Resetting the option.
options(ensembldb.seqnameNotFound = "ORIGINAL")

Next we retrieve transcript sequences from genes encoded on chromosome Y using the BSGenome package for the human genome from UCSC. The specified version hg19 matches the genome build of Ensembl version 75, i.e. GRCh37. Note that while we changed the style of the seqnames to UCSC we did not change the naming of the genome release.

library(BSgenome.Hsapiens.UCSC.hg19)
bsg <- BSgenome.Hsapiens.UCSC.hg19

## Get the genome version
unique(genome(bsg))
## [1] "hg19"
unique(genome(edb))
## [1] "GRCh37"
## Although differently named, both represent genome build GRCh37.

## Extract the full transcript sequences.
yTxSeqs <- extractTranscriptSeqs(bsg, exonsBy(edb, "tx",
                          filter = SeqNameFilter("chrY")))

yTxSeqs
##   A DNAStringSet instance of length 731
##       width seq                                          names               
##   [1]  5239 GCCTAGTGCGCGCGCAGTAAC...TAAATGTTTACTTGTATATG ENST00000155093
##   [2]  4023 ATGTTTAGGGTTGGCTTCTTA...TGGAAACACATCCCTTGTAA ENST00000215473
##   [3]   802 AGAGGACCAAGCCTCCCTGTG...ATAAAATGTTTTAAAAATCA ENST00000215479
##   [4]   910 TGTCTGTCAGAGCTGTCAGCC...AACACTGGTATATTTCTGTT ENST00000250776
##   [5]  1305 TTCCAGGATATGAACTCTACA...AATCCTGTGGCTGTAGGAAA ENST00000250784
##   ...   ... ...
## [727]   333 ATGGATGAAGAAGAGAAAACC...GTGAACTTTCTAGATTGCAT ENST00000604924
## [728]  1247 CATGGCGGGGTTCCTGCCTTC...CTTTGGAGTAATGTCTTAGT ENST00000605584
## [729]   199 CAGTTCTCGCTCCTGTGCAGC...TGGTCTGGGTGGCTTCTGGA ENST00000605663
## [730]   276 GCCCCAGGAGGAAAGGGGGAC...AAATAAAGAACAGCGCATTC ENST00000606439
## [731]   444 ATGGGAGCCACTGGGCTTGGC...ACGTTCATGAAGAAGACTAA ENST00000607210
## Extract just the CDS
Test <- cdsBy(edb, "tx", filter = SeqNameFilter("chrY"))
yTxCds <- extractTranscriptSeqs(bsg, cdsBy(edb, "tx",
                                           filter = SeqNameFilter("chrY")))
yTxCds
##   A DNAStringSet instance of length 160
##       width seq                                          names               
##   [1]  2406 ATGGATGAAGATGAATTTGAA...AAGAAGTTGGTCTGCCCTAA ENST00000155093
##   [2]  4023 ATGTTTAGGGTTGGCTTCTTA...TGGAAACACATCCCTTGTAA ENST00000215473
##   [3]   579 ATGGGGACCTGGATTTTGTTT...AGCAGGAGGAAGTGGATTAA ENST00000215479
##   [4]   792 ATGGCCCGGGGCCCCAAGAAG...CCAAACAGAGCAGTGGCTAA ENST00000250784
##   [5]   378 ATGAGTCCAAAGCCGAGAGCC...CTACTCCCCTATCTCCCTGA ENST00000250823
##   ...   ... ...
## [156]    63 CGCAAGGATTTAAAAGAGATG...CACCCTGTTGGCCAGGCTAG ENST00000601700
## [157]    42 CTTGATACAAAGAATCAATTTAATTTTAAGATTGTCTATCTT   ENST00000601705
## [158]    33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT            ENST00000602680
## [159]    33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT            ENST00000602732
## [160]    33 ATGATGACGCTTGTCCCCAGAGCCAGGACACGT            ENST00000602770

At last changing the seqname style to the default value "Ensembl".

seqlevelsStyle(edb) <- "Ensembl"

6 Interactive annotation lookup using the shiny web app

In addition to the genes, transcripts and exons methods it is possibly to search interactively for gene/transcript/exon annotations using the internal, shiny based, web application. The application can be started with the runEnsDbApp() function. The search results from this app can also be returned to the R workspace either as a data.frame or GRanges object.

7 Plotting gene/transcript features using ensembldb and Gviz and ggbio

The Gviz package provides functions to plot genes and transcripts along with other data on a genomic scale. Gene models can be provided either as a data.frame, GRanges, TxDB database, can be fetched from biomart and can also be retrieved from ensembldb.

Below we generate a GeneRegionTrack fetching all transcripts from a certain region on chromosome Y.

Note that if we want in addition to work also with BAM files that were aligned against DNA sequences retrieved from Ensembl or FASTA files representing genomic DNA sequences from Ensembl we should change the ucscChromosomeNames option from Gviz to FALSE (i.e. by calling options(ucscChromosomeNames = FALSE)). This is not necessary if we just want to retrieve gene models from an EnsDb object, as the ensembldb package internally checks the ucscChromosomeNames option and, depending on that, maps Ensembl chromosome names to UCSC chromosome names.

## Loading the Gviz library
library(Gviz)
library(EnsDb.Hsapiens.v75)
edb <- EnsDb.Hsapiens.v75

## Retrieving a Gviz compatible GRanges object with all genes
## encoded on chromosome Y.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "Y",
                                start = 20400000, end = 21400000)
## Define a genome axis track
gat <- GenomeAxisTrack()

## We have to change the ucscChromosomeNames option to FALSE to enable Gviz usage
## with non-UCSC chromosome names.
options(ucscChromosomeNames = FALSE)

plotTracks(list(gat, GeneRegionTrack(gr)))

options(ucscChromosomeNames = TRUE)

Above we had to change the option ucscChromosomeNames to FALSE in order to use it with non-UCSC chromosome names. Alternatively, we could however also change the seqnamesStyle of the EnsDb object to UCSC. Note that we have to use now also chromosome names in the UCSC style in the SeqNameFilter (i.e. “chrY” instead of Y).

seqlevelsStyle(edb) <- "UCSC"
## Retrieving the GRanges objects with seqnames corresponding to UCSC chromosome names.
gr <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                start = 20400000, end = 21400000)
## Warning in .formatSeqnameByStyleForQuery(x, sn, ifNotFound): Seqnames:
## Y could not be mapped to the seqlevels style of the database (Ensembl)!
## Returning the orginal seqnames for these.
seqnames(gr)
## factor-Rle of length 218 with 1 run
##   Lengths:  218
##   Values : chrY
## Levels(1): chrY
## Define a genome axis track
gat <- GenomeAxisTrack()
plotTracks(list(gat, GeneRegionTrack(gr)))

We can also use the filters from the ensembldb package to further refine what transcripts are fetched, like in the example below, in which we create two different gene region tracks, one for protein coding genes and one for lincRNAs.

protCod <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                     start = 20400000, end = 21400000,
                                     filter = GeneBiotypeFilter("protein_coding"))
lincs <- getGeneRegionTrackForGviz(edb, chromosome = "chrY",
                                   start = 20400000, end = 21400000,
                                   filter = GeneBiotypeFilter("lincRNA"))

plotTracks(list(gat, GeneRegionTrack(protCod, name = "protein coding"),
                GeneRegionTrack(lincs, name = "lincRNAs")), transcriptAnnotation = "symbol")