1 Version Info

R version: R version 3.5.1 Patched (2018-07-12 r74967)
Bioconductor version: 3.8
Package version: 1.4.0

2 Introduction

Annotation resources make up a significant proportion of the Bioconductor project[1]. And there are also a diverse set of online resources available which are accessed using specific packages. This walkthrough will describe the most popular of these resources and give some high level examples on how to use them.

Bioconductor annotation resources have traditionally been used near the end of an analysis. After the bulk of the data analysis, annotations would be used interpretatively to learn about the most significant results. But increasingly, they are also used as a starting point or even as an intermediate step to help guide a study that is still in progress. In addition to this, what it means for something to be an annotation is also becoming less clear than it once was. It used to be clear that annotations were only those things that had been established after multiple different studies had been performed (such as the primary role of a gene product). But today many large data sets are treated by communities in much the same way that classic annotations once were: as a reference for additional comparisons.

Another change that is underway with annotations in Bioconductor is in the way that they are obtained. In the past annotations existed almost exclusively as separate annotation packages[2,3,4]. Today packages are still an enormous source of annotations. The current release repository contains over eight hundred annotation packages. This table summarizes some of the more important classes of annotation objects that are often accessed using packages:

Object Type	Example Package Name	Contents
TxDb	TxDb.Hsapiens.UCSC.hg19.knownGene	Transcriptome ranges for the known gene track of Homo sapiens, e.g., introns, exons, UTR regions.
OrgDb	org.Hs.eg.db	Gene-based information for Homo sapiens; useful for mapping between gene IDs, Names, Symbols, GO and KEGG identifiers, etc.
BSgenome	BSgenome.Hsapiens.UCSC.hg19	Full genome sequence for Homo sapiens.
Organism.dplyr	src_organism	Collection of multiple annotations for a common organism and genome build.
AnnotationHub	AnnotationHub	Provides a convenient interface to annotations from many different sources; objects are returned as fully parsed Bioconductor data objects or as the name of a file on disk.

But in spite of the popularity of annotation packages, annotations are increasingly also being pulled down from web services like biomaRt[5,6,7] or from the AnnotationHub[8]. And both of these represent enormous resources for annotation data.

In part because of the rapidly evolving landscape, it is currently impossible in a single document to cover every possible annotation or even every kind of annotation present in Bioconductor. So here we will instead go over the most popular annotation resources and describe them in a way intended to expose common patterns used for accessing them. The hope is that a user with this information will be able to make educated guesses about how to find and use additional resources that will inevitably be added later. Topics that will be covered will include the following:

3 Set Up

In this chapter we make use of several Bioconductor packages. You can install them with BiocManager::install():

if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
BiocManager::install(c("AnnotationHub", "Homo.sapiens",
           "Organism.dplyr",
           "TxDb.Hsapiens.UCSC.hg19.knownGene",
           "TxDb.Hsapiens.UCSC.hg38.knownGene",
           "BSgenome.Hsapiens.UCSC.hg19", "biomaRt",
           "TxDb.Athaliana.BioMart.plantsmart22"))

The usage of the installed packages will be described in detail within the Usage section.

4 Using AnnotationHub

The top of the list for learning about annotation resources is the relatively new AnnotationHub package[8]. The AnnotationHub was created to provide a convenient access point for end users to find a large range of different annotation objects for use with Bioconductor. Resources found in the AnnotationHub are easy to discover and are presented to the user as familiar Bioconductor data objects. Because it is a recent addition, the AnnotationHub allows access to a broad range of annotation like objects, some of which may not have been considered annotations even a few years ago. To get started with the AnnotationHub users only need to load the package and then create a local AnnotationHub object like this:

## snapshotDate(): 2018-10-30

ah <- AnnotationHub()

The very 1st time that you call the AnnotationHub, it will create a cache directory on your system and download the latest metadata for the hubs current contents. From that time forward, whenever you download one of the hubs data objects, it will also cache those files in the local directory so that if you request the information again, you will be able to access it quickly.

The show method of an AnnotationHub object will tell you how many resources are currently accessible using that object as well as give a high level overview of the most common kinds of data present.

ah

## AnnotationHub with 47163 records
## # snapshotDate(): 2018-10-30 
## # $dataprovider: BroadInstitute, Ensembl, UCSC, ftp://ftp.ncbi.nlm.nih.go...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taur...
## # $rdataclass: GRanges, BigWigFile, FaFile, TwoBitFile, Rle, OrgDb, Chain...
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH2"]]' 
## 
##             title                                               
##   AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
##   AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
##   AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
##   AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
##   AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
##   ...       ...                                                 
##   AH66177 | TxDb.Ptroglodytes.UCSC.panTro5.refGene.sqlite       
##   AH66178 | TxDb.Rnorvegicus.UCSC.rn5.refGene.sqlite            
##   AH66179 | TxDb.Rnorvegicus.UCSC.rn6.refGene.sqlite            
##   AH66180 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite           
##   AH66181 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite

As you can see from the object above, there are a LOT of different resources available. So normally when you get an AnnotationHub object the 1st thing you want to do is to filter it to remove unwanted resources.

Fortunately, the AnnotationHub has several different kinds of metadata that you can use for searching and subsetting. To see the different categories all you need to do is to type the name of your AnnotationHub object and then tab complete from the ‘$’ operator. And to see all possible contents of one of these categories you can pass that value in to unique like this:

unique(ah$dataprovider)

##  [1] "Ensembl"                              
##  [2] "UCSC"                                 
##  [3] "RefNet"                               
##  [4] "Inparanoid8"                          
##  [5] "NHLBI"                                
##  [6] "ChEA"                                 
##  [7] "Pazar"                                
##  [8] "NIH Pathway Interaction Database"     
##  [9] "Haemcode"                             
## [10] "BroadInstitute"                       
## [11] "PRIDE"                                
## [12] "Gencode"                              
## [13] "CRIBI"                                
## [14] "Genoscope"                            
## [15] "MISO, VAST-TOOLS, UCSC"               
## [16] "UWashington"                          
## [17] "Stanford"                             
## [18] "dbSNP"                                
## [19] "BioMart"                              
## [20] "GeneOntology"                         
## [21] "KEGG"                                 
## [22] "URGI"                                 
## [23] "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
## [24] "EMBL-EBI"                             
## [25] "MicrosporidiaDB"                      
## [26] "FungiDB"                              
## [27] "TriTrypDB"                            
## [28] "ToxoDB"                               
## [29] "AmoebaDB"                             
## [30] "PlasmoDB"                             
## [31] "PiroplasmaDB"                         
## [32] "CryptoDB"                             
## [33] "TrichDB"                              
## [34] "GiardiaDB"

One of the most valuable ways in which the data is labeled is according to the kind of R object that will be returned to you.

unique(ah$rdataclass)

##  [1] "FaFile"                            "GRanges"                          
##  [3] "data.frame"                        "Inparanoid8Db"                    
##  [5] "TwoBitFile"                        "ChainFile"                        
##  [7] "SQLiteConnection"                  "biopax"                           
##  [9] "BigWigFile"                        "AAStringSet"                      
## [11] "MSnSet"                            "mzRpwiz"                          
## [13] "mzRident"                          "list"                             
## [15] "TxDb"                              "Rle"                              
## [17] "EnsDb"                             "VcfFile"                          
## [19] "igraph"                            "OrgDb"                            
## [21] "data.frame, DNAStringSet, GRanges"

Once you have identified which sorts of metadata you would like to use to find your data of interest, you can then use the subset or query methods to reduce the size of the hub object to something more manageable. For example you could select only those records where the string ‘GRanges’ was in the metadata. As you can see GRanges are one of the more popular formats for data that comes from the AnnotationHub.

grs <- query(ah, "GRanges")
grs

## AnnotationHub with 20478 records
## # snapshotDate(): 2018-10-30 
## # $dataprovider: BroadInstitute, UCSC, Ensembl, Haemcode, FungiDB, Pazar,...
## # $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Cani...
## # $rdataclass: GRanges, data.frame, DNAStringSet, GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##             title                                                           
##   AH5012  | Chromosome Band                                                 
##   AH5013  | STS Markers                                                     
##   AH5014  | FISH Clones                                                     
##   AH5015  | Recomb Rate                                                     
##   AH5016  | ENCODE Pilot                                                    
##   ...       ...                                                             
##   AH65527 | Trypanosoma cruzi marinkellei strain B7 transcript information  
##   AH65528 | Trypanosoma evansi strain STIB 805 transcript information       
##   AH65529 | Trypanosoma grayi ANR4 transcript information                   
##   AH65530 | Trachipleistophora hominis Unknown strain transcript information
##   AH66146 | Allele data from the IPD IMGT/HLA database version 3.33.0

Or you can use subsetting to only select for matches on a specific field

grs <- ah[ah$rdataclass == "GRanges",]

The subset function is also provided.

orgs <- subset(ah, ah$rdataclass == "OrgDb")
orgs

## AnnotationHub with 1691 records
## # snapshotDate(): 2018-10-30 
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Escherichia coli, Acinetobacter genomospecies 2, Leclercia ad...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH61768"]]' 
## 
##             title                                      
##   AH61768 | org.Ag.eg.db.sqlite                        
##   AH61769 | org.At.tair.db.sqlite                      
##   AH61770 | org.Bt.eg.db.sqlite                        
##   AH61771 | org.Cf.eg.db.sqlite                        
##   AH61772 | org.Gg.eg.db.sqlite                        
##   ...       ...                                        
##   AH63468 | org.Salmonella_typhimurium_LT2.eg.sqlite   
##   AH63469 | org.Acinetobacter_baumannii.eg.sqlite      
##   AH63470 | org.Acinetobacter_genomosp._2.eg.sqlite    
##   AH63471 | org.Acinetobacter_genomospecies_2.eg.sqlite
##   AH63472 | org.Bacterium_anitratum.eg.sqlite

And if you really need access to all the metadata you can extract it as a DataFrame using mcols() like so:

meta <- mcols(ah)

Also if you are a fan of GUI’s you can use the display method to look at your data in a browser and return selected rows back as a smaller AnnotationHub object like this:

sah <- display(ah)

Calling this method will produce a web based interface like the one pictured here:

Once you have the AnnotationHub object pared down to a reasonable size, and are sure about which records you want to retrieve, then you only need to use the ‘[[’ operator to extract them. Using the ‘[[’ operator, you can extract by numeric index (1,2,3) or by AnnotationHub ID. If you choose to use the former, you simply extract the element that you are interested in. So for our chain example, you might just want to 1st one like this:

res <- grs[[1]]

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.AnnotationHub/5012'

head(res, n=3)

## UCSC track 'cytoBand'
## UCSCData object with 3 ranges and 1 metadata column:
##       seqnames          ranges strand |        name
##          <Rle>       <IRanges>  <Rle> | <character>
##   [1]     chr1       1-2300000      * |      p36.33
##   [2]     chr1 2300001-5400000      * |      p36.32
##   [3]     chr1 5400001-7200000      * |      p36.31
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4.1 AnnotationHub exercises

Exercise 1: Use the AnnotationHub to extract UCSC data that is from Homo sapiens and also specifically from the hg19 genome. What happens to the hub object as you filter data at each step?

Exercise 2 Now that you have basically narrowed things down to the hg19 annotations from UCSC genome browser, lets get one of these annotations. Find the oreganno track and save it into a local variable.

[ Back to top ]

5 OrgDb objects

At this point you might be wondering: What is this OrgDb object about? OrgDb objects are one member of a family of annotation objects that all represent hidden data through a shared set of methods. So if you look closely at the dog object created below you can see it contains data for Canis familiaris (taxonomy ID = 9615). You can learn a little more about it by learning about the columns method.

dog <- query(orgs, "Canis familiaris")[[1]]

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.AnnotationHub/68517'

dog

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: CANINE_DB
## | ORGANISM: Canis familiaris
## | SPECIES: Canine
## | EGSOURCEDATE: 2018-Apr4
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | CENTRALID: EG
## | TAXID: 9615
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
## | GOSOURCEDATE: 2018-Mar28
## | GOEGSOURCEDATE: 2018-Apr4
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GPSOURCENAME: UCSC Genome Bioinformatics (Canis familiaris)
## | GPSOURCEURL: 
## | GPSOURCEDATE: 2017-Apr6
## | UNIPROTGOSOURCENAME: UNIPROTGO
## | UNIPROTGOSOURCEDATE: Mon Apr  9 19:58:07 2018
## | UNIPROTGOSOURCEURL: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz
## | ENSOURCEDATE: 2017-Dec04
## | ENSOURCENAME: Ensembl
## | ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta

## 
## Please see: help('select') for usage information

columns(dog)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PMID"        
## [17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

The columns method gives you a vector of data types that can be retrieved from the object that you call it on. So the above call indicates that there are several different data types that can be retrieved from the tetra object.

A very similar method is the keytypes method, which will list all the data types that can also be used as keys.

keytypes(dog)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PMID"        
## [17] "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

In many cases most of the things that are listed as columns will also come back from a keytypes call, but since these two things are not guaranteed to be identical, we maintain two separate methods.

Now that you can see what kinds of things can be used as keys, you can call the keys method to extract out all the keys of a given key type.

head(keys(dog, keytype="ENTREZID"))

## [1] "399518" "399530" "399544" "399545" "399653" "403152"

This is useful if you need to get all the IDs of a particular kind but the keys method has a few extra arguments that can make it even more flexible. For example, using the keys method you could also extract the gene SYMBOLS that contain “COX” like this:

keys(dog, keytype="SYMBOL", pattern="COX")

##  [1] "COX5B"   "COX7A2L" "COX8A"   "COX15"   "COX5A"   "COX4I1"  "COX6A2" 
##  [8] "COX20"   "COX18"   "ACOX1"   "COX4I2"  "ACOX3"   "COX10"   "COX17"  
## [15] "COX11"   "ACOXL"   "COX7A1"  "COX1"    "COX2"    "COX3"    "COX19"  
## [22] "COX7B2"  "COX14"   "COX16"   "ACOX2"

Or if you really needed an other keytype, you can use the column argument to extract the ENTREZ GENE IDs for those gene SYMBOLS that contain the string “COX”:

keys(dog, keytype="ENTREZID", pattern="COX", column="SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

##  [1] "474567"    "475739"    "476040"    "477792"    "478370"    "479623"   
##  [7] "479780"    "480099"    "482193"    "483322"    "485825"    "488790"   
## [13] "489515"    "503668"    "609555"    "611729"    "612614"    "804478"   
## [19] "804479"    "804480"    "100685945" "100687434" "100688544" "100846976"
## [25] "100855488"

But often, you will really want to extract other data that matches a particular key or set of keys. For that there are two methods which you can use. The more powerful of these is probably select. Here is how you would look up the gene SYMBOL, and REFSEQ id for specific entrez gene ID.

select(dog, keys="804478", columns=c("SYMBOL","REFSEQ"), keytype="ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

##   ENTREZID SYMBOL    REFSEQ
## 1   804478   COX1 NP_008473

When you call it, select will return a data.frame that attempts to fill in matching values for all the columns you requested. However, if you ask select for things that have a many to one relationship to your keys it can result in an expansion of the data object that is returned. For example, watch what happens when we ask for the GO terms for the same entrez gene ID:

select(dog, keys="804478", columns="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##    ENTREZID         GO EVIDENCE ONTOLOGY
## 1    804478 GO:0004129      IEA       MF
## 2    804478 GO:0005506      IEA       MF
## 3    804478 GO:0005750      IEA       CC
## 4    804478 GO:0005751      IEA       CC
## 5    804478 GO:0006119      IEA       BP
## 6    804478 GO:0006123      IEA       BP
## 7    804478 GO:0009060      IEA       BP
## 8    804478 GO:0015990      IEA       BP
## 9    804478 GO:0016021      IEA       CC
## 10   804478 GO:0020037      IEA       MF
## 11   804478 GO:0045277      IEA       CC

Because there are several GO terms associated with the gene “804478”, you end up with many rows in the data.frame. This can become problematic if you then ask for several columns that have a many to one relationship to the original key. If you were to do that, not only would the result multiply in size, it would also become really hard to use. A better strategy is to be selective when using select.

Sometimes you might want to look up matching results in a way that is simpler than the data.frame object that select returns. This is especially true when you only want to look up one kind of value per key. For these cases, we recommend that you look at the mapIds method. Lets look at what happens if request the same basic information as in our recent select call, but instead using the mapIds method:

mapIds(dog, keys="804478", column="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##       804478 
## "GO:0004129"

As you can see, the mapIds method allows you to simplify the result that is returned. And by default, mapIds only returns the 1st matching element for each key. But what if you really need all those GO terms returned when you call mapIds? Well then you can make use of the mapIds multiVals argument. There are several options for this argument, we have already seen how by default you can return only the ‘first’ element. But you can also return a ‘list’ or ‘CharacterList’ object, or you can ‘filter’ out or return ‘asNA’ any keys that have multiple matches. You can even define your own rule (as a function) and pass that in as an argument to multiVals. Lets look at what happens when you return a list:

mapIds(dog, keys="804478", column="GO", keytype="ENTREZID", multiVals="list")

## 'select()' returned 1:many mapping between keys and columns

## $`804478`
##  [1] "GO:0004129" "GO:0005506" "GO:0005750" "GO:0005751" "GO:0006119"
##  [6] "GO:0006123" "GO:0009060" "GO:0015990" "GO:0016021" "GO:0020037"
## [11] "GO:0045277"

Now you know how to extract information from an OrgDb object, you might find it helpful to know that there is a whole family of other AnnotationDb derived objects that you can also use with these same five methods (keytypes(), columns(), keys(), select(), and mapIds()). For example there are ChipDb objects, InparanoidDb objects and TxDb objects which contain data about microarray probes, inparanoid homology partners or transcript range information respectively. And there are also more specialized objects like GODb or ReactomeDb objects which offer access to data from GO and reactome. In the next section, we will be looking at one of the more popular classes of these objects: the TxDb object.

5.1 OrgDb exercises

Exercise 3: Look at the help page for the different columns and keytypes values with: help(“SYMBOL”). Now use this information and what we just described to look up the entrez gene and chromosome for the gene symbol “MSX2”.

Exercise 4: In the previous exercise we had to use gene symbols as keys. But in the past this kind of behavior has sometimes been inadvisable because some gene symbols are used as the official symbol for more than one gene. To learn if this is still happening take advantage of the fact that entrez gene ids are uniquely assigned, and extract all of the gene symbols and their associated entrez gene ids from the org.Hs.eg.db package. Then check the symbols for redundancy.

[ Back to top ]

6 TxDb Objects

As mentioned before, TxDb objects can be accessed using the standard set of methods: keytypes(), columns(), keys(), select(), and mapIds(). But because these objects contain information about a transcriptome, they are often used to compare range based information to these important features of the genome[3,4]. As a result they also have specialized accessors for extracting out ranges that correspond to important transcriptome characteristics.

Lets start by loading a TxDb object from an annotation package based on the UCSC ensembl genes track for Drosophila. A common practice when loading these is to shorten the long name to ‘txdb’ (just as a convenience).

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

Just by looking at the TxDb object, we can learn a lot about what data it contains including where the data came from, which build of the UCSC genome it was based on and the last time that the object was updated. One of the most common uses for a TxDb object is to extract various kinds of transcript data out of it. So for example you can extract all the transcripts out of the TxDb as a GRanges object like this:

txs <- transcripts(txdb)
txs

## GRanges object with 5506 ranges and 2 metadata columns:
##          seqnames            ranges strand |     tx_id     tx_name
##             <Rle>         <IRanges>  <Rle> | <integer> <character>
##      [1]     chr3     238279-451097      + |     13060  uc003bot.3
##      [2]     chr3     238279-451097      + |     13061  uc003bou.3
##      [3]     chr3     239326-290282      + |     13062  uc003bov.2
##      [4]     chr3     239326-440831      + |     13063  uc003bow.2
##      [5]     chr3     361366-451097      + |     13064  uc011asi.2
##      ...      ...               ...    ... .       ...         ...
##   [5502]    chr18 77732867-77748532      - |     65761  uc002lnr.3
##   [5503]    chr18 77732867-77748532      - |     65762  uc010drf.3
##   [5504]    chr18 77732867-77793915      - |     65763  uc010drg.3
##   [5505]    chr18 77915117-78005397      - |     65764  uc002lny.3
##   [5506]    chr18 77941005-78005397      - |     65765  uc010xfp.2
##   -------
##   seqinfo: 2 sequences from hg19 genome

Similarly, there are also extractors for exons(), cds(), genes() and promoters(). Which kind of feature you choose to extract just depends on what information you are after. These basic extractors are fine if you only want a flat representation of these data, but many of these features are inherently nested. So instead of extracting a flat GRanges object, you might choose instead to extract a GRangesList object that groups the transcripts by the genes that they are associated with like this:

txby <- transcriptsBy(txdb, by="gene")
txby

## GRangesList object of length 1612:
## $1000 
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]    chr18 25530930-25616539      - |     65378  uc010xbn.1
##   [2]    chr18 25530930-25757445      - |     65379  uc002kwg.2
## 
## $100009676 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames              ranges strand | tx_id    tx_name
##   [1]     chr3 101395274-101398057      + | 14200 uc003dvg.3
## 
## $100101467 
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand | tx_id    tx_name
##   [1]    chr18 32831023-32870196      - | 65418 uc002kyl.3
##   [2]    chr18 32831023-32870196      - | 65419 uc002kym.3
##   [3]    chr18 32843361-32870165      - | 65420 uc002kyn.1
## 
## ...
## <1609 more elements>
## -------
## seqinfo: 2 sequences from hg19 genome

Just as with the flat extractors, there is a whole family of extractors available depending on what you want to extract and how you want it grouped. They include transcriptsBy(), exonsBy(), cdsBy(), intronsByTranscript(), fiveUTRsByTranscript() and threeUTRsByTranscript().

When dealing with genomic data it is almost inevitable that you will run into problems with the way that different groups have adopted alternate ways of naming chromosomes. This is because almost every major repository has cooked up their own slightly different way of labeling these important features.

To cope with this, the Seqinfo object was invented and is attached to TxDb objects as well as the GenomicRanges extracted from these objects. You can extract it using the seqinfo() method like this:

si <- seqinfo(txdb)
si

## Seqinfo object with 2 sequences from hg19 genome:
##   seqnames seqlengths isCircular genome
##   chr3      198022430         NA   hg19
##   chr18      78077248         NA   hg19

And since the seqinfo information is also attached to the GRanges objects produced by the TxDb extractors, you can also call seqinfo on the results of those methods like this:

txby <- transcriptsBy(txdb, by="gene")
si <- seqinfo(txby)

The Seqinfo object contains a lot of valuable data about which chromosome features are present, whether they are circular or linear, and how long each one is. It is also something that will be checked against if you try to do an operation like ‘findOverlaps’ to compute overlapping ranges etc. So it’s a valuable way to make sure that the chromosomes and genome are the same for your annotations as the range that you are comparing them to. But sometimes you may have a situation where your annotation object contains data that is comparable to your data object, but where it is simply named with a different naming style. For those cases, there are helpers that you can use to discover what the current name style is for an object. And there is also a setter method to allow you to change the value to something more appropriate. So in the following example, we are going to change the seqlevelStyle from ‘UCSC’ to ‘ensembl’ based naming convention (and then back again).

head(seqlevels(txdb))

## [1] "chr3"  "chr18"

seqlevelsStyle(txdb)

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

## [1] "UCSC"

seqlevelsStyle(txdb) <- "NCBI"

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

head(seqlevels(txdb))

## [1] "3"  "18"

## then change it back
seqlevelsStyle(txdb) <- "UCSC"

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string

head(seqlevels(txdb))

## [1] "chr3"  "chr18"

In addition to being able to change the naming style used for an object with seqinfo data, you can also toggle which of the chromosomes are ‘active’ so that the software will ignore certain chromosomes. By default, all of the chromosomes are set to be ‘active’.

head(isActiveSeq(txdb), n=30)

##  chr3 chr18 
##  TRUE  TRUE

But sometimes you might wish to ignore some of them. For example, lets suppose that you wanted to ignore the Y chromosome from our txdb. You could do that like so:

isActiveSeq(txdb)["chrY"] <- FALSE
head(isActiveSeq(txdb), n=26)

6.1 TxDb exercises

Exercise 5: Use the accessors for the TxDb.Hsapiens.UCSC.hg19.knownGene package to retrieve the gene id, transcript name and transcript chromosome for all the transcripts. Do this using both the select() method and also using the transcripts() method. What is the difference in the output?

Exercise 6: Load the TxDb.Athaliana.BioMart.plantsmart22 package. This package is not from UCSC and it is based on plantsmart. Now use select or one of the range based accessors to look at the gene ids from this TxDb object. How do they compare to what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package?

[ Back to top ]

7 Organism.dplyr src_organism Objects

So what happens if you have data from multiple different Annotation objects. For example, what if you had gene SYMBOLS (found in an OrgDb object) and you wanted to easily match those up with known gene transcript names from a UCSC based TxDb object? There is an ideal tool that can help with this kind of problem and it’s called an src_organism object from the Organism.dplyr package. src_organism objects and their related methods are able to query each of OrgDb and TxDb resources for you and then merge the results back together in way that lets you pretend that you only have one source for all your annotations.

library(Organism.dplyr)

src_organism objects can be created for organisms that have both an OrgDb and a TxDb. To see organisms that can have src_organism objects made, use the function supportOrganisms():

supported <- supportedOrganisms()
print(supported, n=Inf)

## # A tibble: 21 x 3
##    organism                OrgDb        TxDb                                
##    <chr>                   <chr>        <chr>                               
##  1 Bos taurus              org.Bt.eg.db TxDb.Btaurus.UCSC.bosTau8.refGene   
##  2 Caenorhabditis elegans  org.Ce.eg.db TxDb.Celegans.UCSC.ce11.refGene     
##  3 Caenorhabditis elegans  org.Ce.eg.db TxDb.Celegans.UCSC.ce6.ensGene      
##  4 Canis familiaris        org.Cf.eg.db TxDb.Cfamiliaris.UCSC.canFam3.refGe…
##  5 Drosophila melanogaster org.Dm.eg.db TxDb.Dmelanogaster.UCSC.dm3.ensGene 
##  6 Drosophila melanogaster org.Dm.eg.db TxDb.Dmelanogaster.UCSC.dm6.ensGene 
##  7 Danio rerio             org.Dr.eg.db TxDb.Drerio.UCSC.danRer10.refGene   
##  8 Gallus gallus           org.Gg.eg.db TxDb.Ggallus.UCSC.galGal4.refGene   
##  9 Homo sapiens            org.Hs.eg.db TxDb.Hsapiens.UCSC.hg18.knownGene   
## 10 Homo sapiens            org.Hs.eg.db TxDb.Hsapiens.UCSC.hg19.knownGene   
## 11 Homo sapiens            org.Hs.eg.db TxDb.Hsapiens.UCSC.hg38.knownGene   
## 12 Mus musculus            org.Mm.eg.db TxDb.Mmusculus.UCSC.mm10.ensGene    
## 13 Mus musculus            org.Mm.eg.db TxDb.Mmusculus.UCSC.mm10.knownGene  
## 14 Mus musculus            org.Mm.eg.db TxDb.Mmusculus.UCSC.mm9.knownGene   
## 15 Macaca mulatta          org.Mmu.eg.… TxDb.Mmulatta.UCSC.rheMac3.refGene  
## 16 Macaca mulatta          org.Mmu.eg.… TxDb.Mmulatta.UCSC.rheMac8.refGene  
## 17 Pan troglodytes         org.Pt.eg.db TxDb.Ptroglodytes.UCSC.panTro4.refG…
## 18 Rattus norvegicus       org.Rn.eg.db TxDb.Rnorvegicus.UCSC.rn4.ensGene   
## 19 Rattus norvegicus       org.Rn.eg.db TxDb.Rnorvegicus.UCSC.rn5.refGene   
## 20 Rattus norvegicus       org.Rn.eg.db TxDb.Rnorvegicus.UCSC.rn6.refGene   
## 21 Sus scrofa              org.Ss.eg.db TxDb.Sscrofa.UCSC.susScr3.refGene

Notice how there are multiple entries for a single organism (e.g. three for Homo sapiens). There is only one OrgDb per organism, but different TxDbs can be used. To specify a certain version of a TxDb to use, we can use the src_organism() function to create an src_organism object.

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")

## creating 'src_organism' database...

src

## src:  sqlite 3.22.0 [/tmp/Rtmp5CSHZw/file1d4f2f548019]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

We can also create one using the src_ucsc() function. This will create an src_organism object using the most recent TxDb version available:

src <- src_ucsc("Homo sapiens")

src

## src:  sqlite 3.22.0 [/tmp/Rtmp5CSHZw/file1d4f2f548019]
## tbls: id, id_accession, id_go, id_go_all, id_omim_pm, id_protein,
##   id_transcript, ranges_cds, ranges_exon, ranges_gene, ranges_tx

The five methods that worked for all of the other Db objects that we have discussed (keytypes(), columns(), keys(), select(), and mapIds()) all work for src_organism objects. Here, we use keytypes() to show which keytypes can be passed to the keytype argument of select().

keytypes(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"

Use columns() to show which keytypes can be passed to the keytype argument of select().

columns(src)

##  [1] "accnum"       "alias"        "cds_chrom"    "cds_end"     
##  [5] "cds_id"       "cds_name"     "cds_start"    "cds_strand"  
##  [9] "ensembl"      "ensemblprot"  "ensembltrans" "entrez"      
## [13] "enzyme"       "evidence"     "evidenceall"  "exon_chrom"  
## [17] "exon_end"     "exon_id"      "exon_name"    "exon_rank"   
## [21] "exon_start"   "exon_strand"  "gene_chrom"   "gene_end"    
## [25] "gene_start"   "gene_strand"  "genename"     "go"          
## [29] "goall"        "ipi"          "map"          "omim"        
## [33] "ontology"     "ontologyall"  "pfam"         "pmid"        
## [37] "prosite"      "refseq"       "symbol"       "tx_chrom"    
## [41] "tx_end"       "tx_id"        "tx_name"      "tx_start"    
## [45] "tx_strand"    "tx_type"      "unigene"      "uniprot"

And that’s it. You can now use these objects in the same way that you use OrgDb or TxDb objects. It works the same as the base objects that it contains:

select(src, keys="4488", columns=c("symbol", "tx_name"), keytype="entrez")

## Joining, by = "entrez"

##   entrez symbol    tx_name
## 1   4488   MSX2 uc003mcy.4
## 2   4488   MSX2 uc063juz.1

Organism.dplyr also supports numerous Genomic Extractor functions allowing users to filter based on information contained in the OrgDb and TxDb objects. To see the filters supported by a src_organism() object, use supportedFIlters():

head(supportedFilters(src))

##            filter     field
## 1    AccnumFilter    accnum
## 2     AliasFilter     alias
## 3  CdsChromFilter cds_chrom
## 45   CdsEndFilter   cds_end
## 43    CdsIdFilter    cds_id
## 4   CdsNameFilter  cds_name

The ranged based accessors such as those in GenomicFeatures will also work. There are also “_tbl" functions (e.g. transcripts_tbl()) that return tbl objects instead of GRanges objects. Complex filter statements can be given as input. Here we declare a GRangesFilter and use two different type-returning accessors to query transcripts that either start with “SNORD” and are within our given GRangesFilter, or have symbol with symbol “ADA”:

gr <- GRangesFilter(GenomicRanges::GRanges("chr1:44000000-55000000"))
transcripts(src, filter=~(symbol %startsWith% "SNORD" & gr) | symbol == "ADA")

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN `ranges_tx` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT *
## FROM `id`
## WHERE (`symbol` = 'ADA')

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN `ranges_tx` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )) AS `TBL_LEFT`
##   INNER JOIN `table1_14873` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)

## <SQL>
## SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`, `tx_type`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM (SELECT *
## FROM `ranges_tx`)) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM (SELECT *
## FROM `ranges_tx`)) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM (SELECT *
## FROM `id`)
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN (SELECT *
## FROM `ranges_tx`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )) AS `TBL_LEFT`
##   INNER JOIN (SELECT *
## FROM `table1_14873`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )
## UNION
## SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`, `tx_type`
## FROM (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM (SELECT *
## FROM `id`)
## WHERE (`symbol` = 'ADA')) AS `TBL_LEFT`
##   LEFT JOIN (SELECT *
## FROM `ranges_tx`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## )

## GRanges object with 8 ranges and 3 metadata columns:
##       seqnames            ranges strand |     tx_id     tx_name      symbol
##          <Rle>         <IRanges>  <Rle> | <integer> <character> <character>
##   [1]     chr1 44775864-44775943      + |      2560  uc057fwc.1     SNORD55
##   [2]     chr1 44776490-44776593      + |      2561  uc001cmk.2     SNORD46
##   [3]     chr1 44777843-44777912      + |      2564  uc009vxi.4    SNORD38A
##   [4]     chr1 44778390-44778456      + |      2566  uc057fwg.1    SNORD38B
##   [5]     chr1 44778390-44778458      + |      2567  uc001cml.3    SNORD38B
##   [6]    chr20 44619522-44626491      - |    169786  uc061xfj.1         ADA
##   [7]    chr20 44619522-44651742      - |    169787  uc002xmj.4         ADA
##   [8]    chr20 44619810-44651691      - |    169789  uc061xfl.1         ADA
##   -------
##   seqinfo: 455 sequences (1 circular) from hg38 genome

transcripts_tbl(src, filter=~(symbol %startsWith% "SNORD" & gr) | symbol == "ADA")

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN `ranges_tx` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT *
## FROM `id`
## WHERE (`symbol` = 'ADA')

## <SQL>
## SELECT *
## FROM `ranges_tx`

## <SQL>
## SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM `ranges_tx`) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM `id`
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN `ranges_tx` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )) AS `TBL_LEFT`
##   INNER JOIN `table2_14873` AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)

## <SQL>
## SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`, `tx_type`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM (SELECT *
## FROM `ranges_tx`)) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`
## FROM (SELECT `TBL_LEFT`.`tx_chrom` AS `tx_chrom`, `TBL_LEFT`.`tx_strand` AS `tx_strand`, `TBL_LEFT`.`tx_start` AS `tx_start`, `TBL_LEFT`.`tx_end` AS `tx_end`, `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`tx_id` AS `tx_id`, `TBL_LEFT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`map` AS `map`, `TBL_RIGHT`.`ensembl` AS `ensembl`, `TBL_RIGHT`.`symbol` AS `symbol`, `TBL_RIGHT`.`genename` AS `genename`, `TBL_RIGHT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`
## FROM (SELECT *
## FROM `ranges_tx`)) AS `TBL_LEFT`
##   INNER JOIN (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM (SELECT *
## FROM `id`)
## WHERE (`symbol` LIKE 'SNORD%')) AS `TBL_LEFT`
##   LEFT JOIN (SELECT *
## FROM `ranges_tx`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )) AS `TBL_LEFT`
##   INNER JOIN (SELECT *
## FROM `table2_14873`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## ) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`tx_chrom` = `TBL_RIGHT`.`tx_chrom` AND `TBL_LEFT`.`tx_strand` = `TBL_RIGHT`.`tx_strand` AND `TBL_LEFT`.`tx_start` = `TBL_RIGHT`.`tx_start` AND `TBL_LEFT`.`tx_end` = `TBL_RIGHT`.`tx_end` AND `TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez` AND `TBL_LEFT`.`tx_id` = `TBL_RIGHT`.`tx_id` AND `TBL_LEFT`.`tx_name` = `TBL_RIGHT`.`tx_name`)
## )
## UNION
## SELECT `tx_chrom`, `tx_strand`, `tx_start`, `tx_end`, `entrez`, `tx_id`, `tx_name`, `map`, `ensembl`, `symbol`, `genename`, `alias`, `tx_type`
## FROM (SELECT `TBL_LEFT`.`entrez` AS `entrez`, `TBL_LEFT`.`map` AS `map`, `TBL_LEFT`.`ensembl` AS `ensembl`, `TBL_LEFT`.`symbol` AS `symbol`, `TBL_LEFT`.`genename` AS `genename`, `TBL_LEFT`.`alias` AS `alias`, `TBL_RIGHT`.`tx_chrom` AS `tx_chrom`, `TBL_RIGHT`.`tx_strand` AS `tx_strand`, `TBL_RIGHT`.`tx_start` AS `tx_start`, `TBL_RIGHT`.`tx_end` AS `tx_end`, `TBL_RIGHT`.`tx_id` AS `tx_id`, `TBL_RIGHT`.`tx_name` AS `tx_name`, `TBL_RIGHT`.`tx_type` AS `tx_type`
##   FROM (SELECT *
## FROM (SELECT *
## FROM `id`)
## WHERE (`symbol` = 'ADA')) AS `TBL_LEFT`
##   LEFT JOIN (SELECT *
## FROM `ranges_tx`) AS `TBL_RIGHT`
##   ON (`TBL_LEFT`.`entrez` = `TBL_RIGHT`.`entrez`)
## )

## # Source:     lazy query [?? x 7]
## # Database:   sqlite 3.22.0 []
## # Ordered by: tx_id
##   tx_chrom tx_start   tx_end tx_strand  tx_id tx_name    symbol  
##   <chr>       <int>    <int> <chr>      <int> <chr>      <chr>   
## 1 chr1     44775864 44775943 +           2560 uc057fwc.1 SNORD55 
## 2 chr1     44776490 44776593 +           2561 uc001cmk.2 SNORD46 
## 3 chr1     44777843 44777912 +           2564 uc009vxi.4 SNORD38A
## 4 chr1     44778390 44778456 +           2566 uc057fwg.1 SNORD38B
## 5 chr1     44778390 44778458 +           2567 uc001cml.3 SNORD38B
## 6 chr20    44619522 44626491 -         169786 uc061xfj.1 ADA     
## 7 chr20    44619522 44651742 -         169787 uc002xmj.4 ADA     
## 8 chr20    44619810 44651691 -         169789 uc061xfl.1 ADA

7.1 Organism.dplyr exercises

Exercise 7: Use the src_organism object to look up the gene symbol, transcript start and chromosome using select(). Then do the same thing using transcripts. You might expect that this call to transcripts will look the same as it did for the TxDb object, but (temporarily) it will not.

Exercise 8: Look at the results from call the columns method on the src_organism object and compare that to what happens when you call columns on the org.Hs.eg.db object and then look at a call to columns on the TxDb.Hsapiens.UCSC.hg19.knownGene object.

Exercise 9: Use the src_organism object with the transcripts method to look up the entrez gene IDs for all gene symbols that contain the letter ‘X’.

[ Back to top ]

8 BSgenome Objects

Another important annotation resource type is a BSgenome package[10]. There are many BSgenome packages in the repository for you to choose from. And you can learn which organisms are already supported by using the available.genomes() function.

head(available.genomes())

## [1] "BSgenome.Alyrata.JGI.v1"                
## [2] "BSgenome.Amellifera.BeeBase.assembly4"  
## [3] "BSgenome.Amellifera.UCSC.apiMel2"       
## [4] "BSgenome.Amellifera.UCSC.apiMel2.masked"
## [5] "BSgenome.Aofficinalis.NCBI.V1"          
## [6] "BSgenome.Athaliana.TAIR.04232008"

Unlike the other resources that we have discussed here, these packages are meant to contain sequence data for a specific genome build of an organism. You can load one of these packages in the usual way. And each of them normally has an alias for the primary object that is shorter than the full package name (as a convenience):

ls(2)

## [1] "create_env"   "populate_env"

Hsapiens

## Human genome:
## # organism: Homo sapiens (Human)
## # provider: UCSC
## # provider version: hg19
## # release date: Feb. 2009
## # release name: Genome Reference Consortium GRCh37
## # 93 sequences:
## #   chr1                  chr2                  chr3                 
## #   chr4                  chr5                  chr6                 
## #   chr7                  chr8                  chr9                 
## #   chr10                 chr11                 chr12                
## #   chr13                 chr14                 chr15                
## #   ...                   ...                   ...                  
## #   chrUn_gl000235        chrUn_gl000236        chrUn_gl000237       
## #   chrUn_gl000238        chrUn_gl000239        chrUn_gl000240       
## #   chrUn_gl000241        chrUn_gl000242        chrUn_gl000243       
## #   chrUn_gl000244        chrUn_gl000245        chrUn_gl000246       
## #   chrUn_gl000247        chrUn_gl000248        chrUn_gl000249       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[['
## # operator to access a given sequence)

The getSeq method is a useful way of extracting data from these packages. This method takes several arguments but the important ones are the 1st two. The 1st argument specifies the BSgenome object to use and the second argument (names) specifies what data you want back out. So for example, if you call it and give a character vector that names the seqnames for the object then you will get the sequences from those chromosomes as a DNAStringSet object.

seqNms <- seqnames(Hsapiens)
head(seqNms)

## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"

getSeq(Hsapiens, seqNms[1:2])

##   A DNAStringSet instance of length 2
##         width seq                                        names               
## [1] 249250621 NNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNN chr1
## [2] 243199373 NNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNN chr2

Whereas if you give the a GRanges object for the 2nd argument, you can instead get a DNAStringSet that corresponds to those ranges. This can be a powerful way to learn what sequence was present from a particular range. For example, here we can extract the range of a specific gene of interest like this.

txby <- transcriptsBy(txdb, by="gene")
geneOfInterest <- txby[["4488"]]
res <- getSeq(Hsapiens, geneOfInterest)
res

Additionally, the Biostrings[11] package has many useful functions for finding a pattern in a string set etc. You may not have noticed when it happened, but the Biostrings package was loaded when you loaded the BSgenome object, so these functions will already be available for you to explore.

8.1 BSgenome exercises

Exercise 10: Use what you have just learned to extract the sequence for the PTEN gene.

[ Back to top ]

9 biomaRt

Another great annotation resource is the biomaRt package[5,6,7]. The biomaRt package exposes a huge family of different online annotation resources called marts. Each mart is another of a set of online web resources that are following a convention that allows them to work with this package. So the first step in using biomaRt is always to load the package and then decide which “mart” you want to use. Once you have made your decision, you will then use the useMart() method to create a mart object in your R session. Here we are looking at the marts available and then choosing to use one of the most popular marts: the “ensembl”" mart.

head(listMarts())

##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 94
## 2   ENSEMBL_MART_MOUSE      Mouse strains 94
## 3     ENSEMBL_MART_SNP  Ensembl Variation 94
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 94

ensembl <- useMart("ensembl")
ensembl

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   No dataset selected.

Each ‘mart’ can contain datasets for multiple different things. So the next step is that you need to decide on a dataset. Once you have chosen one, you will need to specify that dataset using the dataset argument when you call the useMart() constructor method. Here we will point to the dataset for humans.

head(listDatasets(ensembl))

##                      dataset                              description
## 1   acalliptera_gene_ensembl         Eastern happy genes (fAstCal1.2)
## 2 acarolinensis_gene_ensembl           Anole lizard genes (AnoCar2.0)
## 3  acitrinellus_gene_ensembl           Midas cichlid genes (Midas_v5)
## 4  amelanoleuca_gene_ensembl                    Panda genes (ailMel1)
## 5    amexicanus_gene_ensembl Cave fish genes (Astyanax_mexicanus-2.0)
## 6    anancymaae_gene_ensembl       Ma's night monkey genes (Anan_2.0)
##                  version
## 1             fAstCal1.2
## 2              AnoCar2.0
## 3               Midas_v5
## 4                ailMel1
## 5 Astyanax_mexicanus-2.0
## 6               Anan_2.0

ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
ensembl

## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   Using the hsapiens_gene_ensembl dataset

Next we need to think about attributes, values and filters. Lets start with attributes. You can get a listing of the different kinds of attributes from biomaRt buy using the listAttributes method:

head(listAttributes(ensembl))

##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

And you can see what the values for a particular attribute are by using the getBM method:

head(getBM(attributes="chromosome_name", mart=ensembl))

##   chromosome_name
## 1               1
## 2              10
## 3              11
## 4              12
## 5              13
## 6              14

Attributes are the things that you can have returned from biomaRt. They are analogous to what you get when you use the columns method with other objects.

In the biomaRt package, filters are things that can be used with values to restrict or choose what comes back. The ‘values’ here are treated as keys that you are passing in and which you would like to know more information about. In contrast, the filter represents the kind of key that you are searching for. So for example, you might choose a filter name of “chromosome_name” to go with specific value of “1”. Together these two argument values would request whatever attributes matched things on the 1st chromosome. And just as there here is an accessor for attributes, there is also an accessor for filters:

head(listFilters(ensembl))

##              name              description
## 1 chromosome_name Chromosome/scaffold name
## 2           start                    Start
## 3             end                      End
## 4      band_start               Band Start
## 5        band_end                 Band End
## 6    marker_start             Marker Start

So now you know about attributes, values and filters, you can call the getBM method to put it all together and request specific data from the mart. So for example, the following requests gene symbols and entrez gene IDs that are found on chromosome 1 of flies:

res <- getBM(attributes=c("hgnc_symbol", "entrezgene"),
                    filters = "chromosome_name",
                    values = "1", mart = ensembl)
head(res)

##   hgnc_symbol entrezgene
## 1     DDX11L1  100287102
## 2      WASH7P     653635
## 3   MIR6859-1  102466751
## 4 MIR1302-2HG         NA
## 5   MIR1302-2  100302278
## 6     FAM138A     645520

Of course you may have noticed that a lot of the arguments for getBM are very similar to what you do when you call shsapienselect. So if it’s your preference you can now also use the standard select methods with mart objects.

head(columns(ensembl))

## [1] "3_utr_end"   "3_utr_end"   "3_utr_start" "3_utr_start" "3utr"       
## [6] "5_utr_end"

9.1 biomaRt exercises

Exercise 11: Pull down GO terms for entrez gene id “1” from human by using the ensembl “hsapiens_gene_ensembl” dataset.

Exercise 12: Now compare the GO terms you just pulled down to the same GO terms from the org.Hs.eg.db package (which you can now retrieve using select()). What differences do you notice? Why do you suspect that is?

[ Back to top ]

10 Creating annotation objects

By now you are aware that Bioconductor has a lot of annotation resources. But it is still completely impossible to have every annotation resource pre-packaged for every conceivable use. Because of this, almost all annotation objects have special functions that can be called to create those objects (or the packages that load them) from generalized data resources or specific file types. Below is a table with a few of the more popular options.

If you want this	And you have this	Then you could call this to help
TxDb	tracks from UCSC	GenomicFeatures::makeTxDbPackageFromUCSC
TxDb	data from biomaRt	GenomicFeatures::makeTxDbPackageFromBiomaRt
TxDb	gff or gtf file	GenomicFeatures::makeTxDbFromGFF
OrgDb	custom data.frames	AnnotationForge::makeOrgPackage
OrgDb	valid Taxonomy ID	AnnotationForge::makeOrgPackageFromNCBI
ChipDb	org package & data.frame	AnnotationForge::makeChipPackage
BSgenome	fasta or twobit sequence files	BSgenome::forgeBSgenomeDataPkg

In most cases the output for resource creation functions will be an annotation package that you can install.

And there is unfortunately not enough space to demonstrate how to call each of these functions here. But to do so is actually pretty straightforward and most such functions will be well documented with their associated manual pages and vignettes[3,4,10,12]. As usual, you can see the help page for any function right inside of R.

help("makeTxDbPackageFromUCSC")

If you plan to make use of these kinds of functions then you should expect to consult the associated documentation first. These kinds of functions tend to have a lot of arguments and most of them also require that their input data meet some fairly specific criteria. Finally, you should know that even after you have succeeded at creating an annotation package, you will also have to make use of the install.packages() function (with the repos argument=NULL) to install whatever package source directory has just been created.

11 Important considerations

The bioconductor project represents a very large and active codebase from an active and engaged community. Because of this, you should expect that the software described in this walkthrough will change over time and often in dramatic ways. As an example, the getSeq function that is described in this chapter is expected to a big overhaul in the coming months. When this happens the older function will be deprecated for a full release cycle (6 months) and then labeled as defunct for another release cycle before it is removed. This cycle is in place so that active users can be warned about what is happening and where they should look for the appropriate replacement functionality. But obviously, this system cannot warn end users if they have not been vigilant about updating their software to the latest version. So please take the time to always update your software to the latest version.

To stay abreast of new developments users are encouraged to explore the bioconductor website which contains many current walkthroughs and vignettes. Also visit the support site where you can ask questions and engage in discussions.

12 sessionInfo()

Package versions used in this tutorial:

sessionInfo()

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2                           
##  [2] annotation_1.4.0                         
##  [3] TxDb.Athaliana.BioMart.plantsmart22_3.0.1
##  [4] biomaRt_2.38.0                           
##  [5] BSgenome.Hsapiens.UCSC.hg19_1.4.0        
##  [6] BSgenome_1.50.0                          
##  [7] rtracklayer_1.42.0                       
##  [8] Homo.sapiens_1.3.1                       
##  [9] GO.db_3.7.0                              
## [10] OrganismDbi_1.24.0                       
## [11] org.Mm.eg.db_3.7.0                       
## [12] org.Hs.eg.db_3.7.0                       
## [13] TxDb.Mmusculus.UCSC.mm10.ensGene_3.4.0   
## [14] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0  
## [15] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
## [16] GenomicFeatures_1.34.0                   
## [17] AnnotationDbi_1.44.0                     
## [18] Organism.dplyr_1.10.0                    
## [19] AnnotationFilter_1.6.0                   
## [20] dplyr_0.7.7                              
## [21] AnnotationHub_2.14.0                     
## [22] VariantAnnotation_1.28.0                 
## [23] Rsamtools_1.34.0                         
## [24] Biostrings_2.50.0                        
## [25] XVector_0.22.0                           
## [26] SummarizedExperiment_1.12.0              
## [27] DelayedArray_0.8.0                       
## [28] BiocParallel_1.16.0                      
## [29] matrixStats_0.54.0                       
## [30] Biobase_2.42.0                           
## [31] GenomicRanges_1.34.0                     
## [32] GenomeInfoDb_1.18.0                      
## [33] IRanges_2.16.0                           
## [34] S4Vectors_0.20.0                         
## [35] BiocGenerics_0.28.0                      
## [36] BiocStyle_2.10.0                         
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] progress_1.2.0                httr_1.3.1                   
##  [5] rprojroot_1.3-2               tools_3.5.1                  
##  [7] backports_1.1.2               utf8_1.1.4                   
##  [9] R6_2.3.0                      DBI_1.0.0                    
## [11] lazyeval_0.2.1                tidyselect_0.2.5             
## [13] prettyunits_1.0.2             bit_1.1-14                   
## [15] curl_3.2                      compiler_3.5.1               
## [17] cli_1.0.1                     graph_1.60.0                 
## [19] bookdown_0.7                  RBGL_1.58.0                  
## [21] rappdirs_0.3.1                stringr_1.3.1                
## [23] digest_0.6.18                 rmarkdown_1.10               
## [25] pkgconfig_2.0.2               htmltools_0.3.6              
## [27] dbplyr_1.2.2                  rlang_0.3.0.1                
## [29] RSQLite_2.1.1                 shiny_1.1.0                  
## [31] bindr_0.1.1                   RCurl_1.95-4.11              
## [33] magrittr_1.5                  GenomeInfoDbData_1.2.0       
## [35] Matrix_1.2-14                 fansi_0.4.0                  
## [37] Rcpp_0.12.19                  stringi_1.2.4                
## [39] yaml_2.2.0                    zlibbioc_1.28.0              
## [41] BiocFileCache_1.6.0           grid_3.5.1                   
## [43] blob_1.1.1                    promises_1.0.1               
## [45] crayon_1.3.4                  lattice_0.20-35              
## [47] hms_0.4.2                     knitr_1.20                   
## [49] pillar_1.3.0                  XML_3.98-1.16                
## [51] glue_1.3.0                    evaluate_0.12                
## [53] BiocManager_1.30.3            httpuv_1.4.5                 
## [55] purrr_0.2.5                   assertthat_0.2.0             
## [57] xfun_0.4                      mime_0.6                     
## [59] xtable_1.8-3                  later_0.7.5                  
## [61] tibble_1.4.2                  GenomicAlignments_1.18.0     
## [63] memoise_1.1.0                 interactiveDisplayBase_1.20.0

13 Acknowledgments

Research reported in this chapter was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number U41HG004059 and by the National Cancer Institute of the National Institutes of Health under Award Number U24CA180996. We also want to thank the numerous institutions who produced and maintained the data that is used for generating and updating the annotation resources described here.

14 References

Wolfgang Huber, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, Sean Davis, Laurent Gatto, Thomas Girke, Raphael Gottardo, Florian Hahne, Kasper D Hansen, Rafael A Irizarry, Michael Lawrence, Michael I Love, James MacDonald, Valerie Obenchain, Andrzej K Oleś, Hervé Pagès, Alejandro Reyes, Paul Shannon, Gordon K Smyth, Dan Tenenbaum, Levi Waldron & Martin Morgan (2015) Orchestrating high-throughput genomic analysis with Bioconductor Nature Methods 12:115-121
Pages H, Carlson M, Falcon S and Li N. AnnotationDbi: Annotation Database Interface. R package version 1.30.0.
M. Carlson, H. Pages, P. Aboyoun, S. Falcon, M. Morgan, D. Sarkar, M. Lawrence GenomicFeatures: Tools for making and manipulating transcript centric annotations version 1.19.38.
Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, Morgan M and Carey V (2013). Software for Computing and Annotating Genomic Ranges. PLoS Computational Biology, 9. http://dx.doi.org/10.1371/journal.pcbi.1003118, http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118
Steffen Durinck, Wolfgang Huber biomaRt: Interface to BioMart databases (e.g. Ensembl, COSMIC ,Wormbase and Gramene) version 2.23.5.
Durinck S, Spellman P, Birney E and Huber W (2009). Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 4, pp. 1184-1191.
Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A and Huber W (2005). BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21, pp. 3439-3440.
Morgan M, Carlson M, Tenenbaum D and Arora S. AnnotationHub: Client to access AnnotationHub resources. R package version 2.0.1.
Carlson M, Pages H, Morgan M and Obenchain V. OrganismDbi: Software to enable the smooth interfacing of different database packages. R package version 1.10.0.
Pages H. BSgenome: Infrastructure for Biostrings-based genome data packages. R package version 1.36.0.
Pages H, Aboyoun P, Gentleman R and DebRoy S. Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.36.0.
Carlson M, and Pages H. AnnotationForge: Code for Building Annotation Database Packages. R package version 1.10.0.

15 Answers for exercises

15.1 Exercise 1:

The 1st thing you need to do is look for thing from UCSC

ahs <- query(ah, "UCSC")

Then you can look for Genome values that match ‘hg19’ and a species that matches ‘Homo sapiens’.

ahs <- subset(ahs, ahs$genome=='hg19')
length(ahs)

## [1] 5901

ahs <- subset(ahs, ahs$species=='Homo sapiens')
length(ahs)

## [1] 5901

You might notice that the last two filtering steps are redundant (IOW doing the 1st of them is the same as doing both of them.) If this were not the case, we might suspect that there was a problem with the metadata.

15.2 Exercise 2:

This pulls down the oreganno annotations. Which are described on the UCSC site thusly: “This track displays literature-curated regulatory regions, transcription factor binding sites, and regulatory polymorphisms from ORegAnno (Open Regulatory Annotation). For more detailed information on a particular regulatory element, follow the link to ORegAnno from the details page.”

ahs <- query(ah, 'oreganno')
ahs

## AnnotationHub with 9 records
## # snapshotDate(): 2018-10-30 
## # $dataprovider: Pazar, UCSC
## # $species: Homo sapiens, Saccharomyces cerevisiae, NA
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## #   tags, rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH5087"]]' 
## 
##             title                                 
##   AH5087  | ORegAnno                              
##   AH5213  | ORegAnno                              
##   AH7053  | ORegAnno                              
##   AH7061  | ORegAnno                              
##   AH22286 | pazar_ORegAnno_20120522.csv           
##   AH22287 | pazar_ORegAnno_ENCODEprom_20120522.csv
##   AH22288 | pazar_ORegAnno_Erythroid_20120522.csv 
##   AH22289 | pazar_ORegAnno_STAT1_ChIP_20120522.csv
##   AH22290 | pazar_ORegAnno_STAT1_lit_20120522.csv

ahs[1]

## AnnotationHub with 1 record
## # snapshotDate(): 2018-10-30 
## # names(): AH5087
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # $rdatadateadded: 2013-03-26
## # $title: ORegAnno
## # $description: GRanges object from UCSC track 'ORegAnno'
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: UCSC track
## # $sourceurl: rtracklayer://hgdownload.cse.ucsc.edu/goldenpath/hg19/datab...
## # $sourcesize: NA
## # $tags: c("oreganno", "UCSC", "track", "Gene", "Transcript",
## #   "Annotation") 
## # retrieve record with 'object[["AH5087"]]'

oreg <- ahs[['AH5087']]

## downloading 0 resources

## loading from cache 
##     '/home/biocbuild//.AnnotationHub/5087'

oreg

## GRanges object with 23118 ranges and 2 metadata columns:
##                        seqnames        ranges strand |        name     score
##                           <Rle>     <IRanges>  <Rle> | <character> <numeric>
##       [1]                  chr1 873499-873849      + | OREG0012989         0
##       [2]                  chr1 886764-887214      + | OREG0012990         0
##       [3]                  chr1 886938-886958      + | OREG0007909         0
##       [4]                  chr1 919400-919950      + | OREG0012991         0
##       [5]                  chr1 919695-919715      + | OREG0007910         0
##       ...                   ...           ...    ... .         ...       ...
##   [23114]  chr7_gl000195_random         1-851      + | OREG0026736         0
##   [23115]  chr7_gl000195_random 103427-103447      + | OREG0012963         0
##   [23116]  chr7_gl000195_random 121139-121159      + | OREG0012964         0
##   [23117] chr17_gl000204_random   58370-58955      + | OREG0026769         0
##   [23118] chr17_gl000205_random 117492-118442      + | OREG0026772         0
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

15.3 Exercise 3:

keys <- "MSX2"
columns <- c("ENTREZID", "CHR")
select(org.Hs.eg.db, keys, columns, keytype="SYMBOL")

## Warning in .deprecatedColsMessage(): Accessing gene location information via 'CHR','CHRLOC','CHRLOCEND'
##   is deprecated. Please use a range based accessor like genes(), or
##   select() with columns values like TXCHROM and TXSTART on a TxDb or
##   OrganismDb object instead.

## 'select()' returned 1:1 mapping between keys and columns

##   SYMBOL ENTREZID CHR
## 1   MSX2     4488   5

15.4 Exercise 4:

## 1st get all the gene symbols
orgSymbols <- keys(org.Hs.eg.db, keytype="SYMBOL")
## and then use that to get all gene symbols matched to all entrez gene IDs
egr <- select(org.Hs.eg.db, keys=orgSymbols, "ENTREZID", "SYMBOL")

## 'select()' returned 1:many mapping between keys and columns

length(egr$ENTREZID)

## [1] 61119

length(unique(egr$ENTREZID))

## [1] 61119

## VS:
length(egr$SYMBOL)

## [1] 61119

length(unique(egr$SYMBOL))

## [1] 61050

## So lets trap these symbols that are redundant and look more closely...
redund <- egr$SYMBOL
badSymbols <- redund[duplicated(redund)]
select(org.Hs.eg.db, badSymbols, "ENTREZID", "SYMBOL")

## 'select()' returned many:many mapping between keys and columns

##        SYMBOL  ENTREZID
## 1         HBD      3045
## 2         HBD 100187828
## 3        RNR1      4549
## 4        RNR1      6052
## 5        RNR2      4550
## 6        RNR2      6053
## 7         TEC      7006
## 8         TEC 100124696
## 9       MEMO1      7795
## 10      MEMO1     51072
## 11       MMD2    221938
## 12       MMD2 100505381
## 13   DEL11P13 100528024
## 14   DEL11P13 107648861
## 15  TRNAK-CUU 107984908
## 16  TRNAK-CUU 107985605
## 17  TRNAK-CUU 107985760
## 18  TRNAK-CUU 107984908
## 19  TRNAK-CUU 107985605
## 20  TRNAK-CUU 107985760
## 21  TRNAG-CCC 107985600
## 22  TRNAG-CCC 107985603
## 23  TRNAG-CCC 107985604
## 24  TRNAG-CCC 107985607
## 25  TRNAG-CCC 107985748
## 26  TRNAG-CCC 107985759
## 27  TRNAG-CCC 107985600
## 28  TRNAG-CCC 107985603
## 29  TRNAG-CCC 107985604
## 30  TRNAG-CCC 107985607
## 31  TRNAG-CCC 107985748
## 32  TRNAG-CCC 107985759
## 33  TRNAG-CCC 107985600
## 34  TRNAG-CCC 107985603
## 35  TRNAG-CCC 107985604
## 36  TRNAG-CCC 107985607
## 37  TRNAG-CCC 107985748
## 38  TRNAG-CCC 107985759
## 39  TRNAG-CCC 107985600
## 40  TRNAG-CCC 107985603
## 41  TRNAG-CCC 107985604
## 42  TRNAG-CCC 107985607
## 43  TRNAG-CCC 107985748
## 44  TRNAG-CCC 107985759
## 45  TRNAG-CCC 107985600
## 46  TRNAG-CCC 107985603
## 47  TRNAG-CCC 107985604
## 48  TRNAG-CCC 107985607
## 49  TRNAG-CCC 107985748
## 50  TRNAG-CCC 107985759
## 51  TRNAN-GUU 107985601
## 52  TRNAN-GUU 107985609
## 53  TRNAN-GUU 107985610
## 54  TRNAN-GUU 107985611
## 55  TRNAN-GUU 107985613
## 56  TRNAN-GUU 107985617
## 57  TRNAN-GUU 107985621
## 58  TRNAN-GUU 107985622
## 59  TRNAN-GUU 107985624
## 60  TRNAN-GUU 107985625
## 61  TRNAN-GUU 107985750
## 62  TRNAN-GUU 107985751
## 63  TRNAN-GUU 107985752
## 64  TRNAN-GUU 107985754
## 65  TRNAN-GUU 107985756
## 66  TRNAN-GUU 107985757
## 67  TRNAN-GUU 107985762
## 68  TRNAN-GUU 107985763
## 69  TRNAN-GUU 107987367
## 70  TRNAN-GUU 107985601
## 71  TRNAN-GUU 107985609
## 72  TRNAN-GUU 107985610
## 73  TRNAN-GUU 107985611
## 74  TRNAN-GUU 107985613
## 75  TRNAN-GUU 107985617
## 76  TRNAN-GUU 107985621
## 77  TRNAN-GUU 107985622
## 78  TRNAN-GUU 107985624
## 79  TRNAN-GUU 107985625
## 80  TRNAN-GUU 107985750
## 81  TRNAN-GUU 107985751
## 82  TRNAN-GUU 107985752
## 83  TRNAN-GUU 107985754
## 84  TRNAN-GUU 107985756
## 85  TRNAN-GUU 107985757
## 86  TRNAN-GUU 107985762
## 87  TRNAN-GUU 107985763
## 88  TRNAN-GUU 107987367
## 89  TRNAN-GUU 107985601
## 90  TRNAN-GUU 107985609
## 91  TRNAN-GUU 107985610
## 92  TRNAN-GUU 107985611
## 93  TRNAN-GUU 107985613
## 94  TRNAN-GUU 107985617
## 95  TRNAN-GUU 107985621
## 96  TRNAN-GUU 107985622
## 97  TRNAN-GUU 107985624
## 98  TRNAN-GUU 107985625
## 99  TRNAN-GUU 107985750
## 100 TRNAN-GUU 107985751
## 101 TRNAN-GUU 107985752
## 102 TRNAN-GUU 107985754
## 103 TRNAN-GUU 107985756
## 104 TRNAN-GUU 107985757
## 105 TRNAN-GUU 107985762
## 106 TRNAN-GUU 107985763
## 107 TRNAN-GUU 107987367
## 108 TRNAN-GUU 107985601
## 109 TRNAN-GUU 107985609
## 110 TRNAN-GUU 107985610
## 111 TRNAN-GUU 107985611
## 112 TRNAN-GUU 107985613
## 113 TRNAN-GUU 107985617
## 114 TRNAN-GUU 107985621
## 115 TRNAN-GUU 107985622
## 116 TRNAN-GUU 107985624
## 117 TRNAN-GUU 107985625
## 118 TRNAN-GUU 107985750
## 119 TRNAN-GUU 107985751
## 120 TRNAN-GUU 107985752
## 121 TRNAN-GUU 107985754
## 122 TRNAN-GUU 107985756
## 123 TRNAN-GUU 107985757
## 124 TRNAN-GUU 107985762
## 125 TRNAN-GUU 107985763
## 126 TRNAN-GUU 107987367
## 127 TRNAN-GUU 107985601
## 128 TRNAN-GUU 107985609
## 129 TRNAN-GUU 107985610
## 130 TRNAN-GUU 107985611
## 131 TRNAN-GUU 107985613
## 132 TRNAN-GUU 107985617
## 133 TRNAN-GUU 107985621
## 134 TRNAN-GUU 107985622
## 135 TRNAN-GUU 107985624
## 136 TRNAN-GUU 107985625
## 137 TRNAN-GUU 107985750
## 138 TRNAN-GUU 107985751
## 139 TRNAN-GUU 107985752
## 140 TRNAN-GUU 107985754
## 141 TRNAN-GUU 107985756
## 142 TRNAN-GUU 107985757
## 143 TRNAN-GUU 107985762
## 144 TRNAN-GUU 107985763
## 145 TRNAN-GUU 107987367
## 146 TRNAN-GUU 107985601
## 147 TRNAN-GUU 107985609
## 148 TRNAN-GUU 107985610
## 149 TRNAN-GUU 107985611
## 150 TRNAN-GUU 107985613
## 151 TRNAN-GUU 107985617
## 152 TRNAN-GUU 107985621
## 153 TRNAN-GUU 107985622
## 154 TRNAN-GUU 107985624
## 155 TRNAN-GUU 107985625
## 156 TRNAN-GUU 107985750
## 157 TRNAN-GUU 107985751
## 158 TRNAN-GUU 107985752
## 159 TRNAN-GUU 107985754
## 160 TRNAN-GUU 107985756
## 161 TRNAN-GUU 107985757
## 162 TRNAN-GUU 107985762
## 163 TRNAN-GUU 107985763
## 164 TRNAN-GUU 107987367
## 165 TRNAN-GUU 107985601
## 166 TRNAN-GUU 107985609
## 167 TRNAN-GUU 107985610
## 168 TRNAN-GUU 107985611
## 169 TRNAN-GUU 107985613
## 170 TRNAN-GUU 107985617
## 171 TRNAN-GUU 107985621
## 172 TRNAN-GUU 107985622
## 173 TRNAN-GUU 107985624
## 174 TRNAN-GUU 107985625
## 175 TRNAN-GUU 107985750
## 176 TRNAN-GUU 107985751
## 177 TRNAN-GUU 107985752
## 178 TRNAN-GUU 107985754
## 179 TRNAN-GUU 107985756
## 180 TRNAN-GUU 107985757
## 181 TRNAN-GUU 107985762
## 182 TRNAN-GUU 107985763
## 183 TRNAN-GUU 107987367
## 184 TRNAN-GUU 107985601
## 185 TRNAN-GUU 107985609
## 186 TRNAN-GUU 107985610
## 187 TRNAN-GUU 107985611
## 188 TRNAN-GUU 107985613
## 189 TRNAN-GUU 107985617
## 190 TRNAN-GUU 107985621
## 191 TRNAN-GUU 107985622
## 192 TRNAN-GUU 107985624
## 193 TRNAN-GUU 107985625
## 194 TRNAN-GUU 107985750
## 195 TRNAN-GUU 107985751
## 196 TRNAN-GUU 107985752
## 197 TRNAN-GUU 107985754
## 198 TRNAN-GUU 107985756
## 199 TRNAN-GUU 107985757
## 200 TRNAN-GUU 107985762
## 201 TRNAN-GUU 107985763
## 202 TRNAN-GUU 107987367
## 203 TRNAN-GUU 107985601
## 204 TRNAN-GUU 107985609
## 205 TRNAN-GUU 107985610
## 206 TRNAN-GUU 107985611
## 207 TRNAN-GUU 107985613
## 208 TRNAN-GUU 107985617
## 209 TRNAN-GUU 107985621
## 210 TRNAN-GUU 107985622
## 211 TRNAN-GUU 107985624
## 212 TRNAN-GUU 107985625
## 213 TRNAN-GUU 107985750
## 214 TRNAN-GUU 107985751
## 215 TRNAN-GUU 107985752
## 216 TRNAN-GUU 107985754
## 217 TRNAN-GUU 107985756
## 218 TRNAN-GUU 107985757
## 219 TRNAN-GUU 107985762
## 220 TRNAN-GUU 107985763
## 221 TRNAN-GUU 107987367
## 222 TRNAN-GUU 107985601
## 223 TRNAN-GUU 107985609
## 224 TRNAN-GUU 107985610
## 225 TRNAN-GUU 107985611
## 226 TRNAN-GUU 107985613
## 227 TRNAN-GUU 107985617
## 228 TRNAN-GUU 107985621
## 229 TRNAN-GUU 107985622
## 230 TRNAN-GUU 107985624
## 231 TRNAN-GUU 107985625
## 232 TRNAN-GUU 107985750
## 233 TRNAN-GUU 107985751
## 234 TRNAN-GUU 107985752
## 235 TRNAN-GUU 107985754
## 236 TRNAN-GUU 107985756
## 237 TRNAN-GUU 107985757
## 238 TRNAN-GUU 107985762
## 239 TRNAN-GUU 107985763
## 240 TRNAN-GUU 107987367
## 241 TRNAN-GUU 107985601
## 242 TRNAN-GUU 107985609
## 243 TRNAN-GUU 107985610
## 244 TRNAN-GUU 107985611
## 245 TRNAN-GUU 107985613
## 246 TRNAN-GUU 107985617
## 247 TRNAN-GUU 107985621
## 248 TRNAN-GUU 107985622
## 249 TRNAN-GUU 107985624
## 250 TRNAN-GUU 107985625
## 251 TRNAN-GUU 107985750
## 252 TRNAN-GUU 107985751
## 253 TRNAN-GUU 107985752
## 254 TRNAN-GUU 107985754
## 255 TRNAN-GUU 107985756
## 256 TRNAN-GUU 107985757
## 257 TRNAN-GUU 107985762
## 258 TRNAN-GUU 107985763
## 259 TRNAN-GUU 107987367
## 260 TRNAN-GUU 107985601
## 261 TRNAN-GUU 107985609
## 262 TRNAN-GUU 107985610
## 263 TRNAN-GUU 107985611
## 264 TRNAN-GUU 107985613
## 265 TRNAN-GUU 107985617
## 266 TRNAN-GUU 107985621
## 267 TRNAN-GUU 107985622
## 268 TRNAN-GUU 107985624
## 269 TRNAN-GUU 107985625
## 270 TRNAN-GUU 107985750
## 271 TRNAN-GUU 107985751
## 272 TRNAN-GUU 107985752
## 273 TRNAN-GUU 107985754
## 274 TRNAN-GUU 107985756
## 275 TRNAN-GUU 107985757
## 276 TRNAN-GUU 107985762
## 277 TRNAN-GUU 107985763
## 278 TRNAN-GUU 107987367
## 279 TRNAN-GUU 107985601
## 280 TRNAN-GUU 107985609
## 281 TRNAN-GUU 107985610
## 282 TRNAN-GUU 107985611
## 283 TRNAN-GUU 107985613
## 284 TRNAN-GUU 107985617
## 285 TRNAN-GUU 107985621
## 286 TRNAN-GUU 107985622
## 287 TRNAN-GUU 107985624
## 288 TRNAN-GUU 107985625
## 289 TRNAN-GUU 107985750
## 290 TRNAN-GUU 107985751
## 291 TRNAN-GUU 107985752
## 292 TRNAN-GUU 107985754
## 293 TRNAN-GUU 107985756
## 294 TRNAN-GUU 107985757
## 295 TRNAN-GUU 107985762
## 296 TRNAN-GUU 107985763
## 297 TRNAN-GUU 107987367
## 298 TRNAN-GUU 107985601
## 299 TRNAN-GUU 107985609
## 300 TRNAN-GUU 107985610
## 301 TRNAN-GUU 107985611
## 302 TRNAN-GUU 107985613
## 303 TRNAN-GUU 107985617
## 304 TRNAN-GUU 107985621
## 305 TRNAN-GUU 107985622
## 306 TRNAN-GUU 107985624
## 307 TRNAN-GUU 107985625
## 308 TRNAN-GUU 107985750
## 309 TRNAN-GUU 107985751
## 310 TRNAN-GUU 107985752
## 311 TRNAN-GUU 107985754
## 312 TRNAN-GUU 107985756
## 313 TRNAN-GUU 107985757
## 314 TRNAN-GUU 107985762
## 315 TRNAN-GUU 107985763
## 316 TRNAN-GUU 107987367
## 317 TRNAN-GUU 107985601
## 318 TRNAN-GUU 107985609
## 319 TRNAN-GUU 107985610
## 320 TRNAN-GUU 107985611
## 321 TRNAN-GUU 107985613
## 322 TRNAN-GUU 107985617
## 323 TRNAN-GUU 107985621
## 324 TRNAN-GUU 107985622
## 325 TRNAN-GUU 107985624
## 326 TRNAN-GUU 107985625
## 327 TRNAN-GUU 107985750
## 328 TRNAN-GUU 107985751
## 329 TRNAN-GUU 107985752
## 330 TRNAN-GUU 107985754
## 331 TRNAN-GUU 107985756
## 332 TRNAN-GUU 107985757
## 333 TRNAN-GUU 107985762
## 334 TRNAN-GUU 107985763
## 335 TRNAN-GUU 107987367
## 336 TRNAN-GUU 107985601
## 337 TRNAN-GUU 107985609
## 338 TRNAN-GUU 107985610
## 339 TRNAN-GUU 107985611
## 340 TRNAN-GUU 107985613
## 341 TRNAN-GUU 107985617
## 342 TRNAN-GUU 107985621
## 343 TRNAN-GUU 107985622
## 344 TRNAN-GUU 107985624
## 345 TRNAN-GUU 107985625
## 346 TRNAN-GUU 107985750
## 347 TRNAN-GUU 107985751
## 348 TRNAN-GUU 107985752
## 349 TRNAN-GUU 107985754
## 350 TRNAN-GUU 107985756
## 351 TRNAN-GUU 107985757
## 352 TRNAN-GUU 107985762
## 353 TRNAN-GUU 107985763
## 354 TRNAN-GUU 107987367
## 355 TRNAN-GUU 107985601
## 356 TRNAN-GUU 107985609
## 357 TRNAN-GUU 107985610
## 358 TRNAN-GUU 107985611
## 359 TRNAN-GUU 107985613
## 360 TRNAN-GUU 107985617
## 361 TRNAN-GUU 107985621
## 362 TRNAN-GUU 107985622
## 363 TRNAN-GUU 107985624
## 364 TRNAN-GUU 107985625
## 365 TRNAN-GUU 107985750
## 366 TRNAN-GUU 107985751
## 367 TRNAN-GUU 107985752
## 368 TRNAN-GUU 107985754
## 369 TRNAN-GUU 107985756
## 370 TRNAN-GUU 107985757
## 371 TRNAN-GUU 107985762
## 372 TRNAN-GUU 107985763
## 373 TRNAN-GUU 107987367
## 374 TRNAN-GUU 107985601
## 375 TRNAN-GUU 107985609
## 376 TRNAN-GUU 107985610
## 377 TRNAN-GUU 107985611
## 378 TRNAN-GUU 107985613
## 379 TRNAN-GUU 107985617
## 380 TRNAN-GUU 107985621
## 381 TRNAN-GUU 107985622
## 382 TRNAN-GUU 107985624
## 383 TRNAN-GUU 107985625
## 384 TRNAN-GUU 107985750
## 385 TRNAN-GUU 107985751
## 386 TRNAN-GUU 107985752
## 387 TRNAN-GUU 107985754
## 388 TRNAN-GUU 107985756
## 389 TRNAN-GUU 107985757
## 390 TRNAN-GUU 107985762
## 391 TRNAN-GUU 107985763
## 392 TRNAN-GUU 107987367
## 393 TRNAE-UUC 107985602
## 394 TRNAE-UUC 107985623
## 395 TRNAE-UUC 107985626
## 396 TRNAE-UUC 107985758
## 397 TRNAE-UUC 107987368
## 398 TRNAE-UUC 107985602
## 399 TRNAE-UUC 107985623
## 400 TRNAE-UUC 107985626
## 401 TRNAE-UUC 107985758
## 402 TRNAE-UUC 107987368
## 403 TRNAE-UUC 107985602
## 404 TRNAE-UUC 107985623
## 405 TRNAE-UUC 107985626
## 406 TRNAE-UUC 107985758
## 407 TRNAE-UUC 107987368
## 408 TRNAE-UUC 107985602
## 409 TRNAE-UUC 107985623
## 410 TRNAE-UUC 107985626
## 411 TRNAE-UUC 107985758
## 412 TRNAE-UUC 107987368
## 413 TRNAC-GCA 107985606
## 414 TRNAC-GCA 107985761
## 415 TRNAV-CAC 107985608
## 416 TRNAV-CAC 107985612
## 417 TRNAV-CAC 107985614
## 418 TRNAV-CAC 107985615
## 419 TRNAV-CAC 107985749
## 420 TRNAV-CAC 107985753
## 421 TRNAV-CAC 107985608
## 422 TRNAV-CAC 107985612
## 423 TRNAV-CAC 107985614
## 424 TRNAV-CAC 107985615
## 425 TRNAV-CAC 107985749
## 426 TRNAV-CAC 107985753
## 427 TRNAV-CAC 107985608
## 428 TRNAV-CAC 107985612
## 429 TRNAV-CAC 107985614
## 430 TRNAV-CAC 107985615
## 431 TRNAV-CAC 107985749
## 432 TRNAV-CAC 107985753
## 433 TRNAV-CAC 107985608
## 434 TRNAV-CAC 107985612
## 435 TRNAV-CAC 107985614
## 436 TRNAV-CAC 107985615
## 437 TRNAV-CAC 107985749
## 438 TRNAV-CAC 107985753
## 439 TRNAV-CAC 107985608
## 440 TRNAV-CAC 107985612
## 441 TRNAV-CAC 107985614
## 442 TRNAV-CAC 107985615
## 443 TRNAV-CAC 107985749
## 444 TRNAV-CAC 107985753
## 445 TRNAQ-CUG 107985616
## 446 TRNAQ-CUG 107985755
## 447 TRNAG-GCC 107985618
## 448 TRNAG-GCC 107985619
## 449 TRNAG-GCC 107985620
## 450 TRNAG-GCC 107985618
## 451 TRNAG-GCC 107985619
## 452 TRNAG-GCC 107985620
## 453 TRNAI-AAU 107986679
## 454 TRNAI-AAU 107986683
## 455 TRNAM-CAU 107986680
## 456 TRNAM-CAU 107986682
## 457 TRNAL-CAA 107986681
## 458 TRNAL-CAA 107987402
## 459 TRNAL-CAA 107987430
## 460 TRNAL-CAA 107987450
## 461 TRNAL-CAA 107987455
## 462 TRNAL-CAA 107987460
## 463 TRNAL-CAA 107987523
## 464 TRNAL-CAA 107986681
## 465 TRNAL-CAA 107987402
## 466 TRNAL-CAA 107987430
## 467 TRNAL-CAA 107987450
## 468 TRNAL-CAA 107987455
## 469 TRNAL-CAA 107987460
## 470 TRNAL-CAA 107987523
## 471 TRNAL-CAA 107986681
## 472 TRNAL-CAA 107987402
## 473 TRNAL-CAA 107987430
## 474 TRNAL-CAA 107987450
## 475 TRNAL-CAA 107987455
## 476 TRNAL-CAA 107987460
## 477 TRNAL-CAA 107987523
## 478 TRNAL-CAA 107986681
## 479 TRNAL-CAA 107987402
## 480 TRNAL-CAA 107987430
## 481 TRNAL-CAA 107987450
## 482 TRNAL-CAA 107987455
## 483 TRNAL-CAA 107987460
## 484 TRNAL-CAA 107987523
## 485 TRNAL-CAA 107986681
## 486 TRNAL-CAA 107987402
## 487 TRNAL-CAA 107987430
## 488 TRNAL-CAA 107987450
## 489 TRNAL-CAA 107987455
## 490 TRNAL-CAA 107987460
## 491 TRNAL-CAA 107987523
## 492 TRNAL-CAA 107986681
## 493 TRNAL-CAA 107987402
## 494 TRNAL-CAA 107987430
## 495 TRNAL-CAA 107987450
## 496 TRNAL-CAA 107987455
## 497 TRNAL-CAA 107987460
## 498 TRNAL-CAA 107987523
## 499  RNA5-8SP 110467518
## 500  RNA5-8SP 110467520
## 501  RNA5-8SP 110467521
## 502  RNA5-8SP 110467522
## 503  RNA5-8SP 110467523
## 504  RNA5-8SP 110467524
## 505  RNA5-8SP 110467518
## 506  RNA5-8SP 110467520
## 507  RNA5-8SP 110467521
## 508  RNA5-8SP 110467522
## 509  RNA5-8SP 110467523
## 510  RNA5-8SP 110467524
## 511  RNA5-8SP 110467518
## 512  RNA5-8SP 110467520
## 513  RNA5-8SP 110467521
## 514  RNA5-8SP 110467522
## 515  RNA5-8SP 110467523
## 516  RNA5-8SP 110467524
## 517  RNA5-8SP 110467518
## 518  RNA5-8SP 110467520
## 519  RNA5-8SP 110467521
## 520  RNA5-8SP 110467522
## 521  RNA5-8SP 110467523
## 522  RNA5-8SP 110467524
## 523  RNA5-8SP 110467518
## 524  RNA5-8SP 110467520
## 525  RNA5-8SP 110467521
## 526  RNA5-8SP 110467522
## 527  RNA5-8SP 110467523
## 528  RNA5-8SP 110467524
## 529   RNA28SP 110467525
## 530   RNA28SP 110467526
## 531   RNA28SP 110467527
## 532   RNA28SP 110467528
## 533   RNA28SP 110467529
## 534   RNA28SP 110467531
## 535   RNA28SP 110467525
## 536   RNA28SP 110467526
## 537   RNA28SP 110467527
## 538   RNA28SP 110467528
## 539   RNA28SP 110467529
## 540   RNA28SP 110467531
## 541   RNA28SP 110467525
## 542   RNA28SP 110467526
## 543   RNA28SP 110467527
## 544   RNA28SP 110467528
## 545   RNA28SP 110467529
## 546   RNA28SP 110467531
## 547   RNA28SP 110467525
## 548   RNA28SP 110467526
## 549   RNA28SP 110467527
## 550   RNA28SP 110467528
## 551   RNA28SP 110467529
## 552   RNA28SP 110467531
## 553   RNA28SP 110467525
## 554   RNA28SP 110467526
## 555   RNA28SP 110467527
## 556   RNA28SP 110467528
## 557   RNA28SP 110467529
## 558   RNA28SP 110467531
## 559   RNA18SP 110467532
## 560   RNA18SP 110467533
## 561   RNA18SP 110467534
## 562   RNA18SP 110467535
## 563   RNA18SP 110467536
## 564   RNA18SP 110467537
## 565   RNA18SP 110467538
## 566   RNA18SP 110467532
## 567   RNA18SP 110467533
## 568   RNA18SP 110467534
## 569   RNA18SP 110467535
## 570   RNA18SP 110467536
## 571   RNA18SP 110467537
## 572   RNA18SP 110467538
## 573   RNA18SP 110467532
## 574   RNA18SP 110467533
## 575   RNA18SP 110467534
## 576   RNA18SP 110467535
## 577   RNA18SP 110467536
## 578   RNA18SP 110467537
## 579   RNA18SP 110467538
## 580   RNA18SP 110467532
## 581   RNA18SP 110467533
## 582   RNA18SP 110467534
## 583   RNA18SP 110467535
## 584   RNA18SP 110467536
## 585   RNA18SP 110467537
## 586   RNA18SP 110467538
## 587   RNA18SP 110467532
## 588   RNA18SP 110467533
## 589   RNA18SP 110467534
## 590   RNA18SP 110467535
## 591   RNA18SP 110467536
## 592   RNA18SP 110467537
## 593   RNA18SP 110467538
## 594   RNA18SP 110467532
## 595   RNA18SP 110467533
## 596   RNA18SP 110467534
## 597   RNA18SP 110467535
## 598   RNA18SP 110467536
## 599   RNA18SP 110467537
## 600   RNA18SP 110467538

15.5 Exercise 5:

So to retrieve this information using select you need to do it like this:

res1 <- select(TxDb.Hsapiens.UCSC.hg19.knownGene,
               keys(TxDb.Hsapiens.UCSC.hg19.knownGene, keytype="TXID"),
               columns=c("GENEID","TXNAME","TXCHROM"), keytype="TXID")

## 'select()' returned 1:1 mapping between keys and columns

head(res1)

##   TXID    GENEID     TXNAME TXCHROM
## 1    1 100287102 uc001aaa.3    chr1
## 2    2 100287102 uc010nxq.1    chr1
## 3    3 100287102 uc010nxr.1    chr1
## 4    4     79501 uc001aal.1    chr1
## 5    5      <NA> uc001aaq.2    chr1
## 6    6      <NA> uc001aar.2    chr1

And to do it using transcripts you do it like this:

res2 <- transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene,
                    columns = c("gene_id","tx_name"))
head(res2)

## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames        ranges strand |         gene_id     tx_name
##          <Rle>     <IRanges>  <Rle> | <CharacterList> <character>
##   [1]     chr3 238279-451097      + |           10752  uc003bot.3
##   [2]     chr3 238279-451097      + |           10752  uc003bou.3
##   [3]     chr3 239326-290282      + |           10752  uc003bov.2
##   [4]     chr3 239326-440831      + |           10752  uc003bow.2
##   [5]     chr3 361366-451097      + |           10752  uc011asi.2
##   [6]     chr3 577914-887698      + |            <NA>  uc003boy.1
##   -------
##   seqinfo: 2 sequences from hg19 genome

Notice that in the 2nd case we don’t have to ask for the chromosome, as transcripts() returns a GRanges object, so the chromosome will automatically be returned as part of the object.

15.6 Exercise 6:

res <- transcripts(TxDb.Athaliana.BioMart.plantsmart22, columns = c("gene_id"))

You will notice that the gene ids for this package are TAIR locus IDs and are NOT entrez gene IDs like what you saw in the TxDb.Hsapiens.UCSC.hg19.knownGene package. It’s important to always pay attention to the kind of gene id is being used by the TxDb you are looking at.

15.7 Exercise 7:

keys <- keys(Homo.sapiens, keytype="TXID")
res1 <- select(Homo.sapiens,
               keys= keys,
               columns=c("SYMBOL","TXSTART","TXCHROM"), keytype="TXID")

head(res1)

And to do it using transcripts you do it like this:

res2 <- transcripts(Homo.sapiens, columns="SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

head(res2)

## GRanges object with 6 ranges and 1 metadata column:
##       seqnames        ranges strand |          SYMBOL
##          <Rle>     <IRanges>  <Rle> | <CharacterList>
##   [1]     chr3 238279-451097      + |            CHL1
##   [2]     chr3 238279-451097      + |            CHL1
##   [3]     chr3 239326-290282      + |            CHL1
##   [4]     chr3 239326-440831      + |            CHL1
##   [5]     chr3 361366-451097      + |            CHL1
##   [6]     chr3 577914-887698      + |            <NA>
##   -------
##   seqinfo: 2 sequences from hg19 genome

15.8 Exercise 8:

columns(Homo.sapiens)

##  [1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"      
##  [5] "CDSID"        "CDSNAME"      "CDSSTART"     "CDSSTRAND"   
##  [9] "DEFINITION"   "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
## [13] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL" 
## [17] "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"    
## [21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"      
## [25] "GENENAME"     "GO"           "GOALL"        "GOID"        
## [29] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [33] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [37] "PROSITE"      "REFSEQ"       "SYMBOL"       "TERM"        
## [41] "TXCHROM"      "TXEND"        "TXID"         "TXNAME"      
## [45] "TXSTART"      "TXSTRAND"     "TXTYPE"       "UCSCKG"      
## [49] "UNIGENE"      "UNIPROT"

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

columns(TxDb.Hsapiens.UCSC.hg19.knownGene)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSSTART"  
##  [6] "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"     "EXONNAME"  
## [11] "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"     "TXCHROM"   
## [16] "TXEND"      "TXID"       "TXNAME"     "TXSTART"    "TXSTRAND"  
## [21] "TXTYPE"

## You might also want to look at this:
transcripts(Homo.sapiens, columns=c("SYMBOL","CHRLOC"))

## 'select()' returned 1:1 mapping between keys and columns

## GRanges object with 5506 ranges and 1 metadata column:
##          seqnames            ranges strand |          SYMBOL
##             <Rle>         <IRanges>  <Rle> | <CharacterList>
##      [1]     chr3     238279-451097      + |            CHL1
##      [2]     chr3     238279-451097      + |            CHL1
##      [3]     chr3     239326-290282      + |            CHL1
##      [4]     chr3     239326-440831      + |            CHL1
##      [5]     chr3     361366-451097      + |            CHL1
##      ...      ...               ...    ... .             ...
##   [5502]    chr18 77732867-77748532      - |          TXNL4A
##   [5503]    chr18 77732867-77748532      - |          TXNL4A
##   [5504]    chr18 77732867-77793915      - |          TXNL4A
##   [5505]    chr18 77915117-78005397      - |          PARD6G
##   [5506]    chr18 77941005-78005397      - |          PARD6G
##   -------
##   seqinfo: 2 sequences from hg19 genome

The key difference is that the TXSTART refers to the start of a transcript and originates in the TxDb object from the TxDb.Hsapiens.UCSC.hg19.knownGene package, while the CHRLOC refers to the same thing but originates in the OrgDb object from the org.Hs.eg.db package. The point of origin is significant because the TxDb object represents a transcriptome from UCSC and the OrgDb is primarily gene centric data that originates at NCBI. The upshot is that CHRLOC will not have as many regions represented as TXSTART, since there has to be an official gene for there to even be a record. The CHRLOC data is also locked in for org.Hs.eg.db as data for hg19, whereas you can swap in a different TxDb object to match the genome you are using to make it hg18 etc. For these reasons, we strongly recommend using TXSTART instead of CHRLOC. Howeverm CHRLOC still remains in the org packages for historical reasons.

15.9 Exercise 9:

To find the keys that match, make use of the pattern and column arguments.

xk = head(keys(Homo.sapiens, keytype="ENTREZID", pattern="X", column="SYMBOL"))

## 'select()' returned 1:1 mapping between keys and columns

xk

## [1] "51"  "179" "189" "239" "240" "241"

select verifies the results

select(Homo.sapiens, xk, "SYMBOL", "ENTREZID")

## 'select()' returned 1:1 mapping between keys and columns

##   ENTREZID  SYMBOL
## 1       51   ACOX1
## 2      179   AGMX2
## 3      189    AGXT
## 4      239  ALOX12
## 5      240   ALOX5
## 6      241 ALOX5AP

15.10 Exercise 10:

## Get the transcript ranges grouped by gene
txby <- transcriptsBy(Homo.sapiens, by="gene")
## look up the entrez ID for the gene symbol 'PTEN'
select(Homo.sapiens, keys='PTEN', columns='ENTREZID', keytype='SYMBOL')
## subset that genes transcripts
geneOfInterest <- txby[["5728"]]
## extract the sequence
res <- getSeq(Hsapiens, geneOfInterest)
res

15.11 Exercise 11:

ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
ids=c("1")
getBM(attributes=c('go_id', 'entrezgene'),
            filters = 'entrezgene',
                    values = ids, mart = ensembl)

##         go_id entrezgene
## 1                      1
## 2  GO:0005576          1
## 3  GO:0005615          1
## 4  GO:0070062          1
## 5  GO:0003674          1
## 6  GO:0008150          1
## 7  GO:0072562          1
## 8  GO:0062023          1
## 9  GO:0043312          1
## 10 GO:0002576          1
## 11 GO:0034774          1
## 12 GO:1904813          1
## 13 GO:0031093          1

15.12 Exercise 12:

ids=c("1")
select(org.Hs.eg.db, keys=ids, columns="GO", keytype="ENTREZID")

## 'select()' returned 1:many mapping between keys and columns

##    ENTREZID         GO EVIDENCE ONTOLOGY
## 1         1 GO:0002576      TAS       BP
## 2         1 GO:0003674       ND       MF
## 3         1 GO:0005576      HDA       CC
## 4         1 GO:0005576      IDA       CC
## 5         1 GO:0005576      TAS       CC
## 6         1 GO:0005615      HDA       CC
## 7         1 GO:0008150       ND       BP
## 8         1 GO:0031093      TAS       CC
## 9         1 GO:0034774      TAS       CC
## 10        1 GO:0043312      TAS       BP
## 11        1 GO:0062023      HDA       CC
## 12        1 GO:0070062      HDA       CC
## 13        1 GO:0072562      HDA       CC
## 14        1 GO:1904813      TAS       CC

When this exercise was written, there was a different number of GO terms returned from biomaRt than from org.Hs.eg.db. This may not always be true in the future though as both of these resources are updated. It is expected however that this web service, (which is updated continuously) will fall in and out of sync with the org.Hs.eg.db package (which is updated twice a year). This is an important difference as each approach has different advantages and disadvantages. The advantage to updating continuously is that you always have the very latest annotations which are frequently different for something like GO terms. The advantage to using a package is that the results are frozen to a release of Bioconductor. And this can help you to get the same answers that you get today (reproducibility), a few years from now.

[ Back to top ]

Genomic Annotation Resources

24 April 2018

Contents

1 Version Info

2 Introduction

3 Set Up

4 Using AnnotationHub

4.1 AnnotationHub exercises

5 OrgDb objects

5.1 OrgDb exercises

6 TxDb Objects

6.1 TxDb exercises

7 Organism.dplyr src_organism Objects

7.1 Organism.dplyr exercises

8 BSgenome Objects

8.1 BSgenome exercises

9 biomaRt

9.1 biomaRt exercises

10 Creating annotation objects

11 Important considerations

12 sessionInfo()

13 Acknowledgments

14 References

15 Answers for exercises

15.1 Exercise 1:

15.2 Exercise 2:

15.3 Exercise 3:

15.4 Exercise 4:

15.5 Exercise 5:

15.6 Exercise 6:

15.7 Exercise 7:

15.8 Exercise 8:

15.9 Exercise 9:

15.10 Exercise 10:

15.11 Exercise 11:

15.12 Exercise 12: