Mapping between different Genome references

Package: Pbase
Author: Laurent Gatto
Last compiled: Thu Apr 16 21:59:59 2015
Last modified: 2015-04-16 13:11:05

This vignette described how to convert coordinates between different genome references. We will use transcript ENST00000373316 and GRCh38 and GRCh37 as working example.

GRCh38

Here is use Gviz to query the latest Ensembl biomart and extract the transcript of interest.

suppressMessages(library("Gviz"))
suppressMessages(library("biomaRt"))

h38 <- useMart("ensembl", "hsapiens_gene_ensembl")
tr38 <- BiomartGeneRegionTrack(biomart = h38,
                               transcript = tr)
tr38 <- split(tr38, transcript(tr38))
tr38 <- ranges(tr38[[tr]])
tr38

## GRanges object with 13 ranges and 7 metadata columns:
##        seqnames               ranges strand   |        feature
##           <Rle>            <IRanges>  <Rle>   |    <character>
##    [1]     chrX [78104174, 78104340]      +   |           utr5
##    [2]     chrX [78104341, 78104405]      +   | protein_coding
##    [3]     chrX [78109867, 78109917]      +   | protein_coding
##    [4]     chrX [78113744, 78113899]      +   | protein_coding
##    [5]     chrX [78114016, 78114160]      +   | protein_coding
##    ...      ...                  ...    ... ...            ...
##    [9]     chrX [78123195, 78123374]      +   | protein_coding
##   [10]     chrX [78124874, 78125051]      +   | protein_coding
##   [11]     chrX [78125327, 78125425]      +   | protein_coding
##   [12]     chrX [78125790, 78125830]      +   | protein_coding
##   [13]     chrX [78125831, 78129296]      +   |           utr3
##                   gene            exon      transcript      symbol
##            <character>     <character>     <character> <character>
##    [1] ENSG00000102144 ENSE00001600900 ENST00000373316        PGK1
##    [2] ENSG00000102144 ENSE00001600900 ENST00000373316        PGK1
##    [3] ENSG00000102144 ENSE00003506377 ENST00000373316        PGK1
##    [4] ENSG00000102144 ENSE00003502842 ENST00000373316        PGK1
##    [5] ENSG00000102144 ENSE00003512377 ENST00000373316        PGK1
##    ...             ...             ...             ...         ...
##    [9] ENSG00000102144 ENSE00000672996 ENST00000373316        PGK1
##   [10] ENSG00000102144 ENSE00000672995 ENST00000373316        PGK1
##   [11] ENSG00000102144 ENSE00000672994 ENST00000373316        PGK1
##   [12] ENSG00000102144 ENSE00001948816 ENST00000373316        PGK1
##   [13] ENSG00000102144 ENSE00001948816 ENST00000373316        PGK1
##             rank     phase
##        <numeric> <integer>
##    [1]         1        -1
##    [2]         1        -1
##    [3]         2         2
##    [4]         3         2
##    [5]         4         2
##    ...       ...       ...
##    [9]         8         0
##   [10]         9         0
##   [11]        10         1
##   [12]        11         0
##   [13]        11        -1
##   -------
##   seqinfo: 1 sequence from hsapiens_gene_ensembl genome; no seqlengths

Note the starting position of the transcript is 78104174.

GRCh37

Below, I repeat the same operation without using my own ens Mart instance. As far as I understand, Gviz queries the UCSC genome reference by default.

h37 <- useMart(host = "feb2014.archive.ensembl.org",
                biomart = "ENSEMBL_MART_ENSEMBL",
                dataset = "hsapiens_gene_ensembl")
tr37 <- BiomartGeneRegionTrack(biomart = h37,
                               transcript = tr)
tr37 <- split(tr37, transcript(tr37))
tr37 <- ranges(tr37[[tr]])
tr37

## GRanges object with 13 ranges and 7 metadata columns:
##        seqnames               ranges strand   |        feature
##           <Rle>            <IRanges>  <Rle>   |    <character>
##    [1]     chrX [77359671, 77359837]      +   |           utr5
##    [2]     chrX [77359838, 77359902]      +   | protein_coding
##    [3]     chrX [77365364, 77365414]      +   | protein_coding
##    [4]     chrX [77369241, 77369396]      +   | protein_coding
##    [5]     chrX [77369513, 77369657]      +   | protein_coding
##    ...      ...                  ...    ... ...            ...
##    [9]     chrX [77378692, 77378871]      +   | protein_coding
##   [10]     chrX [77380371, 77380548]      +   | protein_coding
##   [11]     chrX [77380824, 77380922]      +   | protein_coding
##   [12]     chrX [77381287, 77381327]      +   | protein_coding
##   [13]     chrX [77381328, 77384793]      +   |           utr3
##                   gene            exon      transcript      symbol
##            <character>     <character>     <character> <character>
##    [1] ENSG00000102144 ENSE00001600900 ENST00000373316        PGK1
##    [2] ENSG00000102144 ENSE00001600900 ENST00000373316        PGK1
##    [3] ENSG00000102144 ENSE00003506377 ENST00000373316        PGK1
##    [4] ENSG00000102144 ENSE00003502842 ENST00000373316        PGK1
##    [5] ENSG00000102144 ENSE00003512377 ENST00000373316        PGK1
##    ...             ...             ...             ...         ...
##    [9] ENSG00000102144 ENSE00000672996 ENST00000373316        PGK1
##   [10] ENSG00000102144 ENSE00000672995 ENST00000373316        PGK1
##   [11] ENSG00000102144 ENSE00000672994 ENST00000373316        PGK1
##   [12] ENSG00000102144 ENSE00001948816 ENST00000373316        PGK1
##   [13] ENSG00000102144 ENSE00001948816 ENST00000373316        PGK1
##             rank     phase
##        <numeric> <integer>
##    [1]         1        -1
##    [2]         1        -1
##    [3]         2         2
##    [4]         3         2
##    [5]         4         2
##    ...       ...       ...
##    [9]         8         0
##   [10]         9         0
##   [11]        10         1
##   [12]        11         0
##   [13]        11        -1
##   -------
##   seqinfo: 1 sequence from hsapiens_gene_ensembl genome; no seqlengths

Note the starting position of the transcript is 77359671.

These differences seem to stem from different genome builds. Ensembl release 78 uses GRCh38, while UCSC uses GRCh37. Indeed, Gviz sets the Ensembl biomart server to Feb.2014 GRCh37.p13.

Coordinates conversion

We will use the coordinate mapping infrastructure described in the January 2015 Bioconductor Newletter and the Changing genomic coordinate systems with rtracklayer::liftOver workflow.

First, we query AnnotationHub for a chain file to perform the operation we want.

library("AnnotationHub")

## Note: the specification for S3 class "AsIs" in package 'RJSONIO' seems equivalent to one from package 'BiocGenerics': not turning on duplicate class definitions for this class.

hub <- AnnotationHub()
query(hub, 'hg19ToHg38')

## AnnotationHub with 1 record
## # snapshotDate(): 2015-03-26 
## # names(): AH14150
## # $dataprovider: UCSC
## # $species: Homo sapiens
## # $rdataclass: ChainFile
## # $title: hg19ToHg38.over.chain.gz
## # $description: UCSC liftOver chain file from hg19 to hg38
## # $taxonomyid: 9606
## # $genome: hg19
## # $sourcetype: Chain
## # $sourceurl: http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/h...
## # $sourcelastmodifieddate: NA
## # $sourcesize: NA
## # $tags: liftOver, chain, UCSC, genome, homology 
## # retrieve record with 'object[["AH14150"]]'

chain <- query(hub, 'hg19ToHg38')[[1]]

The liftOver function from the rtracklayer package will use the chain and translate the coordinates of a GRanges object into a new GRangesList object.

library("rtracklayer")
res <- liftOver(tr37, chain)
res <- unlist(res)

## set annotation
names(res) <- NULL
genome(res) <- "hsapiens_gene_ensembl"

all.equal(res, tr38)

## [1] TRUE

Session information

sessionInfo()

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] rtracklayer_1.28.0   AnnotationHub_2.0.0  biomaRt_2.24.0      
##  [4] mzID_1.6.0           Biostrings_2.36.0    XVector_0.8.0       
##  [7] Pbase_0.8.0          Gviz_1.12.0          GenomicRanges_1.20.0
## [10] GenomeInfoDb_1.4.0   IRanges_2.2.0        S4Vectors_0.6.0     
## [13] Rcpp_0.11.5          BiocGenerics_0.14.0  BiocStyle_1.6.0     
## 
## loaded via a namespace (and not attached):
##  [1] httr_0.6.1                   Biobase_2.28.0              
##  [3] vsn_3.36.0                   splines_3.2.0               
##  [5] foreach_1.4.2                shiny_0.11.1                
##  [7] Formula_1.2-1                interactiveDisplayBase_1.6.0
##  [9] affy_1.46.0                  latticeExtra_0.6-26         
## [11] BSgenome_1.36.0              Rsamtools_1.20.0            
## [13] yaml_2.1.13                  impute_1.42.0               
## [15] RSQLite_1.0.0                lattice_0.20-31             
## [17] biovizBase_1.16.0            limma_3.24.0                
## [19] chron_2.3-45                 digest_0.6.8                
## [21] RColorBrewer_1.1-2           colorspace_1.2-6            
## [23] httpuv_1.3.2                 htmltools_0.2.6             
## [25] preprocessCore_1.30.0        plyr_1.8.1                  
## [27] MALDIquant_1.11              XML_3.98-1.1                
## [29] zlibbioc_1.14.0              xtable_1.7-4                
## [31] scales_0.2.4                 affyio_1.36.0               
## [33] cleaver_1.6.0                BiocParallel_1.2.0          
## [35] ggplot2_1.0.1                GenomicFeatures_1.20.0      
## [37] nnet_7.3-9                   proto_0.3-10                
## [39] RJSONIO_1.3-0                mime_0.3                    
## [41] survival_2.38-1              evaluate_0.6                
## [43] doParallel_1.0.8             MASS_7.3-40                 
## [45] foreign_0.8-63               mzR_2.2.0                   
## [47] Pviz_1.2.0                   BiocInstaller_1.18.1        
## [49] tools_3.2.0                  data.table_1.9.4            
## [51] formatR_1.1                  matrixStats_0.14.0          
## [53] stringr_0.6.2                MSnbase_1.16.0              
## [55] munsell_0.4.2                cluster_2.0.1               
## [57] AnnotationDbi_1.30.0         lambda.r_1.1.7              
## [59] pcaMethods_1.58.0            futile.logger_1.4           
## [61] RCurl_1.95-4.5               dichromat_2.0-0             
## [63] iterators_1.0.7              VariantAnnotation_1.14.0    
## [65] bitops_1.0-6                 rmarkdown_0.5.1             
## [67] gtable_0.1.2                 codetools_0.2-11            
## [69] DBI_0.3.1                    R6_2.0.1                    
## [71] reshape2_1.4.1               GenomicAlignments_1.4.0     
## [73] knitr_1.9                    Hmisc_3.15-0                
## [75] ProtGenerics_1.0.0           futile.options_1.0.0        
## [77] rpart_4.1-9                  acepack_1.3-3.3