R version: R version 4.0.0 (2020-04-24)
Bioconductor version: 3.11
Package version: 1.12.0
library(gwascat)
cur = makeCurrentGwascat() # result varies by day
data(cur)
cur
## gwasloc instance with 65795 records and 37 attributes per record.
## Extracted: 2018-04-24
## Genome: GRCh38 NA
## Excerpt:
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | DISEASE/TRAIT SNPS
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] 17 78284479 * | Inflammatory skin disease rs9302874
## [2] 10 129683009 * | Inflammatory skin disease rs80312298
## [3] 1 247490110 * | Inflammatory skin disease rs10802516
## [4] 2 60845432 * | Inflammatory skin disease rs35741374
## [5] 1 152619805 * | Inflammatory skin disease rs1581803
## P-VALUE
## <numeric>
## [1] 2e-07
## [2] 6e-07
## [3] 6e-06
## [4] 4e-12
## [5] 2e-12
## -------
## seqinfo: 24 sequences from 2 genomes (GRCh38, NA)
The transformation to hg19 coordinates is defined by a chain file provided by UCSC. rtracklayer::import.chain will bring the data into R.
library(rtracklayer)
path = system.file(package="liftOver", "extdata", "hg38ToHg19.over.chain")
ch = import.chain(path)
ch
## Chain of length 25
## names(25): chr22 chr21 chr19 chr20 chrY chr18 ... chr6 chr5 chr4 chr3 chr2 chr1
str(ch[[1]])
## Formal class 'ChainBlock' [package "rtracklayer"] with 6 slots
## ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
## .. .. ..@ start : int [1:6842] 16367189 16386933 16386970 16387001 16387128 16395491 16395528 16395841 16395860 16395956 ...
## .. .. ..@ width : int [1:6842] 19744 36 31 112 8362 36 312 18 95 33 ...
## .. .. ..@ NAMES : NULL
## .. .. ..@ elementType : chr "ANY"
## .. .. ..@ elementMetadata: NULL
## .. .. ..@ metadata : list()
## ..@ offset : int [1:6842] -480662 -480702 -480702 -480726 -480726 -480726 -480726 -480726 -480726 -480726 ...
## ..@ score : int [1:1168] -1063867308 68830488 21156147 20814926 7358950 3927744 2928210 991419 880681 802146 ...
## ..@ space : chr [1:1168] "chr22" "chr14" "chr22" "chr21" ...
## ..@ reversed: logi [1:1168] FALSE FALSE FALSE FALSE FALSE FALSE ...
## ..@ length : int [1:1168] 1124 1280 173 465 398 110 43 173 342 84 ...
Some more details about the chain data structure are available in the import.chain man page
A chain file essentially details many local alignments, so it is possible for the "from" ranges to map to overlapping regions in the other sequence. The "from" ranges are guaranteed to be disjoint (but do not necessarily cover the entire "from" sequence).
The liftOver function will create a GRangesList.
seqlevelsStyle(cur) = "UCSC" # necessary
cur19 = liftOver(cur, ch)
class(cur19)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
We unlist and coerce to the gwaswloc class, a convenient form for the GWAS catalog with its many mcols fields.
cur19 = unlist(cur19)
genome(cur19) = "hg19"
cur19 = new("gwaswloc", cur19)
cur19
## gwasloc instance with 65757 records and 37 attributes per record.
## Extracted:
## Genome: hg19
## Excerpt:
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | DISEASE/TRAIT SNPS
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr17 76280560 * | Inflammatory skin disease rs9302874
## [2] chr10 131481273 * | Inflammatory skin disease rs80312298
## [3] chr1 247653412 * | Inflammatory skin disease rs10802516
## [4] chr2 61072567 * | Inflammatory skin disease rs35741374
## [5] chr1 152592281 * | Inflammatory skin disease rs1581803
## P-VALUE
## <numeric>
## [1] 2e-07
## [2] 6e-07
## [3] 6e-06
## [4] 4e-12
## [5] 2e-12
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
We see that the translation leads to a loss of some loci.
length(cur)-length(cur19)
## [1] 38
setdiff(mcols(cur)$SNPS, mcols(cur19)$SNPS)
## [1] "rs757210" "rs8064454" "rs386000" "rs644148" "rs9876781"
## [6] "rs1167796" "rs649129" "rs11672691" "rs4911642" "rs718433"
## [11] "rs138700403" "rs144184641" "rs147767607" "rs148916504" "rs149506335"
## [16] "rs192514996" "rs2734221" "rs2855983" "rs4457242" "rs7777677"
## [21] "rs201386833" "rs3757378" "rs400942" "rs11263761" "rs8176645"
## [26] "rs11785400" "rs451000" "rs450937" "rs12601991" "rs7256693"
## [31] "rs3020736" "rs453755"
It may be interesting to follow up some of the losses.