NHGRI maintains and routinely updates a database of selected genome-wide association studies. This document describes R/Bioconductor facilities for working with contents of this database.
Once the package has been installed, use to obtain interactive access to all the facilities. After executing this command, use to obtain an overview. The current version of this vignette can always be accessed at www.bioconductor.org, or by suitably navigating the web pages generated with .
## # A tibble: 402,121 × 38
## DATE ADDED T…¹ PUBME…² FIRST…³ DATE JOURNAL LINK STUDY DISEA…⁴ INITI…⁵
## <date> <dbl> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 2008-06-16 1.58e7 Klein … 2005-03-10 Science www.… Comp… Age-re… 96 Eur…
## 2 2008-06-16 1.63e7 Maraga… 2005-09-09 Am J H… www.… High… Parkin… 381 Eu…
## 3 2008-06-16 1.66e7 Arking… 2006-04-30 Nat Ge… www.… A co… QT int… 100 Eu…
## 4 2008-06-16 1.71e7 Fung HC 2006-09-28 Lancet… www.… Geno… Parkin… 267 Eu…
## 5 2008-06-16 1.71e7 Fung HC 2006-09-28 Lancet… www.… Geno… Parkin… 267 Eu…
## 6 2008-06-16 1.71e7 Fung HC 2006-09-28 Lancet… www.… Geno… Parkin… 267 Eu…
## 7 2008-06-16 1.71e7 Dewan A 2006-10-19 Science www.… HTRA… Age-re… 96 Sou…
## 8 2008-06-16 1.71e7 Duerr … 2006-10-26 Science www.… A ge… Inflam… 547 Eu…
## 9 2008-06-16 1.72e7 Bierut… 2006-12-07 Hum Mo… www.… Nove… Nicoti… 482 Eu…
## 10 2008-06-16 1.72e7 Bierut… 2006-12-07 Hum Mo… www.… Nove… Nicoti… 482 Eu…
## # … with 402,111 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
## # REGION <chr>, CHR_ID <chr>, CHR_POS <chr>, `REPORTED GENE(S)` <chr>,
## # MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
## # SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
## # DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
## # SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
## # INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …
We can produce a GRanges in two forms. By default
we get an mcols that has a small set of columns. Note
that records that lack a CHR_POS
value are omitted.
Records that have complicated CHR_POS
values, including
semicolons or " x " notation are kept, but only the first
position is retained. The CHR_ID
field may have
complicated character values, these are not normalized,
and are simply used as seqnames
“as is”.
## dropping 18239 records that have NA for CHR_POS
## 261 records have semicolon in CHR_POS; splitting and using first entry.
## 251 records have ' x ' in CHR_POS indicating multiple SNP effects, using first.
## GRanges object with 383882 ranges and 4 metadata columns:
## seqnames ranges strand | PUBMEDID DATE
## <Rle> <IRanges> <Rle> | <numeric> <Date>
## [1] 1 161509955 * | 19915573 2009-11-15
## [2] 6 31143579 * | 19915573 2009-11-15
## [3] 13 26957130 * | 19915573 2009-11-15
## [4] 9 5213687 * | 19915573 2009-11-15
## [5] 6 32465390 * | 19915573 2009-11-15
## ... ... ... ... . ... ...
## [383878] 10 68192278 * | 34077760 2021-06-01
## [383879] 13 110066208 * | 34077760 2021-06-01
## [383880] 21 28133942 * | 34077760 2021-06-01
## [383881] 1 91730044 * | 34077760 2021-06-01
## [383882] 11 58646437 * | 34077760 2021-06-01
## DISEASE/TRAIT SNPS
## <character> <character>
## [1] Ulcerative colitis rs1801274
## [2] Ulcerative colitis rs9263739
## [3] Ulcerative colitis rs17085007
## [4] Ulcerative colitis rs10975003
## [5] Ulcerative colitis rs2395185
## ... ... ...
## [383878] Vertical cup-disc ra.. rs10998007
## [383879] Vertical cup-disc ra.. rs12875868
## [383880] Vertical cup-disc ra.. rs6516818
## [383881] Vertical cup-disc ra.. rs10783002
## [383882] Vertical cup-disc ra.. rs1938598
## -------
## seqinfo: 209 sequences from GRCh38 genome; no seqlengths
We can set the seqinfo as follows, retaining only records that employ standard chromosomes.
library(GenomeInfoDb)
data(si.hs.38)
gg = keepStandardChromosomes(gg, pruning="coarse")
seqlevels(gg) = seqlevels(si.hs.38)
seqinfo(gg) = si.hs.38
gg
## GRanges object with 383370 ranges and 4 metadata columns:
## seqnames ranges strand | PUBMEDID DATE
## <Rle> <IRanges> <Rle> | <numeric> <Date>
## [1] 1 161509955 * | 19915573 2009-11-15
## [2] 6 31143579 * | 19915573 2009-11-15
## [3] 13 26957130 * | 19915573 2009-11-15
## [4] 9 5213687 * | 19915573 2009-11-15
## [5] 6 32465390 * | 19915573 2009-11-15
## ... ... ... ... . ... ...
## [383366] 10 68192278 * | 34077760 2021-06-01
## [383367] 13 110066208 * | 34077760 2021-06-01
## [383368] 21 28133942 * | 34077760 2021-06-01
## [383369] 1 91730044 * | 34077760 2021-06-01
## [383370] 11 58646437 * | 34077760 2021-06-01
## DISEASE/TRAIT SNPS
## <character> <character>
## [1] Ulcerative colitis rs1801274
## [2] Ulcerative colitis rs9263739
## [3] Ulcerative colitis rs17085007
## [4] Ulcerative colitis rs10975003
## [5] Ulcerative colitis rs2395185
## ... ... ...
## [383366] Vertical cup-disc ra.. rs10998007
## [383367] Vertical cup-disc ra.. rs12875868
## [383368] Vertical cup-disc ra.. rs6516818
## [383369] Vertical cup-disc ra.. rs10783002
## [383370] Vertical cup-disc ra.. rs1938598
## -------
## seqinfo: 24 sequences from GRCh38 genome
We use BiocFileCache to manage downloaded TSV from EBI’s download site. The file is provided without compression, so prepare for 200+MB download if you are not working from a cache. There is no etag set, so you have to check for updates on your own.
## function (url = "http://www.ebi.ac.uk/gwas/api/search/downloads/alternative",
## cache = BiocFileCache::BiocFileCache(), refresh = FALSE,
## ...)
## NULL
This is converted to a manageable extension of GRanges using
process_gwas_dataframe
.
## function (df)
## NULL
Available functions are:
## gwascat loaded. Use makeCurrentGwascat() to extract current image.
## from EBI. The data folder of this package has some legacy extracts.
## [1] "as_GRanges" "bindcadd_snv" "getRsids"
## [4] "getTraits" "get_cached_gwascat" "gwascat_from_AHub"
## [7] "gwcat_snapshot" "gwcex2gviz" "ldtagr"
## [10] "locs4trait" "makeCurrentGwascat" "obo2graphNEL"
## [13] "process_gwas_dataframe" "riskyAlleleCount" "subsetByChromosome"
## [16] "subsetByTraits" "topTraits" "traitsManh"
An extended GRanges instance with a sample of 50000 SNP-disease associations reported
on 30 April 2020 is
obtained as follows, with addresses based on the GRCh38 genome build.
We use gwtrunc
to refer to it in the sequel.
To determine the most frequently occurring traits in this sample:
##
## Blood protein levels
## 1941
## Heel bone mineral density
## 1309
## Body mass index
## 1283
## Height
## 1238
## Metabolite levels
## 691
## Systolic blood pressure
## 654
## Schizophrenia
## 643
## Educational attainment (years of education)
## 642
## Post bronchodilator FEV1/FVC ratio
## 479
## Type 2 diabetes
## 466
For a given trait, obtain a GRanges with all recorded associations; here only three associations are shown:
## gwasloc instance with 3 records and 38 attributes per record.
## Extracted: 2020-04-30 23:24:51
## metadata()$badpos includes records for which no unique locus was given.
## Genome: GRCh38
## Excerpt:
## GRanges object with 3 ranges and 3 metadata columns:
## seqnames ranges strand | DISEASE/TRAIT SNPS P-VALUE
## <Rle> <IRanges> <Rle> | <character> <character> <numeric>
## [1] 7 21567734 * | LDL cholesterol rs12670798 6e-09
## [2] 5 75355259 * | LDL cholesterol rs3846662 2e-11
## [3] 2 43837951 * | LDL cholesterol rs6756629 3e-10
## -------
## seqinfo: 24 sequences from GRCh38 genome
A basic Manhattan plot is easily constructed with the ggbio package facilities. Here we confine attention to chromosomes 4:6. First, we create a version of the catalog with \(-log_{10} p\) truncated at a maximum value of 25.
requireNamespace("S4Vectors")
mcols = S4Vectors::mcols
mlpv = mcols(gwtrunc)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwtrunc)$PVALUE_MLOG = mlpv
library(GenomeInfoDb)
# seqlevelsStyle(gwtrunc) = "UCSC" # no more!
seqlevels(gwtrunc) = paste0("chr", seqlevels(gwtrunc))
gwlit = gwtrunc[ which(as.character(seqnames(gwtrunc)) %in% c("chr4", "chr5", "chr6")) ]
library(ggbio)
mlpv = mcols(gwlit)$PVALUE_MLOG
mlpv = ifelse(mlpv > 25, 25, mlpv)
S4Vectors::mcols(gwlit)$PVALUE_MLOG = mlpv
A simple call permits visualization of GWAS results for a small number of traits. Note the defaults in this call.