.
.
A phenome-wide association study (PheWAS) is known to be a powerful tool in discovery and replication of genetic association studies. The recent development in the UK Biobank resource with deep genomic and phenotyping data has provided unparalleled research opportunities. To reduce the computational complexity and cost of PheWAS, the SAIGE (scalable and accurate implementation of generalized mixed model [1]) method was proposed recently, controlling for case-control imbalance and sample structure in single variant association studies. However, it is still computationally challenging to analyze the associations of thousands of phenotypes with whole-genome variant data, especially for disease diagnoses using the ICD-10 codes of deep phenotyping.
Here we develop a new high-performance statistical package (SAIGEgds) for large-scale PheWAS using mixed models [2]. In this package, the SAIGE method is implemented with optimized C++ codes taking advantage of sparse structure of genotype dosages. SAIGEgds supports efficient genomic data structure (GDS) files [3] including both integer genotypes and numeric imputed dosages. Benchmarks using the UK Biobank White British genotype data (N=430K) with coronary heart disease and simulated cases, show that SAIGEgds is 5 to 6 times faster than the SAIGE R package in the steps of fitting null models and p-value calculations. When used in conjunction with high-performance computing (HPC) clusters and/or cloud resources, SAIGEgds provides an efficient analysis pipeline for biobank-scale PheWAS.
.
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("SAIGEgds")
The BiocManager::install()
approach may require that you build from source, i.e. make
and compilers must be installed on your system – see the R FAQ for your operating system; you may also need to install dependencies manually.
.
.
library(SeqArray)
library(SAIGEgds)
# the genotype file in SeqArray GDS format (1000 Genomes Phase1, chromosome 22)
(geno_fn <- seqExampleFileName("KG_Phase1"))
## [1] "/home/biocbuild/bbs-3.12-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds"
# open a SeqArray file in the package (1000 Genomes Phase1, chromosome 22)
gds <- seqOpen(geno_fn)
The LD pruning is provided by snpgdsLDpruning() in the pacakge SNPRelate:
library(SNPRelate)
## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)
set.seed(1000)
snpset <- snpgdsLDpruning(gds)
## SNV pruning based on LD:
## Calculating allele counts/frequencies ...
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## # of selected variants: 19,651
## Excluding 122 SNVs (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
## # of samples: 1,092
## # of SNVs: 19,651
## using 1 thread
## sliding window: 500,000 basepairs, Inf SNPs
## |LD| threshold: 0.2
## method: composite
## Chromosome 22: 37.70%, 7,455/19,773
## 7,455 markers are selected in total.
str(snpset)
## List of 1
## $ chr22: int [1:7455] 1 3 4 6 7 8 9 10 12 13 ...
snpset.id <- unlist(snpset, use.names=FALSE) # get the variant IDs of a LD-pruned set
head(snpset.id)
## [1] 1 3 4 6 7 8
Create a genotype file for genetic relationship matrix (GRM) using the LD-pruned SNP set:
grm_fn <- "grm_geno.gds"
seqSetFilter(gds, variant.id=snpset.id)
## # of selected variants: 7,455
# export to a GDS genotype file without annotation data
seqExport(gds, grm_fn, info.var=character(), fmt.var=character(), samp.var=character())
## Export to 'grm_geno.gds'
## sample.id (1,092) [md5: bd2a93b49750ae227793ed23c575b620]
## variant.id (7,455) [md5: 67e7665f2ea9a4132b49a9f5bd84fbae]
## position [md5: a62a3752b030b2742f45dc3a5e751e51]
## chromosome [md5: 6cc8b780b298da2fb8237da650af7fe8]
## allele [md5: 3ff1146981d1807658f5b0243c2d47b0]
## genotype [md5: 5bd348af9af0c1f14bbb75c2b9a8b7e6]
## phase [md5: 922eca75d5098e1b684dcf0bae70e86b]
## annotation/id [md5: 7cfc098439330617251554e95fbb2f79]
## annotation/qual [md5: ce8ebcb77b11f0c53309778b928a7393]
## annotation/filter [md5: 11301697d2af5964436ec4eb2294906f]
## Done.
## Optimize the access efficiency ...
## Clean up the fragments of GDS file:
## open the file 'grm_geno.gds' (412.9K)
## # of fragments: 91
## save to 'grm_geno.gds.tmp'
## rename 'grm_geno.gds.tmp' (412.5K, reduced: 444B)
## # of fragments: 54
If genotypes are split by chromosomes, seqMerge()
in the SeqArray package can be used to combine the LD-pruned SNP sets.
# close the file
seqClose(gds)
.
A simulated phenotype data is used to demonstrate the model fitting:
set.seed(1000)
sampid <- seqGetData(grm_fn, "sample.id") # sample IDs in the genotype file
pheno <- data.frame(
sample.id = sampid,
y = sample(c(0, 1), length(sampid), replace=TRUE, prob=c(0.95, 0.05)),
x1 = rnorm(length(sampid)),
x2 = rnorm(length(sampid)),
stringsAsFactors = FALSE)
head(pheno)
## sample.id y x1 x2
## 1 HG00096 0 0.02329127 -0.1700062
## 2 HG00097 0 -1.38629108 -1.0058530
## 3 HG00099 0 0.56339700 -0.6989284
## 4 HG00100 0 -0.25703246 -1.5884806
## 5 HG00101 0 0.02926429 0.7645377
## 6 HG00102 0 -0.44480750 1.9964501
# null model fitting using GRM from grm_fn
grm_fn
## [1] "grm_geno.gds"
glmm <- seqFitNullGLMM_SPA(y ~ x1 + x2, pheno, grm_fn, trait.type="binary", sample.col="sample.id")
## SAIGE association analysis:
## Sat May 1 21:43:00 2021
## Open 'grm_geno.gds'
## Filtering variants:
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## # of selected variants: 2,079
## Fit the null model: y ~ x1 + x2 + var(GRM)
## # of samples: 1,092
## # of variants: 2,079
## using 1 thread
## Transform on the design matrix with QR decomposition:
## new formula: y ~ x_0 + x_1 + x_2 - 1
## Start loading SNP genotypes:
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## using 1.5M (sparse matrix)
## Binary outcome: y
## y Number Proportion
## 0 1034 0.94688645
## 1 58 0.05311355
## Initial fixed-effect coefficients:
## x_0 x_1 x_2
## 2.884823 -0.09445083 0.01380474
## Initial variance component estimates, tau:
## Sigma_E: 1, Sigma_G: 0.5
## Iteration 1:
## tau: (1, 0.4988798)
## fixed coeff: (2.884823, -0.09445083, 0.01380474)
## Final tau: (1, 0)
## fixed coeff: (2.884823, -0.09445083, 0.01380474)
## Calculate the average ratio of variances:
## Sat May 1 21:43:01 2021
## 1, maf: 0.0101, mac: 22, ratio: 1.0000 (var1: 0.0488, var2: 0.0488)
## 2, maf: 0.0261, mac: 57, ratio: 1.0000 (var1: 0.0544, var2: 0.0544)
## 3, maf: 0.0797, mac: 174, ratio: 1.0000 (var1: 0.0479, var2: 0.0479)
## 4, maf: 0.2074, mac: 453, ratio: 1.0000 (var1: 0.0402, var2: 0.0402)
## 5, maf: 0.2967, mac: 648, ratio: 1.0000 (var1: 0.0371, var2: 0.0371)
## 6, maf: 0.1992, mac: 435, ratio: 1.0000 (var1: 0.0454, var2: 0.0454)
## 7, maf: 0.0101, mac: 22, ratio: 1.0000 (var1: 0.049, var2: 0.049)
## 8, maf: 0.4446, mac: 971, ratio: 1.0000 (var1: 0.0298, var2: 0.0298)
## 9, maf: 0.1268, mac: 277, ratio: 1.0000 (var1: 0.0471, var2: 0.0471)
## 10, maf: 0.2427, mac: 530, ratio: 1.0000 (var1: 0.0305, var2: 0.0305)
## 11, maf: 0.0206, mac: 45, ratio: 1.0000 (var1: 0.0598, var2: 0.0598)
## 12, maf: 0.0385, mac: 84, ratio: 1.0000 (var1: 0.0532, var2: 0.0532)
## 13, maf: 0.0183, mac: 40, ratio: 1.0000 (var1: 0.0525, var2: 0.0525)
## 14, maf: 0.2056, mac: 449, ratio: 1.0000 (var1: 0.0455, var2: 0.0455)
## 15, maf: 0.0096, mac: 21, ratio: 1.0000 (var1: 0.0535, var2: 0.0535)
## 16, maf: 0.0114, mac: 25, ratio: 1.0000 (var1: 0.0487, var2: 0.0487)
## 17, maf: 0.0408, mac: 89, ratio: 1.0000 (var1: 0.049, var2: 0.049)
## 18, maf: 0.1983, mac: 433, ratio: 1.0000 (var1: 0.04, var2: 0.04)
## 19, maf: 0.0490, mac: 107, ratio: 1.0000 (var1: 0.0479, var2: 0.0479)
## 20, maf: 0.0114, mac: 25, ratio: 1.0000 (var1: 0.0515, var2: 0.0515)
## 21, maf: 0.0188, mac: 41, ratio: 1.0000 (var1: 0.0592, var2: 0.0592)
## 22, maf: 0.1937, mac: 423, ratio: 1.0000 (var1: 0.043, var2: 0.043)
## 23, maf: 0.0128, mac: 28, ratio: 1.0000 (var1: 0.0527, var2: 0.0527)
## 24, maf: 0.4020, mac: 878, ratio: 1.0000 (var1: 0.0348, var2: 0.0348)
## 25, maf: 0.0316, mac: 69, ratio: 1.0000 (var1: 0.0463, var2: 0.0463)
## 26, maf: 0.0357, mac: 78, ratio: 1.0000 (var1: 0.0517, var2: 0.0517)
## 27, maf: 0.0989, mac: 216, ratio: 1.0000 (var1: 0.0481, var2: 0.0481)
## 28, maf: 0.0096, mac: 21, ratio: 1.0000 (var1: 0.0585, var2: 0.0585)
## 29, maf: 0.0275, mac: 60, ratio: 1.0000 (var1: 0.0525, var2: 0.0525)
## 30, maf: 0.1255, mac: 274, ratio: 1.0000 (var1: 0.0455, var2: 0.0455)
## ratio avg. is 1, sd: 8.525292e-16
## Sat May 1 21:43:01 2021
## Done.
.
# genetic variants stored in the file geno_fn
geno_fn
## [1] "/home/biocbuild/bbs-3.12-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds"
# calculate, using 2 processes
assoc <- seqAssocGLMM_SPA(geno_fn, glmm, mac=10, parallel=2)
## SAIGE association analysis:
## open '/home/biocbuild/bbs-3.12-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds'
## # of samples: 1,092
## # of variants: 19,773
## MAF threshold: NaN
## MAC threshold: 10
## missing threshold for variants: 0.1
## p-value threshold for SPA adjustment: 0.05
## variance ratio for approximation: 1
## # of processes: 2
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## # of variants after filtering by MAF, MAC and missing thresholds: 9,371
## Done.
head(assoc)
## id chr pos rs.id ref alt AF.alt mac num beta SE pval
## 1 1 22 16051497 rs141578542 A G 0.30494505 666 1092 0.2214607 0.2194430 0.3128813
## 2 2 22 16059752 rs139717388 G A 0.05677656 124 1092 0.4227084 0.4240575 0.3188526
## 3 5 22 16060995 rs2843244 G A 0.06135531 134 1092 0.3044670 0.4100805 0.4578107
## 4 8 22 16166919 ATATTTTCTGCACATATT A 0.01190476 26 1092 -1.0851502 0.8804521 0.2177653
## 5 9 22 16173887 GT G 0.03067766 67 1092 -0.8195449 0.5587294 0.1424302
## 6 10 22 16205515 rs144309057 G A 0.01098901 24 1092 -1.0826898 0.9340256 0.2463890
## p.norm converged
## 1 0.3128813 TRUE
## 2 0.3188526 TRUE
## 3 0.4578107 TRUE
## 4 0.2177653 TRUE
## 5 0.1424302 TRUE
## 6 0.2463890 TRUE
# filtering based on p-value
assoc[assoc$pval < 5e-4, ]
## id chr pos rs.id ref alt AF.alt mac num beta SE pval p.norm
## 8422 17856 22 48507315 rs57751251 T C 0.1666667 364 1092 0.971123 0.2555021 1.442051e-04 5.999175e-05
## 8423 17857 22 48509092 rs7292083 C T 0.1625458 355 1092 1.008351 0.2585365 9.610252e-05 3.358045e-05
## 8630 18282 22 49060987 rs4925399 A G 0.5929487 889 1092 0.708757 0.1956659 2.920160e-04 2.363433e-04
## converged
## 8422 TRUE
## 8423 TRUE
## 8630 TRUE
The output could be directly saved to an R object file or a GDS file:
# save to 'assoc.gds'
seqAssocGLMM_SPA(geno_fn, glmm, mac=10, parallel=2, res.savefn="assoc.gds")
## SAIGE association analysis:
## open '/home/biocbuild/bbs-3.12-bioc/R/library/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds'
## # of samples: 1,092
## # of variants: 19,773
## MAF threshold: NaN
## MAC threshold: 10
## missing threshold for variants: 0.1
## p-value threshold for SPA adjustment: 0.05
## variance ratio for approximation: 1
## # of processes: 2
##
[..................................................] 0%, ETC: ---
[==================================================] 100%, completed, 0s
## # of variants after filtering by MAF, MAC and missing thresholds: 9,371
## Save to 'assoc.gds' ...
## Done.
Open the output GDS file using the functions in the gdsfmt package:
# open the GDS file
(f <- openfn.gds("assoc.gds"))
## File: /tmp/Rtmpxe8agy/Rbuild73891e584dee/SAIGEgds/vignettes/assoc.gds (448.1K)
## + [ ] *
## |--+ sample.id { Str8 1092 LZMA(10.2%), 888B }
## |--+ id { Int32 9371 LZMA(13.4%), 4.9K }
## |--+ chr { Str8 9371 LZMA(0.46%), 128B }
## |--+ pos { Int32 9371 LZMA(53.9%), 19.7K }
## |--+ rs.id { Str8 9371 LZMA(35.7%), 34.0K }
## |--+ ref { Str8 9371 LZMA(23.7%), 89.0K }
## |--+ alt { Str8 9371 LZMA(21.0%), 4.0K }
## |--+ AF.alt { Float64 9371 LZMA(20.1%), 14.7K }
## |--+ mac { Float64 9371 LZMA(17.3%), 12.6K }
## |--+ num { Int32 9371 LZMA(0.35%), 132B }
## |--+ beta { Float64 9371 LZMA(92.9%), 68.0K }
## |--+ SE { Float64 9371 LZMA(89.6%), 65.6K }
## |--+ pval { Float64 9371 LZMA(89.8%), 65.7K }
## |--+ p.norm { Float64 9371 LZMA(89.8%), 65.7K }
## \--+ converged { Int32,logical 9371 LZMA(0.35%), 132B } *
# get p-values
pval <- read.gdsn(index.gdsn(f, "pval"))
summary(pval)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000961 0.2862278 0.4639086 0.4980417 0.7256674 0.9999449
closefn.gds(f)
Load association results using the function seqSAIGE_LoadPval()
in SAIGEgds:
res <- seqSAIGE_LoadPval("assoc.gds")
## Loading 'assoc.gds' ...
head(res)
## id chr pos rs.id ref alt AF.alt mac num beta SE pval
## 1 1 22 16051497 rs141578542 A G 0.30494505 666 1092 0.2214607 0.2194430 0.3128813
## 2 2 22 16059752 rs139717388 G A 0.05677656 124 1092 0.4227084 0.4240575 0.3188526
## 3 5 22 16060995 rs2843244 G A 0.06135531 134 1092 0.3044670 0.4100805 0.4578107
## 4 8 22 16166919 ATATTTTCTGCACATATT A 0.01190476 26 1092 -1.0851502 0.8804521 0.2177653
## 5 9 22 16173887 GT G 0.03067766 67 1092 -0.8195449 0.5587294 0.1424302
## 6 10 22 16205515 rs144309057 G A 0.01098901 24 1092 -1.0826898 0.9340256 0.2463890
## p.norm converged
## 1 0.3128813 TRUE
## 2 0.3188526 TRUE
## 3 0.4578107 TRUE
## 4 0.2177653 TRUE
## 5 0.1424302 TRUE
## 6 0.2463890 TRUE
.
n <- nrow(assoc)
# expected p-values
exp.pval <- (1:n) / n
# observed p-values
obs.pval <- sort(assoc$pval)
plot(-log10(exp.pval), -log10(obs.pval), xlab="-log10(expected P)", ylab="-log10(observed P)", cex=2/3)
abline(0, 1, col="red", lty=2)
.
.
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] SNPRelate_1.24.0 SAIGEgds_1.4.2 Rcpp_1.0.6 SeqArray_1.30.0 gdsfmt_1.26.1
##
## loaded via a namespace (and not attached):
## [1] XVector_0.30.0 knitr_1.33 magrittr_2.0.1 zlibbioc_1.36.0
## [5] GenomicRanges_1.42.0 BiocGenerics_0.36.1 IRanges_2.24.1 R6_2.5.0
## [9] rlang_0.4.11 highr_0.9 stringr_1.4.0 GenomeInfoDb_1.26.7
## [13] tools_4.0.5 parallel_4.0.5 xfun_0.22 jquerylib_0.1.4
## [17] htmltools_0.5.1.1 RcppParallel_5.1.2 yaml_2.2.1 digest_0.6.27
## [21] crayon_1.4.1 GenomeInfoDbData_1.2.4 bitops_1.0-7 sass_0.3.1
## [25] S4Vectors_0.28.1 RCurl_1.98-1.3 evaluate_0.14 rmarkdown_2.7
## [29] stringi_1.5.3 compiler_4.0.5 bslib_0.2.4 Biostrings_2.58.0
## [33] SPAtest_3.1.2 stats4_4.0.5 jsonlite_1.7.2
.
.