Michigan Imputation Server
Michigan Imputation Server pre-phases typed genotypes using HAPI-UR, SHAPEIT, or EAGLE (default is EAGLE2), imputes using Minimac3 imputation engine and outputs Blocked GNU Zip Format VCF files (.vcf.gz
). These .vcf.gz
files are used as input for gwasurvivr
. Minimac uses slightly different metrics to assess imputation quality (\(R^2\) versus info score) and complete details as to minimac output are available on the Minimac3 Wikipage.
The function, michiganCoxSurv
uses a modification of Cox proportional hazard regression from the R library survival:::coxph.fit
. Built specifically for genetic data, michiganCoxSurv
allows the user to filter on info (\(R^2\)) score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using RefPanelAF
as the input arguement for maf.filter
. Users are also provided with the sample minor allele frequency (MAF) in the sangerCoxSurv
output, which can be used for filtering post analysis.
Samples can be selected for analyses by providing a vector of sample.ids
. The output from Sanger imputation server returns the samples as SAMP1, ..., SAMPN
, where N
is the total number of samples. The sample order corresponds to the sample order in the vcf.file
used for imputation. Note, sample order can also be found in the .fam
file if genotyping data were initially in .bed
, .bim
and .fam
(PLINK) format prior to conversion to VCF. If no sample list is specified all samples are included in the analyses.
vcf.file <- system.file(package="gwasurvivr",
"extdata",
"michigan.chr14.dose.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
head(pheno.file)
## ID_1 ID_2 event time age DrugTxYes sex group
## 1 1 SAMP1 0 12.00 33.93 0 male control
## 2 2 SAMP2 1 7.61 58.71 1 male experimental
## 3 3 SAMP3 0 12.00 39.38 0 female control
## 4 4 SAMP4 0 4.30 38.85 0 male control
## 5 5 SAMP5 0 12.00 43.58 0 male experimental
## 6 6 SAMP6 1 2.60 57.74 0 male control
# recode sex column and remove first column
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2" "SAMP5" "SAMP7" "SAMP9" "SAMP11" "SAMP12"
In this example, we will select samples from the experimental
group and will test survival only on these patients. The first column in the pheno.file
are sample IDs, which link the phenotype file to the imputation file. We include age
, DrugTxYes
, and sex
in the survival model as covariates.
We perform the analysis using the experimental
group to demonstrate how one may want to prepare their data if interested in testing only on a subset of samples (i.e. a case-control study and survival of cases is of interest). Note that how the IDs (sample.ids
) need to be a vector of class character
. The chunk.size
refers to size of each data chunk read in and is defaulted to 10,000 rows. Users can customize that to their needs. The larger the chunk.size
the more memory (RAM) required to run the analysis. The recommended chunk.size=10000
and probably should not exceed chunk.size=100000
. This does not mean that gwasurvivr
is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (print.covs='only'
), however users can select print.covs=all
to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size.
Single SNP analysis
Next we run michiganCoxSurv
with the default, print.covs="only"
, load the results into R and provide descriptions of output by column. We will then run the analysis again using print.covs="all"
. verbose=TRUE
is used for these examples so the function display messages while running.
Use ?michiganCoxSurv
for argument specific documentation.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="michigan_only",
r2.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
## Analysis started on 2022-11-01 at 17:45:01
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2022-11-01 at 17:45:15
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/RtmptRLyXZ/michigan_only1c2f7b4b3cb6a.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/RtmptRLyXZ/michigan_only1c2f7b4b3cb6a.coxph
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only"
.
michigan_only <- read.table("michigan_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(michigan_only))
## 'data.frame': 3 obs. of 21 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ AF : num 0.301 0.515 0.52
## $ MAF : num 0.301 0.485 0.48
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ R2 : num 0.552 0.479 0.481
## $ ER2 : logi NA NA NA
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
SNP with covariate interaction
A SNP*covariate interaction can be implemented using the inter.term
argument. In this example, we will use DrugTxYes
from the covariate file as the covariate we want to test for interaction with the SNP.
print.covs="only"
michiganCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="michigan_intx_only",
r2.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=FALSE,
clusterObj=NULL)
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only"
.
michigan_intx_only <- read.table("michigan_intx_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(michigan_intx_only))
## 'data.frame': 3 obs. of 21 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ AF : num 0.301 0.515 0.52
## $ MAF : num 0.301 0.485 0.48
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ R2 : num 0.552 0.479 0.481
## $ ER2 : logi NA NA NA
## $ PVALUE : num 0.9017 0.0806 0.0786
## $ HR : num 1.15 7.03 7.11
## $ HR_lowerCI : num 0.127 0.789 0.799
## $ HR_upperCI : num 10.4 62.7 63.2
## $ Z : num 0.124 1.747 1.759
## $ COEF : num 0.139 1.951 1.961
## $ SE.COEF : num 1.12 1.12 1.11
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
Sanger Imputation Server
Sanger Imputation Server pre-phases typed genotypes using either SHAPEIT or EAGLE, imputes genotypes using PBWT algorithm and outputs a .vcf.gz
file for each chromosome. These .vcf.gz
files are used as input for gwasurvivr
. The function, sangerCoxSurv
uses a modification of Cox proportional hazard regression from survival::coxph.fit
. Built specifically for genetic data, sangerCoxSurv
allows the user to filter on info score (imputation quality metric) and minor allele frequency from the reference panel used for imputation using RefPanelAF
as the input arguement for maf.filter
. Users are also provided with the sample minor allele frequency in the sangerCoxSurv
output.
Samples can be selected for analyses by providing a vector of sample.ids
. The output from Sanger imputation server returns the samples as SAMP1, ..., SAMPN
, where N
is the total number of samples. The sample order corresponds to the sample order in the VCF file used for imputation. Note, sample order can also be found in the .fam
file if genotyping data were initially in .bed
, .bim
and .fam
(PLINK) format prior to conversion to VCF. If no sample list is specified, all samples are included in the analyses.
In this example, we will select samples from the experimental
group and will test survival only on these patients. The first column in the pheno.file
are sample IDs (we will match on these). We include age
, DrugTxYes
, and sex
in the survival model as covariates.
vcf.file <- system.file(package="gwasurvivr",
"extdata",
"sanger.pbwt_reference_impute.vcf.gz")
pheno.fl <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
pheno.file <- read.table(pheno.fl,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
head(pheno.file)
## ID_1 ID_2 event time age DrugTxYes sex group
## 1 1 SAMP1 0 12.00 33.93 0 male control
## 2 2 SAMP2 1 7.61 58.71 1 male experimental
## 3 3 SAMP3 0 12.00 39.38 0 female control
## 4 4 SAMP4 0 4.30 38.85 0 male control
## 5 5 SAMP5 0 12.00 43.58 0 male experimental
## 6 6 SAMP6 1 2.60 57.74 0 male control
# recode sex column and remove first column
pheno.file$SexFemale <- ifelse(pheno.file$sex=="female", 1L, 0L)
# select only experimental group sample.ids
sample.ids <- pheno.file[pheno.file$group=="experimental",]$ID_2
head(sample.ids)
## [1] "SAMP2" "SAMP5" "SAMP7" "SAMP9" "SAMP11" "SAMP12"
We perform the analysis using the experimental
group to demonstrate how one may want to prepare their data if not initially all samples are patients or cases (i.e. a case-control study and survival of cases is of interest). We also are showing how the IDs (sample.ids
) need to be a vector of class character
. The chunk.size
refers to size of each data chunk read in and is defaulted to 10,000 rows, users can customize that to their needs. The larger the chunk.size
the more memory (RAM) required to run the analysis. The recommended chunk.size=10000
and probably should not exceed chunk.size=100000
. This does not mean that gwasurvivr
is limited to only 100,000 SNPs, it just is how many SNPs are analyzed for each iteration.
By default survival estimates and pvalues for the SNP adjusted for other covariates are outputted (print.covs='only'
), however users can select print.covs=all
to get the coefficient estimates for covariates included in the model. Depending on the number of covariates included this can add substantially to output file size.
Use ?sangerCoxSurv
for argument specific documentation.
Single SNP analysis
Next we run sangerCoxSurv
with the default, print.covs="only"
, load the results into R and provide descriptions of output by column. verbose=TRUE
is used for these examples so the function display messages while running.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="sanger_only",
info.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
## Analysis started on 2022-11-01 at 17:45:31
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analyzing chunk 0-100
## Analyzing chunk 100-200
## Analysis completed on 2022-11-01 at 17:45:53
## 0 SNPs were removed from the analysis for not meeting the threshold criteria.
## List of removed SNPs can be found in /tmp/RtmptRLyXZ/sanger_only1c2f7b5a1c6469.snps_removed
## 3 SNPs were analyzed in total
## The survival output can be found at /tmp/RtmptRLyXZ/sanger_only1c2f7b5a1c6469.coxph
Here we load the data and glimpse the first few values in each column from the survival analyses.
str(head(sanger_only))
## 'data.frame': 3 obs. of 19 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : logi FALSE FALSE FALSE
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ REF : chr "C" "G" "A"
## $ ALT : chr "T" "T" "G"
## $ RefPanelAF : num 0.301 0.515 0.52
## $ SAMP_FREQ_ALT: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ INFO : num 0.552 0.479 0.481
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
Column names with descriptions from the survival analyses with covariates, specifying the default print.covs="only"
SNP with covariate interaction
A SNP*covariate interaction can be implemented using the inter.term
argument. In this example, we will use DrugTxYes
from the covariate file as the covariate we want to test for interaction with the SNP.
print.covs="only"
sangerCoxSurv(vcf.file=vcf.file,
covariate.file=pheno.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="sanger_intx_only",
info.filter=0.3,
maf.filter=0.005,
chunk.size=100,
verbose=TRUE,
clusterObj=NULL)
IMPUTE2 Imputation
IMPUTE2 is a genotype imputation and haplotype phasing program (Howie et al 2009). IMPUTE2 outputs 6 files for each chromosome chunk imputed (usually 5 MB in size). Only 2 of these files are required for analyses using gwasurvivr
:
- Genotype file (
.impute
)
- Sample file (
.sample
)
More information can be read about these formats
We are going to load in and pre-process the genetic data and the phenotypic data (covariate.file
).
impute.file <- system.file(package="gwasurvivr",
"extdata",
"impute_example.impute2.gz")
sample.file <- system.file(package="gwasurvivr",
"extdata",
"impute_example.impute2_sample")
covariate.file <- system.file(package="gwasurvivr",
"extdata",
"simulated_pheno.txt")
covariate.file <- read.table(covariate.file,
sep=" ",
header=TRUE,
stringsAsFactors = FALSE)
covariate.file$SexFemale <- ifelse(covariate.file$sex=="female", 1L, 0L)
sample.ids <- covariate.file[covariate.file$group=="experimental",]$ID_2
To perform survival analysis using IMPUTE2 the function arguments are very similarto michiganCoxSurv
and sangerCoxSurv
, however the function now takes a chromosome arguement. This is needed to properly annotate the file output with the chromosome that these SNPs are in. This is purely an artifact of IMPUTE2 and how we leverage GWASTools
in this function.
Single SNP analysis
First we will do the analysis with no interaction term, followed by doing the analysis with the interaction term. The recommended output setting for single SNP analysis is print.cov="only"
.
impute2CoxSurv(impute.file=impute.file,
sample.file=sample.file,
chr=14,
covariate.file=covariate.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term=NULL,
print.covs="only",
out.file="impute_example_only",
chunk.size=100,
maf.filter=0.005,
exclude.snps=NULL,
flip.dosage=TRUE,
verbose=TRUE,
clusterObj=NULL)
## Covariates included in the models are: age, DrugTxYes, SexFemale
## 52 samples are included in the analysis
## Analysis started on 2022-11-01 at 17:46:07
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
## open the file '/tmp/RtmptRLyXZ/1c2f7b19630744.gds' (3.3K)
## # of fragments: 30
## save to '/tmp/RtmptRLyXZ/1c2f7b19630744.gds.tmp'
## rename '/tmp/RtmptRLyXZ/1c2f7b19630744.gds.tmp' (2.4K, reduced: 987B)
## # of fragments: 14
## ***** Compression time ******
## User:0.078
## System: 0.008
## Elapsed: 0.086
## *****************************
## Analyzing part 1/1...
## 0 SNPs were removed from the analysis for not meeting
## the given threshold criteria or for having MAF = 0
## List of removed SNPs are saved to /tmp/RtmptRLyXZ/impute_example_only1c2f7b203447a7.snps_removed
## In total 3 SNPs were included in the analysis
## The Cox model results output was saved to /tmp/RtmptRLyXZ/impute_example_only1c2f7b203447a7.coxph
## Analysis completed on 2022-11-01 at 17:46:08
Here we load the data and glimpse the first few values in each column output.
impute2_res_only <- read.table("impute_example_only.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_only))
## 'data.frame': 3 obs. of 17 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : chr "---" "---" "---"
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ A0 : chr "C" "G" "A"
## $ A1 : chr "T" "T" "G"
## $ exp_freq_A1: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ PVALUE : num 0.713 0.805 0.798
## $ HR : num 1.224 0.885 0.881
## $ HR_lowerCI : num 0.417 0.333 0.332
## $ HR_upperCI : num 3.6 2.35 2.34
## $ Z : num 0.368 -0.246 -0.255
## $ COEF : num 0.203 -0.123 -0.127
## $ SE.COEF : num 0.55 0.498 0.498
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21
SNP covariate interaction
Now we will perform a SNP*covariate interaction survival analysis using impute2CoxSurv
.
impute2CoxSurv(impute.file=impute.file,
sample.file=sample.file,
chr=14,
covariate.file=covariate.file,
id.column="ID_2",
sample.ids=sample.ids,
time.to.event="time",
event="event",
covariates=c("age", "SexFemale", "DrugTxYes"),
inter.term="DrugTxYes",
print.covs="only",
out.file="impute_example_intx",
chunk.size=100,
maf.filter=0.005,
flip.dosage=TRUE,
verbose=FALSE,
clusterObj=NULL,
keepGDS=FALSE)
## Determining number of SNPs and samples...
## Including all SNPs.
## scan.df not given. Assigning scanIDs automatically.
## Reading sample file...
## Reading genotype file...
## Block 1 of 1
## Writing annotation...
## Compressing...
## Clean up the fragments of GDS file:
## open the file '/tmp/RtmptRLyXZ/1c2f7b5b62a40.gds' (3.3K)
## # of fragments: 30
## save to '/tmp/RtmptRLyXZ/1c2f7b5b62a40.gds.tmp'
## rename '/tmp/RtmptRLyXZ/1c2f7b5b62a40.gds.tmp' (2.4K, reduced: 987B)
## # of fragments: 14
## ***** Compression time ******
## User:0.069
## System: 0.012
## Elapsed: 0.08
## *****************************
Here we load the data and glimpse the first few values in each column outputted from the SNP*interaction survival analyses using print.covs="only"
.
impute2_res_intx <- read.table("impute_example_intx.coxph", sep="\t", header=TRUE, stringsAsFactors = FALSE)
str(head(impute2_res_intx))
## 'data.frame': 3 obs. of 17 variables:
## $ RSID : chr "rs34919020" "rs8005305" "rs757545375"
## $ TYPED : chr "---" "---" "---"
## $ CHR : int 14 14 14
## $ POS : int 19459185 20095842 20097287
## $ A0 : chr "C" "G" "A"
## $ A1 : chr "T" "T" "G"
## $ exp_freq_A1: num 0.355 0.517 0.52
## $ SAMP_MAF : num 0.355 0.483 0.48
## $ PVALUE : num 0.9017 0.0806 0.0786
## $ HR : num 1.15 7.03 7.11
## $ HR_lowerCI : num 0.127 0.789 0.799
## $ HR_upperCI : num 10.4 62.7 63.2
## $ Z : num 0.124 1.747 1.759
## $ COEF : num 0.139 1.951 1.961
## $ SE.COEF : num 1.12 1.12 1.11
## $ N : int 52 52 52
## $ N.EVENT : int 21 21 21