Population Structure and Relatedness Inference using the GENESIS Package

Matthew P. Conomos

2018-08-02

Overview

GENESIS provides statistical methodology for analyzing genetic data from samples with population structure and/or familial relatedness. This vignette provides a description of how to use GENESIS for inferring population structure, as well as estimating relatedness measures such as kinship coefficients, identity by descent (IBD) sharing probabilities, and inbreeding coefficients. GENESIS uses PC-AiR for population structure inference that is robust to known or cryptic relatedness, and it uses PC-Relate for accurate relatedness estimation in the presence of population structure, admixutre, and departures from Hardy-Weinberg equilibrium.

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Many estimators exist that use genome-wide SNP genotype data from samples in genetic studies to estimate measures of recent genetic relatedness such as pairwise kinship coefficients, pairwise IBD sharing probabilities, and individual inbreeding coefficients. However, many of these estimators either make simplifying assumptions that do not hold in the presence of population structure and/or ancestry admixture, or they require reference population panels of known ancestry from pre-specified populations. When assumptions are violated, these estimators can provide highly biased estimates.

The PC-Relate method is used to accurately estimate measures of recent genetic relatedness in samples with unknown or unspecified population structure without requiring reference population panels. PC-Relate uses ancestry representative principal components to account for sample ancestry differences and provide estimates that are robust to population structure, ancestry admixture, and departures from Hardy-Weinberg equilibirum.

Data

Reading in Genotype Data

The functions in the GENESIS package read genotype data from a GenotypeData class object as created by the GWASTools package. Through the use of GWASTools, a GenotypeData class object can easily be created from:

Example R code for creating a GenotypeData object is presented below. Much more detail can be found in the GWASTools package reference manual.

R Matrix

  • genotype is a matrix of genotype values coded as 0 / 1 / 2, where rows index SNPs and columns index samples
  • snpID is an integer vector of unique SNP IDs
  • chromosome is an integer vector specifying the chromosome of each SNP
  • position is an integer vector specifying the position of each SNP
  • scanID is a vector of unique individual IDs

GDS files

  • filename is the file path to the GDS object

HapMap Data

To demonstrate PC-AiR and PC-Relate analyses with the GENESIS package, we analyze SNP data from the Mexican Americans in Los Angeles, California (MXL) and African American individuals in the southwestern USA (ASW) population samples of HapMap 3. Mexican Americans and African Americans have a diverse ancestral background, and familial relatives are present in these data. Genotype data at a subset of 20K autosomal SNPs for 173 individuals are provided as a GDS file.

## An object of class GenotypeData 
##  | data:
## File: /tmp/RtmpSJrQVL/Rinst2522ed3f97b/GENESIS/extdata/HapMap_ASW_MXL_geno.gds (901.8K)
## +    [  ] *
## |--+ sample.id   { Int32,factor 173 ZIP(40.9%), 283B } *
## |--+ snp.id   { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.position   { Int32 20000 ZIP(34.6%), 27.1K }
## |--+ snp.chromosome   { Int32 20000 ZIP(0.13%), 103B }
## \--+ genotype   { Bit2 20000x173, 844.7K } *
##  | SNP Annotation:
## NULL
##  | Scan Annotation:
## NULL

Relatedness Estimation Adjusted for Principal Components (PC-Relate)

Running PC-Relate

PC-Relate uses the ancestry representative principal components (PCs) calculated from PC-AiR to adjust for the population structure and ancestry of individuals in the sample and provide accurate estimates of recent genetic relatedness due to family structure. The pcrelate function performs the PC-Relate analysis.

The training.set input of the pcrelate function allows for the specification of which samples are used to estimate the ancestry adjustment at each SNP. The adjustment tends to perform best when close relatives are excluded from training.set, so the individuals in the “unrelated subset” from the PC-AiR analysis are typically a good choice. However, when an “unrelated subset” is not available, the adjustment still works well when estimated using all samples (training.set = NULL).

## Running Analysis with 20000 SNPs - in 2 Block(s)
## Running Analysis with 173 Samples - in 1 Block(s)
## Using 2 PC(s) in pcMat to Calculate Adjusted Estimates
## Using 97 Samples in training.set to Estimate PC effects on Allele Frequencies
## Computing Relatedness Estimates...
## ...SNP Block 1 of 2 Completed - 5.583 secs
## ...SNP Block 2 of 2 Completed - 5.397 secs
## Performing Small Sample Correction...

If estimates of IBD sharing probabilities are not desired, then setting the input ibd.probs = FALSE will speed up the computation.

Output from PC-Relate

The pcrelate function will either return an object of class pcrelate (when the input write.to.gds = FALSE) or it will save the output to a GDS file (when the input write.to.gds = TRUE). Saving the output to a GDS file is useful for large samples, as it allows for efficient storage and access to the estimates (see the package gdsfmt for more details). The following command can be used to read in the results from a previous PC-Relate analysis saved to the GDS file “tmp_pcrelate.gds”

Functions are provided for easily reading the output from pcrelate (either a class pcrelate object or a GDS file) and making a table of pairwise relatedness estimates, a table of individual inbreeding coeficients, and a genetic relationship matrix (GRM).

##         ID1     ID2  nsnp        kin           k0        k1           k2
## 7   NA19919 NA19908 19437 0.24668482 0.0015676062 1.0157801 -0.017347662
## 17  NA19919 NA19909 19626 0.24792942 0.0000000000 0.9911198  0.008880185
## 138 NA20282 NA20301 19765 0.32326286 0.0573556808 0.5780613  0.364583049
## 186 NA19902 NA19901 19640 0.25122134 0.0000000000 0.9883502  0.011649796
## 212 NA19902 NA19900 19771 0.23177090 0.0007574672 1.0228805 -0.023637964
## 365 NA20335 NA20337 19906 0.06867109 0.7346726524 0.2559703  0.009357025
## 493 NA20340 NA20349 19679 0.08084154 0.6820468648 0.3125401  0.005413026
## 501 NA20340 NA20344 19657 0.05710043 0.7614258883 0.2487465 -0.010172410
## 513 NA20297 NA20281 19669 0.06444652 0.7620783007 0.2180573  0.019864365
## 623 NA20290 NA20289 19666 0.25641191 0.0022904721 1.0001161 -0.002406525
## 625 NA20290 NA20333 19731 0.04482144 0.8393821454 0.1419500  0.018667900
## 634 NA20295 NA20294 19769 0.25275105 0.0023046668 1.0044206 -0.006725302
## 649 NA20346 NA20349 19765 0.05540436 0.7728600478 0.2326625 -0.005522529
## 657 NA20346 NA20344 19743 0.05144352 0.7917347083 0.2107565 -0.002491201
## 722 NA20349 NA20344 19878 0.27963012 0.1882682417 0.5316348  0.280096947
##         ID  nsnp          f
## 11 NA20335 19949 0.03210545
## 13 NA19904 19814 0.02276732
## 20 NA20317 19933 0.02302650
## 30 NA20294 19827 0.02705174
## 37 NA20344 19905 0.03607983
## 38 NA20333 19902 0.03498920
##              NA19919     NA19916      NA19835      NA20282     NA19703
## NA19919  0.971883727 0.011129327 -0.029404169  0.009562848 0.032625513
## NA19916  0.011129327 1.004740236  0.007354497  0.001717844 0.008775792
## NA19835 -0.029404169 0.007354497  0.970464272 -0.010680929 0.002279565
## NA20282  0.009562848 0.001717844 -0.010680929  0.989649727 0.016076041
## NA19703  0.032625513 0.008775792  0.002279565  0.016076041 1.000048656

References