Data Processing
- Functions for Data Analysis
- Parallel Implementation
Integration with SNPRelate
Integration with SeqVarTools
Resources
Session Information
References

Data Processing

Functions for Data Analysis

Table 1: The key functions in the SeqArray package.

Function	Description
seqSetFilter	Sets a filter to sample or variant (i.e., define a subset of data).
seqGetData	Gets data from a sequence GDS file (from a subset of data).
seqApply	Applies a user-defined function over array margins.
seqParallel	Applies functions in parallel.
seqParallelSetup	Setups a cluster environment for parallel computing.
—
seqNumAllele	Numbers of alleles per site.
seqMissing	Missing genotype percentages.
seqAlleleFreq	Allele frequencies.
seqAlleleCount	Allele counts.
…

Parallel Implementation

The default setting for the analysis functions in the SeqArray package is serial implementation, but users can setup a cluster computing environment manually via seqParallelSetup() and distribute the calculations to multiple cores even 100 cluster nodes.

library(SeqArray)

## Loading required package: gdsfmt

# 1000 Genomes, Phase 1, chromosome 22
(gds.fn <- seqExampleFileName("KG_Phase1"))

## [1] "/tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds"

# open a GDS file
genofile <- seqOpen(gds.fn)

seqSummary(genofile)

## File: /tmp/RtmpkGyHWJ/Rinst6c674a987c26/SeqArray/extdata/1KG_phase1_release_v3_chr22.gds
## Reference: GRCh37
## Ploidy: 2
## Number of samples: 1092
## Number of variants: 19773
## Chromosomes:
##    22  <NA> 
## 19773     0 
## Number of alleles per site:
##     2 
## 19773 
## Annotation, INFO variable(s):
## Annotation, FORMAT variable(s):
## Annotation, sample variable(s):
##  Sample
##  Family.ID
##  Population
##  Population.Description
##  Gender
##  Relationship
##  Unexpected.Parent.Child
##  Non.Paternity
##  Siblings
##  Grandparents
##  Avuncular
##  Half.Siblings
##  Unknown.Second.Order
##  Third.Order
##  Other.Comments

# use 2 cores for the demonstration
seqParallelSetup(2)

## Enable the computing cluster with 2 forked R processes.

# numbers of alleles per site
table(seqNumAllele(genofile))

## 
##     2 
## 19773

# reference allele frequencies
summary(seqAlleleFreq(genofile, ref.allele=0))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.9725  0.9963  0.9286  0.9991  1.0000

# close the cluster environment
seqParallelSetup(FALSE)

## Stop the computing cluster.

Integration with SNPRelate

The SNPRelate package is developed to accelerate two key computations in genome-wide association studies: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures. The kernels of SNPRelate are written in C/C++ and have been highly optimized for multi-core symmetric multiprocessing computer architectures. The genotypes in SeqArray format are converted to categorical dosages of reference alleles (0,1,2,NA), which are the data format used in the SNPRelate pacakge.

library(SNPRelate)

## SNPRelate -- supported by Streaming SIMD Extensions 2 (SSE2)

LD-based SNP pruning

It is suggested to use a pruned set of SNPs which are in approximate linkage equilibrium with each other to avoid the strong influence of SNP clusters in principal component analysis and relatedness analysis.

set.seed(1000)

# may try different LD thresholds for sensitivity analysis
snpset <- snpgdsLDpruning(genofile, ld.threshold=0.2)

## SNP pruning based on LD:
## Excluding 0 SNP on non-autosomes
## Excluding 122 SNPs (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1092 samples, 19651 SNPs
##  Using 1 (CPU) core
##  Sliding window: 500000 basepairs, Inf SNPs
##  |LD| threshold: 0.2
## Chromosome 22: 37.66%, 7447/19773
## 7447 SNPs are selected in total.

names(snpset)

## [1] "chr22"

head(snpset$chr22)  # snp.id

## [1] 1 3 4 6 7 8

# get all selected snp id
snpset.id <- unlist(snpset)

Principal Component Analysis

# Run PCA
pca <- snpgdsPCA(genofile, snp.id=snpset.id, num.thread=2)

## Principal Component Analysis (PCA) on SNP genotypes:
## Excluding 12326 SNPs on non-autosomes
## Excluding 0 SNP (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1092 samples, 7447 SNPs
##  Using 2 (CPU) cores
## PCA: the sum of all working genotypes (0, 1 and 2) = 15598846
## PCA: Sun Nov 22 19:42:21 2015    0%
## PCA: Sun Nov 22 19:42:23 2015    100%
## PCA: Sun Nov 22 19:42:23 2015    Begin (eigenvalues and eigenvectors)
## PCA: Sun Nov 22 19:42:23 2015    End (eigenvalues and eigenvectors)

# variance proportion (%)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))

## [1] 1.22 0.84 0.29 0.28 0.27 0.26

Population information are available:

pop.code <- factor(seqGetData(genofile, "sample.annotation/Population"))
head(pop.code)

## [1] GBR GBR GBR GBR GBR GBR
## Levels: ASW CEU CHB CHS CLM FIN GBR IBS JPT LWK MXL PUR TSI YRI

popgroup <- list(
    EastAsia = c("CHB", "JPT", "CHS", "CDX", "KHV", "CHD"),
    European = c("CEU", "TSI", "GBR", "FIN", "IBS"),
    African  = c("ASW", "ACB", "YRI", "LWK", "GWD", "MSL", "ESN"),
    SouthAmerica = c("MXL", "PUR", "CLM", "PEL"),
    India = c("GIH", "PJL", "BEB", "STU", "ITU"))

colors <- sapply(levels(pop.code), function(x) {
    for (i in 1:length(popgroup)) {
        if (x %in% popgroup[[i]])
            return(names(popgroup)[i])
    }
    NA
    })
colors <- as.factor(colors)
legend.text <- sapply(levels(colors), function(x) paste(levels(pop.code)[colors==x], collapse=","))
legend.text

##               African              EastAsia              European 
##         "ASW,LWK,YRI"         "CHB,CHS,JPT" "CEU,FIN,GBR,IBS,TSI" 
##          SouthAmerica 
##         "CLM,MXL,PUR"

# make a data.frame
tab <- data.frame(sample.id = pca$sample.id,
    EV1 = pca$eigenvect[,1],    # the first eigenvector
    EV2 = pca$eigenvect[,2],    # the second eigenvector
    Population = pop.code,
    stringsAsFactors = FALSE)
head(tab)

##   sample.id         EV1        EV2 Population
## 1   HG00096 -0.01514785 0.03691694        GBR
## 2   HG00097 -0.01236666 0.02596987        GBR
## 3   HG00099 -0.01042858 0.03828143        GBR
## 4   HG00100 -0.01514046 0.02695951        GBR
## 5   HG00101 -0.01188685 0.02858924        GBR
## 6   HG00102 -0.01509719 0.03023233        GBR

# draw
plot(tab$EV2, tab$EV1, pch=20, cex=0.75, main="1KG Phase 1, chromosome 22",
    xlab="eigenvector 2", ylab="eigenvector 1", col=colors[tab$Population])
legend("topleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75)

Relatedness Analysis

For relatedness analysis, Identity-By-Descent (IBD) estimation in SNPRelate can be done by the method of moments (MoM) (Purcell et al., 2007).

# YRI samples
sample.id <- seqGetData(genofile, "sample.id")
YRI.id <- sample.id[pop.code == "YRI"]

# Estimate IBD coefficients
ibd <- snpgdsIBDMoM(genofile, sample.id=YRI.id, maf=0.05, missing.rate=0.05, num.thread=2)

## IBD analysis (PLINK method of moment) on SNP genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 14681 SNPs (monomorphic: TRUE, < MAF: 0.05, or > missing rate: 0.05)
## Working space: 88 samples, 5092 SNPs
##  Using 2 (CPU) cores
## PLINK IBD:   the sum of all working genotypes (0, 1 and 2) = 650888
## PLINK IBD:   Sun Nov 22 19:42:24 2015    0%
## PLINK IBD:   Sun Nov 22 19:42:24 2015    100%

# Make a data.frame
ibd.coeff <- snpgdsIBDSelection(ibd)
head(ibd.coeff)

##       ID1     ID2        k0         k1    kinship
## 1 NA18486 NA18487 0.8143602 0.18563975 0.04640994
## 2 NA18486 NA18489 0.9485724 0.03564293 0.01680307
## 3 NA18486 NA18498 0.8465192 0.15348080 0.03837020
## 4 NA18486 NA18499 0.6819247 0.31807533 0.07951883
## 5 NA18486 NA18501 0.9460962 0.00000000 0.02695192
## 6 NA18486 NA18502 0.9525580 0.03802225 0.01421544

plot(ibd.coeff$k0, ibd.coeff$k1, xlim=c(0,1), ylim=c(0,1), xlab="k0", ylab="k1", main="YRI samples (MoM)")
lines(c(0,1), c(1,0), col="red", lty=2)

Identity-By-State Analysis

For \(n\) study individuals, snpgdsIBS() can be used to create a \(n \times n\) matrix of genome-wide average IBS pairwise identities. To perform cluster analysis on the \(n \times n\) matrix of genome-wide IBS pairwise distances, and determine the groups by a permutation score:

set.seed(1000)
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))

## Identity-By-State (IBS) analysis on SNP genotypes:
## Excluding 0 SNP on non-autosomes
## Excluding 122 SNPs (monomorphic: TRUE, < MAF: NaN, or > missing rate: NaN)
## Working space: 1092 samples, 19651 SNPs
##  Using 2 (CPU) cores
## IBS: the sum of all working genotypes (0, 1 and 2) = 39869601
## IBS: Sun Nov 22 19:42:25 2015    0%
## IBS: Sun Nov 22 19:42:28 2015    100%

Here is the population information we have known:

# Determine groups of individuals by population information
rv <- snpgdsCutTree(ibs.hc, samp.group=as.factor(colors[pop.code]))

## Create 4 groups.

plot(rv$dendrogram, leaflab="none", main="1KG Phase 1, chromosome 22",
    edgePar=list(col=rgb(0.5,0.5,0.5,0.75), t.col="black"))
legend("bottomleft", legend=legend.text, col=1:length(legend.text), pch=19, cex=0.75, ncol=4)

# close the GDS file
seqClose(genofile)

Integration with SeqVarTools

An R/Bioconductor package SeqVarTools is available on Bioconductor, which defines S4 classes and methods for other common operations and analyses on SeqArray datasets.

Resources

CoreArray C++ project: http://corearray.sourceforge.net/
gdsfmt R package: http://github.com/zhengxwen/gdsfmt, http://www.bioconductor.org/packages/release/bioc/html/gdsfmt.html
SeqArray R package: http://github.com/zhengxwen/SeqArray, http://www.bioconductor.org/packages/release/bioc/html/SeqArray.html
SNPRelate R package: http://github.com/zhengxwen/SNPRelate, http://www.bioconductor.org/packages/release/bioc/html/SNPRelate.html

Session Information

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] SNPRelate_1.4.0 SeqArray_1.10.6 gdsfmt_1.6.2   
## 
## loaded via a namespace (and not attached):
##  [1] AnnotationDbi_1.32.1       knitr_1.11                
##  [3] XVector_0.10.0             magrittr_1.5              
##  [5] GenomicAlignments_1.6.1    GenomicRanges_1.22.1      
##  [7] BiocGenerics_0.16.1        zlibbioc_1.16.0           
##  [9] IRanges_2.4.4              BiocParallel_1.4.0        
## [11] BSgenome_1.38.0            stringr_1.0.0             
## [13] GenomeInfoDb_1.6.1         tools_3.2.2               
## [15] SummarizedExperiment_1.0.1 parallel_3.2.2            
## [17] Biobase_2.30.0             DBI_0.3.1                 
## [19] lambda.r_1.1.7             futile.logger_1.4.1       
## [21] htmltools_0.2.6            yaml_2.1.13               
## [23] digest_0.6.8               rtracklayer_1.30.1        
## [25] formatR_1.2.1              futile.options_1.0.0      
## [27] S4Vectors_0.8.3            bitops_1.0-6              
## [29] biomaRt_2.26.1             RCurl_1.95-4.7            
## [31] RSQLite_1.0.0              evaluate_0.8              
## [33] rmarkdown_0.8.1            stringi_1.0-1             
## [35] GenomicFeatures_1.22.5     Biostrings_2.38.2         
## [37] Rsamtools_1.22.0           XML_3.98-1.3              
## [39] stats4_3.2.2               VariantAnnotation_1.16.3

References

Purcell S, Neale B, Todd-Brown K, et al., 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3):559-75.

A Brief Introduction to Data Analytics on SeqArray GDS Files

Xiuwen Zheng (Department of Biostatistics, University of Washington, Seattle)

Sep 13, 2015