PCACheck {SeqSQC} | R Documentation |
Function to perform principle component analysis for all samples and to infer sample ancestry.
PCACheck(seqfile, remove.samples = NULL, LDprune = TRUE, missing.rate = 0.1, ss.cutoff = 300, maf = 0.01, hwe = 1e-06, ...)
seqfile |
SeqSQC object, which includes the merged gds file for study cohort and benchmark. |
remove.samples |
a vector of sample names for removal from PCA calculation. Could be problematic samples identified from previous QC steps, or user-defined samples. |
LDprune |
whether to use LD-pruned snp set, the default is TRUE. |
missing.rate |
to use the SNPs with "<= |
ss.cutoff |
the minimum sample size (300 by default) to apply the MAF filter. This sample size is the sum of study samples and the benchmark samples of the same population as the study cohort. |
maf |
to use the SNPs with ">= |
hwe |
to use the SNPs with Hardy-Weinberg equilibrium p >= |
... |
Arguments to be passed to other methods. |
Using LD-pruned autosomal variants (by default), we calculate the eigenvectors and eigenvalues for principle component analysis (PCA). We use the benchmark samples as training dataset, and predict the population group for each sample in the study cohort based on the top four eigenvectors. Samples with discordant predicted and self-reported population groups are considered problematic. The function PCACheck
performs the PCA analysis and identifies population outliers in study cohort.
a data frame with sample name, reported population, data resource (benchmark vs study cohort), the first four eigenvectors and the predicted population.
Qian Liu qliu7@buffalo.edu
load(system.file("extdata", "example.seqfile.Rdata", package="SeqSQC")) gfile <- system.file("extdata", "example.gds", package="SeqSQC") seqfile <- SeqSQC(gdsfile = gfile, QCresult = QCresult(seqfile)) seqfile <- PCACheck(seqfile, remove.samples=NULL, LDprune=TRUE, missing.rate=0.1) res.pca <- QCresult(seqfile)$PCA tail(res.pca)