1 Introduction

The GBScleanR package has been developed to conduct error correction on genotype data obtained via NGS-based genotyping methods such as RAD-seq and GBS (Miller et al. 2007; Elshire et al. 2011). It is designed to estimate true genotypes along chromosomes from given allele read counts in the VCF file generated by SNP callers like GATK and TASSEL-GBS (McKenna et al. 2010; Glaubitz et al. 2014). The current implementation supports genotype data of a mapping population derived from two or more diploid founders followed by selfings, sibling crosses, or random crosses. e.g. F\(_2\) and 8-way RILs. Our method supposes markers to be biallelic and ordered along chromosomes by mapping reads on a reference genome sequence. To support smooth access to large size genotype data, every input VCF file is first converted to a genomic data structure (GDS) file (Zheng et al. 2012). The current implementation cannot convert a VCF file containing non-biallelic markers to a GDS file correctly. Thus, any input VCF file should be subjected to filtering for retaining only biallelic SNPs using, for example, bcftools (Li and Barrett 2011). GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. Furthermore, the data structure of the GDS file created via this package is compatible with those used in the SNPRelate, GWASTools and GENESIS packages those are designed to handle large variant data and conduct regression analysis (Zheng et al. 2012; Gogarten et al. 2012, 2019). In this document, we first walk through the utility functions implemented in GBScleanR to introduce a basic usage. Then, the core function of GBScleanR which estimates true genotypes for error correction will be introduced.

2 Prerequisites

This package internally uses the following packages.
- ggplot2
- dplyr
- tidyr
- expm
- gdsfmt
- SeqArray
 

3 Installation

You can install GBScleanR from the Bioconductor repository with the following code.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GBScleanR")

 

The code below let you install the package from the github repository. The package released in the github usually get frequent update more than that in Bioconductor due to the release schedule.

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("tomoyukif/GBScleanR", build_vignettes = TRUE)

4 Loading data

The main class of the GBScleanR package is GbsrGenotypData which inherits the GenotypeData class in the SeqArray package. The gbsrGenotypeData class object has three slots: sample, marker, and scheme. The data slot holds genotype data as a gds.class object which is defined in the gdsfmt package while snpAnnot and scanAnnot contain objects storing annotation information of SNPs and samples, which are the SnpAnnotationDataFrame and ScanAnnotationDataFrame objects defined in the GWASTools package. See the vignette of GWASTools for more detail. GBScleanR follows the way of GWASTools in which a unique genotyping instance (genotyped sample) is called “sample”.
 

Load the package.

library("GBScleanR")

 

GBScleanR only supports a VCF file as input. As an example data, we use sample genotype data for a biparental F2 population derived from inbred founders.

vcf_fn <- system.file("extdata", "sample.vcf", package = "GBScleanR")
gds_fn <- tempfile("sample", fileext = ".gds")

 

As mentioned above, the GbsrGenotypeData class requires genotype data in the gds.class object which enable us quick access to the genotype data without loading the whole data on RAM. At the beginning of the processing, we need to convert data format of our genotype data from VCF to GDS. This conversion can be achieved using gbsrVCF2GDS() as shown below. A compressed VCF file (.vcf.gz) also can be the input.

# `force = TRUE` allow the function to over write the GDS file,
# even if a GDS file exists at `out_fn`.
gbsrVCF2GDS(vcf_fn = vcf_fn, out_fn = gds_fn, force = TRUE, verbose = FALSE)
## [1] "/tmp/Rtmp2LGE7i/sample27687b16763629.gds"

 

Once we converted the VCF to the GDS, we can create a GbsrGenotypeData instance for our data.

gds <- loadGDS(gds_fn, verbose = FALSE)

The first time to load a newly produced GDS file will take long time due to data reformatting for quick access.

5 Utility methods

5.1 Getters

Getter functions allow you to retrieve basic information of genotype data, e.g. the number of SNPs and samples, chromosome names, physical position of SNPs and alleles.

# Number of samples
nsam(gds)
## [1] 102

 

# Number of SNPs
nmar(gds) 
## [1] 242

 

# Indices of chromosome ID of all markers
head(getChromosome(gds)) 
## [1] "1" "1" "1" "1" "1" "1"

 

# Chromosome names of all markers
head(getChromosome(gds)) 
## [1] "1" "1" "1" "1" "1" "1"

 

# Position (bp) of all markers
head(getPosition(gds)) 
## [1] 522289 571177 577905 720086 735019 841286

 

# Reference allele of all markers
head(getAllele(gds)) 
## [1] "A,G" "A,G" "C,T" "G,C" "C,T" "G,T"

 

# SNP IDs
head(getMarID(gds)) 
## [1] 1 2 3 4 5 6

 

# sample IDs
head(getSamID(gds)) 
## [1] "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"

 

The function getGenotype() returns genotype call data in which integer numbers 0, 1, and 2 indicate the number of reference allele.

geno <- getGenotype(gds)

 

The function getRead() returns read count data as a list with two elements ref and alt containing reference read counts and alternative read counts, respectively.

geno <- getRead(gds)

5.2 Data summarization

5.2.1 Collect basic summary statistics

countGenotype() and countRead() are class methods of GbsrGenotypeData and they summarize genotype counts and read counts per marker and per sample.

gds <- countGenotype(gds)
gds <- countRead(gds)

 

5.2.2 Visualize basic summary statistics

These summary statistics can be visualized via plotting functions. With the values obtained via countGenotype(), we can plot histograms of missing rate , heterozygosity, reference allele frequency as shown below.

# Histgrams of missing rate
histGBSR(gds, stats = "missing") 

Missing rate per marker and per sample.  

# Histgrams of heterozygosity
histGBSR(gds, stats = "het") 

Heterozygosity per marker and per sample.  

# Histgrams of reference allele frequency
histGBSR(gds, stats = "raf")