This vignette introduces the SNPhood package. In the following, we introduce main features and necessary background. In addition, we generated a workflow vignette that shows an example analysis using real data from the companion package SNPhoodData (see section Workflow ).
SNPhood is under active development, and we highly recommend using the newest available version. In particular, we recommend using the devel branch of Bioconductor / SNPhood to make sure you use the latest features and bugfixes. If you are not sure how to switch to the devel branch, contact us, we are happy to help!
To date, thousands of single nucleotide polymorphisms (SNPs) have been found to be associated with complex traits and diseases. However, the vast majority of these disease-associated SNPs lie in the non-coding part of the genome, and are likely to affect regulatory elements, such as enhancers and promoters, rather than the function of a protein. Thus, to understand the molecular mechanisms underlying genetic traits and diseases, it becomes increasingly important to study the effect of a SNP on nearby molecular traits such as chromatin environment or transcription factor (TF) binding. Towards this aim, we developed SNPhood, a user-friendly Bioconductor [9] R package to investigate, quantify and visualize the local epigenetic neighborhood of a set of SNPs in terms of chromatin marks, TF binding sites using data from NGS experiments. SNPhood comprises a set of easy-to-use functions to extract, normalize and quantify reads for a genomic region, perform data quality checks, normalize read counts using input files, to investigate the binding pattern using unsupervised clustering. In addition, SNPhood can be employed for identifying and visualizing allele-specific binding patterns around SNPs using a robust permutation based FDR procedure. The regions around each SNP can be binned in a user-defined fashion to allow for analyses ranging from very broad regions to highly detailed investigation of specific binding shapes. Importantly, SNPhood supports the integration with genotype information to investigate and visualize genotype-specific binding patterns.
In this section, we want explicitly mention the designated scope of the SNPhood package, its limitations and additional / companion packages that may be used subsequently or beforehand.
First, let’s be clear what SNPhood is NOT:
Instead, SNPhood aims to fill an existing gap for an increasingly common task: Current workflows for analyzing ChIP-Seq data typically involve peak calling, which summarizes the signal of each binding event into two numbers: enrichment and peak size, and usually neglects additional factors like binding shape. However, when a set of regions of interest (ROI) is already at hand - e.g. GWAS SNPs, quantitative trait loci (QTLs), etc. - a comprehensive and unbiased analysis of the molecular neighborhood of these regions, potentially in combination with allele-specific (AS) binding analyses will be more suited to investigate the underlying (epi-)genomic regulatory mechanisms than simply comparing peak sizes. Currently, such analyses are often carried out “by hand” using basic NGS tools and genome-browser like interfaces to visualize molecular phenotype data independently for each ROI. A tool for systematic analysis of the local molecular neighborhood of regions of interest is currently lacking. SNPhood fills this gap to investigate, quantify, and visualize the local epigenetic neighborhood of regions of interest using chromatin or TF binding data from NGS experiments. It provides a set of tools that are largely complimentary to currently existing software for analyzing ChIP-Seq data.
When running the main function analyzeSNPhood, a series of steps and calculations is performed. In summary, the basic mode of action can be summarized in the following schematic:
More specifically, the mode of action and basic workflow is as follows (see also Figure 4):
The following Figure is an extension of Figure 2 and depicts in more detail the workflow of the package when running the function analyzeSNPhood. In addition, some parameters and their mode of action are highlighted in the step they come into play (depicted in orange).
The following input data are required for the SNPhood package, which is also visualized in Figure 6:
The following data frames are all valid as input for SNPhood and its main function analyzeSNPhood:
signal | input |
---|---|
file1.bam | NA |
file2.bam | NA |
file3.bam | NA |
signal | input | individual | genotype |
---|---|---|---|
file1.bam | input1.bam | S1 | file1.vcf:colName1 |
file2.bam | input1.bam | S1 | file1.vcf:colName1 |
file3.bam | input2.bam, input3.bam | S2 | file1.vcf:colName2 |
In summary, the following data can be integrated into SNPhood: