Package version: SNPhood 1.1.9

Contents

1 Important note regarding the SNPhood version

SNPhood is under active development, and we highly recommend using the newest available version. In particular, we recommend using the devel branch of Bioconductor / SNPhood to make sure you use the latest features and bugfixes. If you are not sure how to switch to the devel branch, contact us, we are happy to help!

2 Motivation, Necessity, Package Scope and Limitations

2.1 Motivation and Necessity

Figure 1 - SNPhood logo.



To date, thousands of single nucleotide polymorphisms (SNPs) have been found to be associated with complex traits and diseases. However, the vast majority of these disease-associated SNPs lie in the non-coding part of the genome, and are likely to affect regulatory elements, such as enhancers and promoters, rather than the function of a protein. Thus, to understand the molecular mechanisms underlying genetic traits and diseases, it becomes increasingly important to study the effect of a SNP on nearby molecular traits such as chromatin environment or transcription factor (TF) binding. Towards this aim, we developed SNPhood, a user-friendly Bioconductor [9] R package to investigate, quantify and visualize the local epigenetic neighborhood of a set of SNPs in terms of chromatin marks, TF binding sites using data from NGS experiments. SNPhood comprises a set of easy-to-use functions to extract, normalize and quantify reads for a genomic region, perform data quality checks, normalize read counts using input files, to investigate the binding pattern using unsupervised clustering. In addition, SNPhood can be employed for identifying and visualizing allele-specific binding patterns around SNPs using a robust permutation based FDR procedure. The regions around each SNP can be binned in a user-defined fashion to allow for analyses ranging from very broad regions to highly detailed investigation of specific binding shapes. Importantly, SNPhood supports the integration with genotype information to investigate and visualize genotype-specific binding patterns.

2.2 Package scope and limitations

In this section, we want explicitly mention the designated scope of the SNPhood package, its limitations and additional / companion packages that may be used subsequently or beforehand.

First, let’s be clear what SNPhood is NOT:

Instead, SNPhood aims to fill an existing gap for an increasingly common task: Current workflows for analyzing ChIP-Seq data typically involve peak calling, which summarizes the signal of each binding event into two numbers: enrichment and peak size, and usually neglects additional factors like binding shape. However, when a set of regions of interest (ROI) is already at hand - e.g. GWAS SNPs, quantitative trait loci (QTLs), etc. - a comprehensive and unbiased analysis of the molecular neighborhood of these regions, potentially in combination with allele-specific (AS) binding analyses will be more suited to investigate the underlying (epi-)genomic regulatory mechanisms than simply comparing peak sizes. Currently, such analyses are often carried out “by hand” using basic NGS tools and genome-browser like interfaces to visualize molecular phenotype data independently for each ROI. A tool for systematic analysis of the local molecular neighborhood of regions of interest is currently lacking. SNPhood fills this gap to investigate, quantify, and visualize the local epigenetic neighborhood of regions of interest using chromatin or TF binding data from NGS experiments. It provides a set of tools that are largely complimentary to currently existing software for analyzing ChIP-Seq data.

Figure 2 - SNPhood feature summary and scope. Comparison and distinction of SNPhood with regard to commonly used tools for ChIP-Seq/RNA-Seq data. Green, yellow and red: Feature fully, partially or not supported, respectively.

3 Basic Mode of Action

When running the main function analyzeSNPhood, a series of steps and calculations is performed. In summary, the basic mode of action can be summarized in the following schematic:

Figure 3 - Basic mode of action of SNPhood. See also Figure 4 for a more detailed schematic.



More specifically, the mode of action and basic workflow is as follows (see also Figure 4):

  1. Initiate the analysis and set all parameters accordingly
  2. Parse and validate SNPs (or other user defined genomic regions)
  3. Split the SNP regions into bins
  4. ITERATE OVER ALL UNIQUE COMBINATIONS OF INPUT FILES OR FILE SETS
    4.1. Parse all files that belong to the input set
       4.1.1. Extract reads overlapping with the user regions
       4.1.2. Compute the genome-wide average for each
       4.1.3. Filter reads by strand (if applicable)
       4.1.4. Determine overlaps per region and bin
    4.2. If applicable (multiple files have been integrated), normalize the counts among each other and adjust genome-wide averages to adjust for different libray sizes
    4.3. ITERATE OVER ALL UNIQUE INDIVIDUALS WITH THE GIVEN SET OF INPUT FILES
       4.3.1. Parse all files that belong to the individual
            4.3.1.1. Extract reads overlapping with the user regions
            4.3.1.2. Determine the genotype distribution at each SNP based on all overlapping reads
            4.3.1.3. Filter reads by strand (if applicable)
            4.3.1.4. If applicable, select reads specifically for each read group and determine the number of overlaps per region and bin
       4.3.2. If applicable (multiple files have been integrated), normalize the counts among each other to adjust for different libray sizes
       4.3.3. If applicable, normalize the counts using the previously processed input to calculate read enrichment
  5. If applicable and requested, normalize read counts among individuals only
  6. If requested, integrate matching SNP genotypes

The following Figure is an extension of Figure 2 and depicts in more detail the workflow of the package when running the function analyzeSNPhood. In addition, some parameters and their mode of action are highlighted in the step they come into play (depicted in orange).

Figure 4 - Basic mode of action of SNPhood (extended).

4 Input

The following input data are required for the SNPhood package, which is also visualized in Figure 6:

Caption to image
Figure 5 - Supported formats of the user regions file. See section Parameters for details.

The following data frames are all valid as input for SNPhood and its main function analyzeSNPhood:

signal input
file1.bam NA
file2.bam NA
file3.bam NA
signal input individual genotype
file1.bam input1.bam S1 file1.vcf:colName1
file2.bam input1.bam S1 file1.vcf:colName1
file3.bam input2.bam, input3.bam S2 file1.vcf:colName2

In summary, the following data can be integrated into SNPhood: