This document provides an introduction to the affinity test for large sets of SNP-motif interactions using the atSNP package(affinity test for regulatory SNP detection) (Zuo, Shin, and Keles 2015). atSNP implements in-silico methods for identifying SNPs that potentially may affect binding affinity of transcription factors. Given a set of SNPs and a library of motif position weight matrices (PWMs), atSNP provides two main functions for analyzing SNP effects:
atSNP implements the importance sampling algorithm in Chan, Zhang, and Chen (2010) to compute the p-values. Compared to other bioinformatics tools, such as FIMO (Grant, Bailey, and Nobel 2011) and is-rSNP (Macintyre et al. 2010) that provide similar functionalities, atSNP avoids computing the p-values analytically. In one of our research projects, we have used atSNP to evaluate interactions between 26K SNPs and 2K motifs within 5 hours. We found no other existing tools can finish the analysis of such a scale.
atSNP depends on the following packages:
In addition, users need to install the annotation package BSgenome from http://www.bioconductor.org/packages/3.0/data/annotation/ that corresponds to the species type and genome version. Our example SNP data set in the subsequent sections corresponds to the hg38 version of human genome. If users wish to annotate the SNP location and allele information given their rs ids, they also need to install the corresponding SNPlocs package. Notice that the annotation packages are usually large and this installation step may take a substantial amount of time.
atSNP includes two motif libraries in the package: the ENCODE derived motif library, and the JASPAR database motif library. In addition, atSNP can load user defined motif libraries in a variety of formats.
atSNP provides a default motif library downloaded from http://compbio.mit.edu/encode-motifs/motifs.txt. This library contains 2065 known and discovered motifs from ENCODE TF ChIP-seq data sets. The following commands allow to load this motif library:
library(atSNP)
data(encode_library)
length(encode_motif)
## [1] 2065
encode_motif[1]
## $SIX5_disc1
## [,1] [,2] [,3] [,4]
## [1,] 8.51100e-03 4.2550e-03 0.987234 1.00000e-10
## [2,] 9.02127e-01 1.2766e-02 0.038298 4.68090e-02
## [3,] 4.55319e-01 7.2340e-02 0.344681 1.27660e-01
## [4,] 2.51064e-01 8.5106e-02 0.085106 5.78724e-01
## [5,] 1.00000e-10 4.6809e-02 0.012766 9.40425e-01
## [6,] 1.00000e-10 1.0000e-10 1.000000 1.00000e-10
## [7,] 3.82980e-02 2.1277e-02 0.029787 9.10638e-01
## [8,] 9.44681e-01 4.2550e-03 0.051064 1.00000e-10
## [9,] 1.00000e-10 1.0000e-10 1.000000 1.00000e-10
## [10,] 1.00000e-10 1.0000e-10 0.012766 9.87234e-01
Here, the motif library is represented by encode_motif
,
which is a list of position weight matrices. The codes below show the content of one matrix as well as its IUPAC letters:
encode_motif[[1]]
## [,1] [,2] [,3] [,4]
## [1,] 8.51100e-03 4.2550e-03 0.987234 1.00000e-10
## [2,] 9.02127e-01 1.2766e-02 0.038298 4.68090e-02
## [3,] 4.55319e-01 7.2340e-02 0.344681 1.27660e-01
## [4,] 2.51064e-01 8.5106e-02 0.085106 5.78724e-01
## [5,] 1.00000e-10 4.6809e-02 0.012766 9.40425e-01
## [6,] 1.00000e-10 1.0000e-10 1.000000 1.00000e-10
## [7,] 3.82980e-02 2.1277e-02 0.029787 9.10638e-01
## [8,] 9.44681e-01 4.2550e-03 0.051064 1.00000e-10
## [9,] 1.00000e-10 1.0000e-10 1.000000 1.00000e-10
## [10,] 1.00000e-10 1.0000e-10 0.012766 9.87234e-01
GetIUPACSequence(encode_motif[[1]])
## [1] "GARWTGTAGT"
The data object encode_library
also contains a character vector encode_motifinfo
that contains detailed information for each motif.
length(encode_motifinfo)
## [1] 2065
head(encode_motifinfo)
## SIX5_disc1
## "SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic"
## MYC_disc1
## "USF2_K562_encode-Snyder_seq_hsa_r1:MDscan#1#Intergenic"
## SRF_disc1
## "SRF_H1-hESC_encode-Myers_seq_hsa_r1:MDscan#2#Intergenic"
## AP1_disc1
## "JUND_K562_encode-Snyder_seq_hsa_r1:MEME#1#Intergenic"
## SIX5_disc2
## "SIX5_H1-hESC_encode-Myers_seq_hsa_r1:MDscan#1#Intergenic"
## NFY_disc1
## "NFYA_K562_encode-Snyder_seq_hsa_r1:MEME#2#Intergenic"
Here, the entry names of this vector are the same as the names of the motif library. encode_motifinfo
allows easy looking up of the motif information for a specific PWM. For example, to look up the motif information for the first PWM in encode_motifinfo
, use the following chunk of code:
encode_motifinfo[names(encode_motif[1])]
## SIX5_disc1
## "SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic"
Our package also includes the JASPAR library downloaded from http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_all.txt. The data object jaspar_library
contains a list of 593 PWMs jaspar_motif
and a character vector jaspar_motifinfo
.
data(jaspar_library)
jaspar_motif[[1]]
## [,1] [,2] [,3] [,4]
## [1,] 0.20833333 0.70833333 0.04166667 0.04166667
## [2,] 0.83333333 0.04166667 0.08333333 0.04166667
## [3,] 0.04166667 0.87500000 0.04166667 0.04166667
## [4,] 0.04166667 0.04166667 0.87500000 0.04166667
## [5,] 0.04166667 0.04166667 0.04166667 0.87500000
## [6,] 0.04166667 0.04166667 0.87500000 0.04166667
jaspar_motifinfo[names(jaspar_motif[1])]
## MA0004.1
## "Arnt"
Users can also provide a list of PWMs as the motif library via the LoadMotifLibrary
function. In this function, ‘tag’ specifies the string that marks the start of each block of PWM; ‘skiprows’ is the number of description lines before the PWM; ‘skipcols’ is the number of columns to be skipped in the PWM matrix; ‘transpose’ is TRUE if the PWM has 4 rows representing A, C, G, T or FALSE if otherwise; ‘field’ is the position of the motif name within the description line; ‘sep’ is a vector of separators in the PWM; ‘pseudocount’ is the number added to the raw matrices, recommended to be 1 if the matrices are in fact position frequency matrices. These arguments provide the flexibility of loading a number of varying formatted files. The PWMs are returned as a list object. This function flexibly adapts to a variety of different formats. Some examples using online accessible files from other research groups are shown below.
## Source: http://meme.nbcr.net/meme/doc/examples/sample-dna-motif.meme-io
pwms <- LoadMotifLibrary(
urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample-dna-motif.meme-io.txt")
## Source: http://compbio.mit.edu/encode-motifs/motifs.txt
pwms <- LoadMotifLibrary(
urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/motifs.txt",
tag = ">", transpose = FALSE, field = 1,
sep = c("\t", " ", ">"), skipcols = 1,
skiprows = 1, pseudocount = 0)
## Source: http://johnsonlab.ucsf.edu/mochi_files/JASPAR_motifs_H_sapiens.txt
pwms <- LoadMotifLibrary(
urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/JASPAR_motifs_H_sapiens.txt",
tag = "/NAME",skiprows = 1, skipcols = 0, transpose = FALSE,
field = 2)
## Source: http://jaspar.genereg.net/html/DOWNLOAD/ARCHIVE/JASPAR2010/all_data/matrix_only/matrix.txt
pwms <- LoadMotifLibrary(
urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/matrix.txt",
tag = ">", skiprows = 1, skipcols = 1, transpose = TRUE,
field = 1, sep = c("\t", " ", "\\[", "\\]", ">"),
pseudocount = 1)
## Source: http://jaspar.genereg.net/html/DOWNLOAD/JASPAR_CORE/pfm/nonredundant/pfm_vertebrates.txt
pwms <- LoadMotifLibrary(
urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/pfm_vertebrates.txt",
tag = ">", skiprows = 1, skipcols = 0, transpose = TRUE, field = 1,
sep = c(">", "\t", " "), pseudocount = 1)
atSNP can load the SNP data in three formats: a table including full SNP information, a list of dbSNP’s rsids, and a pair of fasta files.
In this case, the table that provides the SNP information must include five columns:
This data set can be loaded using the LoadSNPData
function. The ‘genome.lib’ argument specifies the annotation package name corresponding to the SNP data set, such as ‘BSgenome.Hsapiens.UCSC.hg38’. Each side of the SNP is extended by a number of base pairs specified by the ‘half.window.size’ argument. LoadSNPData
extracts the genome sequence within such windows around each SNP using the ‘genome.lib’ package. An example is the following:
The following codes generate a synthetic SNP data and loads it back in :
data(example)
write.table(snp_tbl, file = "test_snp_file.txt",
row.names = FALSE, quote = FALSE)
snp_info <- LoadSNPData("test_snp_file.txt", genome.lib = "BSgenome.Hsapiens.UCSC.hg38", half.window.size = 30, default.par = TRUE, mutation = FALSE)
ncol(snp_info$sequence) == nrow(snp_tbl)
snp_info$rsid.rm
There are two important arguments in function LoadSNPData
. First, the ‘mutation’ argument specifies whether the data set is related to SNP or general single nucleotide mutation. By default, ‘mutation=FALSE’. In this case, LoadSNPData
get the nucleotides on the reference genome based on the genome coordinates specified by ‘chr’ and ‘snp’ and match them to ‘a1’ and ‘a2’ alleles from the BSgenome package. ‘a1’ and ‘a2’ nucleotides are assigned to the refrence or the SNP allele based on which one matches to the reference nucleotide. If neither allele matches to the reference nucleotide, the corresponding row in the SNP information file is discarded. These discarded SNPs are captured by the ‘rsid.rm’ field in the output. Alternatively, if ‘mutation=TRUE’, no row is discarded. LoadSNPData
takes the reference sequences around the SNP locations, replaces the reference nucleotides at the SNP locations by ‘a1’ nucleotides to construct the ‘reference’ sequences, and by ‘a2’ nucleotides to construct the ‘SNP’ sequences. Notice that in this case, in the subsequent analysis, whenever we refer to the ‘reference’ or the ‘SNP’ allele, it actually means the ‘a1’ or the ‘a2’ allele.
mutation_info <- LoadSNPData("test_snp_file.txt", genome.lib = "BSgenome.Hsapiens.UCSC.hg38", half.window.size = 30, default.par = TRUE, mutation = TRUE)
ncol(mutation_info$sequence) == nrow(snp_tbl)
file.remove("test_snp_file.txt")
Second, the ‘default.par’ argument specifies how to estimate the first order Markov model parameters. If ‘default.par = FALSE’, LoadSNPData
simultaneously estimates the parameters for the first order Markov model in the reference genome using the nucleotides within the SNP windows. Otherwise, it loads a set of parameter values pre-fitted from sequences around all the SNPs in the NHGRI GWAS catalog (Hindorff et al. 2009). We recommend setting ‘default.par = TRUE’ when we have fewer than 1000 SNPs. LoadSNPData
returns a list object with five fields:
LoadSNPData
also allows users to load a list of rsids for the SNPs. In this case, the function looks up the SNP location and the allele information using the annotation package specified by ‘snp.lib’, such as ‘SNPlocs.Hsapiens.dbSNP144.GRCh38’.
snp_info1 <- LoadSNPData(snpids = c("rs5050", "rs616488", "rs11249433", "rs182799", "rs12565013", "rs11208590"), genome.lib ="BSgenome.Hsapiens.UCSC.hg38", snp.lib = "SNPlocs.Hsapiens.dbSNP144.GRCh38", half.window.size = 30, default.par = TRUE, mutation = FALSE)
LoadSNPData
may warn about the SNPs with inconsistent information and returns them in the output. The ‘rsid.missing’ output field captures SNPs that are not included in the SNPlocs package. The ‘rsid.duplicate’ output field captures SNPs with more than 2 alleles based on SNPlocs package. The ‘rsid.rm’ output field captures SNPs whose nucleotides in the reference genome do not match to either allele provided by the data source. SNPs in the ‘rsid.missing’ and ‘rsid.rm’ fields are discarded. For SNPs in ‘rsid.duplicate’, we extract all pairs of alleles as reference and SNP pairs. If ‘mutation=TRUE’, we include all of them in the output. If ‘mutation=FALSE’, these pairs are further filtered based on whether one allele matches to the reference genome nucleotide. The remaining alleles are contained in the output.
Users can also provide SNP data through a pair of fasta files, one for the sequences around the SNP location for each allele. An example of such files is at http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_1.fasta and http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_2.fasta. We require that such a pair of fasta files must satisfy the following conditions:
Such a pair of files can be loaded by the function LoadFastaData
:
snp_info2 <- LoadFastaData(ref.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_1.fasta", snp.urlname="http://pages.stat.wisc.edu/~keles/atSNP-Data/sample_2.fasta", default.par = TRUE)
We use a toy example data set included in the package to introduce the usage of functions for affinity score tests.
data(example)
names(motif_library)
## [1] "SIX5_disc1" "MYC_disc1"
str(snpInfo)
## List of 10
## $ sequence_matrix: int [1:61, 1:2] 1 1 1 3 3 1 1 1 3 3 ...
## $ ref_base : int [1:2] 1 2
## $ snp_base : int [1:2] 3 4
## $ snpids : chr [1:2] "rs53576" "rs7412"
## $ transition : num [1:4, 1:4] 0.323 0.345 0.281 0.215 0.174 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "A" "C" "G" "T"
## .. ..$ : chr [1:4] "A" "C" "G" "T"
## $ prior : Named num [1:4] 0.287 0.211 0.213 0.289
## ..- attr(*, "names")= chr [1:4] "A" "C" "G" "T"
## $ rsid.na : NULL
## $ rsid.rm : NULL
## $ rsid.duplicate : NULL
## $ rsid.missing : NULL
## to look at the motif information
data(encode_library)
encode_motifinfo[names(motif_library)]
## SIX5_disc1
## "SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic"
## MYC_disc1
## "USF2_K562_encode-Snyder_seq_hsa_r1:MDscan#1#Intergenic"
The binding affinity scores for all pairs of SNP and PWM can be computed by the ComputeMotifScore
function. It returns a list of two fields: ‘snp.tbl’ is a data.frame containing the nucleotide sequences for each SNP; ‘motif.scores’ is a data.frame containing the binding affinity scores for each SNP-motif pair.
atsnp.scores <- ComputeMotifScore(motif_library, snpInfo, ncores = 1)
atsnp.scores$snp.tbl
## snpid ref_seq
## 1 rs53576 AAAGGAAAGGTGTACGGGACATGCCCGAGGATCCTCAGTCCCACAGAAACAGGGAGGGGCT
## 2 rs7412 CTCCTCCGCGATGCCGATGACCTGCAGAAGCGCCTGGCAGTGTACCAGGCCGGGGCCCGCG
## snp_seq
## 1 AAAGGAAAGGTGTACGGGACATGCCCGAGGGTCCTCAGTCCCACAGAAACAGGGAGGGGCT
## 2 CTCCTCCGCGATGCCGATGACCTGCAGAAGTGCCTGGCAGTGTACCAGGCCGGGGCCCGCG
## ref_seq_rev
## 1 AGCCCCTCCCTGTTTCTGTGGGACTGAGGATCCTCGGGCATGTCCCGTACACCTTTCCTTT
## 2 CGCGGGCCCCGGCCTGGTACACTGCCAGGCGCTTCTGCAGGTCATCGGCATCGCGGAGGAG
## snp_seq_rev snpbase
## 1 AGCCCCTCCCTGTTTCTGTGGGACTGAGGACCCTCGGGCATGTCCCGTACACCTTTCCTTT G
## 2 CGCGGGCCCCGGCCTGGTACACTGCCAGGCACTTCTGCAGGTCATCGGCATCGCGGAGGAG T
atsnp.scores$motif.scores
## motif motif_len snpid log_lik_ref log_lik_snp log_lik_ratio
## 1 MYC_disc1 10 rs53576 -97.08772 -95.57417 -1.5135592
## 2 MYC_disc1 10 rs7412 -94.21702 -94.64204 0.4250229
## 3 SIX5_disc1 10 rs53576 -34.64551 -37.80486 3.1593576
## 4 SIX5_disc1 10 rs7412 -42.60672 -38.40830 -4.1984180
## log_enhance_odds log_reduce_odds ref_start snp_start ref_end snp_end
## 1 1.513559 -1.513559 31 31 40 40
## 2 23.025851 23.025851 29 25 38 34
## 3 -3.159358 3.159358 30 30 39 39
## 4 23.013003 -2.917768 29 22 38 31
## ref_strand snp_strand snpbase
## 1 - - G
## 2 + - T
## 3 + + G
## 4 - + T
The affinity scores for the reference and the SNP alleles are represented by the log_lik_ref
and log_lik_snp
columns in motif.scores
. The affinity score change is included in the log_lik_ratio
column. These three affinity scores are tested in the subsequent steps. motif.scores
also includes other columns for the position of the best matching subsequence on each allele. For a complete description on all these columns, users can look up the help documentation.
After we have computed the binding affinity scores, they can be tested using the ComputePValues
function. The result is a data.frame extending the affinity score table by six columns:
pval_ref
: p-value for the reference allele affinity score.pval_snp
: p-value for the SNP allele affinity score.pval_cond_ref
and pval_cond_snp
: conditional p-values for the affinity scores of the reference and SNP alleles.pval_diff
: p-value for the affinity score change between the two alleles.pval_rank
: p-value for the rank test between the two alleles.We recommend using pval_ref
and pval_snp
for assessing the significance of allele specific affinity; and using pval_rank
for assessing the significance of the SNP effect on the affinity change.
atsnp.result <- ComputePValues(motif.lib = motif_library, snp.info = snpInfo,
motif.scores = atsnp.scores$motif.scores, ncores = 1, testing.mc=TRUE)
## Finished testing motif No. 1
## Finished testing motif No. 2
atsnp.result
## motif motif_len snpid log_lik_ref log_lik_snp log_lik_ratio
## 1 MYC_disc1 10 rs53576 -97.08772 -95.57417 -1.5135592
## 2 MYC_disc1 10 rs7412 -94.21702 -94.64204 0.4250229
## 3 SIX5_disc1 10 rs53576 -34.64551 -37.80486 3.1593576
## 4 SIX5_disc1 10 rs7412 -42.60672 -38.40830 -4.1984180
## log_enhance_odds log_reduce_odds ref_start snp_start ref_end snp_end
## 1 1.513559 -1.513559 31 31 40 40
## 2 23.025851 23.025851 29 25 38 34
## 3 -3.159358 3.159358 30 30 39 39
## 4 23.013003 -2.917768 29 22 38 31
## ref_strand snp_strand snpbase pval_ref pval_snp pval_cond_ref
## 1 - - G 0.6807172 0.4666075 0.39744208
## 2 + - T 0.3679010 0.3953891 0.26291915
## 3 + + G 0.2350785 0.4109446 0.06520269
## 4 - + T 0.6652392 0.4280174 0.66523917
## pval_cond_snp pval_diff pval_rank
## 1 0.3624414 0.7020537 0.6007976
## 2 0.3148531 0.7647636 0.7348781
## 3 0.1480605 0.5153455 0.4672491
## 4 0.2499807 0.4284411 0.5485406
First, we can sort this output table according to the pval_rank
column:
head(atsnp.result[order(atsnp.result$pval_rank), c("snpid", "motif", "pval_ref", "pval_snp", "pval_rank")])
## snpid motif pval_ref pval_snp pval_rank
## 3 rs53576 SIX5_disc1 0.2350785 0.4109446 0.4672491
## 4 rs7412 SIX5_disc1 0.6652392 0.4280174 0.5485406
## 1 rs53576 MYC_disc1 0.6807172 0.4666075 0.6007976
## 2 rs7412 MYC_disc1 0.3679010 0.3953891 0.7348781
We can apply multiple testing adjustment to the p-values. atSNP does not implement any multiple testing adjustment internally. Users have the flexibility of choosing an adjustment method based on their specific application. For example, if we want to adjust pval_rank
from all pairs of SNP-PWM pairs using the Benjamini-Hochberg’s procedure, we may compute:
## snpid motif pval_rank pval_rank_bh
## 1 rs53576 MYC_disc1 0.6007976 0.7348781
## 2 rs7412 MYC_disc1 0.7348781 0.7348781
## 3 rs53576 SIX5_disc1 0.4672491 0.7348781
## 4 rs7412 SIX5_disc1 0.5485406 0.7348781
Alternatively, if we want to compute Storey’s q-values, we may utilize the qvalue package from :
library(qvalue)
qval_rank = qvalue(atsnp.result$pval_rank, pi0=0.1)$qvalues
atsnp.result = cbind(atsnp.result, qval_rank)
atSNP provides additional functions to extract the matched nucleotide subsequences that match to the motifs. The function MatchSubsequence
adds the subsequence matches to the affinity score table by using the motif library and the SNP set. The subsequences matching to the motif in the two alleles are returned in the ref_match_seq
and snp_match_seq
columns. The ‘IUPAC’ column returns the IUPAC letters of the motifs. Notice that if you have a large number of SNPs and motifs, the returned table can be very large.
match.subseq_result <- MatchSubsequence(snp.tbl = atsnp.scores$snp.tbl, motif.scores = atsnp.result, motif.lib = motif_library, snpids = c("rs53576", "rs7412"), motifs = names(motif_library)[1], ncores = 1)
match.subseq_result[c("snpid", "motif", "IUPAC", "ref_match_seq", "snp_match_seq")]
## snpid motif IUPAC ref_match_seq snp_match_seq
## 1 rs53576 SIX5_disc1 GARWTGTAGT GATCCTCAGT GGTCCTCAGT
## 2 rs7412 SIX5_disc1 GARWTGTAGT GCCAGGCGCT CTGCAGAAGT
To visualize how each motif is matched to each allele using the plotMotifMatch
function:
match.seq<-dtMotifMatch(atsnp.scores$snp.tbl, atsnp.scores$motif.scores, snpids="rs53576", motifs="SIX5_disc1", motif.lib = motif_library)
plotMotifMatch(match.seq, motif.lib = motif_library)