1 Introduction

While methods for characterizing marginal relationships between individual SNPs and disease status have been well developed in high throughput genetic association studies of complex diseases, identifying joint associations between collections of genetic variants and disease remains challenging. To date, studies have overwhelmingly focused on detecting variant-disease associations on a SNP by SNP basis. Doing so allows researchers to scan millions of SNPs for evidence of association with disease, but only SNPs with large marginal disease associations can be identified, missing collections of SNPs with large joint disease associations despite small marginal associations. For many diseases, it is hypothesized that increased penetrance may result from the joint effect of variants at multiple susceptibility loci, suggesting methods focused on identifying multi-SNP associations may offer greater insight into the genetic architecture of complex diseases.

The epistasisGA package presents one such approach. In this package, we implement the GADGETS method (Nodzenski* et al. 2021) for identifying multi-SNP disease associations in case-parent triad or affected/unaffected sibling-pair studies. Briefly, GADGETS uses a “genetic algorithm” (GA) (Holland 1975) to identify collections of risk-relevant SNPs. Genetic algorithms are a class of general purpose optimization algorithms particularly well suited to combinatorial optimization for high dimensional problems. While we use a GA in the context of population genetics, genetic algorithms were not originally developed for genetic studies and can be used to solve a broad variety of problems. In a typical genetic algorithm, optimal solutions are identified by mimicking the process of Darwinian natural selection. In the first iteration of the algorithm, or ‘generation’, a set of potential solutions, collectively known as a ‘population’ with each population component referred to as a ‘chromosome’, is sampled. Individual chromosomes are made up of finite sets of discrete elements, just as biological chromosomes are comprised of SNPs. Unlike biological chromosomes, however, GA ‘chromosomes’ are unordered sets. A user defined function then assigns a ‘fitness score’ to each chromosome, with the fitness score constructed such that better solutions have higher fitness scores. Then, the chromosomes are subjected to ‘crossing over’, where component elements of different chromosomes are exchanged, and ‘mutation’, where component elements are arbitrarily replaced, to generate a new population for the next generation. Chromosomes with higher fitness scores are preferentially selected to produce ‘offspring’, analogous to how organisms with higher fitness reproduce more effectively under natural selection. The scoring and mutation/crossing over process continues iteratively until stopping criteria have been achieved, hopefully resulting in one or more acceptable solutions to the optimization problem. To maintain population diversity and help avoid premature convergence, GADGETS implements what is known as an ‘island model’ GA (Andre and Koza 1996), where multiple ‘island’ populations are evolved in parallel with limited ‘migration’ of chromosomes among islands permitted at pre-specified intervals of generations.

In GADGETS, the term ‘chromosome’ refers to a collection or subset of SNPs we wish to examine for evidence of a strong multi-SNP association with disease. To be clear, these SNPs need not be located on the same biological chromosome. The details of the fitness score function and crossover/mutation operations are given in (Nodzenski* et al. 2021), but in short, the intuitive aim is to assign high scores to collections of SNPs that are jointly carried by the disease affected children (cases) much more frequently than not. For affected/unaffected sibling-pairs, we operationalize this by comparing the paired sibling genotypes. For case-parent triads, we imagine a possible sibling for each case based on the alleles that were not transmitted, which we call the ‘complement’, and we compare the frequency with which subsets of SNPs are jointly transmitted to cases versus complements. Importantly, we do not assume a single subset of risk-relevant SNPs, so GADGETS does not directly output a single optimal subset but rather a number provisionally interesting collections of SNPs. Some final processing is therefore required to carry out overall inference and to tease out the actual risk-relevant subset(s) of SNPs.

As a final note, users should be aware this method does not currently scale up to genome wide, but it does accommodate larger numbers of input SNPs than comparable methods. In our simulations, we have used up to 10,000 input SNPs, but users are encouraged to experiment with larger numbers if desired.

2 Basic Usage

2.1 Load Data

We begin our example usage of GADGETS by loading a simulated example of case-parent triad data.

library(epistasisGA)
data(case)
data(dad)
data(mom)

These data were simulated based on a case-parent triad study of children with orofacial-clefts using the method described by Shi et al. (2018) for triads consisting of mothers, fathers, and affected children. For each dataset, the rows correspond to families, and the columns correspond to SNPs. Because this is a small example just for illustration, each of these datasets includes only 100 SNPs. The SNPs in columns 51, 52, 76, and 77 are a true risk pathway (rs1731422, rs4237892, rs7985535, rs1487251), where the joint combination of variants substantially increases the penetrance compared to other genotypes.

At present, GADGETS does not support the use of genotypes imputed with uncertainty (genotype ‘dosages’), so the most likely genotype should be used for any imputed SNPs. GADGETS does, however, support missing genotypes. If a genotype is missing for a particular SNP in any family member, that genotype will be set to missing (-9) for all members of the family, and the family will be considered uninformative for that SNP.

2.2 Pre-process Data

The second step in the analysis pipeline is to pre-process the data. We begin pre-processing by constructing a block diagonal matrix to indicate whether a given pair of SNPs should be considered to be in linkage, where TRUE indicates a pair of SNPs are in linkage and FALSE otherwise. Below, we default to the assumption that SNPs located on the same biological chromosome are in linkage, but users need not make this assumption and are encouraged to more carefully tailor this matrix based on individual circumstances. An important note is that the matrix must be block diagonal (or uniformly set to TRUE), and the ordering of the columns in the input triad data must be consistent with the specified LD structure. For the example data, the SNPs are drawn from chromosomes 10-13, with the columns sorted by chromosome and 25 SNPs per chromosome. We therefore construct the matrix as follows:

library(Matrix)
block.ld.mat <- as.matrix(bdiag(list(matrix(rep(TRUE, 25^2), nrow = 25),
                               matrix(rep(TRUE, 25^2), nrow = 25),
                               matrix(rep(TRUE, 25^2), nrow = 25),
                               matrix(rep(TRUE, 25^2), nrow = 25))))

Now, we can execute pre-processing:

pp.list <- preprocess.genetic.data(case, father.genetic.data = dad,
                                   mother.genetic.data = mom,
                                   block.ld.mat = block.ld.mat)

For affected/unaffected sibling study, the sibling genetic data should be supplied as the complement.genetic.data argument:

### FOR ILLUSTRATION ONLY, DO NOT TRY TO RUN THIS CHUNK ###
pp.list <- preprocess.genetic.data(affected.sibling.genotypes, 
                                   complement.genetic.data = unaffected.sibling.genotypes,
                                   block.ld.mat = block.ld.mat)

GADGETS can also accommodate studies with a mix of triads and unaffected/affected siblings. To input this kind of data, analysts would first need to generate the ‘complement’ data for triads, as follows:

complement <- mom + dad - case

Then the unaffected sibling genotypes should be concatenated with the complement data, and, similarly, the triad case genotypes and affected sibling genotypes should be concatenated, and these genotype matrices should be entered as the complement.genetic.data and case.genetic.data arguments for preprocess.genetic.data:

### STILL JUST FOR ILLUSTRATION ###
combined.case <- rbind(case, affected.sibling.genotypes)
combined.complement <- rbind(complement, unaffected.sibling.genotypes)
pp.list <- preprocess.genetic.data(case.genetic.data = combined.case, 
                                   complement.genetic.data = combined.complement,
                                   block.ld.mat = block.ld.mat)

Importantly, X chromosome SNPs can be used as candidates for GADGETS, but some extra attention is required. Specifically, for affected/unaffected sibling data, both siblings in each pair should be of the same biological sex. For case-parent triad data, the ‘complement’ should be assumed to be of the same sex as the case. As a consequence, analysts must input triad data containing X chromosome SNPs as case.genetic.data and complement.genetic.data, as shown above, after manually creating the complement data for these triads. To do so, autosomal SNP complements should still be created as above (mother genotype + father genotype - case genotype), but the X chromosome SNP complement genotypes should be the allele transmitted by the father to the case and the allele not transmitted by the mother to the case. Note this will result in male X chromosome SNP case and complement genotypes being coded as 0 or 1, compared to 0, 1, or 2 for females. If males and females are both contained in the analysis dataset, analysts may wish to double male X chromosome SNP allele counts (male genotype 1 re-coded as 2) because half of the X copies are typically silenced in females. This, however, is not required and analysts can decide on a SNP-by-SNP basis. Alternatively, GADGETS could be run separately for males and females, in which case doubling male X chromosome allele counts would not be necessary.

The preprocess.genetic.data function performs a few disparate tasks that users should note. First, it identifies the minor allele for each input SNP based on the observed frequency in the mothers and fathers (or both siblings in sibling studies). The coding is then flipped for any SNPs where the allele count corresponds to the major allele. The identities of these re-coded SNPs can be found in the output from preprocess.genetic.data. Afterwards, any SNPs with minor allele frequency below a given value, 0% by default (no filtering), are removed. Following SNP filtering, \(\chi^2\) statistics for univariable SNP-disease associations are computed for each SNP assuming a log-additive model using the method of conditioning on the set of transmitted and untransmitted genotypes, regardless of phase, described by Cordell and Clayton (2002). These statistics are used in GADGETS when SNPs are sampled for mutation, with the sampling probability for a given SNP proportional to its \(\sqrt{\chi^2}\) value. Alternatively, users can manually specify a vector proportional to SNP sampling probabilities using argument snp.sampling.probs. This may be of interest to users who wish to incorporate prior biological or expert knowledge into the algorithm to prioritize sampling a subset of particularly interesting SNPs.

2.3 Run GADGETS

We now use GADGETS to identify interesting collections of SNPs that appear to be jointly risk-related using the run.gadgets function. The method requires a number of tuning parameters, but the function defaults are a good starting point. More information regarding these parameters can be found in the package documentation and the paper describing the method (Nodzenski* et al. 2021). Briefly, GADGETS requires the user to pre-specify the number of SNPs that may be jointly associated with disease status. This is controlled by the chromosome.size argument. We recommend running the algorithm for a range of sizes, typically 2-6. For this simple example, however, we will only examine sizes 3 and 4.

Note that run.gadgets does not output results directly to the R session, it will instead write results to the directory specified in results.dir. Furthermore, as mentioned previously, GADGETS uses a technique known as an ‘island model’ to identify potentially interesting collections of SNPs. Each ‘island’ corresponds to a separately initialized population of n.chromosomes chromosomes, with each chromosome comprised of chromosome.size SNPs and each island randomly assigned to cluster of island.cluster.size islands. When GADGETS is run, each island’s population evolves independently for intervals of a pre-specified number of generations (controlled by argument migration.generations) after which the top n.migrations chromosomes from each island ‘migrate’ to other islands in the cluster. This continues until each island in the cluster converges or the maximum number of generations (argument generations) is reached. Island clusters evolve completely independently from one another and, where computational resources permit, in parallel. Results for each island (argument n.islands) are written separately to results.dir.

run.gadgets(pp.list, n.chromosomes = 5, chromosome.size = 3, 
       results.dir = "size3_res", cluster.type = "interactive",
       registryargs = list(file.dir = "size3_reg", seed = 1300),
       n.islands = 8, island.cluster.size = 4, 
       n.migrations = 2)

run.gadgets(pp.list, n.chromosomes = 5, chromosome.size = 4, 
       results.dir = "size4_res", cluster.type = "interactive", 
       registryargs = list(file.dir = "size4_reg", seed = 1400),
       n.islands = 8, island.cluster.size = 4, 
       n.migrations = 2)

The population size argument, n.chromosomes, does not have a default, so users need to carefully consider how to specify this parameter. In general, run-times are faster with smaller populations, but decreasing the population size may increase the chance of failing to identify true risk pathways. However, this behavior is also depends on parameters n.islands and island.cluster.size. Broadly, our goal is to maximize the number of distinct subsets of SNPs we examine for evidence of association with disease while maintaining acceptable run-times. Because the package allows island clusters to evolve in parallel, we generally pay less of a run-time price for large island numbers compared to large population sizes. We therefore typically choose to run many islands in small clusters with relatively small population sizes. More concretely, in simulations involving 10,000 input SNPs, we have observed good performance using 1,000 islands in clusters of 4 with populations of 200 chromosomes per island.

The cluster.type and registry.args parameters are also important. For the above example, the “interactive” cluster type indicates that all islands run sequentially in the R session. However this is not how we anticipate run.gadgets will be used in most cases. Rather, we strongly recommend this function be used with high performance computing clusters to avoid prohibitively long run-times. An example (not run) of this type of command is given below:

### FOR ILLUSTRATION ONLY, DO NOT TRY TO RUN THIS CHUNK ###

library(BiocParallel)
fname <- batchtoolsTemplate("slurm")
run.gadgets(pp.list, n.chromosomes = 20, chromosome.size = 3, 
       results.dir = "size3_res", cluster.type = "slurm", 
       registryargs = list(file.dir = "size3_reg", seed = 1300),
       cluster.template = fname, 
       resources = list(chunks.as.arrayjobs = TRUE),
       n.islands = 12, 
       island.cluster.size = 4)

The cluster.template must be properly calibrated to the user’s HPC cluster. Packages batchtools and BiocParallel both contain good documentation on how to construct these files. It is also possible to run run.gadgets on a single machine with multiple cores (see run.gadgets documentation).

Regardless of the chosen cluster.type, run.gadgets uses the functionality from package batchtools to run jobs. In the case of HPC cluster use, with a properly configured cluster.template, users simply need to execute run.gadgets from an interactive R session and the jobs will be submitted to and executed on the cluster. This approach is well described in section 4.3.2 of the vignette “Introduction to BiocParallel” from package BiocParallel. Depending on cluster.type, jobs may take minutes to hours to complete. The status of jobs can be queried using the functions in package batchtools, most commonly getStatus. For users of HPC clusters, commands such as ‘squeue’ can also be used. For larger numbers of submissions, users may also construct their own automated scripts for checking that jobs have successfully completed. Perhaps the most obvious indication of a job failure is the results.dir will not contain n.islands distinct sets of results. This is particularly important when running jobs on an HPC cluster, where jobs may fail relatively cryptically. In the case of job failure due to problems with cluster schedulers (jobs fail to launch, node failure, etc.), the failed jobs can be re-run using the exact same run.gadgets command. The function will automatically identify the island jobs that still need to be run, and submit only those. This is also true for users running jobs on personal machines, who need to stop computations before all island results are available.

Once users have confirmed run.gadgets has completed and run properly, the sets of results across islands should be condensed using the function combine.islands. Note that in addition to the results directory path, the function requires as input a data.frame indicating the RSIDs (or a placeholder name), reference, and alternate alleles for each SNP in the data passed to preprocess.genetic.data as well as the list output by preprocess.genetic.data.

data(snp.annotations)
size3.combined.res <- combine.islands("size3_res", snp.annotations,
                                      pp.list, 2)
size4.combined.res <- combine.islands("size4_res", snp.annotations,
                                      pp.list, 2)

Importantly, the function will write these results to the specified directory. The function condenses island results such that the rows are sorted in decreasing order according to fitness score, or, more plainly, with the most interesting SNPs appearing at the top of the dataset:

library(magrittr)
library(knitr)
library(kableExtra)
kable(head(size4.combined.res)) %>%
  kable_styling() %>%
  scroll_box(width = "750px")
snp1 snp2 snp3 snp4 snp1.rsid snp2.rsid snp3.rsid snp4.rsid snp1.risk.allele snp2.risk.allele snp3.risk.allele snp4.risk.allele snp1.diff.vec snp2.diff.vec snp3.diff.vec snp4.diff.vec snp1.allele.copies snp2.allele.copies snp3.allele.copies snp4.allele.copies fitness.score n.cases.risk.geno n.comps.risk.geno chromosome n.islands.found
51 52 76 77 rs1731422 rs4237892 rs7985535 rs1487251 A C T T 0.7066807 0.8052581 0.6216295 0.7066309 1+ 1+ 1+ 1+ 18.189961 37 0 51.52.76.77 8
46 51 52 77 rs1219396 rs1731422 rs4237892 rs1487251 T A C T -0.0764504 0.3861258 0.4904747 0.3173681 1+ 1+ 1+ 1+ 6.894078 38 4 46.51.52.77 1
51 52 74 77 rs1731422 rs4237892 rs12425528 rs1487251 A C G T 0.4340800 0.5198946 0.0056948 0.3910094 1+ 1+ 1+ 1+ 5.571168 6 0 51.52.74.77 2
51 52 77 81 rs1731422 rs4237892 rs1487251 rs2002363 A C T T 0.2942917 0.3920740 0.3131793 0.0270168 1+ 1+ 1+ 1+ 5.300035 20 3 51.52.77.81 1
51 52 63 77 rs1731422 rs4237892 rs17093450 rs1487251 A C A T 0.4449684 0.5056253 0.1890210 0.3211486 1+ 1+ 1+ 1+ 5.256887 12 1 51.52.63.77 1
38 51 52 60 rs4378419 rs1731422 rs4237892 rs12818749 G A C A -0.0104833 0.1513001 0.1838847 -0.0306870 1+ 1+ 1+ 1+ 1.107860 40 12 38.51.52.60 1

In this case, we see that the SNPs corresponding to rs1731422, rs4237892, rs7985535, and rs1487251 are the top collection, or, more simply, the group of 4 SNPs identified as most likely to exhibit a joint association with disease status. Note that these are the actual SNPs simulated to exhibit a joint effect, so we have identified the true risk pathway.

Perhaps the most important elements of the output are the n.cases.risk.genoand n.comps.risk.genocolumns. These report the number of cases and complements/unaffected siblings with the provisional risk genotypes, as determined by GADGETS, at each locus, among families where only one of the case and complement carries the risk genotype. For true epistatic SNP-sets, many more cases should carry the risk genotypes at each locus than the complements.

A few important but subtle components of the results users should note are the risk.allele, allele.copies, diff.vec, fitness.score, and n.islands.found columns. For a given SNP within a chromosome, the corresponding risk.allele column reports the proposed risk-related nucleobase and the allele.copies column indicates the required number of copies for the proposed risk pathway. A ‘1+’ indicates at least one copy of the risk allele is needed, while ‘2’ indicates two copies are required. The n.islands.found column reports the number of distinct islands on which a specific chromosome was found. The diff.vec columns also may be descriptively interesting to some users. A positive value for a given SNP indicates the minor allele is positively associated with disease status, while a negative value implies the reference (‘wild type’) allele is positively associated with the disease. More generally, these values should be large in magnitude when a SNP is transmitted to cases much more often than complements. For true multi-SNP risk pathways, we expect pathway SNPs to be jointly transmitted to cases more often than compliments, therefore the magnitude of the diff.vec values should be large across the pathway. As such, high scoring chromosomes containing a subset of SNPs with small magnitude diff.vec values offer descriptive evidence that some of the chromosome’s SNPs are not jointly transmitted to the cases and suggest a smaller chromosome size may be warranted. In the case of the true risk pathway SNPs, all diff.vec values are positive and reasonably large, and all allele.copies columns are ‘1+’ indicating the risk pathway requires having at least one copy of the minor allele for all 4 SNPs.

2.4 Permute Datasets

Of course, in real studies we do not know the identity of true risk pathways. We would therefore like a way to determine whether the results from the observed data are consistent with what we expect under the global null hypothesis of no association between any input SNPs and disease status. We do so via permutation testing. Maintaining the case/complement (or case/unaffected sibling) nomenclature from above, in permuting the data, we randomly flip or do not flip the the disease status labels. We create these permuted datasets using the permute.dataset command. Here we generate 4 different permuted datasets. In real applications, users are advised to generate at least 100, more if computationally feasible.

set.seed(1400)
perm.data.list <- permute.dataset(case, father.genetic.data = dad,
                                  mother.genetic.data = mom, 
                                  n.permutations = 4)

2.5 Re-Run GADGETS

Once we have created the permuted datasets, for each permutation, we perform exactly the same sequence of analyses shown above. Users should note this step is by far the most time consuming of the workflow and almost certainly requires access to a computing cluster. However, since this vignette provides only a very small example, we are able to run the analyses locally. We begin by pre-processing the permuted datasets:

preprocess.lists <- lapply(perm.data.list, function(permutation){
  
  preprocess.genetic.data(permutation$case,
                          complement.genetic.data = permutation$comp,
                          block.ld.mat = block.ld.mat)
})

Then, we run GADGETS on each permuted dataset for each chromosome size and condense the results:

#specify chromosome sizes
chrom.sizes <- 3:4

#specify a different seed for each permutation
seeds <- 4:7

#run GADGETS for each permutation and size 
lapply(chrom.sizes, function(chrom.size){
  
  lapply(seq_along(preprocess.lists), function(permutation){
  
    perm.data.list <- preprocess.lists[[permutation]]
    seed.val <- chrom.size*seeds[permutation]
    res.dir <- paste0("perm", permutation, "_size", chrom.size, "_res")
    reg.dir <- paste0("perm", permutation, "_size", chrom.size, "_reg")
    run.gadgets(perm.data.list, n.chromosomes = 5, 
           chromosome.size = chrom.size,
           results.dir = res.dir, cluster.type = "interactive", 
           registryargs = list(file.dir = reg.dir, seed = seed.val),
           n.islands = 8, island.cluster.size = 4,
           n.migrations = 2)
    
  })
  
})

#condense the results 
perm.res.list <- lapply(chrom.sizes, function(chrom.size){
  
  lapply(seq_along(preprocess.lists), function(permutation){
  
    perm.data.list <- preprocess.lists[[permutation]]
    res.dir <- paste0("perm", permutation, "_size", chrom.size, "_res")
    combine.islands(res.dir, snp.annotations, perm.data.list, 2)
    
  })
  
})

2.6 Run Global Test

Now we are ready to more formally examine whether our results are consistent with what would be expected under the null hypothesis of no association between the input variants and disease status. Our first step is to run a global test across all chromosome sizes examining whether the fitness scores from the observed data are drawn from the distribution expected under the null. Briefly, for each chromosome size separately, we sum the fitness scores of the top \(k\) chromosomes for the observed data and each permuted dataset. We then rank the observed sum compared to the corresponding chromosome size specific sums for each permuted dataset. Letting \(R_d\) denote the rank for chromosome size \(d\), with 1 denoting the rank of the smallest sum of fitness scores, and with \(N\) denoting the the number of permutes plus 1, we compute \(T_d = -2ln((N - R_d + 0.5)/N)\). The test is based on summing \(T_d\) over chromosome sizes, \(T = \sum_d{T_d}\). We also compute this statistic for each permuted dataset and compute a permutation-based p-value using the \(N\) values for \(T\). Intuitively, this test assesses whether the top-scoring chromosomes are consistently higher than null data across chromosome sizes.

A key feature of this approach is the test does not require adjustment for multiple comparisons despite the very large number of combinations being considered by the GA. The test is implemented in function global.test, with \(k\) specified by argument n.top.scores. We generally specify n.top.scores = 10, but for this illustrative example, we only use 5. When GADGETS returns fewer than n.top.scores chromosomes for a given chromosome size for either the observed data or any permute, say \(l^*\) chromosomes, the global test will be automatically computed using the top \(l^*\) for that chromosome size. Note that the number of chromosomes actually used to compute the global test for each chromosome size will be stored in the chrom.size.k element of the global.test output.

# chromosome size 3 results

# function requires a list of vectors of
# permutation based fitness scores
chrom3.perm.fs <- lapply(perm.res.list[[1]], 
                         function(x) x$fitness.score)
chrom3.list <- list(observed.data = size3.combined.res$fitness.score,
                     permutation.list = chrom3.perm.fs)

# chromosome size 4 results
chrom4.perm.fs <- lapply(perm.res.list[[2]], 
                         function(x) x$fitness.score)
chrom4.list <- list(observed.data = size4.combined.res$fitness.score,
                     permutation.list = chrom4.perm.fs)

# list of results across chromosome sizes, with each list 
# element corresponding to a chromosome size
final.results <- list(chrom3.list, chrom4.list)

# run global test
global.test.res <- global.test(final.results, n.top.scores = 5)

# examine how many chromosomes were used for each  chromosome size 
global.test.res$chrom.size.k
## [1] 5 5
# look at the global test stat and p-value 
global.test.res$obs.test.stat
## [1] 9.21034
global.test.res$pval
## [1] 0.2

We see the observed \(T\) for our simple example is 9.2103404 with associated permutation based p-value 0.2. Note the permutation p-values for both for the global.test and epistasis.test functions are computed as one plus the number of permutation test statistics that exceed the observed test statistic divided the number of permutations plus one (Phipson and Smyth 2010), underscoring the need to run as many permutations as computationally possible. If we overlay the observed \(T\) on a boxplot of the permutations, we see the observed distance is a huge outlier even when taking logs:

2.7 Post-hoc Analyses

Regardless of result, it is crucial that users are clear about what the test implies. The null hypothesis is that the observed test statistic is drawn from the null distribution, i.e., that the global patterns of transmissions to cases are not systematically different from those to the corresponding complements or unaffected siblings. In situations where we reject the null hypothesis, of course the conclusion is the observed test statistic is very different from from the null distribution. However, to best characterize results, it is incumbent on users to closely examine results beyond the global test. In particular, we would like to be able to identify the specific subsets of SNPs that are collectively related to risk.

To that end, global.test provides additional, chromosome size specific information that users are encouraged to examine. First, users may look at the obs.marginal.test.stats and marginal.pvals objects from the global.test output:

global.test.res$obs.marginal.test.stats
##                  
## 1 4.60517 4.60517
global.test.res$marginal.pvals
## [1] 0.2 0.2

The obs.marginal.test.stats object reports the fitness score sums for each chromosome size for the observed data. The marginal.pvals object reports a permutation p-value for each distinct chromosome size.

Second, users may also wish to examine the max.obs.fitness and max.order.pvals elements of the output:

global.test.res$max.obs.fitness
## [1]  2.492284 18.189961
global.test.res$max.order.pvals
## [1] 0.2 0.2

The max.obs.fitness element reports the maximum fitness score in the observed data for each chromosome size, and the max.order.pvals element reports the corresponding permutation p-values. Given sufficient permutations to estimate the null distribution, these p-values correspond to the test of the null hypothesis that the maximum observed fitness score for a given chromosome size is not greater than what would be expected given no SNP-disease associations. Rejecting the null implies the observed maximum fitness score exceeds what would be expected by random chance.

Finally, the function provides boxplots of the observed vs. permuted fitness scores for each element of the input results list:

library(grid)
grid.draw(global.test.res$boxplot.grob)

Note the numbers at the top of the plots correspond to the element of the input results list. In this example, the ‘1’ corresponds to the chromosome size 3 results (the first element of the results list) and the ‘2’ corresponds to the chromosome size 4 results (the second element of the results list). Above, we see the observed fitness scores tend to be higher than the null permutes, especially for chromosome size 4.

Users should also be clear that rejecting the null hypothesis does not necessarily indicate that a collection of SNPs exhibit epistasis. Indeed, we may reject the null simply because we have identified a single SNP with a non-zero marginal disease association. To that end, we provide a permutation based procedure specifically examining evidence for epistasis among a collection of SNPs conditional on the marginal SNP-disease associations. To illustrate, in our example, we might be interested in whether the high score of the top 4 SNP chromosome was attributable to epistasis or at least one large marginal association. We can execute this test as follows:

top.snps <- as.vector(t(size4.combined.res[1, 1:4]))
set.seed(10)
epi.test.res <- epistasis.test(top.snps, pp.list)
epi.test.res$pval
## [1] 9.999e-05

The test indicates the observed fitness score is unusual under the assumption of no epistasis among loci and given the observed marginal transmissions to the disease affected children. However, caution is warranted in interpreting this ‘p-value’. The GADGETS stochastic search method identifies SNP-sets that appear to be interacting even under an independent-effects null, so the epistasis test p-value should be interpreted in an exploratory, hypothesis-generating context. We make use of these ‘p-values’ primarily in constructing network plots of potentially interesting SNP-sets.

Users may also receive a warning when executing epistasis.test that says “All chromosome SNPs in linkage, returning NA for p-value”. This is generated because the procedure can only run if at least two SNPs are not in linkage, as specified in matrix block.ld.mat, otherwise other methods (not provided in this package) are required.

2.8 Visualize Results

As a final step in the analytic pipeline, we recommend users examine network plots of the results using function network.plot. This may be particularly useful when trying to determine the true number of SNPs involved in multi-SNP risk pathways and to identify those SNPs. For instance, in the example above, the true risk pathway involves 4 SNPS, but we ran GADGETS for chromosome sizes of 3 and 4. For chromosome size 3, we saw that many of the top identified collections of SNPs were subsets of the true 4 SNP pathway. If we didn’t know the true pathway size was 4, a network plot might help make this clearer.

Briefly, we use our epistasis test ‘p-values’ to assign graphical scores to pairs of SNPs identified in top-scoring chromosomes, with higher scores corresponding to lower epistasis p-values (more substantial evidence for epistasis). We then aggregate those pair-scores across chromosome sizes to generate a final collection of SNP-pairs, which we display in a single network plot.

We start by computing the SNP-pair scores using compute.graphical.scores. This function takes as required arguments a list of data.tables containing the results from GADGETS run for different chromosome sizes and the list of pre-processed data from function preprocess.genetic.data. For analysts who have run the permutation-based global test, we recommend restricting attention to chromosomes with fitness scores higher than what we would expect for null data. Specifically, the global.test function output contains a vector, max.perm.95th.pctl, that reports the \(95^{th}\) percentile of the maximum observed fitness score across the null permutes for each chromosome size. We restrict our network plots to the chromosomes with fitness scores exceeding the corresponding null threshold for each chromosome size:

# vector of 95th percentile of null fitness scores 
chrom.size.thresholds <- global.test.res$max.perm.95th.pctl

# chromosome size 3 threshold
d3.t <- chrom.size.thresholds[1]

# chromosome size 4 threshold
d4.t <- chrom.size.thresholds[2]

# create results list 
obs.res.list <- list(size3.combined.res[size3.combined.res$fitness.score >= d3.t, ], 
                     size4.combined.res[size4.combined.res$fitness.score >= d4.t, ])

Alternatively, for analysts who have not run the global-test permutes, we recommend restricting attention to a subset of the top scoring chromosomes for each chromosome size. We’ve observed good results using the top 10, but we use the top 5 in the illustrative command below:

obs.res.list.no.permutes <- list(size3.combined.res[1:5, ], size4.combined.res[1:5, ])

Once the results list has been prepared, we generate graphical scores for each SNP-pair. Since we’ve run the global test, we use the obs.res.list results below, but the steps would be exactly the same if we instead used the obs.res.list.no.permutes list. Note that for large numbers of top scoring chromosomes, this function may take at least 10-20 minutes to complete.

set.seed(10)
graphical.scores <- compute.graphical.scores(obs.res.list, pp.list)

Next, we input these pair scores into function network.plot to actually plot the data.

network.plot(graphical.scores, pp.list, graph.area = 200,
             node.size = 40, vertex.label.cex = 2)

We see the true pathway SNPs appear prominently in this plot. Key features of the plotting approach are (1) the thickness of the SNP-to-SNP line segments is proportional to the log of the graphical score for that pair and (2) the area of each node circle is proportional to the log of the sum of the graphical scores for that SNP. Although not seen in the plot above, dashed connections indicate SNPs in high LD (pairwise \(r^2 \geq 0.1\) by default) in the unaffected siblings/complements.

This function uses the Fruchterman-Reingold force directed graph drawing algorithm, as implemented in the qgraph R package. If other layout algorithms are desired, users can specify plot = FALSE to network.plot, which will return an igraph object that can be used for more customized or interactive plotting. For instance, the igraph package function tkplot allows for interactive plotting of igraph objects.

In real applications, there may be too many SNPs for a network plot to fit on a single page. In those instances, we plot a subset of the top scoring pairs. For example, we could plot the top 6 scoring pairs as follows:

network.plot(graphical.scores, pp.list, 
             n.top.scoring.pairs = 6, graph.area = 10,
             node.size = 40, vertex.label.cex = 2)

As a more minor point, the SNP labels default to the RSIDs provided, as seen in the third and fourth columns of the pair.scores object, and the second column of snp.scores:

head(graphical.scores[["pair.scores"]])
##    SNP1 SNP2 SNP1.rsid SNP2.rsid pair.score
## 1:   52   77 rs4237892 rs1487251   4.834115
## 2:   51   52 rs1731422 rs4237892   4.675688
## 3:   51   77 rs1731422 rs1487251   4.675688
## 4:   52   76 rs4237892 rs7985535   3.606632
## 5:   76   77 rs7985535 rs1487251   3.606632
## 6:   51   76 rs1731422 rs7985535   2.913485
head(graphical.scores[["snp.scores"]])
##    SNP       rsid snp.score
## 1:  52  rs4237892  4.834115
## 2:  77  rs1487251  4.834115
## 3:  51  rs1731422  4.675688
## 4:  76  rs7985535  3.606632
## 5:  46  rs1219396  2.913485
## 6:  74 rs12425528  2.913485

However, if users desire custom labels, they can do so by changing the corresponding values. For instance, here we label the true risk SNPs with a star and leave all others blank:

risk.numbers <- c(51, 52, 76, 77)
graphical.scores[[1]]$SNP1.rsid <- ifelse(graphical.scores[[1]]$SNP1 %in% risk.numbers, 
                                          "*", "")
graphical.scores[[1]]$SNP2.rsid <- ifelse(graphical.scores[[1]]$SNP2 %in% risk.numbers,
                                          "*", "")
graphical.scores[[2]]$rsid <- ifelse(graphical.scores[[2]]$SNP %in% risk.numbers, "*", "")
network.plot(graphical.scores, pp.list, graph.area = 200,
             node.size = 40, vertex.label.cex = 10)

Cleanup and sessionInfo()

#remove all example directories 
perm.reg.dirs <- as.vector(outer(paste0("perm", 1:4), 
                                 paste0("_size", chrom.sizes, "_reg"), 
                                 paste0))
perm.res.dirs <- as.vector(outer(paste0("perm", 1:4), 
                                 paste0("_size", chrom.sizes, "_res"), 
                                 paste0))
lapply(c("size3_res", "size3_reg", "size4_res", "size4_reg",
         perm.reg.dirs, perm.res.dirs), unlink, recursive = TRUE)
#session information 
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggplot2_3.4.0     kableExtra_1.3.4  knitr_1.41        magrittr_2.0.3   
## [5] Matrix_1.5-3      epistasisGA_1.0.2 BiocStyle_2.26.0 
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-161        fs_1.5.2            matrixStats_0.63.0 
##   [4] webshot_0.5.4       httr_1.4.4          RColorBrewer_1.1-3 
##   [7] progress_1.2.2      tools_4.2.2         backports_1.4.1    
##  [10] bslib_0.4.2         utf8_1.2.2          R6_2.5.1           
##  [13] rpart_4.1.19        Hmisc_4.7-2         DBI_1.1.3          
##  [16] colorspace_2.0-3    nnet_7.3-18         withr_2.5.0        
##  [19] tidyselect_1.2.0    gridExtra_2.3       prettyunits_1.1.1  
##  [22] mnormt_2.1.1        compiler_4.2.2      qgraph_1.9.3       
##  [25] fdrtool_1.2.17      rvest_1.0.3         cli_3.6.0          
##  [28] htmlTable_2.4.1     xml2_1.3.3          labeling_0.4.2     
##  [31] bookdown_0.32       sass_0.4.4          scales_1.2.1       
##  [34] checkmate_2.1.0     psych_2.2.9         pbapply_1.7-0      
##  [37] rappdirs_0.3.3      systemfonts_1.0.4   stringr_1.5.0      
##  [40] digest_0.6.31       pbivnorm_0.6.0      foreign_0.8-84     
##  [43] svglite_2.1.1       rmarkdown_2.19      base64enc_0.1-3    
##  [46] jpeg_0.1-10         pkgconfig_2.0.3     htmltools_0.5.4    
##  [49] highr_0.10          fastmap_1.1.0       htmlwidgets_1.6.1  
##  [52] rlang_1.0.6         rstudioapi_0.14     farver_2.1.1       
##  [55] jquerylib_0.1.4     generics_0.1.3      jsonlite_1.8.4     
##  [58] gtools_3.9.4        BiocParallel_1.32.5 dplyr_1.0.10       
##  [61] Formula_1.2-4       interp_1.1-3        Rcpp_1.0.9         
##  [64] munsell_0.5.0       fansi_1.0.3         abind_1.4-5        
##  [67] lifecycle_1.0.3     stringi_1.7.12      yaml_2.3.6         
##  [70] debugme_1.1.0       plyr_1.8.8          lavaan_0.6-13      
##  [73] parallel_4.2.2      crayon_1.5.2        deldir_1.0-6       
##  [76] lattice_0.20-45     splines_4.2.2       hms_1.1.2          
##  [79] magick_2.7.3        batchtools_0.9.15   pillar_1.8.1       
##  [82] igraph_1.3.5        base64url_1.4       corpcor_1.6.10     
##  [85] reshape2_1.4.4      codetools_0.2-18    stats4_4.2.2       
##  [88] glue_1.6.2          evaluate_0.20       latticeExtra_0.6-30
##  [91] data.table_1.14.6   BiocManager_1.30.19 png_0.1-8          
##  [94] vctrs_0.5.1         gtable_0.3.1        assertthat_0.2.1   
##  [97] cachem_1.0.6        xfun_0.36           viridisLite_0.4.1  
## [100] survival_3.5-0      glasso_1.11         tibble_3.1.8       
## [103] cluster_2.1.4       ellipsis_0.3.2      brew_1.0-8

References

Andre, J.H., and J.R. Koza. 1996. “Parallel Genetic Programming: A Scalable Implementation Using the Transputer Network Architecture.” In Advances in Genetic Programming, Volume 2, edited by P.J. Angeline and K.E. Kinnear. Cambridge: MIT Press.

Cordell, H.J., and D.G. Clayton. 2002. “A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes.” Am J Hum Genet 70 (1): 124–41. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC384883/.

Holland, JH. 1975. Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press.

Nodzenski*, M., M. Shi*, J. Krahn, A.S. Wise, Y. Li, L. Li, D.M. Umbach, and C.R. Weinberg. 2021. “GADGETS: A genetic algorithm for detecting epistasis using nuclear families.” Bioinformatics 38 (4): 1052–8.

Phipson, B., and G.K Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat Appl Genet Mol Biol 9 (1): Article 39. https://pubmed.ncbi.nlm.nih.gov/21044043/.

Shi, M., D.M. Umbach, A.S. Wise, and C.R. Weinberg. 2018. “Simulating autosomal genotypes with realistic linkage disequilibrium and a spiked-in genetic effect.” BMC Bioinformatics 19 (1): 2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5749028/.