Introduction


Genomics studies employ multiple independent lines of investigation to address a phenotype or complex genetic trait. This includes studying various forms of genomic variation (SNPs, CNVs, InDels) and gene expression (in multiple tissues) in a single phenotype. In addition, such studies might be carried out in a single or multiple species of interest (e.g, humans and other relevant model organisms). One of the common characteristics of such modern high-throughput experiments across -omics fields is that they produce long lists of genes. Integration of data at gene-level from multiple evidence layers has been shown to be an effective approach to identify and prioritize candidate genes in complex genetic traits. Here, we have implemented three methods to integrate gene-level data generated from multiple independent lines of investigation (Figure 1):

Figure 1: Overarching goal of the package

Figure 1: Overarching goal of the package


Background


Evidence layers

We mentioned about the integration of gene-level data from multiple evidence layers above. Here, we briefly explain what is referred to as an ‘evidence layer’ throughout this package. An evidence layer could be one of the multiple independent lines of investigation. Those independent lines of investigation may use a same method (e.g GWAS) to study the phenotype in independent sample groups (e.g GWAS studies carried out by different labs to study the same phenotype). Alternatively, the independent lines of investigation may use different methods (e.g SNP, CNV, RNA, miRNA) to study the phenotype in same or independent sample groups. Instead, the independent lines of investigation may employ multiple methods to study the same phenotype in different tissues or altogether in different species. However, the definition of phenotype and phenotypic homogeneity (less variability in phenotypic characterization) is very crucial in this kind of integrative studies. Examples of evidence layers are shown below in Figure 2.

Figure 2: Evidence layers

Figure 2: Evidence layers

Handling of duplicate genes

There is a possibility that some genes are detected several times within an evidence layer. Let us assume a case, where gene-level data is being integrated from evidence layers like SNP, CNV, RNA expression and miRNA expression. Gene ABCD is detected several times within a single evidence layer (say using SNP data). Even if gene ABCD is not detected across other evidence layers, it would still likely receive an inflated rank because of increased frequency within SNP data. To avoid such bias, duplicate genes are counted only once (as a single vote) within each evidence layer in all the three methods implemented in this package. When retaining duplicate genes, those with significant statistic (low p-values or high effect-size) were retained.

File format

The required input file format is quite straightforward. A tab-delimited text file is required with no header (no column names). The text file should contain at least three columns: the first column contains gene symbols (or names), the second column indicates the type of evidence layer (see more about evidence layers), and the third column contains a significance statistic (e.g, p-value or effect size), which is a non-negative numeric value. For example, the file should look like this:

##      V1   V2    V3
## 1 FKUZZ GWAS 0.587
## 2 HZEAY GWAS 0.402
## 3 HMMJI GWAS 0.903
## 4 ROTUC GWAS 0.317
## 5 BHXYZ GWAS 0.678
## 6 ECXSC GWAS 0.964

However, for the Convergent Evidence scores (CE) method, the first two columns described above are sufficient. CE method does not incorporate significance statistic while ranking genes.


Convergent Evidence (CE) method


Convergent Evidence (CE) method aggregates ranks of genes based on a weighted vote counting method. A conceptually similar gene-level integration has been succesfully used to prioritize candidate genes in neuropsychiatric diseases (e.g., Ayalew M, 2012, Mol Psychiatry).

Here, to rank genes, we compute convergent evidence scores. The convergent evidence score of gene \(G\) is given by \[CE(G)=CE(G_{L_1})/n(L_1)+....CE(G_{L_n})/n(L_n)\] Here \(CE(G_{L_i})\) refers to the self-importance of evidence layer-i, while \(n(L_i)\) refers to the number of genes within evidence layer-i. However, in several instances determining the importance of an evidence layer by the number of genes within that evidence layer may not be biologically meaningful. To accommodate this issue, we propose two other ways to compute convergent evidence scores. One of them is to ignore the numer of genes within each layer, thus \[CE(G)=CE(G_{L_1})+....CE(G_{L_n})\] In this case, the convergent evidence score would be equivalent to the primitive vote counting. Another alternative method enables the researchers to determine the importnace of each layer based on their own intuition. This involves assigning custom weights to each evidence layer based on their expert knowledge in the field. For example, when a researcher knows that a specific technology-based findings could yield less reproducible findings, such evidence layer could be given a relatively less weight compared to the other evidence layers. Another objective way of assigning custom weights to each evidence layer could be based on the sample sizes of each evidence layer. In this case convergent evidence score \[CE(G)=CE(G_{L_1})*w(L_1)+....CE(G_{L_n})*w(L_n)\] where \(w(L_i)\) refers to the custom weight assigned to evidence layer-i.