| Title: | Genomic Annotation in Livestock for Positional Candidate LOci |
| Version: | 2.0 |
| Description: | The accurate annotation of genes and Quantitative Trait Loci (QTLs) located within candidate markers and/or regions (haplotypes, windows, CNVs, etc) is a crucial step the most common genomic analyses performed in livestock, such as Genome-Wide Association Studies or transcriptomics. The Genomic Annotation in Livestock for positional candidate LOci (GALLO) is an R package designed to provide an intuitive and straightforward environment to annotate positional candidate genes and QTLs from high-throughput genetic studies in livestock. Moreover, GALLO allows the graphical visualization of gene and QTL annotation results, data comparison among different grouping factors (e.g., methods, breeds, tissues, statistical models, studies, etc.), and QTL enrichment in different livestock species including cattle, pigs, sheep, and chicken, among others. |
| URL: | <https://github.com/pablobio/GALLO> |
| Depends: | R (≥ 4.0.0) |
| biocViews: | Software |
| Imports: | circlize, data.table, doParallel, dplyr, ggplot2, graphics, grDevices, foreach, lattice , parallel, RColorBrewer, rtracklayer, stats, stringr, unbalhaar, utils, DT, webshot, igraph, visNetwork, CompQuadForm, Matrix, reticulate |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| LazyData: | true |
| RoxygenNote: | 7.3.3 |
| Suggests: | Hmisc, knitr, rmarkdown, testthat |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-02-06 20:59:25 UTC; pablofonseca |
| Author: | Pablo Fonseca [aut, cre], Aroa Suarez-Vega [aut], Gabriele Marras [aut], Angela Cánovas [aut] |
| Maintainer: | Pablo Fonseca <p.fonseca@csic.es> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-09 23:50:20 UTC |
Run a short EM on p-values to fit a 2-component Beta mixture
Description
Fits a two-component mixture model to gene-level p-values where the null component is typically Uniform(0,1) (i.e., Beta(1,1)) and the alternative component is Beta(alpha1, 1) with alpha1 < 1 enriching small p-values. Returns updated mixture parameters and posterior probabilities of belonging to the signal (alternative) component.
Usage
EMinfer(pvalues, pi0, pi1, alpha0, alpha1, max.it)
Arguments
pvalues |
Numeric vector of gene-level p-values in (0, 1]. |
pi0 |
Mixing proportion for the null/background component (Z = 0). |
pi1 |
Mixing proportion for the alternative/signal component (Z = 1); typically |
alpha0 |
Shape parameter of the null p-value distribution, modeled as |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as |
max.it |
Integer. Maximum number of EM iterations. |
Value
A list with elements:
- alpha0
Estimated null shape parameter.
- alpha1
Estimated alternative shape parameter.
- pi
Estimated mixing proportion for the signal component (pi1).
- post
Posterior probabilities
P(Z=1 \mid p)for each p-value.
Function to estimate the posterior probability of association (PPA) based on the integration of -OMICs and functional data via representation learning of candidate genes.
Description
Function to estimate the posterior probability of association (PPA) based on the integration of -OMICs and functional data via representation learning of candidate genes.
Usage
Gene_PPA(
gene.pval,
Gene_ID,
pval,
embeddings,
lambda = 0,
alpha0 = 1,
alpha1 = 0.2,
pi0 = 0.6,
pi1 = 0.4,
max.it = 10,
model,
verbose = F
)
Arguments
gene.pval |
A table containing the gene IDs and p-values. |
Gene_ID |
Name of the column containing the gene IDs (without NAs or duplicated IDs). |
pval |
Name of the column containing the gene-level p-values. |
embeddings |
The output of the |
lambda |
Ridge ( |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
max.it |
Maximum number of iterations for the EM algorithm. |
model |
One of the following options defining the model used during the EM algorithm: |
verbose |
A logical value defining if the iteration information should be printed or not. |
Details
The function Gene_PPA allows the user to select different options that define how the EM algorithm models the
distribution of v (the auxiliary information matrix (feature/annotation matrix) attached to each p-value). The EM algorithms
employed here are adapted from Wu et al. (2018):
-
proj: Fits a two-component mixture model that combines p-values with an LDA-based one-dimensional projection of the feature matrixv. The projection is obtained by soft LDA using current posterior weights, and class-conditional densities of the projected score are estimated nonparametrically via weighted KDE. -
M: Fits a two-component mixture model that combines p-values with the feature matrixvby estimating a separate weighted KDE for each feature and each component. The joint density is approximated by the product of marginal densities (naive independence across features). -
LR: Fits a two-component mixture model where p-values follow a Beta-mixture and the prior probability of belonging to the signal component is modeled as a logistic function ofv. Logistic regression parameters are estimated within the EM iterations, with optional ridge regularization controlled bylambda. -
NB: Fits a two-component mixture model that combines p-values withvassuming a Gaussian naive Bayes model for the features. Component-specific feature means are estimated, with feature-wise variances shared across components, and features are treated as conditionally independent given component membership. -
MVN: Fits a two-component mixture model that combines p-values withvusing a multivariate normal model with shared covariance (parametric, correlation-aware).
Value
Returns the posterior probability of association for each gene.
References
Wu et al. (2018) Methods, 145, doi:10.1016/j.ymeth.2018.06.002.
Global variables
Description
Global variables
Function to estimate gene-level p-value using Davies algorithm
Description
Function to estimate gene-level p-value using Davies algorithm
Usage
Liu_ld(matrix.ld, marker_pvalues)
Arguments
matrix.ld |
A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome |
marker_pvalues |
A vector with the p-values for the SNPs annotated within each gene |
Value
A vector of p-values for each gene annotated within the defined coordinates
Compute the centrality metrics for the nodes composing the network generated by the NetVis function
Description
Compute the centrality metrics for the nodes composing the network generated by the NetVis function
Usage
NetCen(data, g1, g2)
Arguments
data |
A data frame containing the relationship between the two groups to be represented in the network |
g1 |
Name of the column containing the labels of the first group that will be used to create the network |
g2 |
Name of the column containing the labels of the second group that will be used to create the network |
Details
This function returns the following centrality metrics for each node that composed the network: Degree (The number of edges incident to the node), Betweenness (The fraction of shortest paths between pairs of nodes that pass through the node), Closeness (The inverse of the sum of the shortest path distances from the node to all other nodes), and Eigenvector Centrality (The centrality measure based on the eigenvector of the adjacency matrix).
Value
A data frame with the centrality metrics for each node in the network.
Create a dynamic network representing the relationship between two groups of variables
Description
Create a dynamic network representing the relationship between two groups of variables
Usage
NetVis(
data,
g1,
g2,
col1 = "aquamarine",
col2 = "red",
edge_col = "gray",
remove_label = NULL,
node_size = c(15, 40),
font_size = 45,
edge_width = 1
)
Arguments
data |
A data frame containing the relationship between the two groups to be represented in the network |
g1 |
Name of the column containing the labels of the first group that will be used to create the network |
g2 |
Name of the column containing the labels of the second group that will be used to create the network |
col1 |
Color of the nodes that will represent the first group represented in the network. The default value is aquamarine |
col2 |
Color of the nodes that will represent the second group represented in the network The default value is red |
edge_col |
Color of the edges that will connect the nodes in the network. The default value is gray |
remove_label |
If is required to omit the labels for some of the groups, this argument receives the column name informed the g1 or g2 arguments. The default value is NULL |
node_size |
A vector with the node sizes to represent g1 and g2. The defaul values are 15 and 40, respectively |
font_size |
The size of the font of the labels of each node (The default value is 45) |
edge_width |
The width of the edges connecting the nodes in the network |
Details
This function returns a dynamic network, using visNetwork, representing the connection between two groups. For example, the output from the find_genes_qtls_around_markers() function can be used here to represent the connections between markers and QTLs. Another option is to combine the data frames with both gene and QTL annotation around markers to reprsent the connections between genes and QTLs.
Value
A dynamic network representing the connection between two groups.
Network Embedding using biased random walk and Word2Vec.
Description
Network Embedding using biased random walk and Word2Vec.
Usage
Net_embedding(
net_list,
p = 0.25,
q = 1,
num = 10,
l = 80,
vector_size = 32,
window = 10,
min_count = 1,
sg_model = 1,
workers = 4,
epochs = 20
)
Arguments
net_list |
A list containing data frames that can be coerced into pandas data frames. Each data frame contains 3 columns, source, target and weight, representing the connections between nodes in the networks. |
p |
A parameter for the BiasedRandomWalk.Defines probability, 1/p, of returning to source node. |
q |
A parameter for the BiasedRandomWalk.Defines probability, 1/q, for moving to a node away from the source node. |
num |
A parameter for the BiasedRandomWalk. Defines the number of walks per node. |
l |
A parameter for the BiasedRandomWalk. Defines the length of each walk. |
vector_size |
The size of the Word2Vec vectors. |
window |
The window size for the Word2Vec model. |
min_count |
Minimum count for the Word2Vec model. |
sg_model |
Training algorithm for Word2Vec (0 for CBOW, 1 for skip-gram). |
workers |
Number of worker threads to train the Word2Vec model. |
epochs |
Number of epochs to train the Word2Vec model. |
Details
This function performs a network embedding using the python libraries stellargraph and gensim through the BiasedRandomWalk and Word2Vec functions, respectively.
Value
A list of data frames with node embeddings. Each data frame contains nxm dimensions, where n is the number of unique nodes in the network and m is the number of reduced dimensions defined in the function.
Estimate the number of effective markers in a chromosome based on an adapted version of the simpleM methodology
Description
Estimate the number of effective markers in a chromosome based on an adapted version of the simpleM methodology
Usage
Nmarkers_SimpleM(ld.file, PCA_cutoff = 0.995)
Arguments
ld.file |
A data frame with the pairwise linkage disequilibrium (LD) values for a chromosome. The column names SNP_A, SNP_B, and R are mandatory, where the SNP_A and SNP_B contained the markers names and the R column the LD values between the two markers. |
PCA_cutoff |
A cutoff for the total of the variance explained by the markers. |
Details
This function estimate the effective number of markers in a chromosome using adapted version of the simpleM methodology described in Gao et al. (2008). The function use as input a data frame composed by three mandatory columns (SNP_A, SNP_B, and R). This data frame can be obtained using PLINK or any other software to compute LD between markers. Additionally, a threshold for percentage of the sum of the variances explained by the markers must be provided. The number of effective markers identified by this approach can be used in multiple testing corrections, such as Bonferroni.
Value
The effective number of markers identified by the SimpleM approach
References
Gao et al. (2008) Genet Epidemiol, Volume 32, Issue 4, Pages 361-369. (doi:10.1002/gepi.20310)
Estimate the number of independent segments in a chromosome based on the effective population size
Description
Estimate the number of independent segments in a chromosome based on the effective population size
Usage
Nseg_chr(chr.table, chr_length, Ne)
Arguments
chr.table |
A table containing the chromosomes and the chromosomal length (in centiMorgans). |
chr_length |
The name of the column where the length of the chromosomes are informed. |
Ne |
The effective population size. |
Details
This function uses a adapted version of the formula proposed by Goddard et al. (2011) to estimate the independent number of segments in a chromosome based on the effective population size.
Value
A data frame with the effective number of segments in each chromosome.
References
Goddard et al. (2011) Journal of animal breeding and genetics, Volume 128, Issue 6, Pages 409-421. (doi:10.1111/j.1439-0388.2011.00964.x)
Compute Meff statistic based on PCA to determine the number of effective markers
Description
Compute Meff statistic based on PCA to determine the number of effective markers
Usage
PCA_Meff(eigenV, cut.off)
Arguments
eigenV |
The eigenvalues obtained from the linkage disequilibrium matrix |
cut.off |
The threshold for percentage of the sum of the variances explained by the markers |
Value
The effective number of markers identified by the SimpleM approach
Compute a multi-trait test statistic for pleiotropic effects using summary statistics from association tests
Description
Compute a multi-trait test statistic for pleiotropic effects using summary statistics from association tests
Usage
PleioChiTest(data)
Arguments
data |
A data frame with the first column containing the SNP name and the remaining columns the signed t-values obtained for each marker in the association studies individually performed for each trait. |
Details
This function tests a null hypothesis stating that each SNP does not affect any of the traits included in the input file. The method applied here is an implementation of the statistic proposed at Bolormaa et al. (2014) and is approximately distributed as a chi-squared with n degrees of freedom, where n is equal the number of traits included in the input file.
Value
A data frame with the multi-trait chi-squared statistics and the correspondent p-value obtained for each SNP.
References
Bolormaa et al. (2014) Plos Genetics, Volume 10, Issue 3, e1004198. (doi:10.1371/journal.pgen.1004198)
Plot enrichment results for QTL enrichment analysis
Description
Takes the output from qtl_enrich function and creates a bubble plot with enrichment results
Usage
QTLenrich_plot(qtl_enrich, x, pval)
Arguments
qtl_enrich |
The output from qtl_enrich function |
x |
Id column to be used from the qtl_enrich output |
pval |
P-value to be used in the plot. The name informed to this argument must match the p-value column name in the enrichment table |
Value
A plot with the QTL enrichment results
Candidate markers identified by GWAS associated with fertility traits in cattle
Description
Data from a systematic review which evaluated 18 articles regarding genome-wide association studies for male fertility traits in beef and dairy cattle
Usage
data(QTLmarkers)
Format
A data frame with 141 rows and 7 variables:
Associated.marker: Significantly associated marker
SNP.reference: The rs ID when available
Trait: Trait associated
CHR: Chromosome
BP: Chromosomal position in base pairs (bovine reference assembly UMD3.1)
Breed: Breed used in the study
Reference: Study which the markers were retrieved
References
Fonseca et al. (2018) Journal of Animal Science, Volume 96, Issue 12, December 2018, Pages 4978-4999. (doi:10.1093/jas/sky382)
Examples
data(QTLmarkers)
Candidate windows identified by GWAS associated with fertility traits in cattle
Description
Data from a systematic review which evaluated 18 articles regarding genomw-wide association studies for male fertility traits in beef and dairy cattle
Usage
data(QTLwindows)
Format
A data frame with 50 rows and 8 variables:
First.marker.in.the.window: First marker mapped in the candidate window
Last.marker.in.the.window: Last marker mapped in the candidate window
Trait: Trait associated
CHR: Chromosome
BP1: Chromosomal position in base pairs for the first marker mapped in the candidate window(bovine reference assembly UMD3.1)
BP1: Chromosomal position in base pairs for the last marker mapped in the candidate window (bovine reference assembly UMD3.1)
Breed: Breed used in the study
Reference: Study which the markers were retrieved
References
Fonseca et al. (2018) Journal of Animal Science, Volume 96, Issue 12, December 2018, Pages 4978-4999. (doi:10.1093/jas/sky382)
Examples
data(QTLwindows)
Function to perform Weighted Z-score Approach with LD Information
Description
Function to perform Weighted Z-score Approach with LD Information
Usage
WZ_ld(marker_ld, marker_pvalues)
Arguments
marker_ld |
A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome |
marker_pvalues |
A vector with the p-values for the SNPs annotated within each gene |
Value
A vector of p-values for each gene annotated within the defined coordinates
Sub-function to auto-stop the clusters created for parallel processes
Description
Sub-function to auto-stop the clusters created for parallel processes
Usage
autoStopCluster(cl)
Arguments
cl |
The cluster created by the makePSOCKcluster function |
Search genes and QTLs around candidate regions
Description
Takes a list of candidate markers and or regions (haplotypes, CNVs, windows, etc.) and search for genes or QTLs in a determined interval
Usage
find_genes_qtls_around_markers(
db_file,
marker_file,
method = c("gene", "qtl"),
marker = c("snp", "haplotype"),
interval = 0,
nThreads = NULL,
verbose = TRUE
)
Arguments
db_file |
The data frame obtained using the import_gff_gtf() function |
marker_file |
The file with the SNP or haplotype positions. Detail: For SNP files, the columns “CHR” and “BP” with the chromosome and base pair position, respectively, are mandatory. For the haplotype, the following columns are mandatory: “CHR”, “BP1” and “BP2” |
method |
“gene” or “qtl” |
marker |
"snp" or "haplotype" |
interval |
The interval in base pair which can be included upstream and downstream from the markers or haplotype coordinates. |
nThreads |
Number of threads to be used |
verbose |
Logical value defining if messages should of not be printed during the analysis (default=TRUE) |
Value
A dataframe with the genes or QTLs mapped within the specified intervals
Examples
data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(db_file=gffQTLs, marker_file=QTLmarkers,
method = "qtl", marker = "snp",
interval = 500000, nThreads = 1)
Function to annotate markers and the respective p-values within genes
Description
Function to annotate markers and the respective p-values within genes
Usage
find_markers_genes(db_file, marker_file, int = 0)
Arguments
db_file |
A data frame obtained from the import_gff_gtf containing the gtf information |
marker_file |
A data frame with the results of the association test performed for each marker |
int |
The interval (in base pairs) used to annotated markers downstream and upstream from the genes coordinates |
Value
A data frame containing the markers mapped within the selected interval for each gene in the annotation file
Estimate a gene-level p-value using Weighted Z-score approach, Meta-analysis with LD correlation coefficients approach, and Davies algorithm
Description
Estimate a gene-level p-value using Weighted Z-score approach, Meta-analysis with LD correlation coefficients approach, and Davies algorithm
Usage
gene_pval(data, db_file, marker_ld, interval, p)
Arguments
data |
A data frame with the results of the association test performed for each marker |
db_file |
A data frame obtained from the import_gff_gtf containing the gtf information |
marker_ld |
A data frame containing the pairwise linkage disequilibrium between markers in a chromosome |
interval |
The interval (in base pairs) used to annotated markers downstream and upstream from the genes coordinates |
p |
The name of the column containing the P-values for each marker |
Details
Requires a table with p-values from a association test, a gtf file file the gene coordinates in the same assembly used to map the variants used in the association study, and a data frame with pairwise linkage disequilibrium (LD) values between markers. This analysis must be performed for each chromosome individually. The data frame with the results of the association study must have three mandatory columns names as CHR, BP and SNP containing the chromosome, base pair position and marker name, respectively. The gtf file must be imported by the import_gff_gtf() function from GALLO or can be customized by the user, since it has the same columns names. The LD table must contain three mandatory columns, SNP_A, SNP_B and R. where, the first two columns must contain the marker names and the third column, the LD value between these markers. This data frame can be obtained using PLINK or any other software which computes pairwise LD between markers in the same chromosome. In the absence of LD values between any two SNPs in the data frame, a LD equal zero is assumed
Value
A data frame with the gene level p-values obtained using the Weighted Z-score approach (P_WZ_ld), Meta-analysis with LD correlation coefficients approach (P_meta_LD), and Liu algorithm (P_Liu, Liu et al. (2009))
References
Liu et al. (2009) Computational Statistics & Data Analysis, Volume 53, (doi:10.1016/j.csda.2008.11.025)
A gff example for QTL annotation
Description
Data from the Animal QTLdb comprasing the bovine QTL annotation
Usage
data(gffQTLs)
Format
A data frame with 111742 rows and 6 variables:
chr: Chromosome
database: The database which the QTL information was retrieved
QTL_type: The class of each QTL annotated in the database
start_pos: Start position in the genome for each QTL
end_pos: End position in the genome for each QTL
extra_info: Additional information about the QTLs, such as QTL ID, Name, PUBMED ID, mapping type, among others
Examples
data(gffQTLs)
A gtf example for gene annotation
Description
Data from the Ensembl comprasing the gene annotation for the bovine genome
Usage
data(gtfGenes)
Format
A data frame with 24616 rows and 8 variables:
chr: Chromosome
start_pos: Start position in the genome for each geme
end_pos: End position in the genome for each gene
width Gene length
strand Strand which the gene is mapped (+ or -)
gene_id Ensemble gene ID
gene_name Gene symbol
gene_biotype Gene biotype
Examples
data(gtfGenes)
Import .gtf and .gff files to be used during gene and QTL annotation, respectively
Description
Takes a .gft or .gff file and import into a dataframe
Usage
import_gff_gtf(db_file, file_type)
Arguments
db_file |
File with the gene mapping or QTL information. For gene mapping, a .gtf file from Ensembl database must be used. For the QTL search, a .gff file from Animal QTlLdb must be used. Both files must use the same reference annotation used in the original study |
file_type |
"gtf" or "gff" |
Value
A dataframe with the gtf or gtf content
Examples
gffpath <- system.file("extdata", "example.gff", package="GALLO")
qtl.inp <- import_gff_gtf(db_file=gffpath,file_type="gff")
Compute the eigenvalues for the linkage disequilibrium (LD) matrix and use as input for PCA_Meff function to compute the effective number of markers
Description
Compute the eigenvalues for the linkage disequilibrium (LD) matrix and use as input for PCA_Meff function to compute the effective number of markers
Usage
inferCut(mat.r, cut.off)
Arguments
mat.r |
A matrix composed by the LD between markers |
cut.off |
The threshold for percentage of the sum of the variances explained by the markers |
Value
The effective number of markers identified by the SimpleM approach
Function to compute the weighted 1-dimensional kernel density estimator
Description
Function to compute the weighted 1-dimensional kernel density estimator
Usage
kde(x, w = NA)
Arguments
x |
output of the sLDA function |
w |
An optional vector of weights (how much each observation contributes) |
Value
Returns an estimate of the probability density at each x[i]
Function to perform Meta-Analysis with LD Correlation Coefficients
Description
Function to perform Meta-Analysis with LD Correlation Coefficients
Usage
meta_LD(marker_ld, marker_pvalues)
Arguments
marker_ld |
A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome |
marker_pvalues |
A vector with the p-values for the SNPs annotated within each gene |
Value
A vector of p-values for each gene annotated within the defined coordinates
Function to perform an EM algorithm (logistic prior P(Z = 1 \mid v) + beta p-values (discriminative prior))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Description
Function to perform an EM algorithm (logistic prior P(Z = 1 \mid v) + beta p-values (discriminative prior))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Usage
model.LR(
pvalues,
v,
lambda,
alpha0,
alpha1,
pi0,
pi1,
max.it,
verbose = verbose
)
Arguments
pvalues |
Vector of gene-level p-values. |
v |
An auxiliary feature vector. |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
max.it |
Maximum number of iterations for the EM algorithm. |
verbose |
A logical value defining if the iteration information should be printed or not. |
Value
Returns the posterior probability of association for each gene.
Function to perform an EM algorithm (“Multiple” univariate KDEs (nonparametric naive Bayes)) for a 2-component mixture where each observation i has a p-value pi, and an auxiliary feature vector vi
Description
Function to perform an EM algorithm (“Multiple” univariate KDEs (nonparametric naive Bayes)) for a 2-component mixture where each observation i has a p-value pi, and an auxiliary feature vector vi
Usage
model.M(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)
Arguments
pvalues |
Vector of gene-level p-values |
v |
an auxiliary feature vector |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
max.it |
Maximum number of iterations for the EM algorithm. |
verbose |
A logical value defining if the iteration information should be printed or not. |
Value
Returns the posterior probability of association for each gene.
Function to perform an EM algorithm (multivariate normal with shared covariance (parametric, correlation-aware))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Description
Function to perform an EM algorithm (multivariate normal with shared covariance (parametric, correlation-aware))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Usage
model.MVN(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)
Arguments
pvalues |
Vector of gene-level p-values. |
v |
An auxiliary feature vector. |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
max.it |
Maximum number of iterations for the EM algorithm. |
verbose |
A logical value defining if the iteration information should be printed or not. |
Value
Returns the posterior probability of association for each gene.
Function to perform an EM algorithm (Gaussian naive Bayes (parametric, independent features))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Description
Function to perform an EM algorithm (Gaussian naive Bayes (parametric, independent features))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Usage
model.NB(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)
Arguments
pvalues |
Vector of gene-level p-values. |
v |
An auxiliary feature vector. |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
max.it |
Maximum number of iterations for the EM algorithm. |
verbose |
A logical value defining if the iteration information should be printed or not. |
Value
Returns the posterior probability of association for each gene.
Function to perform an EM algorithm (LDA projection + 1D KDE (nonparametric, correlation-aware))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Description
Function to perform an EM algorithm (LDA projection + 1D KDE (nonparametric, correlation-aware))
for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.
Usage
model.proj(pvalues, v, pi0, pi1, alpha0, alpha1, max.it, verbose = verbose)
Arguments
pvalues |
Vector of gene-level p-values. |
v |
An auxiliary feature vector. |
pi0 |
Mixing proportion (prior probability) that an observation belongs to the null/background component |
pi1 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha0 |
Mixing proportion (prior probability) that an observation belongs to the alternative/signal component |
alpha1 |
Shape parameter of the alternative/signal p-value distribution, modeled as
|
max.it |
Maximum number of iterations for the EM algorithm. |
verbose |
A logical value defining if the iteration information should be printed or not. |
Value
Returns the posterior probability of association for each gene.
Overlapping between grouping factors
Description
Takes a dataframe with a column of genes, QTLs (or other data) and a grouping column and create some matrices with the overlapping information
Usage
overlapping_among_groups(file, x, y)
Arguments
file |
A dataframe with the data and grouping factor |
x |
The grouping factor to be compared |
y |
The data to be compared among the levels of the grouping factor |
Value
A list with three matrices: 1) A matrix with the number of overlapping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one
Examples
data(QTLmarkers)
data(gtfGenes)
genes.out <- find_genes_qtls_around_markers(db_file=gtfGenes,
marker_file=QTLmarkers,method="gene",
marker="snp",interval=100000, nThreads=1)
overlapping.out<-overlapping_among_groups(
file=genes.out,x="Reference",y="gene_id")
Plot overlapping between data and grouping factors
Description
Takes the output from overlapping_among_groups function and creates a heatmap with the overlapping between groups
Usage
plot_overlapping(overlapping_matrix, nmatrix, ntext, group, labelcex = 1)
Arguments
overlapping_matrix |
The object obtained in overlapping_amoung_groups function |
nmatrix |
An interger from 1 to 3 indicating which matrix will be used to plot the overlapping, where: 1) A matrix with the number of overllaping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one |
ntext |
An interger from 1 to 3 indicating which matrix will be used as the text matrix for the heatmap, where: 1) A matrix with the number of overllaping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one |
group |
A vector with the size of groups. This vector will be plotted as row and column names in the heatmap |
labelcex |
A numeric value indicating the size of the row and column labels |
Value
A heatmap with the overlapping between groups
Examples
data(QTLmarkers)
data(gtfGenes)
genes.out <- find_genes_qtls_around_markers(
db_file=gtfGenes, marker_file=QTLmarkers,
method="gene", marker="snp",interval=100000,
nThreads=1)
overlapping.out<-overlapping_among_groups(
file=genes.out,x="Reference",y="gene_id")
plot_overlapping(overlapping.out,
nmatrix=2,ntext=2,
group=unique(genes.out$Reference))
Plot QTLs information from the find_genes_qtls_around_markers output
Description
Takes the output from find_genes_qtls_around_markers and create plots for the frequency of each QTL type and trait
Usage
plot_qtl_info(
qtl_file,
qtl_plot = c("qtl_type", "qtl_name"),
n = "all",
qtl_class = NULL,
horiz = FALSE,
...
)
Arguments
qtl_file |
The output from find_genes_qtls_around_markers function |
qtl_plot |
"qtl_type" or"qtl_name" |
n |
Number of QTLs to be plotted when the qtl_name option is selected |
qtl_class |
Class of QTLs to be plotted when the qtl_name option is selected |
horiz |
The legend of the pie plot for the qtl_type should be plotted vertically or horizontally. The default is FALSE. Therefore, the legend is plotted vertically. |
... |
Arguments to be passed to/from other methods. For the default method these can include further arguments (such as axes, asp and main) and graphical parameters (see par) which are passed to plot.window(), title() and axis. |
Value
A plot with the requested information
Examples
data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(db_file=gffQTLs,
marker_file=QTLmarkers, method = "qtl",
marker = "snp", interval = 500000,
nThreads = 1)
plot_qtl_info(out.qtls, qtl_plot = "qtl_type", cex=2)
Performs a QTL enrichment analysis based on a hypergeometric test for each QTL class
Description
Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis
Usage
qtl_enrich(
qtl_db,
qtl_file,
qtl_type = c("QTL_type", "Name"),
enrich_type = c("genome", "chromosome"),
chr.subset = NULL,
nThreads = NULL,
padj = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"),
verbose = TRUE
)
Arguments
qtl_db |
The object obtained using the import_gff_gtf() function |
qtl_file |
The output from find_genes_qtls_around_markers function |
qtl_type |
A character indicating which type of enrichment will be performed. QTL_type indicates that the enrichment processes will be performed for the QTL classes, while Name indicates that the enrichment analysis will be performed for each trait individually |
enrich_type |
A character indicating if the enrichment analysis will be performed for all the chromosomes ("genome") or for a subset of chromosomes ("chromosome). If the "genome" option is selected, the results reported are the merge of all chromosomes |
chr.subset |
If enrich_type equal "chromosome", it is possible to define a subset of chromosomes to be analyzed. The default is equal NULL. Therefore, all the chromosomes will be analyzed |
nThreads |
The number of threads to be used. |
padj |
The algorithm for multiple testing correction to be adopted ("holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr", "none") |
verbose |
Logical value defining if messages should of not be printed during the analysis (default=TRUE) |
Details
The simple bias of investigation for some traits (such as milk production related traits in the QTL database for cattle) may result in a larger proportion of records in the database. Consequently, the simple investigation of the proportion of each QTL type might not be totally useful. In order to reduce the impact of this bias, a QTL enrichment analysis can be performed. The QTL enrichment analysis performed by GALLO package is based in a hypergeometric test using the number of annoatted QTLs within the candidate regions and the total number of the same QTL in the QTL database.
Value
A data frame with the p-value for the enrichment result
Examples
data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(
db_file=gffQTLs,marker_file=QTLmarkers,
method = "qtl",marker = "snp",
interval = 500000, nThreads = 1)
out.enrich<-qtl_enrich(qtl_db=gffQTLs,
qtl_file=out.qtls, qtl_type = "Name",
enrich_type = "chromosome",chr.subset = NULL,
padj = "fdr",nThreads = 1)
Plot relationship between data and grouping factors
Description
Takes the output from find_genes_qtls_around_markers function and creates a chord plot with the relationship between groups
Usage
relationship_plot(
qtl_file,
x,
y,
grid.col = "gray60",
degree = 90,
canvas.xlim = c(-2, 2),
canvas.ylim = c(-2, 2),
cex,
gap
)
Arguments
qtl_file |
The output from find_genes_qtls_around_markers function |
x |
The first grouping factor, to be plotted in the left hand side of the chord plot |
y |
The second grouping factor, to be plotted in the left hand side of the chord plot |
grid.col |
A character with the grid color for the chord plot or a vector with different colors to be used in the grid colors. Note that when a color vector is provided, the length of this vector must be equal the number of sectors in the chord plot |
degree |
A numeric value corresponding to the starting degree from which the circle begins to draw. Note this degree is always reverse-clockwise |
canvas.xlim |
The coordinate for the canvas in the x-axis. By default is c(-1,1) |
canvas.ylim |
The coordinate for the canvas in the y-axis. By default is c(-1,1) |
cex |
The size of the labels to be printed in the plot |
gap |
A numeric value corresponding to the gap between the chord sectors |
Value
A chords relating x and y
Examples
data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(
db_file=gffQTLs, marker_file=QTLmarkers,
method = "qtl", marker = "snp",
interval = 500000, nThreads = 1)
out.enrich<-qtl_enrich(qtl_db=gffQTLs,
qtl_file=out.qtls, qtl_type = "Name",
enrich_type = "chromosome",
chr.subset = NULL, padj = "fdr",nThreads = 1)
out.enrich$ID<-paste(out.enrich$QTL," - ",
"CHR",out.enrich$CHR,sep="")
out.enrich.filtered<-out.enrich[which(out.enrich$adj.pval<0.05),]
out.qtls$ID<-paste(out.qtls$Name," - ",
"CHR",out.qtls$CHR,sep="")
out.enrich.filtered<-out.enrich.filtered[order(out.enrich.filtered$adj.pval),]
out.qtls.filtered<-out.qtls[which(out.qtls$ID%in%out.enrich.filtered$ID[1:10]),]
out.qtls.filtered[which(out.qtls.filtered$Reference==
"Feugang et al. (2010)"), "color_ref"]<-"purple"
out.qtls.filtered[which(out.qtls.filtered$Reference==
"Buzanskas et al. (2017)"),"color_ref"]<-"pink"
color.grid<-c(rep("black",length(unique(out.qtls.filtered$Abbrev))),
unique(out.qtls.filtered$color_ref))
names(color.grid)<-c(unique(out.qtls.filtered$Abbrev),
unique(out.qtls.filtered$Reference))
relationship_plot(qtl_file=out.qtls.filtered,
x="Abbrev", y="Reference",cex=1,gap=5,
degree = 90, canvas.xlim = c(-5, 5),
canvas.ylim = c(-3, 3), grid.col = color.grid)
Function to performs a linear discriminant analysis (LDA) for a multivariate set of features
Description
Function to performs a linear discriminant analysis (LDA) for a multivariate set of features
Usage
sLDA(z, v)
Arguments
z |
numeric vector of length n, with values in [0,1] |
v |
numeric matrix nxd. Each row is an observation, each column a feature |
Value
Returns a one-dimensional discriminant score y for each observation, computed by projecting the feature matrix v onto the first LDA direction estimated using the soft class-membership weights z.
Sub-function to split comment column from QTL output
Description
Takes a list of candidate markers and search for genes a determined interval
Usage
splitQTL_comment(output.final)
Arguments
output.final |
Output from QTL annotation |
Value
A data frame with the extra_info column content, from the gff file, broken in several additional columns
Sub-function to search genes around candidate markers
Description
Takes a list of candidate markers and search for genes a determined interval
Usage
sub_genes_markers(chr_list, db_file, marker_file, nThreads = NULL, int = 0)
Arguments
chr_list |
"Object with the chromosomes to be analyzed" |
db_file |
Data frame with the information from .gtf file |
marker_file |
Data frame with the information from the candidate regions file |
nThreads |
The number of threads to be used |
int |
The interval in base pair |
Value
A dataframe with the genes or QTLs mapped within the specified intervals
Sub-function to search genes around candidate markers
Description
Takes a list of candidate markers and search for genes a determined interval
Usage
sub_genes_windows(chr_list, db_file, marker_file, nThreads = NULL, int = 0)
Arguments
chr_list |
"Object with the chromosomes to be analyzed" |
db_file |
Data frame with the information from .gtf file |
marker_file |
Data frame with the information from the candidate regions file |
nThreads |
The number of threads to be used |
int |
The interval in base pair |
Value
A dataframe with the genes or QTLs mapped within the specified intervals
Performs a QTL enrichment analysis based in a Bootstrap simulation for each QTL class using the QTL information per chromosome
Description
Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis
Usage
sub_qtlEnrich_chom(
qtl_file,
qtl_type,
qtl.file.types,
table.qtl.class,
padj,
qtl_db,
search_qtl,
nThreads
)
Arguments
qtl_file |
The output from find_genes_qtls_around_markers function |
qtl_type |
A string indicating with QTL enrichment will be performed: "QTL_type" or "Name" |
qtl.file.types |
A vector with the observed QTL classes |
table.qtl.class |
An frequency table for the number of each QTL in each chromosome |
qtl_db |
The QTL annotation database |
search_qtl |
The column to perform the QTL searching in counting from the QTL annotation database |
nThreads |
Number of threads for parallel processing |
Value
A data frame with the p-value for th enrichment result
Performs a QTL enrichment analysis based in a Bootstrap simulation for each QTL class using the QTL information across the whole genome
Description
Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis
Usage
sub_qtlEnrich_geno(
qtl_file,
qtl_type,
qtl.file.types,
table.qtl.class,
padj,
qtl_db,
search_qtl,
nThreads
)
Arguments
qtl_file |
The output from find_genes_qtls_around_markers function |
qtl_type |
A string indicating with QTL enrichment will be performed: "QTL_type" or "Name" |
qtl.file.types |
A vector with the observed QTL classes |
table.qtl.class |
An frequency table for the number of each QTL in each chromosome |
qtl_db |
The QTL annotation database |
search_qtl |
The column to perform the QTL searching in counting from the QTL annotation database |
nThreads |
Number of threads for parallel processing |
Value
A data frame with the p-value for th enrichment result
Sub-function to search QTLs around candidate markers
Description
Takes a list of candidate markers and search for genes a determined interval
Usage
sub_qtl_markers(chr_list, db_file, marker_file, nThreads = NULL, int = 0)
Arguments
chr_list |
"Object with the chromosomes to be analyzed" |
db_file |
Data frame with the information from .gff file |
marker_file |
Data frame with the information from the candidate regions file |
nThreads |
The number of threads to be used |
int |
The interval in base pair |
Value
A dataframe with the QTLs mapped within the specified intervals
Sub-function to search QTLs around candidate markers
Description
Takes a list of candidate markers and search for genes a determined interval
Usage
sub_qtl_windows(chr_list, db_file, marker_file, nThreads = NULL, int = 0)
Arguments
chr_list |
"Object with the chromosomes to be analyzed" |
db_file |
Data frame with the information from .gff file |
marker_file |
Data frame with the information from the candidate regions file |
nThreads |
The number of threads to be used |
int |
The interval in base pair |
Value
A dataframe with the QTLs mapped within the specified intervals