Help for package GALLO

Title:

Genomic Annotation in Livestock for Positional Candidate LOci

Version:

2.0

Description:

The accurate annotation of genes and Quantitative Trait Loci (QTLs) located within candidate markers and/or regions (haplotypes, windows, CNVs, etc) is a crucial step the most common genomic analyses performed in livestock, such as Genome-Wide Association Studies or transcriptomics. The Genomic Annotation in Livestock for positional candidate LOci (GALLO) is an R package designed to provide an intuitive and straightforward environment to annotate positional candidate genes and QTLs from high-throughput genetic studies in livestock. Moreover, GALLO allows the graphical visualization of gene and QTL annotation results, data comparison among different grouping factors (e.g., methods, breeds, tissues, statistical models, studies, etc.), and QTL enrichment in different livestock species including cattle, pigs, sheep, and chicken, among others.

URL:

<https://github.com/pablobio/GALLO>

Depends:

R (≥ 4.0.0)

biocViews:

Software

Imports:

circlize, data.table, doParallel, dplyr, ggplot2, graphics, grDevices, foreach, lattice , parallel, RColorBrewer, rtracklayer, stats, stringr, unbalhaar, utils, DT, webshot, igraph, visNetwork, CompQuadForm, Matrix, reticulate

License:

GPL-3

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.3

Suggests:

Hmisc, knitr, rmarkdown, testthat

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2026-02-06 20:59:25 UTC; pablofonseca

Author:

Pablo Fonseca [aut, cre], Aroa Suarez-Vega [aut], Gabriele Marras [aut], Angela Cánovas [aut]

Maintainer:

Pablo Fonseca <p.fonseca@csic.es>

Repository:

CRAN

Date/Publication:

2026-02-09 23:50:20 UTC

Run a short EM on p-values to fit a 2-component Beta mixture

Description

Fits a two-component mixture model to gene-level p-values where the null component is typically Uniform(0,1) (i.e., Beta(1,1)) and the alternative component is Beta(alpha1, 1) with alpha1 < 1 enriching small p-values. Returns updated mixture parameters and posterior probabilities of belonging to the signal (alternative) component.

Usage

EMinfer(pvalues, pi0, pi1, alpha0, alpha1, max.it)

Arguments

pvalues

Numeric vector of gene-level p-values in (0, 1].

pi0

Mixing proportion for the null/background component (Z = 0).

pi1

Mixing proportion for the alternative/signal component (Z = 1); typically pi1 = 1 - pi0.

alpha0

Shape parameter of the null p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_0, 1) when Z = 0. The common choice alpha0 = 1 corresponds to a Uniform(0,1) null.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1. Values < 1 concentrate mass near 0 (enrichment of small p-values).

max.it

Integer. Maximum number of EM iterations.

Value

A list with elements:

alpha0: Estimated null shape parameter.
alpha1: Estimated alternative shape parameter.
pi: Estimated mixing proportion for the signal component (pi1).
post: Posterior probabilities P(Z=1 \mid p) for each p-value.

Function to estimate the posterior probability of association (PPA) based on the integration of -OMICs and functional data via representation learning of candidate genes.

Description

Function to estimate the posterior probability of association (PPA) based on the integration of -OMICs and functional data via representation learning of candidate genes.

Usage

Gene_PPA(
  gene.pval,
  Gene_ID,
  pval,
  embeddings,
  lambda = 0,
  alpha0 = 1,
  alpha1 = 0.2,
  pi0 = 0.6,
  pi1 = 0.4,
  max.it = 10,
  model,
  verbose = F
)

Arguments

gene.pval

A table containing the gene IDs and p-values.

Gene_ID

Name of the column containing the gene IDs (without NAs or duplicated IDs).

pval

Name of the column containing the gene-level p-values.

embeddings

The output of the Net_embedding function.

lambda

Ridge (L_2) regularization strength used only when model = "LR" for the logistic regression part that models the prior probability of being in the signal component.

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

max.it

Maximum number of iterations for the EM algorithm.

model

One of the following options defining the model used during the EM algorithm: proj, M, LR, NB, and MVN. See Details for more information.

verbose

A logical value defining if the iteration information should be printed or not.

Details

The function Gene_PPA allows the user to select different options that define how the EM algorithm models the distribution of v (the auxiliary information matrix (feature/annotation matrix) attached to each p-value). The EM algorithms employed here are adapted from Wu et al. (2018):

proj: Fits a two-component mixture model that combines p-values with an LDA-based one-dimensional projection of the feature matrix v. The projection is obtained by soft LDA using current posterior weights, and class-conditional densities of the projected score are estimated nonparametrically via weighted KDE.
M: Fits a two-component mixture model that combines p-values with the feature matrix v by estimating a separate weighted KDE for each feature and each component. The joint density is approximated by the product of marginal densities (naive independence across features).
LR: Fits a two-component mixture model where p-values follow a Beta-mixture and the prior probability of belonging to the signal component is modeled as a logistic function of v. Logistic regression parameters are estimated within the EM iterations, with optional ridge regularization controlled by lambda.
NB: Fits a two-component mixture model that combines p-values with v assuming a Gaussian naive Bayes model for the features. Component-specific feature means are estimated, with feature-wise variances shared across components, and features are treated as conditionally independent given component membership.
MVN: Fits a two-component mixture model that combines p-values with v using a multivariate normal model with shared covariance (parametric, correlation-aware).

Value

Returns the posterior probability of association for each gene.

References

Wu et al. (2018) Methods, 145, doi:10.1016/j.ymeth.2018.06.002.

Global variables

Description

Global variables

Function to estimate gene-level p-value using Davies algorithm

Description

Function to estimate gene-level p-value using Davies algorithm

Usage

Liu_ld(matrix.ld, marker_pvalues)

Arguments

matrix.ld

A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome

marker_pvalues

A vector with the p-values for the SNPs annotated within each gene

Value

A vector of p-values for each gene annotated within the defined coordinates

Compute the centrality metrics for the nodes composing the network generated by the NetVis function

Description

Compute the centrality metrics for the nodes composing the network generated by the NetVis function

Usage

NetCen(data, g1, g2)

Arguments

data

A data frame containing the relationship between the two groups to be represented in the network

g1

Name of the column containing the labels of the first group that will be used to create the network

g2

Name of the column containing the labels of the second group that will be used to create the network

Details

This function returns the following centrality metrics for each node that composed the network: Degree (The number of edges incident to the node), Betweenness (The fraction of shortest paths between pairs of nodes that pass through the node), Closeness (The inverse of the sum of the shortest path distances from the node to all other nodes), and Eigenvector Centrality (The centrality measure based on the eigenvector of the adjacency matrix).

Value

A data frame with the centrality metrics for each node in the network.

Create a dynamic network representing the relationship between two groups of variables

Description

Create a dynamic network representing the relationship between two groups of variables

Usage

NetVis(
  data,
  g1,
  g2,
  col1 = "aquamarine",
  col2 = "red",
  edge_col = "gray",
  remove_label = NULL,
  node_size = c(15, 40),
  font_size = 45,
  edge_width = 1
)

Arguments

data

A data frame containing the relationship between the two groups to be represented in the network

g1

Name of the column containing the labels of the first group that will be used to create the network

g2

Name of the column containing the labels of the second group that will be used to create the network

col1

Color of the nodes that will represent the first group represented in the network. The default value is aquamarine

col2

Color of the nodes that will represent the second group represented in the network The default value is red

edge_col

Color of the edges that will connect the nodes in the network. The default value is gray

remove_label

If is required to omit the labels for some of the groups, this argument receives the column name informed the g1 or g2 arguments. The default value is NULL

node_size

A vector with the node sizes to represent g1 and g2. The defaul values are 15 and 40, respectively

font_size

The size of the font of the labels of each node (The default value is 45)

edge_width

The width of the edges connecting the nodes in the network

Details

This function returns a dynamic network, using visNetwork, representing the connection between two groups. For example, the output from the find_genes_qtls_around_markers() function can be used here to represent the connections between markers and QTLs. Another option is to combine the data frames with both gene and QTL annotation around markers to reprsent the connections between genes and QTLs.

Value

A dynamic network representing the connection between two groups.

Network Embedding using biased random walk and Word2Vec.

Description

Network Embedding using biased random walk and Word2Vec.

Usage

Net_embedding(
  net_list,
  p = 0.25,
  q = 1,
  num = 10,
  l = 80,
  vector_size = 32,
  window = 10,
  min_count = 1,
  sg_model = 1,
  workers = 4,
  epochs = 20
)

Arguments

net_list

A list containing data frames that can be coerced into pandas data frames. Each data frame contains 3 columns, source, target and weight, representing the connections between nodes in the networks.

p

A parameter for the BiasedRandomWalk.Defines probability, 1/p, of returning to source node.

q

A parameter for the BiasedRandomWalk.Defines probability, 1/q, for moving to a node away from the source node.

num

A parameter for the BiasedRandomWalk. Defines the number of walks per node.

l

A parameter for the BiasedRandomWalk. Defines the length of each walk.

vector_size

The size of the Word2Vec vectors.

window

The window size for the Word2Vec model.

min_count

Minimum count for the Word2Vec model.

sg_model

Training algorithm for Word2Vec (0 for CBOW, 1 for skip-gram).

workers

Number of worker threads to train the Word2Vec model.

epochs

Number of epochs to train the Word2Vec model.

Details

This function performs a network embedding using the python libraries stellargraph and gensim through the BiasedRandomWalk and Word2Vec functions, respectively.

Value

A list of data frames with node embeddings. Each data frame contains nxm dimensions, where n is the number of unique nodes in the network and m is the number of reduced dimensions defined in the function.

Estimate the number of effective markers in a chromosome based on an adapted version of the simpleM methodology

Description

Estimate the number of effective markers in a chromosome based on an adapted version of the simpleM methodology

Usage

Nmarkers_SimpleM(ld.file, PCA_cutoff = 0.995)

Arguments

ld.file

A data frame with the pairwise linkage disequilibrium (LD) values for a chromosome. The column names SNP_A, SNP_B, and R are mandatory, where the SNP_A and SNP_B contained the markers names and the R column the LD values between the two markers.

PCA_cutoff

A cutoff for the total of the variance explained by the markers.

Details

This function estimate the effective number of markers in a chromosome using adapted version of the simpleM methodology described in Gao et al. (2008). The function use as input a data frame composed by three mandatory columns (SNP_A, SNP_B, and R). This data frame can be obtained using PLINK or any other software to compute LD between markers. Additionally, a threshold for percentage of the sum of the variances explained by the markers must be provided. The number of effective markers identified by this approach can be used in multiple testing corrections, such as Bonferroni.

Value

The effective number of markers identified by the SimpleM approach

References

Gao et al. (2008) Genet Epidemiol, Volume 32, Issue 4, Pages 361-369. (doi:10.1002/gepi.20310)

Estimate the number of independent segments in a chromosome based on the effective population size

Description

Estimate the number of independent segments in a chromosome based on the effective population size

Usage

Nseg_chr(chr.table, chr_length, Ne)

Arguments

chr.table

A table containing the chromosomes and the chromosomal length (in centiMorgans).

chr_length

The name of the column where the length of the chromosomes are informed.

Ne

The effective population size.

Details

This function uses a adapted version of the formula proposed by Goddard et al. (2011) to estimate the independent number of segments in a chromosome based on the effective population size.

Value

A data frame with the effective number of segments in each chromosome.

References

Goddard et al. (2011) Journal of animal breeding and genetics, Volume 128, Issue 6, Pages 409-421. (doi:10.1111/j.1439-0388.2011.00964.x)

Compute Meff statistic based on PCA to determine the number of effective markers

Description

Compute Meff statistic based on PCA to determine the number of effective markers

Usage

PCA_Meff(eigenV, cut.off)

Arguments

eigenV

The eigenvalues obtained from the linkage disequilibrium matrix

cut.off

The threshold for percentage of the sum of the variances explained by the markers

Value

The effective number of markers identified by the SimpleM approach

Compute a multi-trait test statistic for pleiotropic effects using summary statistics from association tests

Description

Compute a multi-trait test statistic for pleiotropic effects using summary statistics from association tests

Usage

PleioChiTest(data)

Arguments

data

A data frame with the first column containing the SNP name and the remaining columns the signed t-values obtained for each marker in the association studies individually performed for each trait.

Details

This function tests a null hypothesis stating that each SNP does not affect any of the traits included in the input file. The method applied here is an implementation of the statistic proposed at Bolormaa et al. (2014) and is approximately distributed as a chi-squared with n degrees of freedom, where n is equal the number of traits included in the input file.

Value

A data frame with the multi-trait chi-squared statistics and the correspondent p-value obtained for each SNP.

References

Bolormaa et al. (2014) Plos Genetics, Volume 10, Issue 3, e1004198. (doi:10.1371/journal.pgen.1004198)

Plot enrichment results for QTL enrichment analysis

Description

Takes the output from qtl_enrich function and creates a bubble plot with enrichment results

Usage

QTLenrich_plot(qtl_enrich, x, pval)

Arguments

qtl_enrich

The output from qtl_enrich function

x

Id column to be used from the qtl_enrich output

pval

P-value to be used in the plot. The name informed to this argument must match the p-value column name in the enrichment table

Value

A plot with the QTL enrichment results

Candidate markers identified by GWAS associated with fertility traits in cattle

Description

Data from a systematic review which evaluated 18 articles regarding genome-wide association studies for male fertility traits in beef and dairy cattle

Usage

data(QTLmarkers)

Format

A data frame with 141 rows and 7 variables:

Associated.marker: Significantly associated marker
SNP.reference: The rs ID when available
Trait: Trait associated
CHR: Chromosome
BP: Chromosomal position in base pairs (bovine reference assembly UMD3.1)
Breed: Breed used in the study
Reference: Study which the markers were retrieved

References

Fonseca et al. (2018) Journal of Animal Science, Volume 96, Issue 12, December 2018, Pages 4978-4999. (doi:10.1093/jas/sky382)

Examples

data(QTLmarkers)

Candidate windows identified by GWAS associated with fertility traits in cattle

Description

Data from a systematic review which evaluated 18 articles regarding genomw-wide association studies for male fertility traits in beef and dairy cattle

Usage

data(QTLwindows)

Format

A data frame with 50 rows and 8 variables:

First.marker.in.the.window: First marker mapped in the candidate window
Last.marker.in.the.window: Last marker mapped in the candidate window
Trait: Trait associated
CHR: Chromosome
BP1: Chromosomal position in base pairs for the first marker mapped in the candidate window(bovine reference assembly UMD3.1)
BP1: Chromosomal position in base pairs for the last marker mapped in the candidate window (bovine reference assembly UMD3.1)
Breed: Breed used in the study
Reference: Study which the markers were retrieved

References

Fonseca et al. (2018) Journal of Animal Science, Volume 96, Issue 12, December 2018, Pages 4978-4999. (doi:10.1093/jas/sky382)

Examples

data(QTLwindows)

Function to perform Weighted Z-score Approach with LD Information

Description

Function to perform Weighted Z-score Approach with LD Information

Usage

WZ_ld(marker_ld, marker_pvalues)

Arguments

marker_ld

A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome

marker_pvalues

A vector with the p-values for the SNPs annotated within each gene

Value

A vector of p-values for each gene annotated within the defined coordinates

Sub-function to auto-stop the clusters created for parallel processes

Description

Sub-function to auto-stop the clusters created for parallel processes

Usage

autoStopCluster(cl)

Arguments

cl

The cluster created by the makePSOCKcluster function

Search genes and QTLs around candidate regions

Description

Takes a list of candidate markers and or regions (haplotypes, CNVs, windows, etc.) and search for genes or QTLs in a determined interval

Usage

find_genes_qtls_around_markers(
  db_file,
  marker_file,
  method = c("gene", "qtl"),
  marker = c("snp", "haplotype"),
  interval = 0,
  nThreads = NULL,
  verbose = TRUE
)

Arguments

db_file

The data frame obtained using the import_gff_gtf() function

marker_file

The file with the SNP or haplotype positions. Detail: For SNP files, the columns “CHR” and “BP” with the chromosome and base pair position, respectively, are mandatory. For the haplotype, the following columns are mandatory: “CHR”, “BP1” and “BP2”

method

“gene” or “qtl”

marker

"snp" or "haplotype"

interval

The interval in base pair which can be included upstream and downstream from the markers or haplotype coordinates.

nThreads

Number of threads to be used

verbose

Logical value defining if messages should of not be printed during the analysis (default=TRUE)

Value

A dataframe with the genes or QTLs mapped within the specified intervals

Examples

data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(db_file=gffQTLs, marker_file=QTLmarkers,
method = "qtl", marker = "snp",
interval = 500000, nThreads = 1)

Function to annotate markers and the respective p-values within genes

Description

Function to annotate markers and the respective p-values within genes

Usage

find_markers_genes(db_file, marker_file, int = 0)

Arguments

db_file

A data frame obtained from the import_gff_gtf containing the gtf information

marker_file

A data frame with the results of the association test performed for each marker

int

The interval (in base pairs) used to annotated markers downstream and upstream from the genes coordinates

Value

A data frame containing the markers mapped within the selected interval for each gene in the annotation file

Estimate a gene-level p-value using Weighted Z-score approach, Meta-analysis with LD correlation coefficients approach, and Davies algorithm

Description

Estimate a gene-level p-value using Weighted Z-score approach, Meta-analysis with LD correlation coefficients approach, and Davies algorithm

Usage

gene_pval(data, db_file, marker_ld, interval, p)

Arguments

data

A data frame with the results of the association test performed for each marker

db_file

A data frame obtained from the import_gff_gtf containing the gtf information

marker_ld

A data frame containing the pairwise linkage disequilibrium between markers in a chromosome

interval

The interval (in base pairs) used to annotated markers downstream and upstream from the genes coordinates

p

The name of the column containing the P-values for each marker

Details

Requires a table with p-values from a association test, a gtf file file the gene coordinates in the same assembly used to map the variants used in the association study, and a data frame with pairwise linkage disequilibrium (LD) values between markers. This analysis must be performed for each chromosome individually. The data frame with the results of the association study must have three mandatory columns names as CHR, BP and SNP containing the chromosome, base pair position and marker name, respectively. The gtf file must be imported by the import_gff_gtf() function from GALLO or can be customized by the user, since it has the same columns names. The LD table must contain three mandatory columns, SNP_A, SNP_B and R. where, the first two columns must contain the marker names and the third column, the LD value between these markers. This data frame can be obtained using PLINK or any other software which computes pairwise LD between markers in the same chromosome. In the absence of LD values between any two SNPs in the data frame, a LD equal zero is assumed

Value

A data frame with the gene level p-values obtained using the Weighted Z-score approach (P_WZ_ld), Meta-analysis with LD correlation coefficients approach (P_meta_LD), and Liu algorithm (P_Liu, Liu et al. (2009))

References

Liu et al. (2009) Computational Statistics & Data Analysis, Volume 53, (doi:10.1016/j.csda.2008.11.025)

A gff example for QTL annotation

Description

Data from the Animal QTLdb comprasing the bovine QTL annotation

Usage

data(gffQTLs)

Format

A data frame with 111742 rows and 6 variables:

chr: Chromosome
database: The database which the QTL information was retrieved
QTL_type: The class of each QTL annotated in the database
start_pos: Start position in the genome for each QTL
end_pos: End position in the genome for each QTL
extra_info: Additional information about the QTLs, such as QTL ID, Name, PUBMED ID, mapping type, among others

Examples

data(gffQTLs)

A gtf example for gene annotation

Description

Data from the Ensembl comprasing the gene annotation for the bovine genome

Usage

data(gtfGenes)

Format

A data frame with 24616 rows and 8 variables:

chr: Chromosome
start_pos: Start position in the genome for each geme
end_pos: End position in the genome for each gene
width Gene length
strand Strand which the gene is mapped (+ or -)
gene_id Ensemble gene ID
gene_name Gene symbol
gene_biotype Gene biotype

Examples

data(gtfGenes)

Import .gtf and .gff files to be used during gene and QTL annotation, respectively

Description

Takes a .gft or .gff file and import into a dataframe

Usage

import_gff_gtf(db_file, file_type)

Arguments

db_file

File with the gene mapping or QTL information. For gene mapping, a .gtf file from Ensembl database must be used. For the QTL search, a .gff file from Animal QTlLdb must be used. Both files must use the same reference annotation used in the original study

file_type

"gtf" or "gff"

Value

A dataframe with the gtf or gtf content

Examples

gffpath <- system.file("extdata", "example.gff", package="GALLO")

qtl.inp <- import_gff_gtf(db_file=gffpath,file_type="gff")

Compute the eigenvalues for the linkage disequilibrium (LD) matrix and use as input for PCA_Meff function to compute the effective number of markers

Description

Compute the eigenvalues for the linkage disequilibrium (LD) matrix and use as input for PCA_Meff function to compute the effective number of markers

Usage

inferCut(mat.r, cut.off)

Arguments

mat.r

A matrix composed by the LD between markers

cut.off

The threshold for percentage of the sum of the variances explained by the markers

Value

The effective number of markers identified by the SimpleM approach

Function to compute the weighted 1-dimensional kernel density estimator

Description

Function to compute the weighted 1-dimensional kernel density estimator

Usage

kde(x, w = NA)

Arguments

x

output of the sLDA function

w

An optional vector of weights (how much each observation contributes)

Value

Returns an estimate of the probability density at each x[i]

Function to perform Meta-Analysis with LD Correlation Coefficients

Description

Function to perform Meta-Analysis with LD Correlation Coefficients

Usage

meta_LD(marker_ld, marker_pvalues)

Arguments

marker_ld

A symmetric squared matrix containing the pairwise linkage disequilibrium between markers in a chromosome

marker_pvalues

A vector with the p-values for the SNPs annotated within each gene

Value

A vector of p-values for each gene annotated within the defined coordinates

Function to perform an EM algorithm (logistic prior `P(Z = 1 \mid v)` + beta p-values (discriminative prior)) for a 2-component mixture where each observation `i` has a p-value `p_i`, and an auxiliary feature vector `v_i`.

Description

Function to perform an EM algorithm (logistic prior P(Z = 1 \mid v) + beta p-values (discriminative prior)) for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.

Usage

model.LR(
  pvalues,
  v,
  lambda,
  alpha0,
  alpha1,
  pi0,
  pi1,
  max.it,
  verbose = verbose
)

Arguments

pvalues

Vector of gene-level p-values.

v

An auxiliary feature vector.

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

max.it

Maximum number of iterations for the EM algorithm.

verbose

A logical value defining if the iteration information should be printed or not.

Value

Returns the posterior probability of association for each gene.

Function to perform an EM algorithm (“Multiple” univariate KDEs (nonparametric naive Bayes)) for a 2-component mixture where each observation i has a p-value pi, and an auxiliary feature vector vi

Description

Function to perform an EM algorithm (“Multiple” univariate KDEs (nonparametric naive Bayes)) for a 2-component mixture where each observation i has a p-value pi, and an auxiliary feature vector vi

Usage

model.M(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)

Arguments

pvalues

Vector of gene-level p-values

v

an auxiliary feature vector

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

max.it

Maximum number of iterations for the EM algorithm.

verbose

A logical value defining if the iteration information should be printed or not.

Value

Returns the posterior probability of association for each gene.

Function to perform an EM algorithm (multivariate normal with shared covariance (parametric, correlation-aware)) for a 2-component mixture where each observation `i` has a p-value `p_i`, and an auxiliary feature vector `v_i`.

Description

Function to perform an EM algorithm (multivariate normal with shared covariance (parametric, correlation-aware)) for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.

Usage

model.MVN(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)

Arguments

pvalues

Vector of gene-level p-values.

v

An auxiliary feature vector.

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

max.it

Maximum number of iterations for the EM algorithm.

verbose

A logical value defining if the iteration information should be printed or not.

Value

Returns the posterior probability of association for each gene.

Function to perform an EM algorithm (Gaussian naive Bayes (parametric, independent features)) for a 2-component mixture where each observation `i` has a p-value `p_i`, and an auxiliary feature vector `v_i`.

Description

Function to perform an EM algorithm (Gaussian naive Bayes (parametric, independent features)) for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.

Usage

model.NB(pvalues, v, alpha0, alpha1, pi0, pi1, max.it, verbose = verbose)

Arguments

pvalues

Vector of gene-level p-values.

v

An auxiliary feature vector.

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

max.it

Maximum number of iterations for the EM algorithm.

verbose

A logical value defining if the iteration information should be printed or not.

Value

Returns the posterior probability of association for each gene.

Function to perform an EM algorithm (LDA projection + 1D KDE (nonparametric, correlation-aware)) for a 2-component mixture where each observation `i` has a p-value `p_i`, and an auxiliary feature vector `v_i`.

Description

Function to perform an EM algorithm (LDA projection + 1D KDE (nonparametric, correlation-aware)) for a 2-component mixture where each observation i has a p-value p_i, and an auxiliary feature vector v_i.

Usage

model.proj(pvalues, v, pi0, pi1, alpha0, alpha1, max.it, verbose = verbose)

Arguments

pvalues

Vector of gene-level p-values.

v

An auxiliary feature vector.

pi0

Mixing proportion (prior probability) that an observation belongs to the null/background component Z = 0 in the two-component mixture.

pi1

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha0

Mixing proportion (prior probability) that an observation belongs to the alternative/signal component Z = 1 in the two-component mixture; typically \pi_1 = 1 - \pi_0.

alpha1

Shape parameter of the alternative/signal p-value distribution, modeled as p \sim \mathrm{Beta}(\alpha_1, 1) when Z = 1; values < 1 concentrate mass near 0 (enrichment of small p-values).

max.it

Maximum number of iterations for the EM algorithm.

verbose

A logical value defining if the iteration information should be printed or not.

Value

Returns the posterior probability of association for each gene.

Overlapping between grouping factors

Description

Takes a dataframe with a column of genes, QTLs (or other data) and a grouping column and create some matrices with the overlapping information

Usage

overlapping_among_groups(file, x, y)

Arguments

file

A dataframe with the data and grouping factor

x

The grouping factor to be compared

y

The data to be compared among the levels of the grouping factor

Value

A list with three matrices: 1) A matrix with the number of overlapping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one

Examples

data(QTLmarkers)
data(gtfGenes)
genes.out <- find_genes_qtls_around_markers(db_file=gtfGenes,
marker_file=QTLmarkers,method="gene",
marker="snp",interval=100000, nThreads=1)
overlapping.out<-overlapping_among_groups(
file=genes.out,x="Reference",y="gene_id")

Plot overlapping between data and grouping factors

Description

Takes the output from overlapping_among_groups function and creates a heatmap with the overlapping between groups

Usage

plot_overlapping(overlapping_matrix, nmatrix, ntext, group, labelcex = 1)

Arguments

overlapping_matrix

The object obtained in overlapping_amoung_groups function

nmatrix

An interger from 1 to 3 indicating which matrix will be used to plot the overlapping, where: 1) A matrix with the number of overllaping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one

ntext

An interger from 1 to 3 indicating which matrix will be used as the text matrix for the heatmap, where: 1) A matrix with the number of overllaping data; 2) A matrix with the percentage of overlapping; 3) A matrix with the combination of the two previous one

group

A vector with the size of groups. This vector will be plotted as row and column names in the heatmap

labelcex

A numeric value indicating the size of the row and column labels

Value

A heatmap with the overlapping between groups

Examples

data(QTLmarkers)
data(gtfGenes)
genes.out <- find_genes_qtls_around_markers(
db_file=gtfGenes, marker_file=QTLmarkers,
method="gene", marker="snp",interval=100000,
nThreads=1)

overlapping.out<-overlapping_among_groups(
file=genes.out,x="Reference",y="gene_id")
plot_overlapping(overlapping.out,
nmatrix=2,ntext=2,
group=unique(genes.out$Reference))

Plot QTLs information from the find_genes_qtls_around_markers output

Description

Takes the output from find_genes_qtls_around_markers and create plots for the frequency of each QTL type and trait

Usage

plot_qtl_info(
  qtl_file,
  qtl_plot = c("qtl_type", "qtl_name"),
  n = "all",
  qtl_class = NULL,
  horiz = FALSE,
  ...
)

Arguments

qtl_file

The output from find_genes_qtls_around_markers function

qtl_plot

"qtl_type" or"qtl_name"

n

Number of QTLs to be plotted when the qtl_name option is selected

qtl_class

Class of QTLs to be plotted when the qtl_name option is selected

horiz

The legend of the pie plot for the qtl_type should be plotted vertically or horizontally. The default is FALSE. Therefore, the legend is plotted vertically.

...

Arguments to be passed to/from other methods. For the default method these can include further arguments (such as axes, asp and main) and graphical parameters (see par) which are passed to plot.window(), title() and axis.

Value

A plot with the requested information

Examples

data(QTLmarkers)
data(gffQTLs)

out.qtls<-find_genes_qtls_around_markers(db_file=gffQTLs,
marker_file=QTLmarkers, method = "qtl",
marker = "snp", interval = 500000,
nThreads = 1)

plot_qtl_info(out.qtls, qtl_plot = "qtl_type", cex=2)

Performs a QTL enrichment analysis based on a hypergeometric test for each QTL class

Description

Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis

Usage

qtl_enrich(
  qtl_db,
  qtl_file,
  qtl_type = c("QTL_type", "Name"),
  enrich_type = c("genome", "chromosome"),
  chr.subset = NULL,
  nThreads = NULL,
  padj = c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"),
  verbose = TRUE
)

Arguments

qtl_db

The object obtained using the import_gff_gtf() function

qtl_file

The output from find_genes_qtls_around_markers function

qtl_type

A character indicating which type of enrichment will be performed. QTL_type indicates that the enrichment processes will be performed for the QTL classes, while Name indicates that the enrichment analysis will be performed for each trait individually

enrich_type

A character indicating if the enrichment analysis will be performed for all the chromosomes ("genome") or for a subset of chromosomes ("chromosome). If the "genome" option is selected, the results reported are the merge of all chromosomes

chr.subset

If enrich_type equal "chromosome", it is possible to define a subset of chromosomes to be analyzed. The default is equal NULL. Therefore, all the chromosomes will be analyzed

nThreads

The number of threads to be used.

padj

The algorithm for multiple testing correction to be adopted ("holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr", "none")

verbose

Logical value defining if messages should of not be printed during the analysis (default=TRUE)

Details

The simple bias of investigation for some traits (such as milk production related traits in the QTL database for cattle) may result in a larger proportion of records in the database. Consequently, the simple investigation of the proportion of each QTL type might not be totally useful. In order to reduce the impact of this bias, a QTL enrichment analysis can be performed. The QTL enrichment analysis performed by GALLO package is based in a hypergeometric test using the number of annoatted QTLs within the candidate regions and the total number of the same QTL in the QTL database.

Value

A data frame with the p-value for the enrichment result

Examples

data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(
db_file=gffQTLs,marker_file=QTLmarkers,
method = "qtl",marker = "snp",
interval = 500000, nThreads = 1)

out.enrich<-qtl_enrich(qtl_db=gffQTLs,
qtl_file=out.qtls, qtl_type = "Name",
enrich_type = "chromosome",chr.subset = NULL,
padj = "fdr",nThreads = 1)

Plot relationship between data and grouping factors

Description

Takes the output from find_genes_qtls_around_markers function and creates a chord plot with the relationship between groups

Usage

relationship_plot(
  qtl_file,
  x,
  y,
  grid.col = "gray60",
  degree = 90,
  canvas.xlim = c(-2, 2),
  canvas.ylim = c(-2, 2),
  cex,
  gap
)

Arguments

qtl_file

The output from find_genes_qtls_around_markers function

x

The first grouping factor, to be plotted in the left hand side of the chord plot

y

The second grouping factor, to be plotted in the left hand side of the chord plot

grid.col

A character with the grid color for the chord plot or a vector with different colors to be used in the grid colors. Note that when a color vector is provided, the length of this vector must be equal the number of sectors in the chord plot

degree

A numeric value corresponding to the starting degree from which the circle begins to draw. Note this degree is always reverse-clockwise

canvas.xlim

The coordinate for the canvas in the x-axis. By default is c(-1,1)

canvas.ylim

The coordinate for the canvas in the y-axis. By default is c(-1,1)

cex

The size of the labels to be printed in the plot

gap

A numeric value corresponding to the gap between the chord sectors

Value

A chords relating x and y

Examples

data(QTLmarkers)
data(gffQTLs)
out.qtls<-find_genes_qtls_around_markers(
db_file=gffQTLs, marker_file=QTLmarkers,
method = "qtl", marker = "snp",
interval = 500000, nThreads = 1)

out.enrich<-qtl_enrich(qtl_db=gffQTLs,
qtl_file=out.qtls, qtl_type = "Name",
enrich_type = "chromosome",
chr.subset = NULL, padj = "fdr",nThreads = 1)

out.enrich$ID<-paste(out.enrich$QTL," - ",
"CHR",out.enrich$CHR,sep="")

out.enrich.filtered<-out.enrich[which(out.enrich$adj.pval<0.05),]

out.qtls$ID<-paste(out.qtls$Name," - ",
"CHR",out.qtls$CHR,sep="")

out.enrich.filtered<-out.enrich.filtered[order(out.enrich.filtered$adj.pval),]

out.qtls.filtered<-out.qtls[which(out.qtls$ID%in%out.enrich.filtered$ID[1:10]),]

out.qtls.filtered[which(out.qtls.filtered$Reference==
"Feugang et al. (2010)"), "color_ref"]<-"purple"

out.qtls.filtered[which(out.qtls.filtered$Reference==
"Buzanskas et al. (2017)"),"color_ref"]<-"pink"

color.grid<-c(rep("black",length(unique(out.qtls.filtered$Abbrev))),
unique(out.qtls.filtered$color_ref))

names(color.grid)<-c(unique(out.qtls.filtered$Abbrev),
unique(out.qtls.filtered$Reference))

relationship_plot(qtl_file=out.qtls.filtered,
x="Abbrev", y="Reference",cex=1,gap=5,
degree = 90, canvas.xlim = c(-5, 5),
canvas.ylim = c(-3, 3), grid.col = color.grid)

Function to performs a linear discriminant analysis (LDA) for a multivariate set of features

Description

Function to performs a linear discriminant analysis (LDA) for a multivariate set of features

Usage

sLDA(z, v)

Arguments

z

numeric vector of length n, with values in [0,1]

v

numeric matrix nxd. Each row is an observation, each column a feature

Value

Returns a one-dimensional discriminant score y for each observation, computed by projecting the feature matrix v onto the first LDA direction estimated using the soft class-membership weights z.

Sub-function to split comment column from QTL output

Description

Takes a list of candidate markers and search for genes a determined interval

Usage

splitQTL_comment(output.final)

Arguments

output.final

Output from QTL annotation

Value

A data frame with the extra_info column content, from the gff file, broken in several additional columns

Sub-function to search genes around candidate markers

Description

Takes a list of candidate markers and search for genes a determined interval

Usage

sub_genes_markers(chr_list, db_file, marker_file, nThreads = NULL, int = 0)

Arguments

chr_list

"Object with the chromosomes to be analyzed"

db_file

Data frame with the information from .gtf file

marker_file

Data frame with the information from the candidate regions file

nThreads

The number of threads to be used

int

The interval in base pair

Value

A dataframe with the genes or QTLs mapped within the specified intervals

Sub-function to search genes around candidate markers

Description

Takes a list of candidate markers and search for genes a determined interval

Usage

sub_genes_windows(chr_list, db_file, marker_file, nThreads = NULL, int = 0)

Arguments

chr_list

"Object with the chromosomes to be analyzed"

db_file

Data frame with the information from .gtf file

marker_file

Data frame with the information from the candidate regions file

nThreads

The number of threads to be used

int

The interval in base pair

Value

A dataframe with the genes or QTLs mapped within the specified intervals

Performs a QTL enrichment analysis based in a Bootstrap simulation for each QTL class using the QTL information per chromosome

Description

Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis

Usage

sub_qtlEnrich_chom(
  qtl_file,
  qtl_type,
  qtl.file.types,
  table.qtl.class,
  padj,
  qtl_db,
  search_qtl,
  nThreads
)

Arguments

qtl_file

The output from find_genes_qtls_around_markers function

qtl_type

A string indicating with QTL enrichment will be performed: "QTL_type" or "Name"

qtl.file.types

A vector with the observed QTL classes

table.qtl.class

An frequency table for the number of each QTL in each chromosome

qtl_db

The QTL annotation database

search_qtl

The column to perform the QTL searching in counting from the QTL annotation database

nThreads

Number of threads for parallel processing

Value

A data frame with the p-value for th enrichment result

Performs a QTL enrichment analysis based in a Bootstrap simulation for each QTL class using the QTL information across the whole genome

Description

Takes the output from find_genes_qtls_around_markers and run a QTL enrichment analysis

Usage

sub_qtlEnrich_geno(
  qtl_file,
  qtl_type,
  qtl.file.types,
  table.qtl.class,
  padj,
  qtl_db,
  search_qtl,
  nThreads
)

Arguments

qtl_file

The output from find_genes_qtls_around_markers function

qtl_type

A string indicating with QTL enrichment will be performed: "QTL_type" or "Name"

qtl.file.types

A vector with the observed QTL classes

table.qtl.class

An frequency table for the number of each QTL in each chromosome

qtl_db

The QTL annotation database

search_qtl

The column to perform the QTL searching in counting from the QTL annotation database

nThreads

Number of threads for parallel processing

Value

A data frame with the p-value for th enrichment result

Sub-function to search QTLs around candidate markers

Description

Takes a list of candidate markers and search for genes a determined interval

Usage

sub_qtl_markers(chr_list, db_file, marker_file, nThreads = NULL, int = 0)

Arguments

chr_list

"Object with the chromosomes to be analyzed"

db_file

Data frame with the information from .gff file

marker_file

Data frame with the information from the candidate regions file

nThreads

The number of threads to be used

int

The interval in base pair

Value

A dataframe with the QTLs mapped within the specified intervals

Sub-function to search QTLs around candidate markers

Description

Takes a list of candidate markers and search for genes a determined interval

Usage

sub_qtl_windows(chr_list, db_file, marker_file, nThreads = NULL, int = 0)

Arguments

chr_list

"Object with the chromosomes to be analyzed"

db_file

Data frame with the information from .gff file

marker_file

Data frame with the information from the candidate regions file

nThreads

The number of threads to be used

int

The interval in base pair

Value

A dataframe with the QTLs mapped within the specified intervals