Contents

1 Overview

ABAEnrichment is designed to test user-defined genes for expression enrichment in different human brain regions. The package integrates the expression of the input gene set and the structural information of the brain using an ontology, both provided by the Allen Brain Atlas project [1-4]. The statistical analysis is performed by the core function aba_enrich which interfaces with the ontology enrichment software FUNC [5]. Additional functions provided in this package are get_expression, plot_expression, get_name, get_id, get_sampled_substructures, get_superstructures and get_annotated_genes supporting the exploration and visualization of the expression data.

1.1 Expression data

The package incorporates three different brain expression datasets:

  1. microarray data from six adult individuals
  2. RNA-seq data from 42 individuals of five different developmental stages (prenatal, infant, child, adolescent, adult)
  3. developmental effect scores measuring the age effect on expression per gene

All three datasets are filtered for protein-coding genes and gene expression is averaged across donors. Although the third dataset does not contain expression data, but a derived score, for simplicity we only refer to ‘expression’ in this documentation. For details on the datasets see the ABAData vignette.

1.2 Annotation of genes to brain regions

Using the ontology that describes the hierarchical organization of the brain, brain regions get annotated all genes that are expressed in the brain region itself or in any of its substructures. The boundary between ‘expressed’ and ‘not expressed’ is defined by different expression quantiles (e.g. using a quantile of 0.4, the lowest 40% of gene expression in the brain are considered ‘not expressed’ and the upper 60% are considered ‘expressed’). These cutoffs are set with the parameter cutoff_quantiles and an analysis is run for every cutoff separately. The default cutoffs are 10% to 90% in steps of 10%.

1.3 Enrichment analysis

The enrichment analysis is performed by using either the hypergeometric test, the Wilcoxon rank-sum test, the binomial test or the 2x2 contingency table test implemented in the ontology enrichment software FUNC [5]. The hypergeometric test evaluates the enrichment of annotated (expressed) candidate genes compared to annotated background genes for each brain region (see Schematic 1 below). The background genes can be defined explicitly like the candidate genes or, by default, consist of all protein-coding genes from the dataset that are not contained in the set of candidate genes. In contrast to this binary distinction between candidate and background genes, the Wilcoxon rank-sum test uses user-defined scores that are assigned to the input genes. It then tests every brain region for an enrichment of genes with high scores in the set of expressed input genes. When genes are associated with two counts (A and B), e.g. amino-acid changes since a common ancestor in two species, a binomial test can be used to identify brain regions with an enrichment of expressed genes with a high fraction of A compared to the fraction of A in the brain in general. When genes are associated with four counts (A-D), e.g. non-synonymous or synonymous variants that are fixed between or variable within species, like for a McDonald-Kreitman test [6], the 2x2 contingency table test can be used. It can identify brain regions which have a high ratio of A/B compared to C/D in their expressed genes.