3 Annotations

We rely on the AnnotationHub package for CpG annotations, the GenomicFeatures, TxDb.*, and org.*.eg.db packages for genic annotations, and the rtracklayer package for enhancer annotations from FANTOM5.

Supported annotations are given by supported_annotations(). Additionally, custom annotations can be loaded with read_annotations().

3.1 CpG Annotations

The CpG islands are the basis for all CpG annotations, and are given by the AnnotationHub package for the given organism. CpG shores are defined as 2Kb upstream/downstream from the ends of the CpG islands, less the CpG islands. CpG shelves are defined as another 2Kb upstream/downstream of the farthest upstream/downstream limits of the CpG shores, less the CpG islands and CpG shores. The remaining genomic regions make up the inter-CGI annotation.

CpG annotations are available for hg19, hg38, mm9, mm10, rn4, rn5, rn6.

Schematic of CpG annotations.

3.2 UCSC knownGenes

The genic annotations are determined by functions from GenomicFeatures and data from the TxDb.* and org.*.eg.db packages. Genic annotations include 1-5Kb upstream of the TSS, the promoter (< 1Kb upstream of the TSS), 5’UTR, first exons, exons, introns, CDS, 3’UTR, and intergenic regions (the intergenic regions exclude the previous list of annotations). The schematic below illustrates the relationship between the different annotations as extracted from the TxDb.* packages via GenomicFeatures functions.

Schematic of knownGene annotations.

Also included in genic annotations are intronexon and exonintron boundaries. These annotations are 200bp up/down stream of any boundary between an exon and intron. Important to note, is that the boundaries are with respect to the strand of the gene.

Non-intergenic gene annotations include Entrez ID and gene symbol information where it exists. The org.*.eg.db packages for the appropriate organisms are used to provide gene IDs and gene symbols.

The genic annotations have populated tx_id, gene_id, and symbol columns. Respectively they are, the knownGene transcript name, Entrez Gene ID, and gene symbol.

Genic annotations are available for all hg19, hg38, mm9, mm10, rn4, rn5, rn6, dm3, and dm6.

3.3 FANTOM5 Permissive Enhancers

FANTOM5 permissive enhancers were determined from bi-directional CAGE transcription as in Andersson et al. (2014), and are downloaded and processed for hg19 and mm9 from the FANTOM5 resource.

3.4 GENCODE lncRNA transcripts

The long non-coding RNA (lncRNA) annotations are from GENCODE for hg19, hg38, and mm10. The lncRNA transcripts are used, and we eventually plan to include the lncRNA introns/exons at a later date. The lncRNA annotations have populated tx_id, gene_id, and symbol columns. Respectively they are, the Ensembl transcript name, Entrez Gene ID, and gene symbol. As per the transcript_type field in the GENCODE anntotations, the biotypes are given in the id column.

3.5 Chromatin states from ChromHMM

Chromatin states determined by chromHMM (Ernst and Kellis (2012)) in hg19 are available for nine cell lines (Gm12878, H1hesc, Hepg2, Hmec, Hsmm, Huvec, K562, Nhek, and Nhlf) via the UCSC Genome Browser tracks. Annotations for all states can be built using a shortcut like hg19_Gm12878-chromatin, or specific chromatin states can be accessed via codes like hg19_chromatin_Gm12878-StrongEnhancer or hg19_chromatin_Gm12878-Repressed.

3.6 Custom Annotations

Users may load their own annotations from BED files using the read_annotations() function, which uses the rtracklayer::import() function. The output is a GRanges with mcols() for id, tx_id, gene_id, symbol, and type. If a user wants to include tx_id, gene_id, and/or symbol in their custom annotations they can be included as extra columns on a BED6 input file.

## Use ENCODE ChIP-seq peaks for EZH2 in GM12878 and CJUN in K562
## These files contain chr, start, and end columns
ezh2_file = system.file('extdata', 'Gm12878_Ezh2_peak_annotations.txt.gz', package = 'annotatr')
cjun_file = system.file('extdata', 'K562_Cjun_peak_annotations.txt.gz', package = 'annotatr')

## Custom annotation objects are given names of the form genome_custom_name
read_annotations(con = ezh2_file, genome = 'hg19', name = 'ezh2', format = 'bed')
read_annotations(con = cjun_file, genome = 'hg19', name = 'cjun', format = 'bed')

print(annotatr_cache$get('hg19_custom_ezh2'))

## GRanges object with 2472 ranges and 5 metadata columns:
##          seqnames                 ranges strand |          id     tx_id
##             <Rle>              <IRanges>  <Rle> | <character> <logical>
##      [1]     chr1     [ 860063,  860382]      * |      ezh2:1      <NA>
##      [2]     chr1     [ 934911,  935230]      * |      ezh2:2      <NA>
##      [3]     chr1     [3573321, 3573640]      * |      ezh2:3      <NA>
##      [4]     chr1     [6301401, 6301720]      * |      ezh2:4      <NA>
##      [5]     chr1     [6301996, 6302315]      * |      ezh2:5      <NA>
##      ...      ...                    ...    ... .         ...       ...
##   [2468]     chrX [ 99880950,  99881269]      * |   ezh2:2468      <NA>
##   [2469]     chrX [108514101, 108514420]      * |   ezh2:2469      <NA>
##   [2470]     chrX [111981673, 111981992]      * |   ezh2:2470      <NA>
##   [2471]     chrX [118109216, 118109535]      * |   ezh2:2471      <NA>
##   [2472]     chrX [136114771, 136115090]      * |   ezh2:2472      <NA>
##            gene_id    symbol             type
##          <logical> <logical>      <character>
##      [1]      <NA>      <NA> hg19_custom_ezh2
##      [2]      <NA>      <NA> hg19_custom_ezh2
##      [3]      <NA>      <NA> hg19_custom_ezh2
##      [4]      <NA>      <NA> hg19_custom_ezh2
##      [5]      <NA>      <NA> hg19_custom_ezh2
##      ...       ...       ...              ...
##   [2468]      <NA>      <NA> hg19_custom_ezh2
##   [2469]      <NA>      <NA> hg19_custom_ezh2
##   [2470]      <NA>      <NA> hg19_custom_ezh2
##   [2471]      <NA>      <NA> hg19_custom_ezh2
##   [2472]      <NA>      <NA> hg19_custom_ezh2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

print(annotatr_cache$get('hg19_custom_cjun'))

## GRanges object with 9848 ranges and 5 metadata columns:
##          seqnames                 ranges strand |          id     tx_id
##             <Rle>              <IRanges>  <Rle> | <character> <logical>
##      [1]     chr1     [ 713944,  714173]      * |      cjun:1      <NA>
##      [2]     chr1     [ 936184,  936369]      * |      cjun:2      <NA>
##      [3]     chr1     [ 968369,  968598]      * |      cjun:3      <NA>
##      [4]     chr1     [ 999654,  999860]      * |      cjun:4      <NA>
##      [5]     chr1     [1136816, 1137045]      * |      cjun:5      <NA>
##      ...      ...                    ...    ... .         ...       ...
##   [9844]     chrX [154027383, 154027612]      * |   cjun:9844      <NA>
##   [9845]     chrX [154254779, 154255008]      * |   cjun:9845      <NA>
##   [9846]     chrX [154563895, 154564124]      * |   cjun:9846      <NA>
##   [9847]     chrX [155117938, 155118167]      * |   cjun:9847      <NA>
##   [9848]     chrX [155123304, 155123533]      * |   cjun:9848      <NA>
##            gene_id    symbol             type
##          <logical> <logical>      <character>
##      [1]      <NA>      <NA> hg19_custom_cjun
##      [2]      <NA>      <NA> hg19_custom_cjun
##      [3]      <NA>      <NA> hg19_custom_cjun
##      [4]      <NA>      <NA> hg19_custom_cjun
##      [5]      <NA>      <NA> hg19_custom_cjun
##      ...       ...       ...              ...
##   [9844]      <NA>      <NA> hg19_custom_cjun
##   [9845]      <NA>      <NA> hg19_custom_cjun
##   [9846]      <NA>      <NA> hg19_custom_cjun
##   [9847]      <NA>      <NA> hg19_custom_cjun
##   [9848]      <NA>      <NA> hg19_custom_cjun
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4 Usage

The following example is based on the results of testing for differential methylation of genomic regions between two conditions using methylSig. The file (inst/extdata/IDH2mut_v_NBM_multi_data_chr9.txt.gz) contains chromosome locations, as well as categorical and numerical data columns, and provides a good example of the flexibility of annotatr.

4.1 Reading Genomic Regions

read_regions() uses the rtracklayer::import() function to read in BED files and convert them to GRanges objects. The name and score columns in a normal BED file can be used for categorical and numeric data, respectively. Additionally, an arbitrary number of categorical and numeric data columns can be appended to a BED6 file. The extraCols parameter is used for this purpose, and the rename_name and rename_score columns allow users to give more descriptive names to these columns.

# This file in inst/extdata represents regions tested for differential
# methylation between two conditions. Additionally, there are columns
# reporting the p-value on the test for differential meth., the
# meth. difference between the two groups, and the group meth. rates.
dm_file = system.file('extdata', 'IDH2mut_v_NBM_multi_data_chr9.txt.gz', package = 'annotatr')
extraCols = c(diff_meth = 'numeric', mu0 = 'numeric', mu1 = 'numeric')
dm_regions = read_regions(con = dm_file, genome = 'hg19', extraCols = extraCols, format = 'bed',
    rename_name = 'DM_status', rename_score = 'pval')
print(dm_regions)

## GRanges object with 17683 ranges and 5 metadata columns:
##           seqnames                 ranges strand |   DM_status       pval
##              <Rle>              <IRanges>  <Rle> | <character>  <numeric>
##       [1]     chr9         [10850, 10948]      * |        none 0.50455021
##       [2]     chr9         [10950, 11048]      * |        none 0.22271260
##       [3]     chr9         [28950, 29048]      * |        none 0.55309581
##       [4]     chr9         [72850, 72948]      * |       hyper 0.01162935
##       [5]     chr9         [72950, 73048]      * |        none 0.17528718
##       ...      ...                    ...    ... .         ...        ...
##   [17679]     chr9 [141094950, 141095048]      * |        none 0.88036114
##   [17680]     chr9 [141095050, 141095148]      * |        none 0.08860275
##   [17681]     chr9 [141098750, 141098848]      * |        none 0.21376652
##   [17682]     chr9 [141103350, 141103448]      * |        hypo 0.04861602
##   [17683]     chr9 [141108950, 141109048]      * |        none 0.26167927
##              diff_meth       mu0        mu1
##              <numeric> <numeric>  <numeric>
##       [1] -10.73290471 79.981920 90.7148252
##       [2]   8.71952705 86.704015 77.9844878
##       [3]   0.07008468  0.124081  0.0539963
##       [4]  44.87532440 72.455413 27.5800883
##       [5]  17.76066258 28.440368 10.6797057
##       ...          ...       ...        ...
##   [17679]     1.714566  92.62087   90.90631
##   [17680]    -1.517393  98.48261  100.00000
##   [17681]   -12.726055  83.18445   95.91050
##   [17682]    -4.035105  95.41078   99.44589
##   [17683]   -10.493345  84.89418   95.38753
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4.2 Annotating Regions

Users may select annotations a la carte via the accessors listed with supported_annotations(), shortcuts, or use custom annotations as described above. The hg19_cpgs shortcut annotates regions to CpG islands, CpG shores, CpG shelves, and inter-CGI. The hg19_basicgenes shortcut annotates regions to 1-5Kb, promoters, 5’UTRs, exons, introns, and 3’UTRs. Shortcuts for other supported_genomes() are accessed in a similar way.

annotate_regions() requires a GRanges object (either the result of read_regions() or an existing object), a GRanges object of the annotations, and a logical value indicating whether to ignore.strand when calling GenomicRanges::findOverlaps(). The positive integer minoverlap is also passed to GenomicRanges::findOverlaps() and specifies the minimum overlap required for a region to be assigned to an annotation.

Before annotating regions, they must be built with build_annotations() which requires a character vector of desired annotation codes.

# Select annotations for intersection with regions
# Note inclusion of custom annotation, and use of shortcuts
annots = c('hg19_cpgs', 'hg19_basicgenes', 'hg19_genes_intergenic',
    'hg19_enhancers_fantom', 'hg19_genes_firstexons', 'hg19_genes_intronexonboundaries',
    'hg19_lncrna_gencode', 'hg19_custom_ezh2', 'hg19_custom_cjun')

# Build the annotations (a single GRanges object)
annotations = build_annotations(genome = 'hg19', annotations = annots)

# Intersect the regions we read in with the annotations
dm_annotated = annotate_regions(
    regions = dm_regions,
    annotations = annotations,
    ignore.strand = TRUE,
    quiet = FALSE)
# A GRanges object is returned
print(dm_annotated)

## GRanges object with 98260 ranges and 6 metadata columns:
##           seqnames                 ranges strand |   DM_status       pval
##              <Rle>              <IRanges>  <Rle> | <character>  <numeric>
##       [1]     chr9         [10850, 10948]      * |        none  0.5045502
##       [2]     chr9         [10850, 10948]      * |        none  0.5045502
##       [3]     chr9         [10950, 11048]      * |        none  0.2227126
##       [4]     chr9         [10950, 11048]      * |        none  0.2227126
##       [5]     chr9         [10950, 11048]      * |        none  0.2227126
##       ...      ...                    ...    ... .         ...        ...
##   [98256]     chr9 [141103350, 141103448]      * |        hypo 0.04861602
##   [98257]     chr9 [141108950, 141109048]      * |        none 0.26167927
##   [98258]     chr9 [141108950, 141109048]      * |        none 0.26167927
##   [98259]     chr9 [141108950, 141109048]      * |        none 0.26167927
##   [98260]     chr9 [141108950, 141109048]      * |        none 0.26167927
##            diff_meth       mu0       mu1                      annot
##            <numeric> <numeric> <numeric>                  <GRanges>
##       [1] -10.732905  79.98192  90.71483          chr9:6987-10986:+
##       [2] -10.732905  79.98192  90.71483             chr9:1-24849:*
##       [3]   8.719527  86.70401  77.98449         chr9:10987-11986:+
##       [4]   8.719527  86.70401  77.98449          chr9:6987-10986:+
##       [5]   8.719527  86.70401  77.98449             chr9:1-24849:*
##       ...        ...       ...       ...                        ...
##   [98256]  -4.035105  95.41078  99.44589 chr9:141074192-141107369:*
##   [98257] -10.493345  84.89418  95.38753 chr9:141107681-141109733:+
##   [98258] -10.493345  84.89418  95.38753 chr9:141107370-141109369:*
##   [98259] -10.493345  84.89418  95.38753 chr9:141106637-141134172:+
##   [98260] -10.493345  84.89418  95.38753 chr9:141107518-141143444:+
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

4.3 Randomizing Regions

Given a set of annotated regions, it is important to know how the annotations compare to those of a randomized set of regions. The randomize_regions() function is a wrapper of regioneR::randomizeRegions() from the regioneR package that creates a set of random regions given a GRanges object. After creating the random set, they must be annotated with annotate_regions() for later use. Only supported_genomes() can be used in our wrapper function. Downstream functions that support using random region annotations are summarize_annotations(), plot_annotation(), and plot_categorical().

It is important to note that if the regions to be randomized have a particular property, for example they are CpGs, the randomize_regions() wrapper will not preserve that property! Instead, we recommend using regioneR::resampleRegions() with universe being the superset of the data regions you want to sample from.

# Randomize the input regions
dm_random_regions = randomize_regions(
    regions = dm_regions,
    allow.overlaps = TRUE,
    per.chromosome = TRUE)

# Annotate the random regions using the same annotations as above
# These will be used in later functions
dm_random_annotated = annotate_regions(
    regions = dm_random_regions,
    annotations = annotations,
    ignore.strand = TRUE,
    quiet = TRUE)

4.4 Summarizing Over Annotations

The three summarization functions below take the GRanges objects output from annotate_regions() as their input, and output either a dplyr::tbl_df (for summarize_annotations()) or dplyr::grouped_df (for the others).

# Usage of summarize functions with defaults
summarize_annotations(annotated_regions, annotated_random, quiet = FALSE)

summarize_numerical(
    annotated_regions,
    by = c("annot.type", "annot.id"),
    over,
    quiet = FALSE)

summarize_categorical(
    annotated_regions,
    by = c("annot.type", "annot.id"),
    quiet = FALSE)

When there is no categorical (name column) or numerical (score column) information associated with the regions, summarize_annotations() is the only possible function to use. It gives the counts of regions in each annotation type (see example below). If there is categorical and/or numerical information, then summarize_numerical() and/or summarize_categorical() may be used. Using random region annotations is only available for summarize_annotations().

# Find the number of regions per annotation type
dm_annsum = summarize_annotations(
    annotated_regions = dm_annotated,
    quiet = TRUE)
print(dm_annsum)

## # A tibble: 17 × 2
##                         annot.type     n
##                              <chr> <int>
## 1                   hg19_cpg_inter  8171
## 2                 hg19_cpg_islands  5647
## 3                 hg19_cpg_shelves  1247
## 4                  hg19_cpg_shores  3591
## 5                 hg19_custom_cjun   128
## 6                 hg19_custom_ezh2    78
## 7            hg19_enhancers_fantom   608
## 8                hg19_genes_1to5kb  2747
## 9                 hg19_genes_3UTRs   556
## 10                hg19_genes_5UTRs  1661
## 11                hg19_genes_exons  4446
## 12           hg19_genes_firstexons  2401
## 13           hg19_genes_intergenic  4481
## 14 hg19_genes_intronexonboundaries  3595
## 15              hg19_genes_introns  9161
## 16            hg19_genes_promoters  2873
## 17             hg19_lncrna_gencode  1309

# Find the number of regions per annotation type
# and the number of random regions per annotation type
dm_annsum_rnd = summarize_annotations(
    annotated_regions = dm_annotated,
    annotated_random = dm_random_annotated,
    quiet = TRUE)
print(dm_annsum_rnd)

## Source: local data frame [34 x 3]
## Groups: data_type [?]
## 
##    data_type            annot.type     n
##        <chr>                 <chr> <int>
## 1       Data        hg19_cpg_inter  8171
## 2       Data      hg19_cpg_islands  5647
## 3       Data      hg19_cpg_shelves  1247
## 4       Data       hg19_cpg_shores  3591
## 5       Data      hg19_custom_cjun   128
## 6       Data      hg19_custom_ezh2    78
## 7       Data hg19_enhancers_fantom   608
## 8       Data     hg19_genes_1to5kb  2747
## 9       Data      hg19_genes_3UTRs   556
## 10      Data      hg19_genes_5UTRs  1661
## # ... with 24 more rows

# Take the mean of the diff_meth column across all regions
# occurring in an annotation.
dm_numsum = summarize_numerical(
    annotated_regions = dm_annotated,
    by = c('annot.type', 'annot.id'),
    over = c('diff_meth'),
    quiet = TRUE)
print(dm_numsum)

## Source: local data frame [29,664 x 5]
## Groups: annot.type [?]
## 
##        annot.type   annot.id     n       mean        sd
##             <chr>      <chr> <int>      <dbl>     <dbl>
## 1  hg19_cpg_inter inter:8599     2 -1.0066888 13.754946
## 2  hg19_cpg_inter inter:8601     3  0.5069885  1.806155
## 3  hg19_cpg_inter inter:8603     2  1.0049352  1.421193
## 4  hg19_cpg_inter inter:8604    17  0.4744959  3.114624
## 5  hg19_cpg_inter inter:8605     1 10.8701221        NA
## 6  hg19_cpg_inter inter:8606     8  1.4979540 17.636208
## 7  hg19_cpg_inter inter:8607     8 -3.4921314 16.864613
## 8  hg19_cpg_inter inter:8608     6 10.8238401 18.879649
## 9  hg19_cpg_inter inter:8610     4  1.2291620 11.498373
## 10 hg19_cpg_inter inter:8611    14 -4.2964809  9.374676
## # ... with 29,654 more rows

# Count the occurrences of classifications in the DM_status
# column across the annotation types.
dm_catsum = summarize_categorical(
    annotated_regions = dm_annotated,
    by = c('annot.type', 'DM_status'),
    quiet = TRUE)
print(dm_catsum)

## Source: local data frame [51 x 3]
## Groups: annot.type [?]
## 
##          annot.type DM_status     n
##               <chr>     <chr> <int>
## 1    hg19_cpg_inter     hyper   523
## 2    hg19_cpg_inter      hypo   596
## 3    hg19_cpg_inter      none  7052
## 4  hg19_cpg_islands     hyper   976
## 5  hg19_cpg_islands      hypo    50
## 6  hg19_cpg_islands      none  4621
## 7  hg19_cpg_shelves     hyper    63
## 8  hg19_cpg_shelves      hypo    70
## 9  hg19_cpg_shelves      none  1114
## 10  hg19_cpg_shores     hyper   477
## # ... with 41 more rows

4.5 Plotting

The 5 plot functions described below are to be used on the object returned by annotate_regions(). The plot functions return an object of type ggplot that can be viewed (print), saved (ggsave), or modified with additional ggplot2 code.

# Usage of plot functions with defaults

plot_annotation(annotated_regions, annotated_random, annotation_order = NULL,
  plot_title, x_label, y_label, quiet = FALSE)

plot_coannotations(annotated_regions, annotation_order = NULL, plot_title,
  axes_label, quiet = FALSE)

plot_numerical(annotated_regions, x, y, facet = "annot.type",
  facet_order = NULL, bin_width = 10, plot_title, x_label, y_label,
  quiet = FALSE)

plot_numerical_coannotations(annotated_regions, x, y, annot1, annot2,
  bin_width = 10, plot_title, x_label, y_label, quiet = FALSE)

plot_categorical(annotated_regions, annotated_random, x, fill = NULL,
  x_order = NULL, fill_order = NULL, position = "stack", plot_title,
  legend_title, x_label, y_label, quiet = FALSE)

4.5.1 Plotting Regions per Annotation

# View the number of regions per annotation. This function
# is useful when there is no classification or data
# associated with the regions.
annots_order = c(
    'hg19_custom_ezh2',
    'hg19_custom_cjun',
    'hg19_enhancers_fantom',
    'hg19_genes_1to5kb',
    'hg19_genes_promoters',
    'hg19_genes_5UTRs',
    'hg19_genes_firstexons',
    'hg19_genes_exons',
    'hg19_genes_intronexonboundaries',
    'hg19_genes_introns',
    'hg19_genes_3UTRs',
    'hg19_genes_intergenic')
dm_vs_kg_annotations = plot_annotation(
    annotated_regions = dm_annotated,
    annotation_order = annots_order,
    plot_title = '# of Sites Tested for DM annotated on chr9',
    x_label = 'knownGene Annotations',
    y_label = 'Count')
print(dm_vs_kg_annotations)

Number of DM regions per annotation.

The plot_annotation() can also use the annotated random regions in the annotated_random argument to plot the number of random regions per annotation type next to the number of input data regions.

# View the number of regions per annotation and include the annotation
# of randomized regions
annots_order = c(
    'hg19_custom_ezh2',
    'hg19_custom_cjun',
    'hg19_enhancers_fantom',
    'hg19_genes_1to5kb',
    'hg19_genes_promoters',
    'hg19_genes_5UTRs',
    'hg19_genes_firstexons',
    'hg19_genes_exons',
    'hg19_genes_intronexonboundaries',
    'hg19_genes_introns',
    'hg19_genes_3UTRs',
    'hg19_genes_intergenic',
    'hg19_lncrna_gencode')
dm_vs_kg_annotations_wrandom = plot_annotation(
    annotated_regions = dm_annotated,
    annotated_random = dm_random_annotated,
    annotation_order = annots_order,
    plot_title = 'Dist. of Sites Tested for DM (with rndm.)',
    x_label = 'Annotations',
    y_label = 'Count')
print(dm_vs_kg_annotations_wrandom)

Number of DM regions per annotation with randomized regions.

4.5.2 Plotting Regions Occurring in Pairs of Annotations

# View a heatmap of regions occurring in pairs of annotations
annots_order = c(
    'hg19_custom_ezh2',
    'hg19_custom_cjun',
    'hg19_enhancers_fantom',
    'hg19_genes_1to5kb',
    'hg19_genes_promoters',
    'hg19_genes_5UTRs',
    'hg19_genes_firstexons',
    'hg19_genes_exons',
    'hg19_genes_intronexonboundaries',
    'hg19_genes_introns',
    'hg19_genes_3UTRs',
    'hg19_genes_intergenic',
    'hg19_lncrna_gencode')
dm_vs_coannotations = plot_coannotations(
    annotated_regions = dm_annotated,
    annotation_order = annots_order,
    axes_label = 'Annotations',
    plot_title = 'Regions in Pairs of Annotations')
print(dm_vs_coannotations)

Number of DM regions per pair of annotations.

4.5.3 Plotting Numerical Data Over Regions

With numerical data, the plot_numerical() function plots a single variable (histogram) or two variables (scatterplot) at the region level, faceting over the annotations. Note, when the plot is a histogram, the distribution over all regions is plotted within each facet.

dm_vs_regions_annot = plot_numerical(
    annotated_regions = dm_annotated,
    x = 'mu0',
    facet = 'annot.type',
    facet_order = c('hg19_genes_1to5kb','hg19_genes_promoters',
        'hg19_genes_5UTRs','hg19_genes_3UTRs', 'hg19_custom_ezh2',
        'hg19_custom_cjun', 'hg19_genes_intergenic', 'hg19_cpg_islands',
        'hg19_lncrna_gencode'),
    bin_width = 5,
    plot_title = 'Group 0 Region Methylation In Genes',
    x_label = 'Group 0')
print(dm_vs_regions_annot)

Methylation Rates in Group 0 for Regions Over DM Status.

dm_vs_regions_name = plot_numerical(
    annotated_regions = dm_annotated,
    x = 'mu0',
    y = 'mu1',
    facet = 'annot.type',
    facet_order = c('hg19_genes_1to5kb','hg19_genes_promoters',
        'hg19_genes_5UTRs','hg19_genes_3UTRs', 'hg19_custom_ezh2',
        'hg19_custom_cjun', 'hg19_genes_intergenic', 'hg19_cpg_islands',
        'hg19_cpg_shores'),
    plot_title = 'Region Methylation: Group 0 vs Group 1',
    x_label = 'Group 0',
    y_label = 'Group 1')
print(dm_vs_regions_name)

Methylation Rates in Regions Over DM Status in Group 0 vs Group 1.

The plot_numerical_coannotations() shows the distribution of numerical data for regions occurring in any two annotations, as well as in one or the other annotation. For example, the following example shows CpG methylation rates for CpGs occurring in just promoters, just CpG islands, and both promoters and CpG islands.

dm_vs_num_co = plot_numerical_coannotations(
    annotated_regions = dm_annotated,
    x = 'mu0',
    annot1 = 'hg19_cpg_islands',
    annot2 = 'hg19_genes_promoters',
    bin_width = 5,
    plot_title = 'Group 0 Perc. Meth. in CpG Islands and Promoters',
    x_label = 'Percent Methylation')
print(dm_vs_num_co)

Group 0 methylation Rates in Regions in promoters, CpG islands, and both.

4.5.4 Plotting Categorical Data

# View the counts of CpG annotations in data classes

# The orders for the x-axis labels. This is also a subset
# of the labels (hyper, hypo, none).
x_order = c(
    'hyper',
    'hypo')
# The orders for the fill labels. Can also use this
# parameter to subset annotation types to fill.
fill_order = c(
    'hg19_cpg_islands',
    'hg19_cpg_shores',
    'hg19_cpg_shelves',
    'hg19_cpg_inter')
# Make a barplot of the data class where each bar
# is composed of the counts of CpG annotations.
dm_vs_cpg_cat1 = plot_categorical(
    annotated_regions = dm_annotated, x='DM_status', fill='annot.type',
    x_order = x_order, fill_order = fill_order, position='stack',
    plot_title = 'DM Status by CpG Annotation Counts',
    legend_title = 'Annotations',
    x_label = 'DM status',
    y_label = 'Count')
print(dm_vs_cpg_cat1)

Differential methylation classification with counts of CpG annotations.

# Use the same order vectors as the previous code block,
# but use proportional fill instead of counts.

# Make a barplot of the data class where each bar
# is composed of the *proportion* of CpG annotations.
dm_vs_cpg_cat2 = plot_categorical(
    annotated_regions = dm_annotated, x='DM_status', fill='annot.type',
    x_order = x_order, fill_order = fill_order, position='fill',
    plot_title = 'DM Status by CpG Annotation Proportions',
    legend_title = 'Annotations',
    x_label = 'DM status',
    y_label = 'Proportion')
print(dm_vs_cpg_cat2)

Differential methylation classification with proportion of CpG annotations.

As with plot_annotation() one may add annotations for random regions to the annotated_random parameter of plot_categorical(). The result is a Random Regions bar representing the distribution of random regions for the categorical variable used for fill. NOTE: Random regions can only be added when fill = 'annot.type'.

# Add in the randomized annotations for "Random Regions" bar

# Make a barplot of the data class where each bar
# is composed of the *proportion* of CpG annotations, and
# includes "All" regions tested for DM and "Random Regions"
# regions consisting of randomized regions.
dm_vs_cpg_cat_random = plot_categorical(
    annotated_regions = dm_annotated, annotated_random = dm_random_annotated,
    x='DM_status', fill='annot.type',
    x_order = x_order, fill_order = fill_order, position='fill',
    plot_title = 'DM Status by CpG Annotation Proportions',
    legend_title = 'Annotations',
    x_label = 'DM status',
    y_label = 'Proportion')
print(dm_vs_cpg_cat_random)

Differential methylation classification with proportion of CpG annotations and random regions.

# View the proportions of data classes in knownGene annotations

# The orders for the x-axis labels.
x_order = c(
    'hg19_custom_ezh2',
    'hg19_custom_cjun',
    'hg19_enhancers_fantom',
    'hg19_genes_1to5kb',
    'hg19_genes_promoters',
    'hg19_genes_5UTRs',
    'hg19_genes_exons',
    'hg19_genes_introns',
    'hg19_genes_3UTRs',
    'hg19_genes_intergenic',
    'hg19_lncrna_gencode')
# The orders for the fill labels.
fill_order = c(
    'hyper',
    'hypo',
    'none')
dm_vs_kg_cat = plot_categorical(
    annotated_regions = dm_annotated, x='annot.type', fill='DM_status',
    x_order = x_order, fill_order = fill_order, position='fill',
    legend_title = 'DM Status',
    x_label = 'knownGene Annotations',
    y_label = 'Proportion')
print(dm_vs_kg_cat)

Basic gene annotations with proportions of DM classification.

`annotatr`: Making sense of genomic regions

Raymond G. Cavalcante

2017-02-06

Contents

1 Introduction

2 Installation

3 Annotations

3.1 CpG Annotations

3.2 UCSC knownGenes

3.3 FANTOM5 Permissive Enhancers

3.4 GENCODE lncRNA transcripts

3.5 Chromatin states from ChromHMM

3.6 Custom Annotations

4 Usage

4.1 Reading Genomic Regions

4.2 Annotating Regions

4.3 Randomizing Regions

4.4 Summarizing Over Annotations

4.5 Plotting

4.5.1 Plotting Regions per Annotation

4.5.2 Plotting Regions Occurring in Pairs of Annotations

4.5.3 Plotting Numerical Data Over Regions

4.5.4 Plotting Categorical Data

annotatr: Making sense of genomic regions

Raymond G. Cavalcante

2017-02-06

Contents

1 Introduction

2 Installation

3 Annotations

3.1 CpG Annotations

3.2 UCSC knownGenes

3.3 FANTOM5 Permissive Enhancers

3.4 GENCODE lncRNA transcripts

3.5 Chromatin states from ChromHMM

3.6 Custom Annotations

4 Usage

4.1 Reading Genomic Regions

4.2 Annotating Regions

4.3 Randomizing Regions

4.4 Summarizing Over Annotations

4.5 Plotting

4.5.1 Plotting Regions per Annotation

4.5.2 Plotting Regions Occurring in Pairs of Annotations

4.5.3 Plotting Numerical Data Over Regions

4.5.4 Plotting Categorical Data

`annotatr`: Making sense of genomic regions