# Contents

Package: metagene
Authors: Charles Joly Beauparlant charles.joly-beauparlant@crchul.ulaval.ca, Fabien Claude Lamaze fabien.lamaze.1@ulaval.ca, Rawane Samb rawane.samb.1@ulaval.ca, Astrid Louise Deschenes Astrid-Louise.Deschenes@crchudequebec.ulaval.ca, Arnaud Droit arnaud.droit@crchuq.ulaval.ca
Modified: 18 september, 2015
Compiled: Thu Nov 10 19:37:27 2016

# 1 Introduction

This package produces metagene-like plots to compare the behavior of DNA-interacting proteins at selected groups of features. A typical analysis can be done in viscinity of transcription start sites (TSS) of genes or at any regions of interest (such as enhancers). Multiple combinations of group of features and/or group of bam files can be compared in a single analysis. Bootstraping analysis is used to compare the groups and locate regions with statistically different enrichment profiles. In order to increase the sensitivity of the analysis, alignment data is used instead of peaks produced with peak callers (i.e.: MACS2 or PICS). The metagene package uses bootstrap to obtain a better estimation of the mean enrichment and the confidence interval for every group of samples.

This vignette will introduce all the main features of the metagene package.

library(metagene)

# 3 Inputs

## 3.1 Alignment files (BAM files)

There is no hard limit in the number of BAM files that can be included in an analysis (but with too many BAM files, memory may become an issue). BAM files must be indexed. For instance, if you use a file names file.bam, a file named file.bam.bai must be present in the same directory.

The path (relative or absolute) to the BAM files must be in a vector:

bam_files <- get_demo_bam_files()
bam_files
## [1] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep1.bam"
## [2] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep2.bam"
## [3] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep1.bam"
## [4] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep2.bam"
## [5] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/ctrl.bam"

For this demo, we have 2 samples (each with 2 replicates). It is also possible to use a named vector to add your own names to each BAM files:

named_bam_files <- bam_files
names(named_bam_files) <- letters[seq_along(bam_files)]
named_bam_files
##                                                                    a
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep1.bam"
##                                                                    b
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep2.bam"
##                                                                    c
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep1.bam"
##                                                                    d
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep2.bam"
##                                                                    e
##        "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/ctrl.bam"

Using named BAM files can simplify the use of the metagene helper functions and the creation of the design.

## 3.2 Genomic regions

### 3.2.1 BED files

To compare custom regions of interest, it is possible to use a list of one or more BED files.

regions <- get_demo_regions()
regions
## [1] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/list1.bed"
## [2] "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/list2.bed"

The name of the files (without the extension) will be used to name each groups.

metagene also support the narrowPeak and the broadPeak.

### 3.2.2 GRanges or GRangesList objects - Regions

As an alternative to a list of BED files, GRanges or GRangesList objects can be used.

### 3.2.3 Available datasets

Some common datasets are already available with the metagene package:

• promoters_hg19
• promoters_hg18
• promoters_mm10
• promoters_mm9
data(promoters_hg19)
promoters_hg19
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames                 ranges strand |     gene_id
##            <Rle>              <IRanges>  <Rle> | <character>
##       1    chr19 [ 58873215,  58875214]      - |           1
##      10     chr8 [ 18247755,  18249754]      + |          10
##     100    chr20 [ 43279377,  43281376]      - |         100
##    1000    chr18 [ 25756446,  25758445]      - |        1000
##   10000     chr1 [244005887, 244007886]      - |       10000
##     ...      ...                    ...    ... .         ...
##    9991     chr9 [115094945, 115096944]      - |        9991
##    9992    chr21 [ 35735323,  35737322]      + |        9992
##    9993    chr22 [ 19108968,  19110967]      - |        9993
##    9994     chr6 [ 90538619,  90540618]      + |        9994
##    9997    chr22 [ 50963906,  50965905]      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

For more details about each datasets, please refer to their documentation (i.e.: ?promoters_hg19).

## 3.3 Design groups

A design group contains a set of BAM files that, when put together, represent a logical analysis. Furthermore, a design group contains the relationship between every BAM files present. Samples (with or without replicates) and controls can be assigned to a same design group. There can be as many groups as necessary. A BAM file can be assigned to more than one group.

To represent the relationship between every BAM files, design groups must have the following columns:

• The list of paths to every BAM files related to an analysis.
• One column per group of files (replicates and/or controls).

There is two possible way to create design groups, by reading a file or by directly creating a design object in R.

### 3.3.1 Design groups from a file

Design groups can be loaded into the metagene package by using a text file. As the relationship between BAM files as to be specified, the following columns are mandatory:

• First column: The list of paths (absolute or relative) to every BAM files for all the design groups, the BAM filenames or the BAM names (if a named BAM. file was used).
• Following columns: One column per design group (replicates and/or controls). The column can take only three values:
• 0: ignore file
• 1: input
• 2: control

The file must also contain a header. It is recommanded to use Samples for the name of the first column, but the value is not checked. The other columns in the design file will be used for naming design groups, and must be unique.

fileDesign <- system.file("extdata/design.txt", package="metagene")
design$Samples <- paste(system.file("extdata", package="metagene"), design$Samples, sep="/")
kable(design)
Samples align1 align2
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep1.bam 1 0
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep2.bam 1 0
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep1.bam 0 1
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep2.bam 0 1
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/ctrl.bam 2 2

### 3.3.2 Design groups from R

It is not obligatory to use a design file, you can create the design data.frame using your prefered method (as long as the restrictions on the values mentioned previously are respected).

For instance, the previous design data.frame could have been create directly in R:

design <- data.frame(Samples = c("align1_rep1.bam", "align1_rep2.bam",
"align2_rep1.bam", "align2_rep2.bam", "ctrl.bam"),
align1 = c(1,1,0,0,2), align2 = c(0,0,1,1,2))
design$Samples <- paste0(system.file("extdata", package="metagene"), "/", design$Samples)
kable(design)
Samples align1 align2
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep1.bam 1 0
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep2.bam 1 0
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep1.bam 0 1
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep2.bam 0 1
/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/ctrl.bam 2 2

# 4 Analysis steps

A typical metagene analysis will consist steps:

• Extraction the read count of every BAM files in selected regions.
• Conversion in coverage.
• Noise removal
• Normalization of the coverage values.
• Matrices production.
• Data frame production.
• Generation of the metagene plot.

## 4.1 Minimal analysis

A minimal metagene analysis can be performed in 2 steps:

1. Initialization (the new function).
2. plot
regions <- get_demo_regions()
bam_files <- get_demo_bam_files()
# Initialization
mg <- metagene$new(regions = regions, bam_files = bam_files) # Plotting mg$plot(title = "Demo metagene plot")
## align1_rep1_list1
## align1_rep1_list2
## align1_rep2_list1
## align1_rep2_list2
## align2_rep1_list1
## align2_rep1_list2
## align2_rep2_list1
## align2_rep2_list2
## ctrl_list1
## ctrl_list2

As you can see, it is not mandatory to explicitly call each step of the metagene analysis. For instance, in the previous example, the plot function call the other steps automatically with default values (the next section will describe the steps in more details).

In this specific case, the plot is messy since by default metagene will produce a curve for each possible combinations of BAM file and regions. Since we have 5 BAM files and 2 regions, this gives us 10 curves.

If we want more control on how every step of the analysis are performed, we have to call each functions directly.

## 4.2 Complete analysis

In order to fully control every step of a metagene analysis, it is important to understand how a complete analysis is performed. If we are satisfied with the default values, it is not mandatory to explicitly call every step (as was shown in the previous section).

### 4.2.1 Initialization

During this step, the coverages for every regions specified are extracted from every BAM files. More specifically, a new GRanges is created by combining all the regions specified with the regions param of the new function.

regions <- get_demo_regions()
bam_files <- get_demo_bam_files()
mg <- metagene$new(regions = regions, bam_files = bam_files) ### 4.2.2 Producing the matrices To produce the metagene plot, the coverages must be converted in a matrix where the columns represent the positions and the rows the regions. Furthermore, to reduce the computation time during the following steps, the positions are also binned. We can control the size of the bins with the bin_count argument. By default, a bin_count of 100 will be used during this step. mg$produce_matrices()

We can also use the design we produced earlier to remove background signal and combine replicates:

mg$produce_matrices(design = design) ### 4.2.3 Producing the data.frame The metagene plot are produced using the ggplot2 package, which require a data.frame as input. During this step, the values of the ribbon are calculated. The algorithm used for the estimation of the confidence interval can be specified with the stat parameter. By default, metagene uses “bootstrap” to obtain a better estimation of the mean of enrichment for every positions in each group of regions/BAM files. Another approach, called “basic” will simply use the ribbon to represent 95% of the values for each positions in each groups. mg$produce_data_frame(stat = "basic")
## align1_list1
## align1_list2
## align2_list1
## align2_list2

### 4.2.4 Plotting

During this step, metagene will use the data.frame to plot the calculated values using ggplot2. We show a subset of the regions by using the region_names and exp_names parameter. The region_names correspond to the names of the regions used during the initialization. The exp_name will vary depending if a design was added. If no design was added, this param correspond to the BAM name or BAM filenames. Otherwise, we have to use the names of the columns from the design.

mg$plot(region_names = "list1", title = "Demo plot subset") # 5 Manipulating the metagene objects ## 5.1 Getters Multiple getters functions are available to access the data that is stored in a metagene object. ### 5.1.1get_params The various parameters used during the initialization of the metagene object, the production of the matrices and the production of the plot are saved and can be accessed with the get_params function: mg <- get_demo_metagene() mg$get_params()
## $padding_size ## [1] 0 ## ##$verbose
## [1] FALSE
##
## $bin_count ## [1] 100 ## ##$bam_files
##                                                          align1_rep1
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep1.bam"
##                                                          align1_rep2
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align1_rep2.bam"
##                                                          align2_rep1
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep1.bam"
##                                                          align2_rep2
## "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/align2_rep2.bam"
##                                                                 ctrl
##        "/tmp/Rtmpn6ozhW/Rinst77db76ee300c/metagene/extdata/ctrl.bam"
##
## $force_seqlevels ## [1] FALSE ## ##$flip_regions
## [1] FALSE

### 5.1.2get_design

To get the design that was used to produce the last version of the matrices, you can use the get_design function:

mg$produce_matrices(design = get_demo_design()) ## Alternatively, it is also possible to add a design without producing the ## matrices: #mg$add_design(get_demo_design())
mgget_design() ## Samples align1 align2 ## 1 align1_rep1.bam 1 0 ## 2 align1_rep2.bam 1 0 ## 3 align2_rep1.bam 0 1 ## 4 align2_rep2.bam 0 1 ## 5 ctrl.bam 2 2 ### 5.1.3get_bam_count To get the number of aligned read in a BAM file, you can use the get_bam_count function: mgget_bam_count(bam_files[1])
## [1] 4635

### 5.1.4get_regions

To get all the regions, you can use the get_regions function:

mg$get_regions() ## GRangesList object of length 2: ##$list1
## GRanges object with 50 ranges and 0 metadata columns:
##        seqnames                 ranges strand
##           <Rle>              <IRanges>  <Rle>
##    [1]     chr1   [16103663, 16105662]      *
##    [2]     chr1   [23921318, 23923317]      *
##    [3]     chr1   [34848977, 34850976]      *
##    [4]     chr1   [36368182, 36370181]      *
##    [5]     chr1   [36690488, 36692487]      *
##    ...      ...                    ...    ...
##   [46]     chr1 [172081530, 172083529]      *
##   [47]     chr1 [172081796, 172083795]      *
##   [48]     chr1 [172147016, 172149015]      *
##   [49]     chr1 [172205805, 172207804]      *
##   [50]     chr1 [172260642, 172262641]      *
##
## ...
## <1 more element>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths

It is also possible to extract a subset of the regions with the get_regions function:

mg$get_regions(region_names = c(regions[1])) ## GRangesList object of length 1: ##$list1
## GRanges object with 50 ranges and 0 metadata columns:
##        seqnames                 ranges strand
##           <Rle>              <IRanges>  <Rle>
##    [1]     chr1   [16103663, 16105662]      *
##    [2]     chr1   [23921318, 23923317]      *
##    [3]     chr1   [34848977, 34850976]      *
##    [4]     chr1   [36368182, 36370181]      *
##    [5]     chr1   [36690488, 36692487]      *
##    ...      ...                    ...    ...
##   [46]     chr1 [172081530, 172083529]      *
##   [47]     chr1 [172081796, 172083795]      *
##   [48]     chr1 [172147016, 172149015]      *
##   [49]     chr1 [172205805, 172207804]      *
##   [50]     chr1 [172260642, 172262641]      *
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths

### 5.1.5get_raw_coverages

To get the coverages produced during the initialization of the metagene object, you can use the get_raw_coverages function. Please note that to save space, metagene will only extract the coverages in the regions provided.

coverages <- mg$get_raw_coverages() coverages[[1]] ## RleList of length 22 ##$chr10
## integer-Rle of length 130694993 with 1 run
##   Lengths: 130694993
##   Values :         0
##
## $chr11 ## integer-Rle of length 122082543 with 1 run ## Lengths: 122082543 ## Values : 0 ## ##$chr12
## integer-Rle of length 120129022 with 1 run
##   Lengths: 120129022
##   Values :         0
##
## $chr13 ## integer-Rle of length 120421639 with 1 run ## Lengths: 120421639 ## Values : 0 ## ##$chr14
## integer-Rle of length 124902244 with 1 run
##   Lengths: 124902244
##   Values :         0
##
## ...
## <17 more elements>
length(coverages)
## [1] 5

It is also possible to extract a subset of all the coverages by providing the filenames:

coverages <- mgget_raw_coverages(filenames = bam_files[1:2]) length(coverages) ## [1] 2 ### 5.1.6get_normalized_coverages The get_normalized_coverages function works exactly like the get_raw_coverages function except that it returns the coverages in read per million aligned (RPM). ## 5.2 Chaining functions Every function of metagene (except for the getters) invisibly return a pointer to itself. This means that the functions can be chained: rg <- get_demo_regions() bam <- get_demo_bam_files() d <- get_demo_design() title <- "Show chain" mg <- metagenenew(rg, bam)$produce_matrices(design = d)$plot(title = title)
## align1_list1
## align1_list2
## align2_list1
## align2_list2

## 5.3 Copying a metagene object

To copy a metagene object, you have to use the clone function:

mg_copy <- mg$clone() # 6 Managing large datasets While metagene try to reduce it’s memory usage, it’s possible to run into memory limits when working with multiple large datasets (especially when there is a lot of regions with a large width). One way to avoid this is to analyse each dataset seperately and then merge just before producing the metagene plot: mg1 <- metagene$new(bam_files = bam_files, regions = regions[1])
mg1produce_data_frame() ## align1_rep1_list1 ## align1_rep2_list1 ## align2_rep1_list1 ## align2_rep2_list1 ## ctrl_list1 mg2 <- metagenenew(bam_files = bam_files, regions = regions[2])
mg2produce_data_frame() ## align1_rep1_list2 ## align1_rep2_list2 ## align2_rep1_list2 ## align2_rep2_list2 ## ctrl_list2 Then you can extract the data.frames and combine them with rbind: df1 <- mg1get_data_frame()
df2 <- mg2$get_data_frame() df <- rbind(df1, df2) Finally, you can use the plot_metagene function to produce the metagene plot: p <- plot_metagene(df) p + ggplot2::ggtitle("Managing large datasets") # 7 Comparing profiles with permutations It is possible to compare two metagene profiles using the permutation_test function provided with the metagene package. Please note that the permutation tests functionality is still in development and is expected to change in future releases. The first step is to decide which profiles we want to compare and extract the corresponding matrices: matrices <- mg$get_matrices()
m1 <- matrices[["list1"]][["align1"]][["input"]]
m2 <- matrices[["list1"]][["align2"]][["input"]]

Then we defined to function to use to compare the two profiles. For this, a companion package of metagene named similaRpeak provides multiple metrics.

For this example, we will prepare a function to calculate the RATIO_NORMALIZED_INTERSECT between two profiles:

library(similaRpeak)
perm_fun <- function(profile1, profile2) {
similarity(profile1, profile2)[["metrics"]][["RATIO_NORMALIZED_INTERSECT"]]
}

We then compare our two profiles using this metric:

ratio_normalized_intersect <- perm_fun(colMeans(m1), colMeans(m2))
ratio_normalized_intersect
## [1] 0.7336965

To check if this value is significant, we can permute the two matrices that were used to produce the profile and calculate their RATIO_NORMALIZED_INTERSECT:

permutation_results <-  permutation_test(m1, m2, sample_size = nrow(m1),
sample_count = 1000, FUN = perm_fun)

Finally, we check how often the calculated value is greater than the results of the permutations:

sum(ratio_normalized_intersect >= permutation_results) / length(permutation_results)
## [1] 0.005