Using LOLA Core

Nathan Sheffield

2017-10-30

(This vignette is unevaluated because it relies on loading huge database files)

Most likely, your first goal will be to test your regions of interest for enrichment against as many public datasets as you can get your hands on. To make this easier on you, we’ve assembled a core database (LOLA core) which we curate from many sources. Since LOLA core is rather large, it’s not possible to distribute with the package, and you’ll need to download it on your own. For this vignette you need to download and unpack either the complete or the cached LOLA core database; a download link is kept in the README file at the LOLA github page. Download and unpack the LOLA core archive into your current working directory.

Loading full-scale example data

You’ll also need a few region sets of interest. Typically, this will be data you’ve generated or downloaded and you’ll know something about it already; for this vignette, you can download some example data from the same page as LOLA Core. Just for fun, in this vignette I’ll call these mystery sets, and the point will be to see what LOLA tells us about them if we have no previous knowledge. Then I’ll reveal their true identity at the end of the vignette.

To load the mystery region sets: LOLA requires the data in GRanges objects. LOLA now comes with a readBed function, which is fast and convenient. You also need to come up with a “region universe” for the calculation. This can be quite tricky, so there’s another vignette dedicated to choosing an appropriate universe. For this vignette, I’ve provided a reasonable universe (it will likely be applicable elsewhere, too). This is just a combined set of all active DNase hypersensitive sites in the genome. This would be a reasonable universe to use for a set of some specific active regulatory elements, like in a ChIP-seq experiment. Since the mystery data in this vignette is all active ChIP-seq data, it’s a reasonable universe here.

Load up the database and query regions like this (this should take under a minute or so on a modern machine).

library("LOLA")

regionDB = loadRegionDB("LOLACore/hg19")

regionSetA = readBed("lola_vignette_data/setA_100.bed")
regionSetB = readBed("lola_vignette_data/setB_100.bed")
regionSetC = readBed("lola_vignette_data/setC_100.bed")
activeDHS = readBed("lola_vignette_data/activeDHS_universe.bed")

Running the enrichment

We then need to combine these GRanges objects into a single GRangesList. This lets us calculate enrichments for all your region sets simultaneously. It’s done independently, but it’s faster to merge them and get a single result than to calculate for each set individually. Run the enrichment calculation:

userSets = GRangesList(regionSetA, regionSetB, regionSetC)
locResults = runLOLA(userSets, activeDHS, regionDB, cores=1)

Exploring results

locResults[1:10,]

A first look at the top results looks like userSet1 (setA) is highly enriched for active ChIP experiments (Pol2, TAF1) in a variety of cell-types, with SK-N-MC at the top. By default, the results will be ranked by p-value. The meanRnk and maxRnk columns rank the results by a combination of p-value, log odds, and support (overlap), and often yield the most interpretable results. Explore these rankings by simply reordering the data.table. If results are dominated by just a single collection (such as encode_tfbs), it may also be interesting to just look at enrichments for a particular collection. Look around at the results in various ways to get an idea of what trends in the results. Since one userSet may also dominating the top results, we can limit the table to other userSets to see how they fare.

locResults[order(meanRnk, decreasing=FALSE),][1:20,]
locResults[order(maxRnk, decreasing=FALSE),][1:20,]
locResults[collection=="sheffield_dnase", ][order(maxRnk, decreasing=FALSE),]

locResults[userSet==2,][order(maxRnk, decreasing=FALSE),][1:10,]
locResults[userSet==3,][order(maxRnk, decreasing=FALSE),][1:10,]

locResults[userSet==1,][collection=="sheffield_dnase", ][1:15,]
locResults[userSet==2,][collection=="sheffield_dnase", ][1:15,]
locResults[userSet==3,][collection=="sheffield_dnase", ][1:15,]

From this we can see that set B and set C have very different enrichments, with setC (userSet3) being enriched in AP-1 factors (cJun and cFos) in a variety of cell types, and multiple databases. UserSet 2 looks more similar to userSet1, dominated by polymerase binding. In the DNase collection, we see that both userSets 1 and 2 are enriched for both ubiquitous CGI promoters, and Ewing-sarcoma specific elements. Set 3 is enriched in a bunch of astrocyte-specific dnase hypersensitive sites, and other weaker sites, likely to be enhancer elements rather than promoters.

To make it easier to look at results for individual region sets or databases, you can write the results out and view them with spreadsheet software.

writeCombinedEnrichment(locResults, outFolder= "lolaResults", includeSplits=TRUE)

By default, this function will write the entire table to a tsv file. I recommend using the includeSplits parameter, which tells the function to also print out additional tables that are subsetted by userSet, so that each region set you test has its own result table. It just makes it a little easier to explore the results.

Mystery set identity

The results we found fit perfectly with the identity of the region sets: