Introduction

The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.

What is DeepBlue ?

A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.

Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.

Getting started

Installation

Installation of DeepBlueR and its companion packages can be performed using the Bioconductor install method in the BiocManager package:

install.packages("BiocManager") 
BiocManager::install("DeepBlueR")

The package name is DeepBlueR and it can be loaded via:

library(DeepBlueR)

You can test your installation and connectivity by saying hello to the DeepBlue server:

deepblue_info("me")

Overview of DeepBlue commands

DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.

A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:

help(package="DeepBlueR")

In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix ’deepblue_*’, e.g. deepblue_select_genes.

Category Command Description
Information info Information about an entity
List and search list_genomes List registered genomes
list_biosources List registered biosources
list_samples List registered samples
list_epigenetic marks List registered epigenetic marks
list_experiments List available experiments
list_annotations List available annotations
search Perform a full-text search
Selection select_regions Select regions from experiments
select_experiments Select regions from experiments
select_annotations Select regions from annotations
select_genes Select genes as regions
select_expressions Select expression data
tiling_regions Generate tiling regions
input_regions Upload and use a small region-set
Operation aggregate Aggregate and summarize regions
filter_regions Filter regions by theirs attributes
flank Generate flanking regions
intersection Filter for intersecting regions
overlap Filter for regions overlapping by at least a specific size
merge_queries Merge two regions set
Result count_regions Count selected regions
score_matrix Request a score matrix
get_regions Request the selected regions
binning Bin results according to counts
Request get_request data Obtain the requested data

In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:

Category Command Description
Request batch_export_results Download the results for a list of requests
download_request_data Download and convert the requested data (blocking)
export_meta_data Export metadata to a tab delimited file
export_tab Export any result as tab delimited file
export_bed Export GenomicRanges results as BED file

DeepBlue usage examples

In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.

One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:

Listing experiments

We use the deepblue_list_experiments command to list all experiments with the corresponding values in their metadata.

experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
    biosource=c("inflammatory macrophage", "macrophage"),
    project="BLUEPRINT Epigenome")

Accessing the extra-metadata

The extra-metadata is important because it contains information that is not stored in the mandatory metadata fields. We use the deepblue_info command to access an experiment’s metadata- and extra-metadata fields. The following example prints the file_url attribute that is contained in the data imported from the ENCODE project.

info = deepblue_info("e30000")
print(info$extra_metadata$file_url)
## [1] "https://www.encodeproject.org/files/ENCFF001YBB/"

Select epigenomic data

We use the deepblue_select_experiments command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions command with the query_id value returned by the deepblue_select_experiments command.

The deepblue_count_regions command is executed asynchronously. This means that the user receives a request_id and should check the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))
# Count how many regions where selected
request_id = deepblue_count_regions(query_id=query_id)
# Download the request data as soon as processing is finished
requested_data = deepblue_download_request_data(request_id=request_id)
print(paste("The selected experiments have", requested_data, "regions."))
## [1] "The selected experiments have 115347 regions."

Output with selected columns

We use the deepblue_select_experiments command to select genomic regions from the experiments that are in chromosome 1, position 0 to 50,000,000.

We then use the deepblue_get_regions command with the query_id value returned by the deepblue_select_experiments command to request the regions with the selected columns. Selecting the columns @NAME and @BIOSOURCE represent the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments (
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
        chromosome="chr1", start=0, end=50000000)

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 3783 ranges and 4 metadata columns:
##          seqnames            ranges strand | SIGNAL_VALUE      PEAK
##             <Rle>         <IRanges>  <Rle> |  <character> <integer>
##      [1]     chr1     270668-270987      * |       6.5758        39
##      [2]     chr1     271277-271468      * |       6.2148       136
##      [3]     chr1     273768-274209      * |      14.1567       164
##      [4]     chr1     778377-778676      * |       8.0198       154
##      [5]     chr1     778409-778678      * |       4.5767       123
##      ...      ...               ...    ... .          ...       ...
##   [3779]     chr1 47437420-47437621      * |       3.7686       147
##   [3780]     chr1 47437751-47438038      * |       9.6553       149
##   [3781]     chr1 48245368-48245867      * |       4.7708       346
##   [3782]     chr1 48542755-48543280      * |       7.3002       152
##   [3783]     chr1 48793649-48793986      * |       5.1974       108
##                           @NAME   @BIOSOURCE
##                     <character>  <character>
##      [1] S008SGH1.ERX406923.H.. myeloid cell
##      [2] S008SGH1.ERX406923.H.. myeloid cell
##      [3] S008SGH1.ERX406923.H.. myeloid cell
##      [4] S008SGH1.ERX406923.H.. myeloid cell
##      [5] BL-2_c01.ERX297416.H..         BL-2
##      ...                    ...          ...
##   [3779] BL-2_c01.ERX297416.H..         BL-2
##   [3780] S008SGH1.ERX406923.H.. myeloid cell
##   [3781] S008SGH1.ERX406923.H.. myeloid cell
##   [3782] S008SGH1.ERX406923.H.. myeloid cell
##   [3783] S008SGH1.ERX406923.H.. myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Filter epigenomic data by metadata

We use the deepblue_list_samples command to obtain all samples with the biosource ‘myeloid cell’ from the BLUEPRINT project. The deepblue_list_samples returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions command to selects genomic regions that are in chromosome 1, position 0 to 50,000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME, SAMPLE_ID, and @BIOSOURCE representing the experiment name, the sample ID, and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

samples = deepblue_list_samples(
    biosource="myeloid cell",
    extra_metadata = list("source" = "BLUEPRINT Epigenome"))
samples_ids = deepblue_extract_ids(samples)
query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids,
    chromosome="chr1", start=0, end=50000)
request_id = deepblue_get_regions(query_id=query_id,
    output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions,1)
## GRanges object with 1 range and 3 metadata columns:
##       seqnames    ranges strand |                  @NAME  @SAMPLE_ID
##          <Rle> <IRanges>  <Rle> |            <character> <character>
##   [1]     chr1   0-10000      * | S00Q7NH1_12_12_Bluep..       s8797
##         @BIOSOURCE
##        <character>
##   [1] myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Filter epigenomic data by region attributes

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
    query_id=query_id, field="SIGNAL_VALUE", operation=">",
    value="10", type="number")
query_id_filters = deepblue_filter_regions(
    query_id=query_id_filter_signal, field="PEAK", operation=">",
    value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 161 ranges and 4 metadata columns:
##         seqnames            ranges strand | SIGNAL_VALUE      PEAK
##            <Rle>         <IRanges>  <Rle> |  <character> <integer>
##     [1]     chr1   1142428-1144001      * |      10.9313      1275
##     [2]     chr1   1573400-1575582      * |      17.8805      1094
##     [3]     chr1   1612814-1616174      * |      32.2064      2802
##     [4]     chr1   1668761-1670450      * |      20.2936      1017
##     [5]     chr1   1778583-1783797      * |      35.4277      1293
##     ...      ...               ...    ... .          ...       ...
##   [157]     chr1 44774644-44776655      * |      16.3227      1160
##   [158]     chr1 44806139-44811000      * |      22.8156      1381
##   [159]     chr1 46301112-46304262      * |      19.8041      2397
##   [160]     chr1 46579227-46582046      * |      15.9613      1824
##   [161]     chr1 46593677-46595181      * |      11.8798      1304
##                          @NAME   @BIOSOURCE
##                    <character>  <character>
##     [1] BL-2_c01.ERX297416.H..         BL-2
##     [2] S008SGH1.ERX406923.H.. myeloid cell
##     [3] S008SGH1.ERX406923.H.. myeloid cell
##     [4] S008SGH1.ERX406923.H.. myeloid cell
##     [5] S008SGH1.ERX406923.H.. myeloid cell
##     ...                    ...          ...
##   [157] S008SGH1.ERX406923.H.. myeloid cell
##   [158] S008SGH1.ERX406923.H.. myeloid cell
##   [159] S008SGH1.ERX406923.H.. myeloid cell
##   [160] S008SGH1.ERX406923.H.. myeloid cell
##   [161] S008SGH1.ERX406923.H.. myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Find intersecting regions

We use the deepblue_select_experiments command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE > 10 and PEAK > 1000.

The command deepblue_intersection filters for all regions of the query_id that intersect with at least one region in promoters_id.

Then, we use the deepblue_get_regions command with the parameter query_id returned by the deepblue_select_regions command and the columns @NAME and @BIOSOURCE representing the experiment name and the experiment biosource.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
promoters_id = deepblue_select_annotations(annotation_name="promoters",
    genome="GRCh38", chromosome="chr1")
intersect_id = deepblue_intersection(
    query_data_id=query_id, query_filter_id=promoters_id)
request_id = deepblue_get_regions(
    query_id=intersect_id,
    output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 608 ranges and 4 metadata columns:
##         seqnames            ranges strand | SIGNAL_VALUE      PEAK
##            <Rle>         <IRanges>  <Rle> |  <character> <integer>
##     [1]     chr1     903997-904177      * |       4.7708        89
##     [2]     chr1     904302-905111      * |       5.4928       560
##     [3]     chr1     910269-910975      * |       4.7201       136
##     [4]     chr1     911973-913915      * |      17.0446       624
##     [5]     chr1     923976-924329      * |       4.7201       109
##     ...      ...               ...    ... .          ...       ...
##   [604]     chr1 46718435-46719027      * |       6.2499       230
##   [605]     chr1 47313340-47313980      * |       6.8711       174
##   [606]     chr1 47313412-47313588      * |       4.8558       120
##   [607]     chr1 47313632-47314141      * |      10.2801       371
##   [608]     chr1 47333183-47335172      * |      18.9772       857
##                          @NAME   @BIOSOURCE
##                    <character>  <character>
##     [1] S008SGH1.ERX406923.H.. myeloid cell
##     [2] S008SGH1.ERX406923.H.. myeloid cell
##     [3] S008SGH1.ERX406923.H.. myeloid cell
##     [4] S008SGH1.ERX406923.H.. myeloid cell
##     [5] S008SGH1.ERX406923.H.. myeloid cell
##     ...                    ...          ...
##   [604] BL-2_c01.ERX297416.H..         BL-2
##   [605] BL-2_c01.ERX297416.H..         BL-2
##   [606] S008SGH1.ERX406923.H.. myeloid cell
##   [607] S008SGH1.ERX406923.H.. myeloid cell
##   [608] S008SGH1.ERX406923.H.. myeloid cell
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(Gviz)
atrack <- AnnotationTrack(regions, name = "Intersecting regions",
    group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1")
plotTracks(list(itrack, atrack, gtrack), groupAnnotation="group", fontsize=18,
           background.panel = "#FFFEDB", background.title = "darkblue")

Retrieve DNA sequences

The meta-column @LENGTH contains the genomic region length, and we filter the genomic regions where this value is smaller than 2,000.

The meta-column @SEQUENCE includes the DNA Sequence in the genomic region output.

The deepblue_get_regions command is executed asynchronously. This means that the user receives a request_id to be able to check for the status of this request. In contrast to the command deepblue_get_request_data, the DeepBlueR package-specific command deepblue_download_request_data will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.

query_id = deepblue_select_experiments(
    experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(query_id=query_id,
    field="SIGNAL_VALUE", operation=">", value="10", type="number")
query_id_filters = deepblue_filter_regions(query_id=query_id_filter_signal,
    field="PEAK", operation=">", value="1000", type="number")
query_id_filter_length = deepblue_filter_regions (query_id=query_id_filters,
    field="@LENGTH", operation="<", value="2000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filter_length,
    output_format="CHROMOSOME,START,END,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 1)
## GRanges object with 1 range and 4 metadata columns:
##       seqnames          ranges strand |                  @NAME  @BIOSOURCE
##          <Rle>       <IRanges>  <Rle> |            <character> <character>
##   [1]     chr1 1142428-1144001      * | BL-2_c01.ERX297416.H..        BL-2
##         @LENGTH              @SEQUENCE
##       <integer>            <character>
##   [1]      1573 CCAGGCTGGTCTCAAACTCC..
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

DNA pattern matching operations

We use the deepblue_find_motif command to find all position of a given pattern in the genome. An example is finding all locations of ‘TATAA’ in genome assembly GRCh38.

We use the deepblue_select_experiments command to select genomic regions that are in chromosome 1, position 0 to 50,000,000 from the selected experiments.

The command deepblue_intersect selects all regions of the query_id that intersect with at least one tataa_regions region.

tataa_regions = deepblue_find_motif(motif="TATAAA", genome="GRCh38", chromosomes="chr1")
query_id = deepblue_select_experiments(
    experiment_name= c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
        "S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
    chromosome="chr1", start=0, end=50000000)
overlapped = deepblue_intersection(query_data_id=query_id,
                                   query_filter_id=tataa_regions)
request_id=deepblue_get_regions(overlapped,
    "CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE,@PROJECT")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)
## GRanges object with 3 ranges and 7 metadata columns:
##       seqnames        ranges strand | SIGNAL_VALUE      PEAK
##          <Rle>     <IRanges>  <Rle> |  <character> <integer>
##   [1]     chr1 273768-274209      * |      14.1567       164
##   [2]     chr1 911203-911477      * |       4.8558       105
##   [3]     chr1 916813-917682      * |      10.5467       183
##                        @NAME   @BIOSOURCE   @LENGTH              @SEQUENCE
##                  <character>  <character> <integer>            <character>
##   [1] S008SGH1.ERX406923.H.. myeloid cell       441 AGCCAAAGGTGATATTTTCA..
##   [2] S008SGH1.ERX406923.H.. myeloid cell       274 ATAAACAGGCAGGAGTCTTC..
##   [3] S008SGH1.ERX406923.H.. myeloid cell       869 TTTATTGCACAAATTATTAA..
##                  @PROJECT
##               <character>
##   [1] BLUEPRINT Epigenome
##   [2] BLUEPRINT Epigenome
##   [3] BLUEPRINT Epigenome
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Counting motifs in a region

The meta column @COUNT.MOTIF() allows for counting how many times a motif appears in the selected genomic region. For example, the following code return the experiment regions with the DNA sequence length, the counts of G, CG, GC, and the DNA sequence itself.

experiment_data = deepblue_select_experiments(
  "DG-75_c01.ERX297417.H3K27ac.bwa.GRCh38.20150527.bed")
fmt = "CHROMOSOME,START,END,@LENGTH,@COUNT.MOTIF(C),@COUNT.MOTIF(G),@COUNT.MOTIF(CG),@COUNT.MOTIF(GC),@SEQUENCE"
request_id=deepblue_get_regions(experiment_data, fmt)
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)
## GRanges object with 3 ranges and 6 metadata columns:
##       seqnames        ranges strand |   @LENGTH @COUNT.MOTIF(C) @COUNT.MOTIF(G)
##          <Rle>     <IRanges>  <Rle> | <integer>     <character>     <character>
##   [1]     chr1 779094-779379      * |       285              83              85
##   [2]     chr1 826755-827064      * |       309             109              78
##   [3]     chr1 958700-959105      * |       405             121             124
##       @COUNT.MOTIF(CG) @COUNT.MOTIF(GC)              @SEQUENCE
##            <character>      <character>            <character>
##   [1]               15               24 ACGTGCGCTCCACAACGCCT..
##   [2]               15               18 ACCCATCCACTTCCCATCTA..
##   [3]               29               32 GAGATTTTTGCACAACTCAC..
##   -------
##   seqinfo: 36 sequences from an unspecified genome; no seqlengths

Genes

We use the deepblue_select_genes command to select the gene RP11-34P13 from GENCODE v23.

The selected genes behave like a regular genomic region, which, for example, can be filtered by their attributes. We use the @GENE_ATTRIBUTE meta-column to select the genomic regions that are annotated as lincRNAs.

q_genes = deepblue_select_genes(genes="RP11-34P13", gene_model="gencode v23")
q_filter = deepblue_filter_regions(query_id=q_genes,
    field="@GENE_ATTRIBUTE(gene_type)", operation="==",
    value="lincRNA", type="string")
request_id=deepblue_get_regions(q_filter, "CHROMOSOME,START,END,GTF_ATTRIBUTES")
regions = deepblue_download_request_data(request_id=request_id)
regions

Aggregate and summarize regions

The command deepblue_aggregate summarizes the query_id regions using the cpg_islands regions defined by the corresponding annotation as boundaries.

The aggregated values can be accessed through the @AGG.* meta-columns.

query_id = deepblue_select_experiments (
    experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"),
    chromosome="chr1", start=0, end=50000000)
cpg_islands = deepblue_select_annotations(annotation_name="CpG Islands",
    genome="GRCh38", chromosome="chr1", start=0, end=50000000)
# Aggregate
overlapped = deepblue_aggregate (data_id=query_id, ranges_id=cpg_islands,
    column="VALUE" )

# Retrieve the experiments data (The @NAME meta-column is used to include
# the experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=overlapped,
    output_format=
      "CHROMOSOME,START,END,@AGG.MIN,@AGG.MAX,@AGG.MEAN,@AGG.VAR")
regions = deepblue_download_request_data(request_id=request_id)

Gene expression

In the following example we obtain the gene expression levels of three genes, i.e., NOX3, NOXA1, and NOX4 from all biosources related to the hematopoietic stem cell biosource from the BLUEPRINT project. With related we refer to children of this biosource term in the ontologies used by DeepBlue.

hsc_children = deepblue_get_biosource_children("hematopoietic stem cell")

hsc_children_name = deepblue_extract_names(hsc_children)

hsc_children_samples = deepblue_list_samples(
    biosource = hsc_children_name,
    extra_metadata = list(source="BLUEPRINT Epigenome"))

hsc_samples_ids = deepblue_extract_ids(hsc_children_samples)

# Note that BLUEPRINT uses Ensembl Gene IDs
gene_exprs_query = deepblue_select_expressions(
    expression_type = "gene",
    sample_ids = hsc_samples_ids,
    identifiers = c("ENSG00000074771.3", "ENSG00000188747.7", "ENSG00000086991.11"),
    gene_model = "gencode v22")

request_id = deepblue_get_regions(
    query_id = gene_exprs_query,
    output_format ="@GENE_NAME(gencode v22),CHROMOSOME,START,END,FPKM,@BIOSOURCE")

regions = deepblue_download_request_data(request_id = request_id)
regions
## GRanges object with 679 ranges and 3 metadata columns:
##         seqnames              ranges strand | @GENE_NAME(gencode v22)
##            <Rle>           <IRanges>  <Rle> |             <character>
##     [1]     chr9 137423350-137434406      * |                   NOXA1
##     [2]     chr9 137423350-137434406      * |                   NOXA1
##     [3]     chr9 137423350-137434406      * |                   NOXA1
##     [4]     chr9 137423350-137434406      * |                   NOXA1
##     [5]     chr9 137423350-137434406      * |                   NOXA1
##     ...      ...                 ...    ... .                     ...
##   [675]     chr6 155395370-155455903      * |                    NOX3
##   [676]     chr6 155395370-155455903      * |                    NOX3
##   [677]     chr6 155395370-155455903      * |                    NOX3
##   [678]     chr6 155395370-155455903      * |                    NOX3
##   [679]     chr6 155395370-155455903      * |                    NOX3
##                FPKM             @BIOSOURCE
##         <character>            <character>
##     [1]      0.0100 hematopoietic multip..
##     [2]      0.3900 CD34-negative, CD41-..
##     [3]      3.1400           myeloid cell
##     [4]      2.1800 CD14-positive, CD16-..
##     [5]      0.9800           myeloid cell
##     ...         ...                    ...
##   [675]      0.0000           myeloid cell
##   [676]      0.0000      mature neutrophil
##   [677]      0.0000           naive B cell
##   [678]      0.0000 central memory CD4-p..
##   [679]      0.0000           myeloid cell
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Tiling regions

We use the deepblue_tiling_regions command to generate a set of consecutive genomic regions of size 100,000 from chromosome 1 of the genome assembly GRCh38.

The command deepblue_aggregate summarizes the query_id regions using the column VALUE and the cpg_islands regions as boundaries.

# Selecting the data from 2 experiments:
#    GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig
# As we already know the experiments names, we keep all others fields empty.
# We are selecting all regions of chromosome 1
query_id = deepblue_select_experiments(
    experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"),
    chromosome="chr1")

#  Tiling regions of 100.000 base pairs
tiling_id = deepblue_tiling_regions(size=100000,
    genome="GRCh38", chromosome="chr1")

# Aggregate
overlapped = deepblue_aggregate (data_id=query_id,
    ranges_id=tiling_id, column="VALUE")

# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=overlapped,
    output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")

regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 2489 ranges and 2 metadata columns:
##          seqnames              ranges strand | @AGG.MEAN   @AGG.SD
##             <Rle>           <IRanges>  <Rle> | <numeric> <numeric>
##      [1]     chr1            0-100000      * |    0.6677    0.3639
##      [2]     chr1       100000-200000      * |    0.8358    0.2414
##      [3]     chr1       200000-300000      * |    0.7714    0.2512
##      [4]     chr1       300000-400000      * |    0.7595    0.2477
##      [5]     chr1       400000-500000      * |    0.8512    0.1877
##      ...      ...                 ...    ... .       ...       ...
##   [2485]     chr1 248400000-248500000      * |    0.8348    0.1880
##   [2486]     chr1 248500000-248600000      * |    0.8576    0.1561
##   [2487]     chr1 248600000-248700000      * |    0.8664    0.1786
##   [2488]     chr1 248700000-248800000      * |    0.8425    0.1846
##   [2489]     chr1 248800000-248900000      * |    0.6572    0.4079
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Such data can now be plotted using any of the common R plotting mechanisms and packages. An example is shown here:

library(ggplot2)
plot_data <- as.data.frame(regions)
plot_data[,grepl("X.", colnames(plot_data))] <-
    apply(plot_data[,grepl("X.", colnames(plot_data))], 2, as.numeric)
AGG.plot <- ggplot(plot_data, aes(start)) +
    geom_ribbon(aes(ymin = X.AGG.MEAN - (X.AGG.SD / 2),
        ymax = X.AGG.MEAN + (X.AGG.SD / 2)), fill = "grey70") +
    geom_line(aes(y = X.AGG.MEAN))
print(AGG.plot)