The DeepBlue Epigenomic Data Server is an online application that allows researchers to access data from various epigenomic mapping consortia such as DEEP, BLUEPRINT, ENCODE, or ROADMAP. DeepBlue can be accessed through a web interface or programmatically via its API. The usage of the API is documented with examples, use cases, and a user manual. While the description of the API is language agnostic, the examples and use cases shown online are focused on the python language. However, the R package presented here also enables access to the DeepBlue API directly within the R statistical environment and provides convenient functionality for triggering operations on the DeepBlue server as well as for data retrievel using R functions. In the following, we give a brief introduction to the package and subsequently show how python examples from the online documentation can be reproduced with it.
A wealth of epigenomic data has been collected over the past decade by large epigenomic mapping consortia. Event though most of these data are publicly available, the task of identifiying, downloading and processing data from various experiments is challenging. Recognizing that these tedious steps need to be tackled programmatically, we developed the DeepBlue epigenomic data server. Epigenome data from the different epigenome mapping consortia are accessible with standardized metadata. An experiment is the most important entity in DeepBlue and typically encompasses a single file (usually a bed or wig file) with a set of mandatory metadata: name, genome assembly, epigenetic mark, biosource, sample, technique, and project. For the sake of organization, all metadata fields are part of controlled vocabularies, some of which are imported from ontologies (CL, EFO, and UBERON, to name a few). DeepBlue also contains annotations, i.e. auxiliary data that is helpful in epigenomic analysis, such as, for example, CpG Islands, promoter regions, and genes. DeepBlue provides different types of commands, such as listing and searching commands as well as commands for data retrieval. A typical work-flow for the latter is to select, filter, transform, and finally download the selected data. For a more thorough description of DeepBlue we refer to the DeepBlue publication in the 2016 NAR webserver issue. If you find DeepBlue useful and use it in your project consider citing this paper.
Important note: With the exception of data aggregation tasks, DeepBlue does not alter the imported data, i.e. it remains exactly as provided by the epigenome mapping consortia.
Installation of DeepBlueR and its companion packages can be performed using the Bioconductor install method in the BiocManager package:
install.packages("BiocManager")
BiocManager::install("DeepBlueR")
The package name is DeepBlueR
and it can be loaded via:
library(DeepBlueR)
You can test your installation and connectivity by saying hello to the DeepBlue server:
deepblue_info("me")
DeepBlue provides a comprehensive programmatic interface for finding, selecting, filtering, summarizing and downloading annotated genomic region sets. Downloaded region sets are stored using the GenomicRanges R package, which allows for downloaded region sets to be further processed, visualized and analyzed with existing R packages such as LOLA or GViz.
A list of all commands available by DeepBlue is provided in its API page. The vast majority of these commands is also available through this R package and can be listed as follows:
help(package="DeepBlueR")
In the following we listed the most frequently used DeepBlue commands. The full list of commands is available here. Note that each command in the following two tables has the prefix ’deepblue_*’, e.g. deepblue_select_genes.
Category | Command | Description |
---|---|---|
Information | info | Information about an entity |
List and search | list_genomes | List registered genomes |
list_biosources | List registered biosources | |
list_samples | List registered samples | |
list_epigenetic marks | List registered epigenetic marks | |
list_experiments | List available experiments | |
list_annotations | List available annotations | |
search | Perform a full-text search | |
Selection | select_regions | Select regions from experiments |
select_experiments | Select regions from experiments | |
select_annotations | Select regions from annotations | |
select_genes | Select genes as regions | |
select_expressions | Select expression data | |
tiling_regions | Generate tiling regions | |
input_regions | Upload and use a small region-set | |
Operation | aggregate | Aggregate and summarize regions |
filter_regions | Filter regions by theirs attributes | |
flank | Generate flanking regions | |
intersection | Filter for intersecting regions | |
overlap | Filter for regions overlapping by at least a specific size | |
merge_queries | Merge two regions set | |
Result | count_regions | Count selected regions |
score_matrix | Request a score matrix | |
get_regions | Request the selected regions | |
binning | Bin results according to counts | |
Request | get_request data | Obtain the requested data |
In addition, this package provides a set of convenience functions not part of the DeepBlue API, such as:
Category | Command | Description |
---|---|---|
Request | batch_export_results | Download the results for a list of requests |
download_request_data | Download and convert the requested data (blocking) | |
export_meta_data | Export metadata to a tab delimited file | |
export_tab | Export any result as tab delimited file | |
export_bed | Export GenomicRanges results as BED file |
In the following we give a number of increasingly complex examples illustrating what DeepBlue can achieve in your epigenomic data analysis work-flow. We go beyond the online description of these examples by showing how the retrieved information can be further used in R.
One of the first tasks in DeepBlue is finding the data of interest. This can be achieved in three ways:
deepblue_search
commanddeepblue_list_{experiments, annotations, ...}
commandsIn this example, we use the command deepblue_search
to find experiments that contain the keywords ‘H3k27AC’, ‘blood’, and ‘peaks’ in their metadata. We put the names in single quotes to show that these names must be in the metadata.
# We are selecting the experiments with terms 'H3k27AC', 'blood', and
# 'peak' in the metadata.
experiments_found = deepblue_search(
keyword="'H3k27AC' 'blood' 'peak'", type="experiments")
custom_table = do.call("rbind", apply(experiments_found, 1, function(experiment){
experiment_id = experiment[1]
# Obtain the information about the experiment_id
info = deepblue_info(experiment_id)
# Print the experiment name, project, biosource, and epigenetic mark.
with(info, { data.frame(name = name, project = project,
biosource = sample_info$biosource_name, epigenetic_mark = epigenetic_mark)
})
}))
head(custom_table)
## name project biosource epigenetic_mark
## 1 E038-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
## 2 E047-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
## 3 E048-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
## 4 E037-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
## 5 E045-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
## 6 E040-H3K27ac.narrowPeak.bed Roadmap Epigenomics BLOOD H3K27ac
We use the deepblue_list_experiments
command to list all experiments with the corresponding values in their metadata.
experiments = deepblue_list_experiments(type="peaks", epigenetic_mark="H3K4me3",
biosource=c("inflammatory macrophage", "macrophage"),
project="BLUEPRINT Epigenome")
The extra-metadata is important because it contains information that is not stored in the mandatory metadata fields. We use the deepblue_info
command to access an experiment’s metadata- and extra-metadata fields. The following example prints the file_url
attribute that is contained in the data imported from the ENCODE project.
info = deepblue_info("e30000")
print(info$extra_metadata$file_url)
## [1] "https://www.encodeproject.org/files/ENCFF001YBB/"
We use the deepblue_select_experiments
command to select all genomic regions from the two informed experiments. We use the deepblue_count_regions
command with the query_id
value returned by the deepblue_select_experiments
command.
The deepblue_count_regions
command is executed asynchronously. This means that the user receives a request_id
and should check the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name=c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"))
# Count how many regions where selected
request_id = deepblue_count_regions(query_id=query_id)
# Download the request data as soon as processing is finished
requested_data = deepblue_download_request_data(request_id=request_id)
print(paste("The selected experiments have", requested_data, "regions."))
## [1] "The selected experiments have 115347 regions."
We use the deepblue_select_experiments
command to select genomic regions from the experiments that are in chromosome 1, position 0 to 50,000,000.
We then use the deepblue_get_regions
command with the query_id
value returned by the deepblue_select_experiments
command to request the regions with the selected columns. Selecting the columns @NAME
and @BIOSOURCE
represent the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments (
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=query_id,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 3783 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 270668-270987 * | 6.5758 39
## [2] chr1 271277-271468 * | 6.2148 136
## [3] chr1 273768-274209 * | 14.1567 164
## [4] chr1 778377-778676 * | 8.0198 154
## [5] chr1 778409-778678 * | 4.5767 123
## ... ... ... ... . ... ...
## [3779] chr1 47437420-47437621 * | 3.7686 147
## [3780] chr1 47437751-47438038 * | 9.6553 149
## [3781] chr1 48245368-48245867 * | 4.7708 346
## [3782] chr1 48542755-48543280 * | 7.3002 152
## [3783] chr1 48793649-48793986 * | 5.1974 108
## @NAME @BIOSOURCE
## <character> <character>
## [1] S008SGH1.ERX406923.H.. myeloid cell
## [2] S008SGH1.ERX406923.H.. myeloid cell
## [3] S008SGH1.ERX406923.H.. myeloid cell
## [4] S008SGH1.ERX406923.H.. myeloid cell
## [5] BL-2_c01.ERX297416.H.. BL-2
## ... ... ...
## [3779] BL-2_c01.ERX297416.H.. BL-2
## [3780] S008SGH1.ERX406923.H.. myeloid cell
## [3781] S008SGH1.ERX406923.H.. myeloid cell
## [3782] S008SGH1.ERX406923.H.. myeloid cell
## [3783] S008SGH1.ERX406923.H.. myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_list_samples
command to obtain all samples with the biosource ‘myeloid cell’ from the BLUEPRINT project. The deepblue_list_samples
returns a list of samples with their IDs and content. We extract the sample IDs from this list and use it in the deepblue_select_regions
command to selects genomic regions that are in chromosome 1, position 0 to 50,000.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
, SAMPLE_ID
, and @BIOSOURCE
representing the experiment name, the sample ID, and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
samples = deepblue_list_samples(
biosource="myeloid cell",
extra_metadata = list("source" = "BLUEPRINT Epigenome"))
samples_ids = deepblue_extract_ids(samples)
query_id = deepblue_select_regions(genome="GRCh38", sample=samples_ids,
chromosome="chr1", start=0, end=50000)
request_id = deepblue_get_regions(query_id=query_id,
output_format="CHROMOSOME,START,END,@NAME,@SAMPLE_ID,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions,1)
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | @NAME @SAMPLE_ID
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 0-10000 * | S00Q7NH1_12_12_Bluep.. s8797
## @BIOSOURCE
## <character>
## [1] myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_select_experiments
command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE
> 10 and PEAK
> 1000.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
and @BIOSOURCE
representing the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(
query_id=query_id, field="SIGNAL_VALUE", operation=">",
value="10", type="number")
query_id_filters = deepblue_filter_regions(
query_id=query_id_filter_signal, field="PEAK", operation=">",
value="1000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filters,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 161 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 1142428-1144001 * | 10.9313 1275
## [2] chr1 1573400-1575582 * | 17.8805 1094
## [3] chr1 1612814-1616174 * | 32.2064 2802
## [4] chr1 1668761-1670450 * | 20.2936 1017
## [5] chr1 1778583-1783797 * | 35.4277 1293
## ... ... ... ... . ... ...
## [157] chr1 44774644-44776655 * | 16.3227 1160
## [158] chr1 44806139-44811000 * | 22.8156 1381
## [159] chr1 46301112-46304262 * | 19.8041 2397
## [160] chr1 46579227-46582046 * | 15.9613 1824
## [161] chr1 46593677-46595181 * | 11.8798 1304
## @NAME @BIOSOURCE
## <character> <character>
## [1] BL-2_c01.ERX297416.H.. BL-2
## [2] S008SGH1.ERX406923.H.. myeloid cell
## [3] S008SGH1.ERX406923.H.. myeloid cell
## [4] S008SGH1.ERX406923.H.. myeloid cell
## [5] S008SGH1.ERX406923.H.. myeloid cell
## ... ... ...
## [157] S008SGH1.ERX406923.H.. myeloid cell
## [158] S008SGH1.ERX406923.H.. myeloid cell
## [159] S008SGH1.ERX406923.H.. myeloid cell
## [160] S008SGH1.ERX406923.H.. myeloid cell
## [161] S008SGH1.ERX406923.H.. myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_select_experiments
command for selecting genomic regions from two specific experiments that are in chromosome 1, position 0 to 50,000,000. Then, we filter these for regions with SIGNAL_VALUE
> 10 and PEAK
> 1000.
The command deepblue_intersection
filters for all regions of the query_id
that intersect with at least one region in promoters_id
.
Then, we use the deepblue_get_regions
command with the parameter query_id
returned by the deepblue_select_regions
command and the columns @NAME
and @BIOSOURCE
representing the experiment name and the experiment biosource.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
promoters_id = deepblue_select_annotations(annotation_name="promoters",
genome="GRCh38", chromosome="chr1")
intersect_id = deepblue_intersection(
query_data_id=query_id, query_filter_id=promoters_id)
request_id = deepblue_get_regions(
query_id=intersect_id,
output_format="CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 608 ranges and 4 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 903997-904177 * | 4.7708 89
## [2] chr1 904302-905111 * | 5.4928 560
## [3] chr1 910269-910975 * | 4.7201 136
## [4] chr1 911973-913915 * | 17.0446 624
## [5] chr1 923976-924329 * | 4.7201 109
## ... ... ... ... . ... ...
## [604] chr1 46718435-46719027 * | 6.2499 230
## [605] chr1 47313340-47313980 * | 6.8711 174
## [606] chr1 47313412-47313588 * | 4.8558 120
## [607] chr1 47313632-47314141 * | 10.2801 371
## [608] chr1 47333183-47335172 * | 18.9772 857
## @NAME @BIOSOURCE
## <character> <character>
## [1] S008SGH1.ERX406923.H.. myeloid cell
## [2] S008SGH1.ERX406923.H.. myeloid cell
## [3] S008SGH1.ERX406923.H.. myeloid cell
## [4] S008SGH1.ERX406923.H.. myeloid cell
## [5] S008SGH1.ERX406923.H.. myeloid cell
## ... ... ...
## [604] BL-2_c01.ERX297416.H.. BL-2
## [605] BL-2_c01.ERX297416.H.. BL-2
## [606] S008SGH1.ERX406923.H.. myeloid cell
## [607] S008SGH1.ERX406923.H.. myeloid cell
## [608] S008SGH1.ERX406923.H.. myeloid cell
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
library(Gviz)
atrack <- AnnotationTrack(regions, name = "Intersecting regions",
group = regions$`@BIOSOURCE`, genome="hg38")
gtrack <- GenomeAxisTrack()
itrack <- IdeogramTrack(genome = "hg38", chromosome = "chr1")
plotTracks(list(itrack, atrack, gtrack), groupAnnotation="group", fontsize=18,
background.panel = "#FFFEDB", background.title = "darkblue")
The meta-column @LENGTH
contains the genomic region length, and we filter the genomic regions where this value is smaller than 2,000.
The meta-column @SEQUENCE
includes the DNA Sequence in the genomic region output.
The deepblue_get_regions
command is executed asynchronously. This means that the user receives a request_id
to be able to check for the status of this request. In contrast to the command deepblue_get_request_data
, the DeepBlueR package-specific command deepblue_download_request_data
will wait for the processing to finish, before downloading the data. Moreover, this command will convert any regions to a GRanges object.
query_id = deepblue_select_experiments(
experiment_name = c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
query_id_filter_signal = deepblue_filter_regions(query_id=query_id,
field="SIGNAL_VALUE", operation=">", value="10", type="number")
query_id_filters = deepblue_filter_regions(query_id=query_id_filter_signal,
field="PEAK", operation=">", value="1000", type="number")
query_id_filter_length = deepblue_filter_regions (query_id=query_id_filters,
field="@LENGTH", operation="<", value="2000", type="number")
request_id = deepblue_get_regions(query_id=query_id_filter_length,
output_format="CHROMOSOME,START,END,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 1)
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | @NAME @BIOSOURCE
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 1142428-1144001 * | BL-2_c01.ERX297416.H.. BL-2
## @LENGTH @SEQUENCE
## <integer> <character>
## [1] 1573 CCAGGCTGGTCTCAAACTCC..
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
We use the deepblue_find_motif
command to find all position of a given pattern in the genome. An example is finding all locations of ‘TATAA’ in genome assembly GRCh38.
We use the deepblue_select_experiments
command to select genomic regions that are in chromosome 1, position 0 to 50,000,000 from the selected experiments.
The command deepblue_intersect
selects all regions of the query_id
that intersect with at least one tataa_regions
region.
tataa_regions = deepblue_find_motif(motif="TATAAA", genome="GRCh38", chromosomes="chr1")
query_id = deepblue_select_experiments(
experiment_name= c("BL-2_c01.ERX297416.H3K27ac.bwa.GRCh38.20150527.bed",
"S008SGH1.ERX406923.H3K27ac.bwa.GRCh38.20150728.bed"),
chromosome="chr1", start=0, end=50000000)
overlapped = deepblue_intersection(query_data_id=query_id,
query_filter_id=tataa_regions)
request_id=deepblue_get_regions(overlapped,
"CHROMOSOME,START,END,SIGNAL_VALUE,PEAK,@NAME,@BIOSOURCE,@LENGTH,@SEQUENCE,@PROJECT")
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)
## GRanges object with 3 ranges and 7 metadata columns:
## seqnames ranges strand | SIGNAL_VALUE PEAK
## <Rle> <IRanges> <Rle> | <character> <integer>
## [1] chr1 273768-274209 * | 14.1567 164
## [2] chr1 911203-911477 * | 4.8558 105
## [3] chr1 916813-917682 * | 10.5467 183
## @NAME @BIOSOURCE @LENGTH @SEQUENCE
## <character> <character> <integer> <character>
## [1] S008SGH1.ERX406923.H.. myeloid cell 441 AGCCAAAGGTGATATTTTCA..
## [2] S008SGH1.ERX406923.H.. myeloid cell 274 ATAAACAGGCAGGAGTCTTC..
## [3] S008SGH1.ERX406923.H.. myeloid cell 869 TTTATTGCACAAATTATTAA..
## @PROJECT
## <character>
## [1] BLUEPRINT Epigenome
## [2] BLUEPRINT Epigenome
## [3] BLUEPRINT Epigenome
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The meta column @COUNT.MOTIF()
allows for counting how many times a motif appears in the selected genomic region. For example, the following code return the experiment regions with the DNA sequence length, the counts of G
, CG
, GC
, and the DNA sequence itself.
experiment_data = deepblue_select_experiments(
"DG-75_c01.ERX297417.H3K27ac.bwa.GRCh38.20150527.bed")
fmt = "CHROMOSOME,START,END,@LENGTH,@COUNT.MOTIF(C),@COUNT.MOTIF(G),@COUNT.MOTIF(CG),@COUNT.MOTIF(GC),@SEQUENCE"
request_id=deepblue_get_regions(experiment_data, fmt)
regions = deepblue_download_request_data(request_id=request_id)
head(regions, 3)
## GRanges object with 3 ranges and 6 metadata columns:
## seqnames ranges strand | @LENGTH @COUNT.MOTIF(C) @COUNT.MOTIF(G)
## <Rle> <IRanges> <Rle> | <integer> <character> <character>
## [1] chr1 779094-779379 * | 285 83 85
## [2] chr1 826755-827064 * | 309 109 78
## [3] chr1 958700-959105 * | 405 121 124
## @COUNT.MOTIF(CG) @COUNT.MOTIF(GC) @SEQUENCE
## <character> <character> <character>
## [1] 15 24 ACGTGCGCTCCACAACGCCT..
## [2] 15 18 ACCCATCCACTTCCCATCTA..
## [3] 29 32 GAGATTTTTGCACAACTCAC..
## -------
## seqinfo: 36 sequences from an unspecified genome; no seqlengths
We use the deepblue_select_genes
command to select the gene RP11-34P13
from GENCODE v23.
The selected genes behave like a regular genomic region, which, for example, can be filtered by their attributes. We use the @GENE_ATTRIBUTE
meta-column to select the genomic regions that are annotated as lincRNAs.
q_genes = deepblue_select_genes(genes="RP11-34P13", gene_model="gencode v23")
q_filter = deepblue_filter_regions(query_id=q_genes,
field="@GENE_ATTRIBUTE(gene_type)", operation="==",
value="lincRNA", type="string")
request_id=deepblue_get_regions(q_filter, "CHROMOSOME,START,END,GTF_ATTRIBUTES")
regions = deepblue_download_request_data(request_id=request_id)
regions
The command deepblue_aggregate
summarizes the query_id
regions using the cpg_islands
regions defined by the corresponding annotation as boundaries.
The aggregated values can be accessed through the @AGG.*
meta-columns.
query_id = deepblue_select_experiments (
experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"),
chromosome="chr1", start=0, end=50000000)
cpg_islands = deepblue_select_annotations(annotation_name="CpG Islands",
genome="GRCh38", chromosome="chr1", start=0, end=50000000)
# Aggregate
overlapped = deepblue_aggregate (data_id=query_id, ranges_id=cpg_islands,
column="VALUE" )
# Retrieve the experiments data (The @NAME meta-column is used to include
# the experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=overlapped,
output_format=
"CHROMOSOME,START,END,@AGG.MIN,@AGG.MAX,@AGG.MEAN,@AGG.VAR")
regions = deepblue_download_request_data(request_id=request_id)
In the following example we obtain the gene expression levels of three genes, i.e., NOX3, NOXA1, and NOX4 from all biosources related to the hematopoietic stem cell
biosource from the BLUEPRINT project. With related we refer to children of this biosource term in the ontologies used by DeepBlue.
hsc_children = deepblue_get_biosource_children("hematopoietic stem cell")
hsc_children_name = deepblue_extract_names(hsc_children)
hsc_children_samples = deepblue_list_samples(
biosource = hsc_children_name,
extra_metadata = list(source="BLUEPRINT Epigenome"))
hsc_samples_ids = deepblue_extract_ids(hsc_children_samples)
# Note that BLUEPRINT uses Ensembl Gene IDs
gene_exprs_query = deepblue_select_expressions(
expression_type = "gene",
sample_ids = hsc_samples_ids,
identifiers = c("ENSG00000074771.3", "ENSG00000188747.7", "ENSG00000086991.11"),
gene_model = "gencode v22")
request_id = deepblue_get_regions(
query_id = gene_exprs_query,
output_format ="@GENE_NAME(gencode v22),CHROMOSOME,START,END,FPKM,@BIOSOURCE")
regions = deepblue_download_request_data(request_id = request_id)
regions
## GRanges object with 679 ranges and 3 metadata columns:
## seqnames ranges strand | @GENE_NAME(gencode v22)
## <Rle> <IRanges> <Rle> | <character>
## [1] chr9 137423350-137434406 * | NOXA1
## [2] chr9 137423350-137434406 * | NOXA1
## [3] chr9 137423350-137434406 * | NOXA1
## [4] chr9 137423350-137434406 * | NOXA1
## [5] chr9 137423350-137434406 * | NOXA1
## ... ... ... ... . ...
## [675] chr6 155395370-155455903 * | NOX3
## [676] chr6 155395370-155455903 * | NOX3
## [677] chr6 155395370-155455903 * | NOX3
## [678] chr6 155395370-155455903 * | NOX3
## [679] chr6 155395370-155455903 * | NOX3
## FPKM @BIOSOURCE
## <character> <character>
## [1] 0.0100 hematopoietic multip..
## [2] 0.3900 CD34-negative, CD41-..
## [3] 3.1400 myeloid cell
## [4] 2.1800 CD14-positive, CD16-..
## [5] 0.9800 myeloid cell
## ... ... ...
## [675] 0.0000 myeloid cell
## [676] 0.0000 mature neutrophil
## [677] 0.0000 naive B cell
## [678] 0.0000 central memory CD4-p..
## [679] 0.0000 myeloid cell
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
We use the deepblue_tiling_regions
command to generate a set of consecutive genomic regions of size 100,000 from chromosome 1 of the genome assembly GRCh38.
The command deepblue_aggregate
summarizes the query_id
regions using the column VALUE
and the cpg_islands
regions as boundaries.
# Selecting the data from 2 experiments:
# GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig
# As we already know the experiments names, we keep all others fields empty.
# We are selecting all regions of chromosome 1
query_id = deepblue_select_experiments(
experiment=c("GC_T14_10.CPG_methylation_calls.bs_call.GRCh38.20160531.wig"),
chromosome="chr1")
# Tiling regions of 100.000 base pairs
tiling_id = deepblue_tiling_regions(size=100000,
genome="GRCh38", chromosome="chr1")
# Aggregate
overlapped = deepblue_aggregate (data_id=query_id,
ranges_id=tiling_id, column="VALUE")
# Retrieve the experiments data (The @NAME meta-column is used to include the
# experiment name and @BIOSOURCE for experiment's biosource
request_id = deepblue_get_regions(query_id=overlapped,
output_format="CHROMOSOME,START,END,@AGG.MEAN,@AGG.SD")
regions = deepblue_download_request_data(request_id=request_id)
regions
## GRanges object with 2489 ranges and 2 metadata columns:
## seqnames ranges strand | @AGG.MEAN @AGG.SD
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 0-100000 * | 0.6677 0.3639
## [2] chr1 100000-200000 * | 0.8358 0.2414
## [3] chr1 200000-300000 * | 0.7714 0.2512
## [4] chr1 300000-400000 * | 0.7595 0.2477
## [5] chr1 400000-500000 * | 0.8512 0.1877
## ... ... ... ... . ... ...
## [2485] chr1 248400000-248500000 * | 0.8348 0.1880
## [2486] chr1 248500000-248600000 * | 0.8576 0.1561
## [2487] chr1 248600000-248700000 * | 0.8664 0.1786
## [2488] chr1 248700000-248800000 * | 0.8425 0.1846
## [2489] chr1 248800000-248900000 * | 0.6572 0.4079
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Such data can now be plotted using any of the common R plotting mechanisms and packages. An example is shown here:
library(ggplot2)
plot_data <- as.data.frame(regions)
plot_data[,grepl("X.", colnames(plot_data))] <-
apply(plot_data[,grepl("X.", colnames(plot_data))], 2, as.numeric)
AGG.plot <- ggplot(plot_data, aes(start)) +
geom_ribbon(aes(ymin = X.AGG.MEAN - (X.AGG.SD / 2),
ymax = X.AGG.MEAN + (X.AGG.SD / 2)), fill = "grey70") +
geom_line(aes(y = X.AGG.MEAN))
print(AGG.plot)