Onassis uses Conceptmapper, an Apache UIMA (Unstructured Information Management Architecture) dictionary lookup tool to retrieve dictionary terms in a given text. https://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html
In particular, the ccp-nlp Conceptmapper wrapper, specific for the biomedical domain, implements a pipeline through which it is possible to retrieve concepts from OBO ontologies in any given text with different adjustable options (Verspoor et al. 2009).
Onassis can handle any type of text as input, but is particularly well suited for the analysis of the metadata from Gene Expression Omnibus (GEO) or Sequence Read Archive (SRA). This represents a fundamental first step in the integrative analysis of the data from those repositories (Galeota and Pelizzola 2016). Indeed it provides the possibility to associate concepts from any OBO ontology to GEO or SRA metadata retrieved using GEOmetadb or SRAdb. In addition to the ontological concepts, the recognition of gene/protein symbols or epigenetic modifications can be highly relevant, especially for experiments directed to those specific factor or mark (such as ChIP-seq experiments).
The semantic similarity module uses different semantic similarity measures to determine the semantic similarity of concepts in a given ontology. This module has been developed on the basis of the Java slib http://www.semantic-measures-library.org/sml. To run Onassis Java (>= 1.8) is needed.
One of the most straightforward ways to retrieve GEO metadata is through GEOmetadb or SRAdb packages. In order to use them through Onassis the user should download the corresponding SQLite databases following the instructions provided in those packages vignettes. While the SQLite packages are required, Onassis allows to access to those data without any knowledge on SQL programming, thus providing functions to help the metadat retrieval without the need of explicitly making queries to the database.
First, it is necessary to obtain and get a connection to the SQLite database. connectToGEODB
returns a connection to the database given the path of the SQLite file. If the latter is missing, it is automatically downloaded into the current working directory. Because of the size of these files (0.5-4GB), the results of the queries illustrated below are available into Onassis for subsequent analysis illustrated in this document. Then, the getGEOmetadata
function can be used to retrieve the metadata of specific GEO samples, taking as minimal parameters the connection to the database and one of the experiment types available. Optionally it is possible to specify the organism and the platform.
## Running this function might take long time if the database has to be downloaded.
geo_con <- connectToGEODB(system.file(getwd(), 'GEOmetadb.sqlite'))
#Showing the experiment types available in GEO
experiments <- experiment_types(geo_con)
#Showing the organism types available in GEO
species <- organism_types(geo_con)
#Retrieving the metadata associated to experiment type "Methylation profiling by high througput sequencing"
meth_metadata <- getGEOMetadata(geo_con, experiment_type='Methylation profiling by high throughput sequencing', organism = 'Homo sapiens')
#Retrieving Human gene expression metadata, knowing the GEO platform identifier, e.g. the Affymetrix Human Genome U133 Plus 2.0 Array
expression <- getGEOMetadata(geo_con, experiment_type='Expression profiling by array', gpl='GPL570')
Some of the experiment types available are the following:
Table: (#tab:experimentTypesshow)x |
---|
Expression profiling by MPSS |
Expression profiling by RT-PCR |
Expression profiling by SAGE |
Expression profiling by SNP array |
Expression profiling by array |
Expression profiling by genome tiling array |
Expression profiling by high throughput sequencing |
Genome binding/occupancy profiling by SNP array |
Genome binding/occupancy profiling by array |
Genome binding/occupancy profiling by genome tiling array |
Some of the organisms available are the following:
Table: (#tab:speciesShow)x |
---|
Homo sapiens |
Drosophila melanogaster |
Mus musculus |
Zea mays |
Arabidopsis thaliana |
Caenorhabditis elegans |
Helicobacter pylori |
Escherichia coli |
Rattus norvegicus |
Saccharomyces cerevisiae |
To avoid installing GEOmetadb meth_metadata was previously saved and can be loaded from Onassis:
meth_metadata <- readRDS(system.file('extdata', 'vignette_data', 'GEOmethylation.rds', package='Onassis'))
Table: (#tab:printmeta)
FALSE | series_id | gsm | title | gpl | source_name_ch1 | organism_ch1 | characteristics_ch1 | description | experiment_title | experiment_summary |
---|---|---|---|---|---|---|---|---|---|---|
1251 | GSE42590 | GSM1045538 | 2316_DLPFC_Control | GPL10999 | Brain (dorsolateral prefrontal cortex) | Homo sapiens | tissue: Heterogeneous brain tissue | NA | Genome-wide DNA methylation profiling of human dorsolateral prefrontal cortex | Reduced representation bisulfite sequencing (RRBS) |
511 | GSE27432 | GSM678217 | hEB16d_H9_p65_RRBS | GPL9115 | embryoid body from hES H9 p65 | Homo sapiens | cell type: hEB16d_H9_p65 | reduced representation bisulfite sequencing | Genomic distribution and inter-sample variation of non-CG methylation across human cell types | DNA methylation plays an important role in develop |
2731 | GSE58889 | GSM1421876 | Normal_CD19_11 | GPL11154 | Normal CD19+ cells | Homo sapiens | cell type: Normal CD19+ cells; disease status: healthy | NA | Methylation disorder in CLL | We performed RRBS and WGBS on primary human chroni |
1984 | GSE50761 | GSM1228607 | Time Course Off-target Day 7 1 HBB133 | GPL15520 | K562 cells | Homo sapiens | cell line: K562 cells; target loci: Time Course Off-target Day 7 1 | 2013.03.16._MM364_analysis.csv | Targeted DNA demethylation using TALE-TET1 fusion proteins | Recent large-scale studies have defined genomewide |
851 | GSE36173 | GSM882245 | H1 human ES cells | GPL10999 | H1 human ES cells | Homo sapiens | cell line: H1 | 5-hmC whole genome bisulfite sequencing | Base Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome | The study of 5-hydroxylmethylcytosines (5hmC), the |
1966 | GSE50761 | GSM1228589 | Time Course HB-6 Day 4 1 HBB115 | GPL15520 | K562 cells | Homo sapiens | cell line: K562 cells; target loci: Time Course HB-6 Day 4 1 | 2013.03.16._MM364_analysis.csv | Targeted DNA demethylation using TALE-TET1 fusion proteins | Recent large-scale studies have defined genomewide |
1827 | GSE50761 | GSM1228450 | Off target -650 to -850 3 RHOX117 | GPL15520 | 293 cells | Homo sapiens | cell line: 293 cells; target loci: Off target -650 to -850 3 | 2013-07-23-MM195-288-394_analysis.csv | Targeted DNA demethylation using TALE-TET1 fusion proteins | Recent large-scale studies have defined genomewide |
378 | GSE26592 | GSM655200 | Endometrial Recurrent 5 | GPL9052 | Human endometrial specimen | Homo sapiens | tissue: Human endometrial specimen; cell type: primary tissues; disease status: Recurrent; chromatin selection: MBD protein | MBDCap using MethylMiner Methylated DNA Enrichment Kit (Invitrogen, ME 10025); library strategy: Endometrial samples: MBDCao-seq. Breast cells: MBDCap-seq.; library selection: Endometrial samples: MBDCap. Breast cells: MBDCap-se | q. Neighboring genomic regions influence differential methylation patterns of CpG islands in endometrial and breast cance | rs We report the global methylation patterns by MBDCa |
1754 | GSE50761 | GSM1228377 | Initial Screen RH-3 -250-+1 2 RHOX44 | GPL15520 | HeLa cells | Homo sapiens | cell line: HeLa cells; target loci: Initial Screen RH-3 -250-+1 2 | 2013-07-12-MM564_analysis.csv | Targeted DNA demethylation using TALE-TET1 fusion proteins | Recent large-scale studies have defined genomewide |
2371 | GSE54961 | GSM1327281 | Healthy Control | GPL9052 | Healthy Control | Homo sapiens | etiology: Healthy Control; tissue: Peripheral venous blood; molecule subtype: serum cell-free DNA | Sample 1 | Epigenome analysis of serum cell-free circulating DNA in progression of HBV-related Hepatocellular carcinoma | Purpose: Aberrantly methylated DNA are hallmarks |
As for GEO, the getSRAMetadata
function allows the retrieval of metadata of high througput sequencing data stored in SRA through the BiocStyle::Biocpkg("SRAdb")
package. To facilitate the retrieval of experiment types in SRA, the Onassis function library_strategies
can be used. Filters for the sample’s material (GENOMIC, TRANSCRIPTOMIC, METAGENOMIC…), the species and the center hosting the data are allowed.
For example, to obtain SRA metadata of ChIP-Seq human samples and Bisulfite sequencing samples the following code can be used.
# Connection to the SRAmetadb and potential download of the sqlite file
sra_con <- connectToSRADB()
# Query for the ChIP-Seq experiments contained in GEO for human samples
sra_chip_seq <- getSRAMetadata(sra_con, library_strategy='ChIP-Seq', library_source='GENOMIC', taxon_id=9606, center_name='GEO')
# The following example allows to retrieve Bisulfite sequencing samples' metadata.
bisulfite_seq <- getSRAMetadata(sra_con, library_strategy='Bisulfite-Seq', library_source='GENOMIC', taxon_id=9606, center_name='GEO' )
To avoid installing GEOmetadb sra_chip_seq was previously saved and can be loaded from Onassis:
sra_chip_seq <- readRDS(system.file('extdata', 'vignette_data', 'GEO_human_chip.rds', package='Onassis'))
Table: (#tab:printchromatinIP)
FALSE | sample_accession | experiment_accession | center_name | title | library_strategy | library_layout | experiment_url_link | experiment_attribute | description | sample_attribute | sample_url_link |
---|---|---|---|---|---|---|---|---|---|---|---|
5904 | SRS421364 | SRX278504 | GEO | GSM1142700: p53 ChIP LCL nutlin-3 treated; Homo sapiens; ChIP-Seq | ChIP-Seq | SINGLE - | GEO Sample: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700 | GEO Accession: GSM1142700 | NA | source_name: lymphoblastoid cells || cell type: nutlin-3 treated lymphoblastoid cells || coriell id: GM12878 || chip antibody: mouse monoclonal anti-human p53 (BD Pharmingen, cat# 554294) || BioSampleModel: Generic | GEO Sample GSM1142700: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700 |
4981 | SRS371783 | SRX199902 | GEO | GSM1022674: UW_ChipSeq_A549_InputRep1 | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674 | GEO Accession: GSM1022674 | NA | source_name: A549 || biomaterial_provider: ATCC || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: A549 || cell organism: human || cell description: epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), “This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male.” - ATCC, newly promoted to tier 2: not in 2011 analysis || cell karyotype: cancer || cell lineage: endoderm || cell sex: M || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH001904 || labexpid: DS18301 || labversion: WindowDensity-bin20-win+/-75 || replicate: 1 || BioSampleModel: Generic | GEO Sample GSM1022674: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674 |
4619 | SRS365824 | SRX190055 | GEO | GSM945272: UW_ChipSeq_HRPEpiC_Input | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272 | GEO Accession: GSM945272 | NA | source_name: HRPEpiC || biomaterial_provider: ScienCell || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: HRPEpiC || cell organism: human || cell description: retinal pigment epithelial cells || cell karyotype: normal || cell lineage: ectoderm || cell sex: U || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH000962 || labexpid: DS16014 || labversion: Bowtie 0.12.7 || replicate: 1 || BioSampleModel: Generic | GEO Sample GSM945272: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272 |
911 | SRS117344 | SRX028649 | GEO | GSM608166: H3K27me3_K562_ChIP-seq_rep1 | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166 | GEO Accession: GSM608166 | NA | source_name: chronic myeloid leukemia cell line || cell line: K562 || harvest date: 2008-06-12 || chip antibody: CST monoclonal rabbit rabbit anti-H3K27me3 || BioSampleModel: Generic | GEO Sample GSM608166: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166 |
4244 | SRS362733 | SRX186665 | GEO | GSM1003469: Broad_ChipSeq_Dnd41_H3K79me2 | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469 | GEO Accession: GSM1003469 | NA | source_name: Dnd41 || biomaterial_provider: DSMZ || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || controlid: wgEncodeEH002434 || replicate: 1,2 || softwareversion: ScriptureVPaperR3 || cell sex: M || antibody: H3K79me2 || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: Dnd41/Input/std || softwareversion: ScriptureVPaperR3 || BioSampleModel: Generic | GEO Sample GSM1003469: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469 |
7502 | SRS494656 | SRX369112 | GEO | GSM1252315: CHG092; Homo sapiens; ChIP-Seq | ChIP-Seq | SINGLE - | GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315 | GEO Accession: GSM1252315 | NA | source_name: Gastric Primary Sample || tissuetype: Tumor || chip antibody: H3K4me1 || reads length: 101 || BioSampleModel: Generic | GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315 |
2127 | SRS266173 | SRX099863 | GEO | GSM808752: MCF7_CTCF_REP1 | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752 | GEO Accession: GSM808752: | NA | source_name: breast adenocarcinoma cells || cell type: breast adenocarcinoma cells || cell line: MCF7 || antibody: CTCF || BioSampleModel: Generic | GEO Sample GSM808752: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752 |
6299 | SRS468164 | SRX332680 | GEO | GSM1204476: Input DNA for ChIP; Homo sapiens; ChIP-Seq | ChIP-Seq | SINGLE - | GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476 | GEO Accession: GSM1204476 | NA | source_name: MDAMB231 || cell line: MDAMB231 || chip antibody: input || BioSampleModel: Generic | GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476 |
832 | SRS115184 | SRX027300 | GEO | GSM593367: H3K4me3_H3 | ChIP-Seq | SINGLE - | GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367 | GEO Accession: GSM593367 | NA | source_name: LCL || chip antibody: H3K4me3 || cell type: lymphoblastoid cell line || BioSampleModel: Generic | GEO Sample GSM593367: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367 |
8638 | SRS598154 | SRX528309 | GEO | GSM1375207: H3_ChIPSeq_Human; Homo sapiens; ChIP-Seq | ChIP-Seq | SINGLE - | GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207 | GEO Accession: GSM1375207 | NA | source_name: H3_ChIPSeq_Human || donor age: adult || cell type: sperm || chip antibody: H3F3B || chip antibody vendor: Abnova || BioSampleModel: Generic | GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207 |
The Onassis EntityFinder
class has methods for annotating any text with dictionary terms. More specifically, Onassis can take advantage of the OBO dictionaries (http://www.obofoundry.org/), as shown in the next section where we will define relationships between different samples annotated with different ontology concepts thanks to the structure of the ontology.
Input text can be provided as:
Conceptmapper dictionaries are XML files with a set of entries specified by the xml tag <token>
with a canonical name (the name of the entry) and one or more variants (synonyms). Additional properties are allowed. The following code represents a sample of the Conceptmapper dictionary obtained from the Brenda tissue ontology.
<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
<token id="http://purl.obolibrary.org/obo/BTO_0005205" canonical="cerebral artery">
<variant base="cerebral artery"/>
</token>
<token id="http://purl.obolibrary.org/obo/BTO_0002179" canonical="184A1N4 cell">
<variant base="184A1N4 cell"/>
<variant base="A1N4 cell"/>
</token>
<token id="http://purl.obolibrary.org/obo/BTO_0003871" canonical="uterine endometrial cancer cell">
<variant base="uterine endometrial cancer cell"/>
<variant base="endometrial cancer cell"/>
<variant base="uterine endometrial carcinoma cell"/>
<variant base="endometrial carcinoma cell"/>
</token>
</synonym>
The function dict dictionary
creates an instance of the class CMdictionary
(by internally calling the method buildDictionary
).
dictType
option to “CMDICT”.# If a Conceptmapper dictionary is already available the dictType CMDICT can be specified and the corresponding file loaded
sample_dict <- dictionary(inputFileOrDb=system.file('extdata', 'cmDict-sample.cs.xml', package = 'Onassis'), dictType = 'CMDICT')
#Creation of a dictionary from the file sample.cs.obo available in OnassisJavaLibs
obo <- system.file('extdata', 'sample.cs.obo', package='OnassisJavaLibs')
sample_dict <- dictionary(inputFileOrDb=obo, outputdir=getwd(), synonymType='ALL')
# Creation of a dictionary for human genes/proteins
require(org.Hs.eg.db)
targets <- dictionary(dictType='TARGET', inputFileOrDb = 'org.Hs.eg.db')
Conceptmapper includes 7 different options controlling the annotation step. These are documented in detail in the documentation of the CMoptions function. They can be listed through the listCombinations
method after creating a ‘CMoptions’ object, and they can be set through the CMoptions
function. The set options are thus stored in an object of class CMoptions that will be required for the subsequent step of annotation.
#Showing configuration permutations
opts <- CMoptions()
combinations <- listCombinations(opts)
#Setting the combination of parameters through the paramValueIndex value
myopts <- CMoptions(c("CONTIGUOUS_MATCH", "CASE_INSENSITIVE", 'BIOLEMMATIZER', 'PUBMED', 'ON', 'YES', 'ALL'))
The class EntityFinder
has methods to find concepts of any OBO ontology in a given text. The findEntities
and annotateDF
methods accept text within files or data.frame, respetively, as described in Section 3.1. Alternatively, the function annotate
automatically adapts to the provided input type.
For example, to annotate the metadata derived from ChIP-seq experiments obtained from SRA with tissue and cell type concepts belonging to BRENDA ontology the following code can be used:
chipseq_dict_annot <- annotate(sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], dictionary=sample_dict, options=myopts)
## [1] "/tmp/Rtmpzmmn37"
## [1] "/tmp/Rtmpzmmn37/SRS115184.a1"
## [1] "/tmp/Rtmpzmmn37/SRS117344.a1"
## [1] "/tmp/Rtmpzmmn37/SRS213443.a1"
## [1] "/tmp/Rtmpzmmn37/SRS241934.a1"
## [1] "/tmp/Rtmpzmmn37/SRS266173.a1"
## [1] "/tmp/Rtmpzmmn37/SRS285318.a1"
## [1] "/tmp/Rtmpzmmn37/SRS336079.a1"
## [1] "/tmp/Rtmpzmmn37/SRS346539.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362733.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362801.a1"
## [1] "/tmp/Rtmpzmmn37/SRS365824.a1"
## [1] "/tmp/Rtmpzmmn37/SRS371783.a1"
## [1] "/tmp/Rtmpzmmn37/SRS376074.a1"
## [1] "/tmp/Rtmpzmmn37/SRS410226.a1"
## [1] "/tmp/Rtmpzmmn37/SRS421364.a1"
## [1] "/tmp/Rtmpzmmn37/SRS468164.a1"
## [1] "/tmp/Rtmpzmmn37/SRS494656.a1"
## [1] "/tmp/Rtmpzmmn37/SRS598154.a1"
## [1] "/tmp/Rtmpzmmn37/SRS636394.a1"
## [1] "/tmp/Rtmpzmmn37/SRS724993.a1"
## Conceptmapper annotations created in directory: /tmp/Rtmpzmmn37
The resulting data.frame contains for each row a match to the provided dictionary for a specific document/sample (indicated in the first column). The annotation is reported with the id of the concept (term_id), its canonical name (term name), its URL in the obo format, and the matching sentence of the document.
Table: (#tab:showchipresults)The function annotate
can also be used to identify the targeted entity of each ChIP-seq experiment, by retrieving gene names and histone types or modifications in the ChIP-seq metadata.
#Finding the TARGET entities
target_entities <- annotate(inputFileorDf=sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], options = myopts, dictionary=targets)
## [1] "/tmp/Rtmpzmmn37"
## [1] "/tmp/Rtmpzmmn37/SRS115184.a1"
## [1] "/tmp/Rtmpzmmn37/SRS117344.a1"
## [1] "/tmp/Rtmpzmmn37/SRS213443.a1"
## [1] "/tmp/Rtmpzmmn37/SRS241934.a1"
## [1] "/tmp/Rtmpzmmn37/SRS266173.a1"
## [1] "/tmp/Rtmpzmmn37/SRS285318.a1"
## [1] "/tmp/Rtmpzmmn37/SRS336079.a1"
## [1] "/tmp/Rtmpzmmn37/SRS346539.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362733.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362801.a1"
## [1] "/tmp/Rtmpzmmn37/SRS365824.a1"
## [1] "/tmp/Rtmpzmmn37/SRS371783.a1"
## [1] "/tmp/Rtmpzmmn37/SRS376074.a1"
## [1] "/tmp/Rtmpzmmn37/SRS410226.a1"
## [1] "/tmp/Rtmpzmmn37/SRS421364.a1"
## [1] "/tmp/Rtmpzmmn37/SRS468164.a1"
## [1] "/tmp/Rtmpzmmn37/SRS494656.a1"
## [1] "/tmp/Rtmpzmmn37/SRS598154.a1"
## [1] "/tmp/Rtmpzmmn37/SRS636394.a1"
## [1] "/tmp/Rtmpzmmn37/SRS724993.a1"
Table: (#tab:printKable)
FALSE | sample_id | term_id | term_name | term_url | matched_sentence |
---|---|---|---|---|---|
1 | SRS115184 | Reference T1 H3K4me3 | H3K4me3 | NA | H3K4me3 |
2 | SRS117344 | Reference T1 9514 | GAL3ST1 | NA | CST |
3 | SRS117344 | Reference T2 10559 | SLC35A1 | NA | CST |
4 | SRS117344 | Reference T3 54897 | CASZ1 | NA | CST |
5 | SRS117344 | Reference T4 H3K27me3 | H3K27me3 | NA | H3K27me3 |
6 | SRS213443 | Reference T1 10664 | CTCF | NA | CTCF |
7 | SRS213443 | Reference T2 10664 | CTCF | NA | CTCF |
8 | SRS241934 | Reference T1 604 | BCL6 | NA | BCL6 |
9 | SRS266173 | Reference T1 10664 | CTCF | NA | CTCF |
19 | SRS285318 | Reference T10 23133 | PHF8 | NA | PHF8 |
25 | SRS336079 | Reference T6 100616102 | ERVK-9 | NA | polymerase |
26 | SRS336079 | Reference T7 100862685 | ERVK-19 | NA | polymerase |
27 | SRS336079 | Reference T8 100862688 | ERVK-11 | NA | polymerase |
33 | SRS336079 | Reference T14 1283 | CTD | NA | CTD |
36 | SRS336079 | Reference T17 100616102 | ERVK-9 | NA | polymerase |
37 | SRS336079 | Reference T18 100862685 | ERVK-19 | NA | polymerase |
38 | SRS336079 | Reference T19 100862688 | ERVK-11 | NA | polymerase |
49 | SRS362733 | Reference T5 H3K79me2 | H3K79me2 | NA | H3K79me2 |
50 | SRS362733 | Reference T6 H3K79me2 | H3K79me2 | NA | H3K79me2 |
51 | SRS362733 | Reference T7 H3K79me2 | H3K79me2 | NA | H3K79me2 |
52 | SRS362733 | Reference T8 H3K79me2 | H3K79me2 | NA | H3K79me2 |
53 | SRS362733 | Reference T9 H3K79me2 | H3K79me2 | NA | H3K79me2 |
54 | SRS362801 | Reference T1 929 | CD14 | NA | CD14 |
55 | SRS362801 | Reference T2 4695 | NDUFA2 | NA | CD14 |
56 | SRS362801 | Reference T3 929 | CD14 | NA | CD14 |
57 | SRS362801 | Reference T4 4695 | NDUFA2 | NA | CD14 |
63 | SRS362801 | Reference T10 H3K79me2 | H3K79me2 | NA | H3K79me2 |
64 | SRS362801 | Reference T11 H3K79me2 | H3K79me2 | NA | H3K79me2 |
65 | SRS362801 | Reference T12 H3K79me2 | H3K79me2 | NA | H3K79me2 |
66 | SRS362801 | Reference T13 H3K79me2 | H3K79me2 | NA | H3K79me2 |
67 | SRS362801 | Reference T14 H3K79me2 | H3K79me2 | NA | H3K79me2 |
68 | SRS362801 | Reference T15 929 | CD14 | NA | CD14 |
69 | SRS362801 | Reference T16 4695 | NDUFA2 | NA | CD14 |
78 | SRS371783 | Reference T5 10195 | ALG3 | NA | not |
79 | SRS376074 | Reference T1 6886 | TAL1 | NA | TAL1 |
80 | SRS376074 | Reference T2 6886 | TAL1 | NA | TAL1 |
82 | SRS376074 | Reference T4 3855 | KRT7 | NA | SCL |
83 | SRS376074 | Reference T5 6886 | TAL1 | NA | SCL |
84 | SRS376074 | Reference T6 51540 | SCLY | NA | SCL |
86 | SRS410226 | Reference T1 H3K27ac | H3K27ac | NA | H3K27ac |
87 | SRS494656 | Reference T1 H3K4me1 | H3K4me1 | NA | H3K4me1 |
88 | SRS598154 | Reference T1 3021 | H3F3B | NA | H3F3B |
89 | SRS636394 | Reference T1 3169 | FOXA1 | NA | FOXA1 |
With Onassis it is possible to quantify the semantic similarity between the concepts of a given ontology using different semantic similarity measures. Similarity
is an Onassis class applying methods of the Java library slib (Harispe et al. 2014), which builds a semantic graph starting from OBO ontology concepts and their hierarchical relationships.
The following methods are available and are automatically chosen depending on the settings of the Similarity
function. The sim
and groupsim
methods allow the computation of semantic similarity between single terms (pairwise measures) and between group of terms (groupwise measures), respectively. Pairwise measures can be edge based, if they rely only on the structure of the ontology, or information-content based if they also consider the information that each term in the ontology carries. Rather, groupwise measures can be indirect, if they compute the pairwise similarity between each couple of terms, or direct if they consider each set of concepts as a whole.
The samplesim
method allows to determine the semantic similarity between two documents, each possibly associated to multiple concepts, using groupwise measures. Finally, the multisim
method allows to determine the semantic similarity between documents annotated with two or more ontologies: first samplesim
is run for each ontology, then a user defined function can be used to aggregate the resulting semantic similarities for each pair of documents.
The function showSimilarities
shows all the measures supported by Onassis. For details about the measures run {?Similarity}
.
#Instantiating the Similarity
similarities <- showSimilarities()
The following example shows the pairwise similarities between sample cell concepts obtained annotating the ChIP-seq metadata. The resnik similarity measure is used by default, which is based on the information content of the most informative common ancestor of the considered concepts. In particular, the seco information content is used by default, which determines the specificity of each concept based on the number of concepts it subsumes.
found_terms <- unique(chipseq_dict_annot$term_url)
n <- length(found_terms)
ontologyfile <- obo
pairwise_results <- data.frame(term1 = character(0), term2= character(0), value = double(0L))
for(i in 1:(n-1)){
term1 <- as.character(found_terms[i])
j = i + 1
for(k in j:n){
term2 <- as.character(found_terms[k])
two_term_similarity <- similarity(ontologyfile, term1, term2 )
new_row <- cbind(term1, term2, two_term_similarity)
pairwise_results <- rbind(pairwise_results, new_row )
}
}
pairwise_results <- unique(pairwise_results)
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term2', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term2_name'
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term1', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term1_name'
pairwise_results <- unique(pairwise_results)
Table: (#tab:showing_similarity1)
FALSE | term1 | term2 | two_term_similarity | term2_name | term1_name |
---|---|---|---|---|---|
1 | http://purl.obolibrary.org/obo/CL_0000000 | http://purl.obolibrary.org/obo/CL_0000066 | 0.226149129142891 | epithelial cell | cell |
67 | http://purl.obolibrary.org/obo/CL_0000000 | http://purl.obolibrary.org/obo/CL_0000236 | 0.226149129142891 | B cell | cell |
89 | http://purl.obolibrary.org/obo/CL_0000066 | http://purl.obolibrary.org/obo/CL_0000236 | 0.268130656074674 | B cell | epithelial cell |
In the following code the semantic similarity between two groups of terms is computed using the ui measure, a groupwise direct measure combining the intersection and the union of the set of ancestors of the two groups of concepts.
similarity(obo, found_terms[1:2], found_terms[3])
## [1] 0.1875
Finally, the pariwise semantic similarity between ChIP-seq samples is illustrated.
annotated_samples <- as.character(as.vector(unique(chipseq_dict_annot$sample_id)))
n <- length(annotated_samples)
samples_results <- data.frame(sample1 = character(0), sample2= character(0), value = double(0L))
samples_results <- matrix(0, nrow=n, ncol=n)
rownames(samples_results) <- colnames(samples_results) <- annotated_samples
for(i in 1:(n-1)){
sample1 <- as.character(annotated_samples[i])
j = i + 1
for(k in j:n){
sample2 <- as.character(annotated_samples[k])
two_samples_similarity <- similarity(ontologyfile, sample1, sample2, chipseq_dict_annot)
samples_results[i, k] <- samples_results[k, i] <- two_samples_similarity
}
}
diag(samples_results) <- 1
heatmap.2(samples_results, density.info = "none", trace="none", main='Semantic similarity of annotated samples', margins=c(5,5))
Galeota, Eugenia, and Mattia Pelizzola. 2016. “Ontology-Based Annotations and Semantic Relations in Large-Scale (Epi) Genomics Data.” Briefings in Bioinformatics. Oxford Univ Press, bbw036.
Harispe, Sébastien, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain. 2014. “The Semantic Measures Library and Toolkit: Fast Computation of Semantic Similarity and Relatedness Using Biomedical Ontologies.” Bioinformatics 30 (5). Oxford Univ Press:740–42.
Verspoor, K., W. Baumgartner Jr, C. Roeder, and L. Hunter. 2009. “Abstracting the Types away from a UIMA Type System.” From Form to Meaning: Processing Texts Automatically. Tübingen:Narr, 249–56.