1 Introduction to OnASSis
2 Retrieving public repositories metadata
- 2.1 Handling GEO (Gene Expression Omnibus) Metadata
- 2.2 Handling SRA (Sequence Read Archive) Metadata
3 Annotating text with Ontology Concepts
4 Semantic similarity
References

1 Introduction to OnASSis

Public repositories contain thousands of experiments and samples that are difficult to mine. Annotating the description of this data with controlled vocabularies or ontology terms could improve the retrieval of data of interest both programmatically or manually (Galeota and Pelizzola 2016). OnASSiS (Ontology Annotations and Semantic Similarity software) is a package aimed at matching metadata associated with biological experiments with concepts from ontologies, thus aiming at obtaining semantically coherent omics datasets, possibly representing various data types ad derived from independent studies. The recognition of domain specific entities not only allows users to retrieve samples related to a given cell type or experimental condition, but also to discover different and not immediately obvious relationships between experiments. Onassis applies Natural Language Processing techniques to analyze sample’s and experiments’ descriptions, recognize concepts from a multitude of biomedical ontologies and to quantify the similarities/divergences between pairs or groups of query studies. In particular the software includes modules to assist on:

the retrieval of samples’ metadata from repositories of large scale biologial data
the annotation of these data with concepts belonging to OBO biomedical ontologies
the organization of available samples in comparable and coherent groups based on semantic similarity measures

Onassis uses Conceptmapper, an Apache UIMA (Unstructured Information Management Architecture) dictionary lookup tool to retrieve dictionary terms in a given text. https://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html
In particular, the ccp-nlp Conceptmapper wrapper, specific for the biomedical domain, implements a pipeline through which it is possible to retrieve concepts from OBO ontologies in any given text with different adjustable options (Verspoor et al. 2009).

Onassis can handle any type of text as input, but is particularly well suited for the analysis of the metadata from Gene Expression Omnibus (GEO) or Sequence Read Archive (SRA). This represents a fundamental first step in the integrative analysis of the data from those repositories (Galeota and Pelizzola 2016). Indeed it provides the possibility to associate concepts from any OBO ontology to GEO or SRA metadata retrieved using GEOmetadb or SRAdb. In addition to the ontological concepts, the recognition of gene/protein symbols or epigenetic modifications can be highly relevant, especially for experiments directed to those specific factor or mark (such as ChIP-seq experiments).

The semantic similarity module uses different semantic similarity measures to determine the semantic similarity of concepts in a given ontology. This module has been developed on the basis of the Java slib http://www.semantic-measures-library.org/sml. To run Onassis Java (>= 1.8) is needed.

2 Retrieving public repositories metadata

One of the most straightforward ways to retrieve GEO metadata is through GEOmetadb or SRAdb packages. In order to use them through Onassis the user should download the corresponding SQLite databases following the instructions provided in those packages vignettes. While the SQLite packages are required, Onassis allows to access to those data without any knowledge on SQL programming, thus providing functions to help the metadat retrieval without the need of explicitly making queries to the database.

2.1 Handling GEO (Gene Expression Omnibus) Metadata

First, it is necessary to obtain and get a connection to the SQLite database. connectToGEODB returns a connection to the database given the path of the SQLite file. If the latter is missing, it is automatically downloaded into the current working directory. Because of the size of these files (0.5-4GB), the results of the queries illustrated below are available into Onassis for subsequent analysis illustrated in this document. Then, the getGEOmetadata function can be used to retrieve the metadata of specific GEO samples, taking as minimal parameters the connection to the database and one of the experiment types available. Optionally it is possible to specify the organism and the platform.

## Running this function might take long time if the database has to be downloaded.
geo_con <- connectToGEODB(system.file(getwd(), 'GEOmetadb.sqlite'))

#Showing the experiment types available in GEO
experiments <- experiment_types(geo_con)

#Showing the organism types available in GEO
species <- organism_types(geo_con)

#Retrieving the metadata associated to experiment type "Methylation profiling by high througput sequencing"
meth_metadata <- getGEOMetadata(geo_con, experiment_type='Methylation profiling by high throughput sequencing', organism = 'Homo sapiens')

#Retrieving Human gene expression metadata, knowing the GEO platform identifier, e.g. the Affymetrix Human Genome U133 Plus 2.0 Array
expression <- getGEOMetadata(geo_con, experiment_type='Expression profiling by array', gpl='GPL570')

Some of the experiment types available are the following:

Table: (#tab:experimentTypesshow) Table 1: Experiments available in GEO

x
Expression profiling by MPSS
Expression profiling by RT-PCR
Expression profiling by SAGE
Expression profiling by SNP array
Expression profiling by array
Expression profiling by genome tiling array
Expression profiling by high throughput sequencing
Genome binding/occupancy profiling by SNP array
Genome binding/occupancy profiling by array
Genome binding/occupancy profiling by genome tiling array

Some of the organisms available are the following:

Table: (#tab:speciesShow) Table 1: Species available in GEO

x
Homo sapiens
Drosophila melanogaster
Mus musculus
Zea mays
Arabidopsis thaliana
Caenorhabditis elegans
Helicobacter pylori
Escherichia coli
Rattus norvegicus
Saccharomyces cerevisiae

To avoid installing GEOmetadb meth_metadata was previously saved and can be loaded from Onassis:

meth_metadata <- readRDS(system.file('extdata', 'vignette_data', 'GEOmethylation.rds', package='Onassis'))

Table: (#tab:printmeta) Table 1: Methylation profiling by high througput sequencing metadata from GEOmetadb.

FALSE	series_id	gsm	title	gpl	source_name_ch1	organism_ch1	characteristics_ch1	description	experiment_title	experiment_summary
1251	GSE42590	GSM1045538	2316_DLPFC_Control	GPL10999	Brain (dorsolateral prefrontal cortex)	Homo sapiens	tissue: Heterogeneous brain tissue	NA	Genome-wide DNA methylation profiling of human dorsolateral prefrontal cortex	Reduced representation bisulfite sequencing (RRBS)
511	GSE27432	GSM678217	hEB16d_H9_p65_RRBS	GPL9115	embryoid body from hES H9 p65	Homo sapiens	cell type: hEB16d_H9_p65	reduced representation bisulfite sequencing	Genomic distribution and inter-sample variation of non-CG methylation across human cell types	DNA methylation plays an important role in develop
2731	GSE58889	GSM1421876	Normal_CD19_11	GPL11154	Normal CD19+ cells	Homo sapiens	cell type: Normal CD19+ cells; disease status: healthy	NA	Methylation disorder in CLL	We performed RRBS and WGBS on primary human chroni
1984	GSE50761	GSM1228607	Time Course Off-target Day 7 1 HBB133	GPL15520	K562 cells	Homo sapiens	cell line: K562 cells; target loci: Time Course Off-target Day 7 1	2013.03.16._MM364_analysis.csv	Targeted DNA demethylation using TALE-TET1 fusion proteins	Recent large-scale studies have defined genomewide
851	GSE36173	GSM882245	H1 human ES cells	GPL10999	H1 human ES cells	Homo sapiens	cell line: H1	5-hmC whole genome bisulfite sequencing	Base Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome	The study of 5-hydroxylmethylcytosines (5hmC), the
1966	GSE50761	GSM1228589	Time Course HB-6 Day 4 1 HBB115	GPL15520	K562 cells	Homo sapiens	cell line: K562 cells; target loci: Time Course HB-6 Day 4 1	2013.03.16._MM364_analysis.csv	Targeted DNA demethylation using TALE-TET1 fusion proteins	Recent large-scale studies have defined genomewide
1827	GSE50761	GSM1228450	Off target -650 to -850 3 RHOX117	GPL15520	293 cells	Homo sapiens	cell line: 293 cells; target loci: Off target -650 to -850 3	2013-07-23-MM195-288-394_analysis.csv	Targeted DNA demethylation using TALE-TET1 fusion proteins	Recent large-scale studies have defined genomewide
378	GSE26592	GSM655200	Endometrial Recurrent 5	GPL9052	Human endometrial specimen	Homo sapiens	tissue: Human endometrial specimen; cell type: primary tissues; disease status: Recurrent; chromatin selection: MBD protein	MBDCap using MethylMiner Methylated DNA Enrichment Kit (Invitrogen, ME 10025); library strategy: Endometrial samples: MBDCao-seq. Breast cells: MBDCap-seq.; library selection: Endometrial samples: MBDCap. Breast cells: MBDCap-se	q. Neighboring genomic regions influence differential methylation patterns of CpG islands in endometrial and breast cance	rs We report the global methylation patterns by MBDCa
1754	GSE50761	GSM1228377	Initial Screen RH-3 -250-+1 2 RHOX44	GPL15520	HeLa cells	Homo sapiens	cell line: HeLa cells; target loci: Initial Screen RH-3 -250-+1 2	2013-07-12-MM564_analysis.csv	Targeted DNA demethylation using TALE-TET1 fusion proteins	Recent large-scale studies have defined genomewide
2371	GSE54961	GSM1327281	Healthy Control	GPL9052	Healthy Control	Homo sapiens	etiology: Healthy Control; tissue: Peripheral venous blood; molecule subtype: serum cell-free DNA	Sample 1	Epigenome analysis of serum cell-free circulating DNA in progression of HBV-related Hepatocellular carcinoma	Purpose: Aberrantly methylated DNA are hallmarks

2.2 Handling SRA (Sequence Read Archive) Metadata

As for GEO, the getSRAMetadata function allows the retrieval of metadata of high througput sequencing data stored in SRA through the BiocStyle::Biocpkg("SRAdb") package. To facilitate the retrieval of experiment types in SRA, the Onassis function library_strategies can be used. Filters for the sample’s material (GENOMIC, TRANSCRIPTOMIC, METAGENOMIC…), the species and the center hosting the data are allowed. For example, to obtain SRA metadata of ChIP-Seq human samples and Bisulfite sequencing samples the following code can be used.

# Connection to the SRAmetadb and potential download of the sqlite file
sra_con <- connectToSRADB()

# Query for the ChIP-Seq experiments contained in GEO for human samples 
sra_chip_seq <- getSRAMetadata(sra_con, library_strategy='ChIP-Seq', library_source='GENOMIC', taxon_id=9606, center_name='GEO')

# The following example allows to retrieve Bisulfite sequencing samples' metadata.
bisulfite_seq <- getSRAMetadata(sra_con, library_strategy='Bisulfite-Seq', library_source='GENOMIC', taxon_id=9606, center_name='GEO' )

To avoid installing GEOmetadb sra_chip_seq was previously saved and can be loaded from Onassis:

sra_chip_seq <- readRDS(system.file('extdata', 'vignette_data', 'GEO_human_chip.rds',  package='Onassis'))

Table: (#tab:printchromatinIP) Table: ChIP-Seq metadata obtained from SRAdb.

FALSE	sample_accession	experiment_accession	center_name	title	library_strategy	library_layout	experiment_url_link	experiment_attribute	description	sample_attribute	sample_url_link
5904	SRS421364	SRX278504	GEO	GSM1142700: p53 ChIP LCL nutlin-3 treated; Homo sapiens; ChIP-Seq	ChIP-Seq	SINGLE -	GEO Sample: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700	GEO Accession: GSM1142700	NA	source_name: lymphoblastoid cells \|\| cell type: nutlin-3 treated lymphoblastoid cells \|\| coriell id: GM12878 \|\| chip antibody: mouse monoclonal anti-human p53 (BD Pharmingen, cat# 554294) \|\| BioSampleModel: Generic	GEO Sample GSM1142700: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700
4981	SRS371783	SRX199902	GEO	GSM1022674: UW_ChipSeq_A549_InputRep1	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674	GEO Accession: GSM1022674	NA	source_name: A549 \|\| biomaterial_provider: ATCC \|\| lab: UW \|\| lab description: Stamatoyannopoulous - University of Washington \|\| datatype: ChipSeq \|\| datatype description: Chromatin IP Sequencing \|\| cell: A549 \|\| cell organism: human \|\| cell description: epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), “This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male.” - ATCC, newly promoted to tier 2: not in 2011 analysis \|\| cell karyotype: cancer \|\| cell lineage: endoderm \|\| cell sex: M \|\| antibody: Input \|\| antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. \|\| treatment: None \|\| treatment description: No special treatment or protocol applies \|\| control: std \|\| control description: Standard input signal for most experiments. \|\| controlid: wgEncodeEH001904 \|\| labexpid: DS18301 \|\| labversion: WindowDensity-bin20-win+/-75 \|\| replicate: 1 \|\| BioSampleModel: Generic	GEO Sample GSM1022674: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674
4619	SRS365824	SRX190055	GEO	GSM945272: UW_ChipSeq_HRPEpiC_Input	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272	GEO Accession: GSM945272	NA	source_name: HRPEpiC \|\| biomaterial_provider: ScienCell \|\| lab: UW \|\| lab description: Stamatoyannopoulous - University of Washington \|\| datatype: ChipSeq \|\| datatype description: Chromatin IP Sequencing \|\| cell: HRPEpiC \|\| cell organism: human \|\| cell description: retinal pigment epithelial cells \|\| cell karyotype: normal \|\| cell lineage: ectoderm \|\| cell sex: U \|\| antibody: Input \|\| antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. \|\| treatment: None \|\| treatment description: No special treatment or protocol applies \|\| control: std \|\| control description: Standard input signal for most experiments. \|\| controlid: wgEncodeEH000962 \|\| labexpid: DS16014 \|\| labversion: Bowtie 0.12.7 \|\| replicate: 1 \|\| BioSampleModel: Generic	GEO Sample GSM945272: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272
911	SRS117344	SRX028649	GEO	GSM608166: H3K27me3_K562_ChIP-seq_rep1	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166	GEO Accession: GSM608166	NA	source_name: chronic myeloid leukemia cell line \|\| cell line: K562 \|\| harvest date: 2008-06-12 \|\| chip antibody: CST monoclonal rabbit rabbit anti-H3K27me3 \|\| BioSampleModel: Generic	GEO Sample GSM608166: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166
4244	SRS362733	SRX186665	GEO	GSM1003469: Broad_ChipSeq_Dnd41_H3K79me2	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469	GEO Accession: GSM1003469	NA	source_name: Dnd41 \|\| biomaterial_provider: DSMZ \|\| datatype: ChipSeq \|\| datatype description: Chromatin IP Sequencing \|\| antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 \|\| antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). \|\| antibody vendorname: Active Motif \|\| antibody vendorid: 39143 \|\| controlid: wgEncodeEH002434 \|\| replicate: 1,2 \|\| softwareversion: ScriptureVPaperR3 \|\| cell sex: M \|\| antibody: H3K79me2 \|\| antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 \|\| antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). \|\| antibody vendorname: Active Motif \|\| antibody vendorid: 39143 \|\| treatment: None \|\| treatment description: No special treatment or protocol applies \|\| control: std \|\| control description: Standard input signal for most experiments. \|\| controlid: Dnd41/Input/std \|\| softwareversion: ScriptureVPaperR3 \|\| BioSampleModel: Generic	GEO Sample GSM1003469: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469
7502	SRS494656	SRX369112	GEO	GSM1252315: CHG092; Homo sapiens; ChIP-Seq	ChIP-Seq	SINGLE -	GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315	GEO Accession: GSM1252315	NA	source_name: Gastric Primary Sample \|\| tissuetype: Tumor \|\| chip antibody: H3K4me1 \|\| reads length: 101 \|\| BioSampleModel: Generic	GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315
2127	SRS266173	SRX099863	GEO	GSM808752: MCF7_CTCF_REP1	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752	GEO Accession: GSM808752:	NA	source_name: breast adenocarcinoma cells \|\| cell type: breast adenocarcinoma cells \|\| cell line: MCF7 \|\| antibody: CTCF \|\| BioSampleModel: Generic	GEO Sample GSM808752: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752
6299	SRS468164	SRX332680	GEO	GSM1204476: Input DNA for ChIP; Homo sapiens; ChIP-Seq	ChIP-Seq	SINGLE -	GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476	GEO Accession: GSM1204476	NA	source_name: MDAMB231 \|\| cell line: MDAMB231 \|\| chip antibody: input \|\| BioSampleModel: Generic	GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476
832	SRS115184	SRX027300	GEO	GSM593367: H3K4me3_H3	ChIP-Seq	SINGLE -	GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367	GEO Accession: GSM593367	NA	source_name: LCL \|\| chip antibody: H3K4me3 \|\| cell type: lymphoblastoid cell line \|\| BioSampleModel: Generic	GEO Sample GSM593367: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367
8638	SRS598154	SRX528309	GEO	GSM1375207: H3_ChIPSeq_Human; Homo sapiens; ChIP-Seq	ChIP-Seq	SINGLE -	GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207	GEO Accession: GSM1375207	NA	source_name: H3_ChIPSeq_Human \|\| donor age: adult \|\| cell type: sperm \|\| chip antibody: H3F3B \|\| chip antibody vendor: Abnova \|\| BioSampleModel: Generic	GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207

3 Annotating text with Ontology Concepts

The Onassis EntityFinder class has methods for annotating any text with dictionary terms. More specifically, Onassis can take advantage of the OBO dictionaries (http://www.obofoundry.org/), as shown in the next section where we will define relationships between different samples annotated with different ontology concepts thanks to the structure of the ontology.

3.1 Data preparation

Input text can be provided as:

The path of a directory containing named documents (findEntities method).
The path of a single file containing multiple documents. In this case each row contains the name/identifier of the document followed by a ‘|’ separator and the text to annotate (findEntities method).
A data frame. In this case each row represents a document, first column has to be the document identifier, and the remaining columns will be combined and contain the text to analyze (annotateDF method). This option can be conveniently used with the metadata retrieved from GEOmetadb and SRAdb, possibly selecting a subset of columns.

3.2 Creation of a Conceptmapper Dictionary

Conceptmapper dictionaries are XML files with a set of entries specified by the xml tag <token> with a canonical name (the name of the entry) and one or more variants (synonyms). Additional properties are allowed. The following code represents a sample of the Conceptmapper dictionary obtained from the Brenda tissue ontology.

   <?xml version="1.0" encoding="UTF-8" ?>
   <synonym>
      <token id="http://purl.obolibrary.org/obo/BTO_0005205" canonical="cerebral artery">
          <variant base="cerebral artery"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0002179" canonical="184A1N4 cell">
          <variant base="184A1N4 cell"/>
          <variant base="A1N4 cell"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0003871" canonical="uterine endometrial cancer cell">
          <variant base="uterine endometrial cancer cell"/>
          <variant base="endometrial cancer cell"/>
          <variant base="uterine endometrial carcinoma cell"/>
          <variant base="endometrial carcinoma cell"/>
      </token>
  </synonym>

The function dict dictionary creates an instance of the class CMdictionary (by internally calling the method buildDictionary).

If an XML file containing the Conceptmapper dictionary is already available, it can be uploaded into Onassis indicating its path and setting the dictType option to “CMDICT”.
If the dictionary has to be built from an OBO ontology (as a file in the OBO or OWL format), its path has to be provided and dictType has to be set to “OBO”. The synonymType argument can be set to EXACT_ONLY or ALL to consider only canonical concept names or also to include any synonym. The resulting XML file is written in the indicated outputdir.
To build a dictionary containing only gene/protein names, dictType has to be set to either TARGET or ENTREZ, to include histone types and marks or not, respetively. If a specific Org.xx.eg.db Bioconductor library is indicated in the inputFileOrDb parameter as a character string, gene names will be derived from it. Alterantively, if a specific species is indicated in the taxID parameter, the gene_info.gz file hosted at NCBI is used. If available, this file can be located with the inputFile parameter. Otherwise, it will be automatically downloaded (300MB).

# If a Conceptmapper dictionary is already available the dictType CMDICT can be specified and the corresponding file loaded
sample_dict <- dictionary(inputFileOrDb=system.file('extdata', 'cmDict-sample.cs.xml', package = 'Onassis'), dictType = 'CMDICT')

#Creation of a dictionary from the file sample.cs.obo available in OnassisJavaLibs
obo <- system.file('extdata', 'sample.cs.obo', package='OnassisJavaLibs')

sample_dict <- dictionary(inputFileOrDb=obo, outputdir=getwd(), synonymType='ALL')

# Creation of a dictionary for human genes/proteins
require(org.Hs.eg.db)
targets <- dictionary(dictType='TARGET', inputFileOrDb = 'org.Hs.eg.db')

3.3 Setting the options for the annotator

Conceptmapper includes 7 different options controlling the annotation step. These are documented in detail in the documentation of the CMoptions function. They can be listed through the listCombinations method after creating a ‘CMoptions’ object, and they can be set through the CMoptions function. The set options are thus stored in an object of class CMoptions that will be required for the subsequent step of annotation.

#Showing configuration permutations 
opts <- CMoptions()  
combinations <- listCombinations(opts)
#Setting the combination of parameters through the paramValueIndex value
myopts <- CMoptions(c("CONTIGUOUS_MATCH", "CASE_INSENSITIVE", 'BIOLEMMATIZER', 'PUBMED', 'ON', 'YES', 'ALL'))

3.4 Running the entity finder

The class EntityFinder has methods to find concepts of any OBO ontology in a given text. The findEntities and annotateDF methods accept text within files or data.frame, respetively, as described in Section 3.1. Alternatively, the function annotate automatically adapts to the provided input type. For example, to annotate the metadata derived from ChIP-seq experiments obtained from SRA with tissue and cell type concepts belonging to BRENDA ontology the following code can be used:

chipseq_dict_annot <- annotate(sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], dictionary=sample_dict, options=myopts)

## [1] "/tmp/Rtmpzmmn37"
## [1] "/tmp/Rtmpzmmn37/SRS115184.a1"
## [1] "/tmp/Rtmpzmmn37/SRS117344.a1"
## [1] "/tmp/Rtmpzmmn37/SRS213443.a1"
## [1] "/tmp/Rtmpzmmn37/SRS241934.a1"
## [1] "/tmp/Rtmpzmmn37/SRS266173.a1"
## [1] "/tmp/Rtmpzmmn37/SRS285318.a1"
## [1] "/tmp/Rtmpzmmn37/SRS336079.a1"
## [1] "/tmp/Rtmpzmmn37/SRS346539.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362733.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362801.a1"
## [1] "/tmp/Rtmpzmmn37/SRS365824.a1"
## [1] "/tmp/Rtmpzmmn37/SRS371783.a1"
## [1] "/tmp/Rtmpzmmn37/SRS376074.a1"
## [1] "/tmp/Rtmpzmmn37/SRS410226.a1"
## [1] "/tmp/Rtmpzmmn37/SRS421364.a1"
## [1] "/tmp/Rtmpzmmn37/SRS468164.a1"
## [1] "/tmp/Rtmpzmmn37/SRS494656.a1"
## [1] "/tmp/Rtmpzmmn37/SRS598154.a1"
## [1] "/tmp/Rtmpzmmn37/SRS636394.a1"
## [1] "/tmp/Rtmpzmmn37/SRS724993.a1"

## Conceptmapper annotations created in directory: /tmp/Rtmpzmmn37

The resulting data.frame contains for each row a match to the provided dictionary for a specific document/sample (indicated in the first column). The annotation is reported with the id of the concept (term_id), its canonical name (term name), its URL in the obo format, and the matching sentence of the document.

Table: (#tab:showchipresults) Table: Annotations of the methylation profiling by high througput sequencing metadata obtained from GEO with BRENDA ontology concepts

sample_id	term_id	term_name	term_url	matched_sentence
SRS115184	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS117344	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS213443	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS213443	CL_0000066	epithelial cell	http://purl.obolibrary.org/obo/CL_0000066	Epithelial Cells
SRS241934	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS266173	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cells
SRS266173	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS285318	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS285318	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cells
SRS336079	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS336079	CL_0000236	B cell	http://purl.obolibrary.org/obo/CL_0000236	B-lymphocyte
SRS346539	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS362733	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS362801	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS365824	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS365824	CL_0000066	epithelial cell	http://purl.obolibrary.org/obo/CL_0000066	epithelial cells
SRS371783	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell
SRS371783	CL_0000066	epithelial cell	http://purl.obolibrary.org/obo/CL_0000066	epithelial cell
SRS376074	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cells
SRS376074	CL_0000000	cell	http://purl.obolibrary.org/obo/CL_0000000	cell

The function annotate can also be used to identify the targeted entity of each ChIP-seq experiment, by retrieving gene names and histone types or modifications in the ChIP-seq metadata.

#Finding the TARGET entities
target_entities <- annotate(inputFileorDf=sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], options = myopts, dictionary=targets)

## [1] "/tmp/Rtmpzmmn37"
## [1] "/tmp/Rtmpzmmn37/SRS115184.a1"
## [1] "/tmp/Rtmpzmmn37/SRS117344.a1"
## [1] "/tmp/Rtmpzmmn37/SRS213443.a1"
## [1] "/tmp/Rtmpzmmn37/SRS241934.a1"
## [1] "/tmp/Rtmpzmmn37/SRS266173.a1"
## [1] "/tmp/Rtmpzmmn37/SRS285318.a1"
## [1] "/tmp/Rtmpzmmn37/SRS336079.a1"
## [1] "/tmp/Rtmpzmmn37/SRS346539.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362733.a1"
## [1] "/tmp/Rtmpzmmn37/SRS362801.a1"
## [1] "/tmp/Rtmpzmmn37/SRS365824.a1"
## [1] "/tmp/Rtmpzmmn37/SRS371783.a1"
## [1] "/tmp/Rtmpzmmn37/SRS376074.a1"
## [1] "/tmp/Rtmpzmmn37/SRS410226.a1"
## [1] "/tmp/Rtmpzmmn37/SRS421364.a1"
## [1] "/tmp/Rtmpzmmn37/SRS468164.a1"
## [1] "/tmp/Rtmpzmmn37/SRS494656.a1"
## [1] "/tmp/Rtmpzmmn37/SRS598154.a1"
## [1] "/tmp/Rtmpzmmn37/SRS636394.a1"
## [1] "/tmp/Rtmpzmmn37/SRS724993.a1"

Table: (#tab:printKable) Table: Annotations of ChIP-seq test metadata obtained from SRAdb and stored into files with the TARGETs (genes and histone variants)

FALSE	sample_id	term_id	term_name	term_url	matched_sentence
1	SRS115184	Reference T1 H3K4me3	H3K4me3	NA	H3K4me3
2	SRS117344	Reference T1 9514	GAL3ST1	NA	CST
3	SRS117344	Reference T2 10559	SLC35A1	NA	CST
4	SRS117344	Reference T3 54897	CASZ1	NA	CST
5	SRS117344	Reference T4 H3K27me3	H3K27me3	NA	H3K27me3
6	SRS213443	Reference T1 10664	CTCF	NA	CTCF
7	SRS213443	Reference T2 10664	CTCF	NA	CTCF
8	SRS241934	Reference T1 604	BCL6	NA	BCL6
9	SRS266173	Reference T1 10664	CTCF	NA	CTCF
19	SRS285318	Reference T10 23133	PHF8	NA	PHF8
25	SRS336079	Reference T6 100616102	ERVK-9	NA	polymerase
26	SRS336079	Reference T7 100862685	ERVK-19	NA	polymerase
27	SRS336079	Reference T8 100862688	ERVK-11	NA	polymerase
33	SRS336079	Reference T14 1283	CTD	NA	CTD
36	SRS336079	Reference T17 100616102	ERVK-9	NA	polymerase
37	SRS336079	Reference T18 100862685	ERVK-19	NA	polymerase
38	SRS336079	Reference T19 100862688	ERVK-11	NA	polymerase
49	SRS362733	Reference T5 H3K79me2	H3K79me2	NA	H3K79me2
50	SRS362733	Reference T6 H3K79me2	H3K79me2	NA	H3K79me2
51	SRS362733	Reference T7 H3K79me2	H3K79me2	NA	H3K79me2
52	SRS362733	Reference T8 H3K79me2	H3K79me2	NA	H3K79me2
53	SRS362733	Reference T9 H3K79me2	H3K79me2	NA	H3K79me2
54	SRS362801	Reference T1 929	CD14	NA	CD14
55	SRS362801	Reference T2 4695	NDUFA2	NA	CD14
56	SRS362801	Reference T3 929	CD14	NA	CD14
57	SRS362801	Reference T4 4695	NDUFA2	NA	CD14
63	SRS362801	Reference T10 H3K79me2	H3K79me2	NA	H3K79me2
64	SRS362801	Reference T11 H3K79me2	H3K79me2	NA	H3K79me2
65	SRS362801	Reference T12 H3K79me2	H3K79me2	NA	H3K79me2
66	SRS362801	Reference T13 H3K79me2	H3K79me2	NA	H3K79me2
67	SRS362801	Reference T14 H3K79me2	H3K79me2	NA	H3K79me2
68	SRS362801	Reference T15 929	CD14	NA	CD14
69	SRS362801	Reference T16 4695	NDUFA2	NA	CD14
78	SRS371783	Reference T5 10195	ALG3	NA	not
79	SRS376074	Reference T1 6886	TAL1	NA	TAL1
80	SRS376074	Reference T2 6886	TAL1	NA	TAL1
82	SRS376074	Reference T4 3855	KRT7	NA	SCL
83	SRS376074	Reference T5 6886	TAL1	NA	SCL
84	SRS376074	Reference T6 51540	SCLY	NA	SCL
86	SRS410226	Reference T1 H3K27ac	H3K27ac	NA	H3K27ac
87	SRS494656	Reference T1 H3K4me1	H3K4me1	NA	H3K4me1
88	SRS598154	Reference T1 3021	H3F3B	NA	H3F3B
89	SRS636394	Reference T1 3169	FOXA1	NA	FOXA1

4 Semantic similarity

With Onassis it is possible to quantify the semantic similarity between the concepts of a given ontology using different semantic similarity measures. Similarity is an Onassis class applying methods of the Java library slib (Harispe et al. 2014), which builds a semantic graph starting from OBO ontology concepts and their hierarchical relationships. The following methods are available and are automatically chosen depending on the settings of the Similarity function. The sim and groupsim methods allow the computation of semantic similarity between single terms (pairwise measures) and between group of terms (groupwise measures), respectively. Pairwise measures can be edge based, if they rely only on the structure of the ontology, or information-content based if they also consider the information that each term in the ontology carries. Rather, groupwise measures can be indirect, if they compute the pairwise similarity between each couple of terms, or direct if they consider each set of concepts as a whole. The samplesim method allows to determine the semantic similarity between two documents, each possibly associated to multiple concepts, using groupwise measures. Finally, the multisim method allows to determine the semantic similarity between documents annotated with two or more ontologies: first samplesim is run for each ontology, then a user defined function can be used to aggregate the resulting semantic similarities for each pair of documents.

The function showSimilarities shows all the measures supported by Onassis. For details about the measures run {?Similarity}.

#Instantiating the Similarity
similarities <- showSimilarities()

The following example shows the pairwise similarities between sample cell concepts obtained annotating the ChIP-seq metadata. The resnik similarity measure is used by default, which is based on the information content of the most informative common ancestor of the considered concepts. In particular, the seco information content is used by default, which determines the specificity of each concept based on the number of concepts it subsumes.

found_terms <- unique(chipseq_dict_annot$term_url)
n <- length(found_terms)

ontologyfile <- obo
pairwise_results <- data.frame(term1 = character(0), term2= character(0), value = double(0L))
for(i in 1:(n-1)){
  term1 <- as.character(found_terms[i])
  j = i + 1 
  for(k in j:n){
    term2 <- as.character(found_terms[k])
    two_term_similarity <- similarity(ontologyfile,  term1, term2 )
    new_row <- cbind(term1, term2, two_term_similarity)
    pairwise_results <- rbind(pairwise_results, new_row )
  }
}
pairwise_results <- unique(pairwise_results)
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term2', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term2_name'
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term1', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term1_name'
pairwise_results <- unique(pairwise_results)

Table: (#tab:showing_similarity1) Table: Pairwise similarities of cell line terms annotating the ChIP-seq metadata

FALSE	term1	term2	two_term_similarity	term2_name	term1_name
1	http://purl.obolibrary.org/obo/CL_0000000	http://purl.obolibrary.org/obo/CL_0000066	0.226149129142891	epithelial cell	cell
67	http://purl.obolibrary.org/obo/CL_0000000	http://purl.obolibrary.org/obo/CL_0000236	0.226149129142891	B cell	cell
89	http://purl.obolibrary.org/obo/CL_0000066	http://purl.obolibrary.org/obo/CL_0000236	0.268130656074674	B cell	epithelial cell

In the following code the semantic similarity between two groups of terms is computed using the ui measure, a groupwise direct measure combining the intersection and the union of the set of ancestors of the two groups of concepts.

similarity(obo, found_terms[1:2], found_terms[3])

## [1] 0.1875

Finally, the pariwise semantic similarity between ChIP-seq samples is illustrated.

annotated_samples <- as.character(as.vector(unique(chipseq_dict_annot$sample_id)))
n <- length(annotated_samples)


samples_results <- data.frame(sample1 = character(0), sample2= character(0), value = double(0L))
samples_results <- matrix(0, nrow=n, ncol=n)
rownames(samples_results) <- colnames(samples_results) <- annotated_samples
for(i in 1:(n-1)){
  sample1 <- as.character(annotated_samples[i])
  j = i + 1 
  for(k in j:n){
    sample2 <- as.character(annotated_samples[k])
    two_samples_similarity <- similarity(ontologyfile, sample1, sample2, chipseq_dict_annot)
    samples_results[i, k] <- samples_results[k, i] <- two_samples_similarity
  }
}
diag(samples_results) <- 1
heatmap.2(samples_results, density.info = "none", trace="none", main='Semantic similarity of annotated samples', margins=c(5,5))

References

Galeota, Eugenia, and Mattia Pelizzola. 2016. “Ontology-Based Annotations and Semantic Relations in Large-Scale (Epi) Genomics Data.” Briefings in Bioinformatics. Oxford Univ Press, bbw036.

Harispe, Sébastien, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain. 2014. “The Semantic Measures Library and Toolkit: Fast Computation of Semantic Similarity and Relatedness Using Biomedical Ontologies.” Bioinformatics 30 (5). Oxford Univ Press:740–42.

Verspoor, K., W. Baumgartner Jr, C. Roeder, and L. Hunter. 2009. “Abstracting the Types away from a UIMA Type System.” From Form to Meaning: Processing Texts Automatically. Tübingen:Narr, 249–56.

Onassis: Ontology Annotation and Semantic Similarity software

2018-02-06

Contents