1 Introduction

The contents of the database entries, once parsed, are stored by biodb into objects of the class BiodbEntry.

The BiodbEntry class is an RC (aka R5) class (not S3 or S4). RC instances are never copied implicitly by R. This means that each instance is shared by all parts of your code. If one part of your code modifies or deletes a BiodbEntry object, any other part of your code will be affected by this modification. See Reference classes and the vignette Details on biodb , for more explanations.

biodb uses identifiers (IDs) to retrieve and manipulate BiodbEntry instances indirectly. Those identifiers are, in case of web server databases, the official accession numbers provided by these databases.

We will see in this vignette how to retrieve entries using a connector, manipulate fields of an entry, free entry instances from memory and delete their content from disk cache, search for entries in a database, convert entries into data frames or JSON, copying all entries of a database into a new empty database, and merge the entries of several databases into a single database.

To start we need to instantiate the package main class:

mybiodb <- biodb::BiodbMain$new()

## INFO  [08:58:36.729] Loading definitions from package biodb version 1.0.4.

For the demonstration of this vignette, we will use an extract of the ChEBI (Hastings et al. 2012) database, that we have put inside a TSV file.

Here is the TSV file:

chebi.tsv <- system.file("extdata", "chebi_extract.tsv", package='biodb')

And now we create the connector to this CSV File database:

chebi <- mybiodb$getFactory()$createConn('comp.csv.file', url=chebi.tsv)

2 Getting entries

To retrieve entries, we first need to get their identifiers. We can either ask the connector to give us the full list of all entry identifiers:

chebi$getEntryIds()

## INFO  [08:58:37.088] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract.tsv".

##  [1] "1018"  "1390"  "1456"  "1549"  "1894"  "1932"  "1997"  "10561" "15939"
## [10] "16750" "35485" "40304" "64679"

or get the first n entry IDs:

chebi$getEntryIds(max.results=3)

## [1] "1018" "1390" "1456"

Another way of getting entry IDs, is to search the database using a filter. Here we search for entries by name:

chebi$searchForEntries(list(name='deoxyguanosine'))

## [1] 40304

Now we search by mass:

chebi$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))

## [1] 16750 35485 40304

And finally by both name and mass:

chebi$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))

## [1] 16750 40304

Now that we have identifiers, we can get entry objects. First we choose two identifiers:

ids <- chebi$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2)

Then we get the corresponding list of entry instances:

chebi$getEntry(ids)

## [[1]]
## Biodb Compound CSV File entry instance 16750.
## 
## [[2]]
## Biodb Compound CSV File entry instance 35485.

3 Entry fields

The content of an entry is stored inside its fields. To access the values contained in the fields or information about the fields, you need to call methods onto the entry object.

First, we get an entry object:

e <- chebi$getEntry(ids[[1]])

To get a list of all fields having a value inside an entry object, call:

e$getFieldNames()

## [1] "accession"         "comp.csv.file.id"  "description"      
## [4] "formula"           "kegg.compound.id"  "molecular.mass"   
## [7] "monoisotopic.mass" "name"              "smiles"

To get the value of a field, call:

e$getFieldValue('name')

## [1] "guanosine"

To get all the mass fields, run:

e$getFieldsByType('mass')

## [1] "monoisotopic.mass" "molecular.mass"

If you want more information about a field, you have to access the entry fields instance:

mybiodb$getEntryFields()$get('monoisotopic.mass')

## Entry field "monoisotopic.mass".
##   Description: Monoisotopic mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass of the primary isotope of the elements including the mass defect (mass difference between neutron and proton, and nuclear binding energy). Used with high resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
##   Class: double.
##   Type: mass.
##   Cardinality: one.
##   Aliases: exact.mass.

4 Conversion

Entries may be converted into lists of values, data frames, and JSON.

To convert a single entry into a data frame, run (result in 1):

x <- e$getFieldsAsDataframe()

Table 1: Converting an entry to a data frame
accession	formula	monoisotopic.mass	molecular.mass	kegg.compound.id	name	smiles	description	comp.csv.file.id
16750	C10H13N5O5	283.0917	283.2409	C00387	guanosine	Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1OC@H C@(???)[C@H]1O	A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond.	16750

Several options are available to control which fields are output. For instance, you can select the set of fields by their name (result in 2):

x <- e$getFieldsAsDataframe(fields=c('name', 'monoisotopic.mass'))

Table 2: Selecting fields by names
name	monoisotopic.mass
guanosine	283.0917

or by their type (result in 3):

x <- e$getFieldsAsDataframe(fields.type='mass')

Table 3: Selecting fields by type
monoisotopic.mass	molecular.mass
283.0917	283.2409

In case of entries with fields that contain multiple values, other options are useful. This is the case for mass spectrum entries. If we get an entry from an extract of Massbank (Horai et al. 2010):

massSqliteFile <- system.file("extdata", "generated", "massbank_extract_full.sqlite", package='biodb')
massbank <- mybiodb$getFactory()$createConn('mass.sqlite', url=massSqliteFile)
massbankEntry <- massbank$getEntry('KNA00776')

we can select the fields of cardinality one only (result in 4):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=TRUE)

Table 4: Selecting fields with only one value
accession	chrom.col.id	chrom.col.name	chrom.rt	chrom.rt.unit	formula	inchi	inchikey	monoisotopic.mass	ms.level	ms.mode	smiles
KNA00776	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1		283.0917	1	neg	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N

or get all the fields, in which case fields with more than one value will have their values concatenated into a string using a default separator (result in 5):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE)

Table 5: Concatenate multiple values
accession	cas.id	chebi.id	chrom.col.id	chrom.col.name	chrom.rt	chrom.rt.unit	formula	inchi	inchikey	kegg.compound.id	mass.csv.file.id	mass.sqlite.id	monoisotopic.mass	ms.level	ms.mode	name	smiles	peak.mz	peak.mztheo	peak.relative.intensity
KNA00776	118-00-3	16750	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1		C00387	KNA00776	KNA00776	283.0917	1	neg	Guanosine	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N	282.083871;150.041965;133.015586	282.083871;150.041965;133.015586	100;56;5

It is also possible to get one value per line for fields with cardinality greater than one (result in 6):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE, flatten=FALSE)

Table 6: Output one value per row
accession	cas.id	chebi.id	chrom.col.id	chrom.col.name	chrom.rt	chrom.rt.unit	formula	inchi	kegg.compound.id	mass.csv.file.id	mass.sqlite.id	monoisotopic.mass	ms.level	ms.mode	name	smiles	peak.mz	peak.mztheo	peak.relative.intensity
KNA00776	118-00-3	16750	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1	C00387	KNA00776	KNA00776	283.0917	1	neg	Guanosine	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N	282.0839	282.0839	100
KNA00776	118-00-3	16750	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1	C00387	KNA00776	KNA00776	283.0917	1	neg	Guanosine	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N	150.0420	150.0420	56
KNA00776	118-00-3	16750	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1	C00387	KNA00776	KNA00776	283.0917	1	neg	Guanosine	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N	133.0156	133.0156	5

And we can limit the number of values for each field (result in 7):

x <- massbankEntry$getFieldsAsDataframe(only.card.one=FALSE, limit=1)

Table 7: Output only one value for each field
accession	cas.id	chebi.id	chrom.col.id	chrom.col.name	chrom.rt	chrom.rt.unit	formula	inchi	inchikey	kegg.compound.id	mass.csv.file.id	mass.sqlite.id	monoisotopic.mass	ms.level	ms.mode	name	smiles	peak.mz	peak.mztheo	peak.relative.intensity
KNA00776	118-00-3	16750	TOSOH TSKgel ODS-100V 5um Part no. 21456	TOSOH TSKgel ODS-100V 5um Part no. 21456	329.8511	s	C10H13N5O5	InChI=1S/C10H13N5O5/c11-10-13-7-4(8(19)14-10)12-2-15(7)9-6(18)5(17)3(1-16)20-9/h2-3,5-6,9,16-18H,1H2,(H3,11,13,14,19)/t3-,5-,6-,9-/m1/s1		C00387	KNA00776	KNA00776	283.0917	1	neg	Guanosine	C1=NC2=C(N1[C@H]3C@(???)O)N=C(NC2=O)N	282.0839	282.0839	100

A list of several entries can also be convert into a data frame (result in 8):

entries <- chebi$getEntry(chebi$getEntryIds(max.results=3))
x <- mybiodb$entriesToDataframe(entries)

Table 8: Converting a list of entries into a data frame
accession	formula	monoisotopic.mass	molecular.mass	kegg.compound.id	name	smiles	description	comp.csv.file.id
16750	C10H13N5O5	283.0917	283.2409	C00387	guanosine	Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1OC@H C@(???)[C@H]1O	A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond.	16750
35485	C10H13N5O5	283.0917	283.2409	NA	adenosine 1-oxide	Nc1c2ncn([C@@H]3OC@H C@(???)[C@H]3O)c2ncn1=O		35485
1018	C2H8AsNO3	168.9720	169.0120	C07279	2-Aminoethylarsonate	NCCAs(O)=O		1018

or to JSON

mybiodb$entriesToJson(entries)

## [1] "{\n  \"accession\": \"16750\",\n  \"formula\": \"C10H13N5O5\",\n  \"monoisotopic.mass\": 283.0917,\n  \"molecular.mass\": 283.2409,\n  \"kegg.compound.id\": \"C00387\",\n  \"name\": \"guanosine\",\n  \"smiles\": \"Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O\",\n  \"description\": \"A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond.\",\n  \"comp.csv.file.id\": \"16750\"\n}"
## [2] "{\n  \"accession\": \"35485\",\n  \"formula\": \"C10H13N5O5\",\n  \"monoisotopic.mass\": 283.0917,\n  \"molecular.mass\": 283.2409,\n  \"name\": \"adenosine 1-oxide\",\n  \"smiles\": \"Nc1c2ncn([C@@H]3O[C@H](CO)[C@@H](O)[C@H]3O)c2ncn1=O\",\n  \"description\": \"\",\n  \"comp.csv.file.id\": \"35485\"\n}"                                                                                                                                   
## [3] "{\n  \"accession\": \"1018\",\n  \"formula\": \"C2H8AsNO3\",\n  \"monoisotopic.mass\": 168.97201,\n  \"molecular.mass\": 169.012,\n  \"kegg.compound.id\": \"C07279\",\n  \"name\": \"2-Aminoethylarsonate\",\n  \"smiles\": \"NCC[As](O)(O)=O\",\n  \"description\": \"\",\n  \"comp.csv.file.id\": \"1018\"\n}"

5 Memory usage

Each time you call the getEntry() method, biodb checks first if the entries you requested are already in memory. If this is the case, it returns them, otherwise it looks into the cache on disk for downloaded contents. If the entry contents have never been downloaded, the connector contacts the database to get the missing contents and save them into the cache. From the contents, biodb create the corresponding BiodbEntry objects.

You may want either to free memory usage by removing entry objects in memory, or delete entry contents from cache in order to download more recent versions of entries. To remove entries from memory, run:

chebi$deleteAllEntriesFromVolatileCache()

To remove entry content files in cache folder, run:

chebi$deleteAllEntriesFromPersistentCache()

To remove all cache files attached to a connector, run:

chebi$deleteWholePersistentCache()

## INFO  [08:58:38.982] No cache files exist for comp.csv.file-1ca83c9b26288b0c06a8acd97be33857.

This will also delete the caching of all HTTP requests and all downloads, including the possible download of the database, thus forcing to download again data from the database.

6 Copy

Entry objects from any connector can be copied into a writable connector.

If we create a new connector to a SQLite file that does not exist:

sqliteOutputFile <- tempfile(pattern="biodb_copy_entries_new_db", fileext='.sqlite')
newDbConn <- mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteOutputFile)

And allow modifications for this connector:

newDbConn$allowEditing()
newDbConn$allowWriting()

We can copy all entries from another connector into it:

mybiodb$copyDb(chebi, newDbConn)

And finally write the entries into the SQLite file:

newDbConn$write()

## INFO  [08:58:40.458] Write all new entries into "/tmp/RtmpQqyRBJ/biodb_copy_entries_new_db66b865a125c88.sqlite".

7 Merging databases

In this vignette we will merge entries from three different databases into a single database.

For the demonstration we will use the ChEBI connector already created, and create two other connectors.

A connector to the Uniprot (Consortium 2016) database:

uniprot.tsv <- system.file("extdata", "uniprot_extract.tsv", package='biodb')
uniprot <- mybiodb$getFactory()$createConn('comp.csv.file', url=uniprot.tsv)

A connector to the ExPASy enzyme (Bairoch 2000) database:

expasy.tsv <- system.file("extdata", "expasy_enzyme_extract.tsv", package='biodb')
expasy <- mybiodb$getFactory()$createConn('comp.csv.file', url=expasy.tsv)

7.1 Merging the entries

We will now merge the entries into a single database. However we will use differently the entries of the three databases. The ChEBI and Uniprot will just be put together since they have no link between them. But we will use the ExPASy entries to add missing fields to the uniprot entries. We will be able to do that because the uniprot entries have a field 'expasy.enzyme.id' that we can use to make the link with the ExPASy entries.

We will write a function that takes a Uniprot entry and search for the ExPASy entry referenced and take missing fields from it:

completeUniprotEntry <- function(e) {
    expasy.id <- e$getFieldValue('expasy.enzyme.id');
    if ( ! is.na(expasy.id)) {
        ex <- expasy$getEntry(expasy.id)
        if ( ! is.null(ex)) {
            for (field in c('catalytic.activity', 'cofactor')) {
                v <- ex$getFieldValue(field)
                if ( ! is.na(v) && length(v) > 0)
                    e$setFieldValue(field, v)
            }
        }
    }
}

Remember that we use RC (Reference Classes, or R5) OOP model in biodb. This means that we use references to objects. Thus we can modify an instance at any place inside the code.

Now we will get all entries from Uniprot and run the function to complete all entries:

uniprot.entries <- uniprot$getEntry(uniprot$getEntryIds())

## INFO  [08:58:42.368] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/uniprot_extract.tsv".

invisible(lapply(uniprot.entries, completeUniprotEntry))

## INFO  [08:58:42.806] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/expasy_enzyme_extract.tsv".

Finally we get all entries from our ChEBI extract, merge all our entries into a single data frame and save it in a file (see content in 9):

chebi.entries <- chebi$getEntry(chebi$getEntryIds())
all.entries.df <- mybiodb$entriesToDataframe(c(chebi.entries, uniprot.entries))
output.file <- tempfile(pattern="biodb_merged_entries", fileext='.tsv')
write.table(all.entries.df, file=output.file, sep="\t", row.names=FALSE)

Table 9: Merged data
accession	formula	monoisotopic.mass	molecular.mass	kegg.compound.id	name	smiles	comp.csv.file.id	gene.symbol	expasy.enzyme.id	kegg.genes.id	aa.seq.length	aa.seq	catalytic.activity	cofactor
1018	C2H8AsNO3	168.97201	169.012	C07279	2-Aminoethylarsonate	NCCAs(O)=O	1018	NA	NA	NA	NA	NA	NA	NA
1390	C8H8O2	136.05243	136.148	C06224	3,4-Dihydroxystyrene	Oc1ccc(C=C)cc1O	1390	NA	NA	NA	NA	NA	NA	NA
1456	C3H9NO2	91.06333	91.109	C06057	3-aminopropane-1,2-diol	NCC@HCO	1456	NA	NA	NA	NA	NA	NA	NA
1549	C3H5O3R	89.02387	89.070	C03834	3-hydroxymonocarboxylic acid	OC([*])CC(O)=O	1549	NA	NA	NA	NA	NA	NA	NA
1894	C5H11NO	101.08406	101.147	C10974	4-Methylaminobutanal	CNCCCC=O	1894	NA	NA	NA	NA	NA	NA	NA
1932	C6H6NR	92.05002	92.119	C03084	4-Substituted aniline	Nc1ccc([*])cc1	1932	NA	NA	NA	NA	NA	NA	NA

7.2 Use a writable database

Instead of building the data frame, we could have used a writable database as seen earlier. Here is a new file database for which we enable edition (for inserting new entries) and writing (for saving it onto disk):

newDbOutputFile <- tempfile(pattern="biodb_merged_entries_new_db", fileext='.tsv')
newDbConn <- mybiodb$getFactory()$createConn('comp.csv.file', url=newDbOutputFile)
newDbConn$allowEditing()
newDbConn$allowWriting()

Now we copy entries into this new database:

mybiodb$copyDb(chebi, newDbConn)

## INFO  [08:58:44.470] Creating empty database.

mybiodb$copyDb(uniprot, newDbConn)

And finally we write the database:

newDbConn$write()

## INFO  [08:58:45.065] Write all entries into "/tmp/RtmpQqyRBJ/biodb_merged_entries_new_db66b866149e492.tsv".

8 Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:

mybiodb$terminate()

## INFO  [08:58:45.975] Closing BiodbMain instance... 
## INFO  [08:58:45.977] Connector "comp.csv.file" deleted. 
## INFO  [08:58:45.979] Connector "mass.sqlite" deleted. 
## INFO  [08:58:45.981] Connector "comp.sqlite" deleted. 
## INFO  [08:58:45.983] Connector "comp.csv.file.1" deleted. 
## INFO  [08:58:45.985] Connector "comp.csv.file.2" deleted. 
## INFO  [08:58:45.986] Connector "comp.csv.file.3" deleted.

References

Bairoch, A. 2000. “The Enzyme Database in 2000.” Nucleic Acids Research 28 (1): 304–5. https://doi.org/10.1093/nar/28.1.304.

Consortium, The UniProt. 2016. “UniProt: the universal protein knowledgebase.” Nucleic Acids Research 45 (D1): D158–D169. https://doi.org/10.1093/nar/gkw1099.

Hastings, Janna, Paula de Matos, Adriano Dekker, Marcus Ennis, Bhavana Harsha, Namrata Kale, Venkatesh Muthukrishnan, et al. 2012. “The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013.” Nucleic Acids Research 41 (D1): D456–D463. https://doi.org/10.1093/nar/gks1146.

Horai, Hisayuki, Masanori Arita, Shigehiko Kanaya, Yoshito Nihei, Tasuku Ikeda, Kazuhiro Suwa, Yuya Ojima, et al. 2010. “MassBank: A Public Repository for Sharing Mass Spectral Data for Life Sciences.” Journal of Mass Spectrometry 45 (7): 703–14. https://doi.org/https://doi.org/10.1002/jms.1777.

Manipulating entry objects

10 June 2021

Abstract

Package