1 Introduction

biodb is able to handle in-house compound databases either stored inside a CSV file, using the comp.csv.file, or inside an SQLite file, using comp.sqlite connectors.

Both connectors accept the creation of new entries through biodb methods. The CSV file connector is able to read any CSV file in input, while the SQLite connector is only able to interact with a database file created by biodb.

Inside vignette Manipulating entry objects you can learn how to create a new empty SQLite of CSV file connector in order to copy entries into it entries from another connector, and thus create your own database.

To start we create an instance of the BiodbMain class:

mybiodb <- biodb::newInst()

## INFO  [08:58:47.312] Loading definitions from package biodb version 1.0.4.

2 CSV File connector

In order to facilitate the loading of the file, you should use the tabulation character as columns separator and name the columns of your file with biodb standard field names. However, if your CSV file does not respect the biodb standard, you have also the possibility to declare a custom column separator character and the mapping between column names of your file and biodb field names just before loading the file.

Once the connection to the file is defined, you can use the connector to your in-house file as any other compound database connector.

2.1 Creating a connector

In order to create a connector to a CSV database, you have to provide the path to your CSV file. This is done with the url parameter of the createConn() method.

If your CSV file respects the biodb defaults (see below), then no further information is required.

If your CSV file does not respect the biodb standard, then you will have to modify the defaults on the connector instance, before the CSV file is loaded. Thus it has to be done immediately after the connector creation.

In the following sub-sections we are going to see how to load a biodb standard CSV file and a custom CSV file.

2.1.1 Loading a biodb standard CSV file

Here is a biodb standard CSV file containing an extract of the ChEBI database:

csvUrl <- system.file("extdata", "chebi_extract.tsv", package='biodb')

See table 1 for the content of this file.

Table 1: Excerpt from compound database TSV file.
accession	formula	monoisotopic.mass	molecular.mass	kegg.compound.id	name	smiles
1018	C2H8AsNO3	168.97201	169.012	C07279	2-Aminoethylarsonate	`NCC[As](O)(O)=O`
1390	C8H8O2	136.05243	136.148	C06224	3,4-Dihydroxystyrene	`Oc1ccc(C=C)cc1O`
1456	C3H9NO2	91.06333	91.109	C06057	3-aminopropane-1,2-diol	`NC[C@H](O)CO`
1549	C3H5O3R	89.02387	89.070	C03834	3-hydroxymonocarboxylic acid	`OC([*])CC(O)=O`
1894	C5H11NO	101.08406	101.147	C10974	4-Methylaminobutanal	`CNCCCC=O`
1932	C6H6NR	92.05002	92.119	C03084	4-Substituted aniline	`Nc1ccc([*])cc1`

This CSV file respects the biodb defaults. The columns separator is the tabulation character. Column names use biodb standard entry field names. String values may be quoted with double quotes (").

We instantiate the connector by passing the URL to the factory:

conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl)

We will later use this connector to run the examples of this vignette.

2.1.2 Loading a custom CSV file

Here is a custom CSV file containing the same extract of the ChEBI database than with the biodb standard CSV file:

csvUrl2 <- system.file("extdata", "chebi_extract_custom.csv", package='biodb')

Only the columns separator character and some column names have been changed.

We create now a connector for this custom CSV file:

conn2 <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl2)

At this step the file has not yet been loaded. We can thus customize the connector in order for the CSV file parsing to proceed correctly. The effective loading of the CSV file will happen when you run a method of the connector that requires the data.

The first step to customize your connector is to set the separator character:

conn2$setCsvSep(';')

Then you may change the quote characters:

conn2$setCsvQuote('')

Here we specify with an empty string that this CSV file does not use quotes for character values.

Finally you have to map each custom column name with the name of a biodb entry field. For this you call the setField() method for each column name, giving as first argument the biodb field name and as second argument the column name. In our case this gives:

conn2$setField('accession',         'ID')

## INFO  [08:58:47.702] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract_custom.csv".

conn2$setField('kegg.compound.id',  'kegg')
conn2$setField('monoisotopic.mass', 'mass')
conn2$setField('molecular.mass',    'molmass')

You will notice that with the first call to setField() an information message tells you that the CSV file has been loaded.

It is possible to associate several column names to a single biodb field, in which case you have to provide a character vector containing your column names. The values of the resulting biodb field will be the concatenation of the values of your selected columns, in the order specified. Because of the concatenation of your values, the type of the targeted biodb field must be character. This is particularly useful for the accession field, which must correspond to a unique entry inside your CSV file. Depending on your CSV file, you may need to associate several columns to create a valid accession value that identifies a unique entry.

2.2 Retrieving entries

Retrieving entries is done as with any other connector in biodb, using their accession numbers. The returned value is a list of BiodbEntry objects:

entries <- conn$getEntry(c('1018', '1456', '16750', '64679'))

## INFO  [08:58:47.732] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract.tsv".

entries

## [[1]]
## Biodb Compound CSV File entry instance 1018.
## 
## [[2]]
## Biodb Compound CSV File entry instance 1456.
## 
## [[3]]
## Biodb Compound CSV File entry instance 16750.
## 
## [[4]]
## Biodb Compound CSV File entry instance 64679.

From a list of entries, you can obtain a data frame with their values:

entriesDf <- mybiodb$entriesToDataframe(entries)

See table 2 for the content of this data frame.

Table 2: Some entries from the compound database.
accession	formula	monoisotopic.mass	molecular.mass	kegg.compound.id	name	smiles	description	comp.csv.file.id
1018	C2H8AsNO3	168.97201	169.0120	C07279	2-Aminoethylarsonate	`NCC[As](O)(O)=O`		1018
1456	C3H9NO2	91.06333	91.1090	C06057	3-aminopropane-1,2-diol	`NC[C@H](O)CO`		1456
16750	C10H13N5O5	283.09170	283.2409	C00387	guanosine	`Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O`	A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond.	16750
64679	C9H18NO11P	347.06180	347.2131	NA	O-(alpha-D-mannose-1-phosphoryl)-L-serine	`N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O`	A mannose phosphate in which in which the phosphate group of alpha-D-mannose 1-phosphate is esterified by the alcoholic hydroxy group of L-serine.	64679

See vignette Manipulating entry objects to know everything you can do with biodb entry objects and also the help page of the class ?biodb::BiodbEntry.

2.3 Searching for entries

It is possible to search for entries by mass inside a compounds database:

conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))

## [1] 16750 35485 40304

The function returns a list of accession numbers that you can use with the getEntry() method to retrieve full entry objects.

The tolerance can also be expressed in PPM:

conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, ppm=10)))

## [1] 16750 35485 40304

or with a range:

conn$searchForEntries(list(monoisotopic.mass=list(min=283.091, max=283.093)))

## [1] 16750 35485 40304

You can set a maximum to the number of entries returned with the max.results parameter:

conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2)

## [1] 16750 35485

To get a list of all possible mass fields in biodb, run:

mybiodb$getEntryFields()$getFieldNames(type='mass')

## [1] "average.mass"      "molecular.mass"    "monoisotopic.mass"
## [4] "nominal.mass"

To get information on these fields run:

mybiodb$getEntryFields()$get(c('monoisotopic.mass', 'nominal.mass'))

## $monoisotopic.mass
## Entry field "monoisotopic.mass".
##   Description: Monoisotopic mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass of the primary isotope of the elements including the mass defect (mass difference between neutron and proton, and nuclear binding energy). Used with high resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
##   Class: double.
##   Type: mass.
##   Cardinality: one.
##   Aliases: exact.mass.
## 
## $nominal.mass
## Entry field "nominal.mass".
##   Description: Nominal mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass number of the most abundant isotope of each atom. Typically used with low resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
##   Class: integer.
##   Type: mass.
##   Cardinality: one.
##   Aliases: NA.

To check if a connector is searchable by a field, use the following method:

conn$isSearchableByField('monoisotopic.mass')

## [1] TRUE

To get the list of searchable fields for a connector, run:

conn$getSearchableFields()

## [1] "name"              "monoisotopic.mass" "molecular.mass"

Entries are also searchable by name:

conn$searchForEntries(list(name='deoxyguanosine'))

## [1] 40304

And it is possible to combine a search by mass with a search by name:

conn$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))

## [1] 16750 40304

2.4 Annotation of an MS file

Your in-house chemical database can be used to annotate a mass spectrum, using a data frame or a vector as input. Annotation is done using the annotateMzValues() method, which is a generic method. It is thus available for all compound databases that allow search on masses. You will obtain a new data frame with appended columns taken from the chemical database.

Here is an input data frame example with M/Z values in a column:

msTsv <- system.file("extdata", "ms.tsv", package='biodb')
mzDf <- read.table(msTsv, header=TRUE, sep="\t")

See table 3 for the content of the input.

Table 3: Input M/Z values.
mz	rt
282.0839	334
283.0623	872
346.0546	536
821.3964	740

We run the annotation with the annotateMzValues() method:

annotDf <- conn$annotateMzValues(mzDf, mz.tol=1e-3, ms.mode='neg', mz.tol.unit='plain',
                                 fields=c('accession', 'name', 'formula',
                                          'molecular.mass', 'monoisotopic.mass'),
                                 prefix='mydb.', fieldsLimit=1)

See table 4 for the results.

Inside this table, the values coming from the database entry fields have been prefixed with the value provided inside the prefix parameter. The default value of this parameter would be the name of the database but you can set it to any value you like.

The first parameter is the input, as a data frame or a numeric vector. In case of a data frame the column containing the M/Z values must be named mz or you have to specify its name using the mz.col parameter.

The mz.tol and mz.tol.unit parameters are used to set the tolerance, see the manual of the class BiodbCompounddbConn. You can set the mass field to use in the database with the mass.field parameter (default is monoisotopic.mass).

By default all entry fields from the database will be copied inside the output data frame, but you can restrict to a custom set of fields using the fields parameter.

The fieldsLimit parameter is used to limit the number of values output for fields that may contain more than one value. Here it is used for the 'name' field, which may content more than one name for each entry. By setting the parameter to 1 we select only the first name for each entry.

You will find a complete description of this method and other compound methods by running ?biodb::BiodbCompounddbConn.

Table 4: The annotated mass spectrum
Columns prefixed with “mydb.” come from the compound database.
mz	rt	mydb.accession	mydb.formula	mydb.molecular.mass	mydb.monoisotopic.mass	mydb.name
282.0839	334	16750	C10H13N5O5	283.2409	283.0917	guanosine
282.0839	334	35485	C10H13N5O5	283.2409	283.0917	adenosine 1-oxide
282.0839	334	40304	C10H13N5O5	283.2407	283.0917	8-hydroxy-2’-deoxyguanosine
283.0623	872	NA	NA	NA	NA	NA
346.0546	536	64679	C9H18NO11P	347.2131	347.0618	O-(alpha-D-mannose-1-phosphoryl)-L-serine
821.3964	740	15939	C42H62O16	822.9321	822.4038	glycyrrhizinic acid

See also vignette In-house mass spectra database for annotation using a mass spectra database.

3 SQLite connector

The SQLite connector operates in the same way as the CSV file connector, except for the instantiation step for which it needs an SQLite file as input instead of a CSV file. Moreover the SQLite file needs to be in biodb format (the relations need to be created by biodb).

Here is a biodb SQLite compounds file, already filled with entries:

sqliteFile <- system.file("extdata", "generated", "chebi_extract.sqlite", package='biodb')

We create a connector from this file:

sqliteConn <-  mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteFile)

We can search inside this database the same way we have been searching inside the CSV file database:

sqliteConn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))

## [1] "16750" "35485" "40304"

4 Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:

mybiodb$terminate()

## INFO  [08:58:49.076] Closing BiodbMain instance... 
## INFO  [08:58:49.079] Connector "comp.csv.file" deleted. 
## INFO  [08:58:49.081] Connector "comp.csv.file.1" deleted. 
## INFO  [08:58:49.083] Connector "comp.sqlite" deleted.

In-house compound database

10 June 2021

Abstract

Package