biodb 1.0.4
biodb is able to handle in-house compound databases either stored inside a
CSV file, using the comp.csv.file
, or inside an SQLite file, using
comp.sqlite
connectors.
Both connectors accept the creation of new entries through biodb methods. The CSV file connector is able to read any CSV file in input, while the SQLite connector is only able to interact with a database file created by biodb.
Inside vignette Manipulating entry objects you can learn how to create a new empty SQLite of CSV file connector in order to copy entries into it entries from another connector, and thus create your own database.
To start we create an instance of the BiodbMain
class:
mybiodb <- biodb::newInst()
## INFO [08:58:47.312] Loading definitions from package biodb version 1.0.4.
In order to facilitate the loading of the file, you should use the tabulation character as columns separator and name the columns of your file with biodb standard field names. However, if your CSV file does not respect the biodb standard, you have also the possibility to declare a custom column separator character and the mapping between column names of your file and biodb field names just before loading the file.
Once the connection to the file is defined, you can use the connector to your in-house file as any other compound database connector.
In order to create a connector to a CSV database, you have to provide the path
to your CSV file.
This is done with the url
parameter of the createConn()
method.
If your CSV file respects the biodb defaults (see below), then no further information is required.
If your CSV file does not respect the biodb standard, then you will have to modify the defaults on the connector instance, before the CSV file is loaded. Thus it has to be done immediately after the connector creation.
In the following sub-sections we are going to see how to load a biodb standard CSV file and a custom CSV file.
Here is a biodb standard CSV file containing an extract of the ChEBI database:
csvUrl <- system.file("extdata", "chebi_extract.tsv", package='biodb')
See table 1 for the content of this file.
accession | formula | monoisotopic.mass | molecular.mass | kegg.compound.id | name | smiles | description |
---|---|---|---|---|---|---|---|
1018 | C2H8AsNO3 | 168.97201 | 169.012 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
|
1390 | C8H8O2 | 136.05243 | 136.148 | C06224 | 3,4-Dihydroxystyrene | Oc1ccc(C=C)cc1O |
|
1456 | C3H9NO2 | 91.06333 | 91.109 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
|
1549 | C3H5O3R | 89.02387 | 89.070 | C03834 | 3-hydroxymonocarboxylic acid | OC([*])CC(O)=O |
|
1894 | C5H11NO | 101.08406 | 101.147 | C10974 | 4-Methylaminobutanal | CNCCCC=O |
|
1932 | C6H6NR | 92.05002 | 92.119 | C03084 | 4-Substituted aniline | Nc1ccc([*])cc1 |
This CSV file respects the biodb defaults.
The columns separator is the tabulation character.
Column names use biodb standard entry field names.
String values may be quoted with double quotes ("
).
We instantiate the connector by passing the URL to the factory:
conn <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl)
We will later use this connector to run the examples of this vignette.
Here is a custom CSV file containing the same extract of the ChEBI database than with the biodb standard CSV file:
csvUrl2 <- system.file("extdata", "chebi_extract_custom.csv", package='biodb')
Only the columns separator character and some column names have been changed.
We create now a connector for this custom CSV file:
conn2 <- mybiodb$getFactory()$createConn('comp.csv.file', url=csvUrl2)
At this step the file has not yet been loaded. We can thus customize the connector in order for the CSV file parsing to proceed correctly. The effective loading of the CSV file will happen when you run a method of the connector that requires the data.
The first step to customize your connector is to set the separator character:
conn2$setCsvSep(';')
Then you may change the quote characters:
conn2$setCsvQuote('')
Here we specify with an empty string that this CSV file does not use quotes for character values.
Finally you have to map each custom column name with the name of a biodb entry field.
For this you call the setField()
method for each column name, giving as first
argument the biodb field name and as second argument the column name.
In our case this gives:
conn2$setField('accession', 'ID')
## INFO [08:58:47.702] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract_custom.csv".
conn2$setField('kegg.compound.id', 'kegg')
conn2$setField('monoisotopic.mass', 'mass')
conn2$setField('molecular.mass', 'molmass')
You will notice that with the first call to setField()
an information message
tells you that the CSV file has been loaded.
It is possible to associate several column names to a single biodb field, in which case you have to provide a character vector containing your column names. The values of the resulting biodb field will be the concatenation of the values of your selected columns, in the order specified. Because of the concatenation of your values, the type of the targeted biodb field must be character. This is particularly useful for the accession field, which must correspond to a unique entry inside your CSV file. Depending on your CSV file, you may need to associate several columns to create a valid accession value that identifies a unique entry.
Retrieving entries is done as with any other connector in biodb, using their
accession numbers.
The returned value is a list of BiodbEntry
objects:
entries <- conn$getEntry(c('1018', '1456', '16750', '64679'))
## INFO [08:58:47.732] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/chebi_extract.tsv".
entries
## [[1]]
## Biodb Compound CSV File entry instance 1018.
##
## [[2]]
## Biodb Compound CSV File entry instance 1456.
##
## [[3]]
## Biodb Compound CSV File entry instance 16750.
##
## [[4]]
## Biodb Compound CSV File entry instance 64679.
From a list of entries, you can obtain a data frame with their values:
entriesDf <- mybiodb$entriesToDataframe(entries)
See table 2 for the content of this data frame.
accession | formula | monoisotopic.mass | molecular.mass | kegg.compound.id | name | smiles | description | comp.csv.file.id |
---|---|---|---|---|---|---|---|---|
1018 | C2H8AsNO3 | 168.97201 | 169.0120 | C07279 | 2-Aminoethylarsonate | NCC[As](O)(O)=O |
1018 | |
1456 | C3H9NO2 | 91.06333 | 91.1090 | C06057 | 3-aminopropane-1,2-diol | NC[C@H](O)CO |
1456 | |
16750 | C10H13N5O5 | 283.09170 | 283.2409 | C00387 | guanosine | Nc1nc2n(cnc2c(=O)[nH]1)[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O |
A purine nucleoside in which guanine is attached to ribofuranose via a beta-N(9)-glycosidic bond. | 16750 |
64679 | C9H18NO11P | 347.06180 | 347.2131 | NA | O-(alpha-D-mannose-1-phosphoryl)-L-serine | N[C@@H](COP(O)(=O)O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]1O)C(O)=O |
A mannose phosphate in which in which the phosphate group of alpha-D-mannose 1-phosphate is esterified by the alcoholic hydroxy group of L-serine. | 64679 |
See vignette
Manipulating entry objects
to know everything you can do with biodb entry objects and also the help page of the class ?biodb::BiodbEntry
.
It is possible to search for entries by mass inside a compounds database:
conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))
## [1] 16750 35485 40304
The function returns a list of accession numbers that you can use with the
getEntry()
method to retrieve full entry objects.
The tolerance can also be expressed in PPM:
conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, ppm=10)))
## [1] 16750 35485 40304
or with a range:
conn$searchForEntries(list(monoisotopic.mass=list(min=283.091, max=283.093)))
## [1] 16750 35485 40304
You can set a maximum to the number of entries returned with the max.results
parameter:
conn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)), max.results=2)
## [1] 16750 35485
To get a list of all possible mass fields in biodb, run:
mybiodb$getEntryFields()$getFieldNames(type='mass')
## [1] "average.mass" "molecular.mass" "monoisotopic.mass"
## [4] "nominal.mass"
To get information on these fields run:
mybiodb$getEntryFields()$get(c('monoisotopic.mass', 'nominal.mass'))
## $monoisotopic.mass
## Entry field "monoisotopic.mass".
## Description: Monoisotopic mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass of the primary isotope of the elements including the mass defect (mass difference between neutron and proton, and nuclear binding energy). Used with high resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
## Class: double.
## Type: mass.
## Cardinality: one.
## Aliases: exact.mass.
##
## $nominal.mass
## Entry field "nominal.mass".
## Description: Nominal mass, in u (unified atomic mass units) or Da (Dalton). It is computed using the mass number of the most abundant isotope of each atom. Typically used with low resolution mass spectrometers. See https://en.wikipedia.org/wiki/Monoisotopic_mass.
## Class: integer.
## Type: mass.
## Cardinality: one.
## Aliases: NA.
To check if a connector is searchable by a field, use the following method:
conn$isSearchableByField('monoisotopic.mass')
## [1] TRUE
To get the list of searchable fields for a connector, run:
conn$getSearchableFields()
## [1] "name" "monoisotopic.mass" "molecular.mass"
Entries are also searchable by name:
conn$searchForEntries(list(name='deoxyguanosine'))
## [1] 40304
And it is possible to combine a search by mass with a search by name:
conn$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))
## [1] 16750 40304
Your in-house chemical database can be used to annotate a mass spectrum, using
a data frame or a vector as input.
Annotation is done using the annotateMzValues()
method, which is a generic method.
It is thus available for all compound databases that allow search on masses.
You will obtain a new data frame with appended columns taken from the chemical
database.
Here is an input data frame example with M/Z values in a column:
msTsv <- system.file("extdata", "ms.tsv", package='biodb')
mzDf <- read.table(msTsv, header=TRUE, sep="\t")
See table 3 for the content of the input.
mz | rt |
---|---|
282.0839 | 334 |
283.0623 | 872 |
346.0546 | 536 |
821.3964 | 740 |
We run the annotation with the annotateMzValues()
method:
annotDf <- conn$annotateMzValues(mzDf, mz.tol=1e-3, ms.mode='neg', mz.tol.unit='plain',
fields=c('accession', 'name', 'formula',
'molecular.mass', 'monoisotopic.mass'),
prefix='mydb.', fieldsLimit=1)
See table 4 for the results.
Inside this table, the values coming from the database entry fields have been
prefixed with the value provided inside the prefix
parameter.
The default value of this parameter
would be the name of the database but you
can set it to any value you like.
The first parameter is the input, as a data frame or a numeric vector.
In case of a data frame the column containing the M/Z values must be named mz
or you have to specify its name using the mz.col
parameter.
The mz.tol
and mz.tol.unit
parameters are used to set the tolerance, see
the manual of the class BiodbCompounddbConn
.
You can set the mass field to use in the database with the mass.field
parameter (default is monoisotopic.mass
).
By default all entry fields from the database will be copied inside the output
data frame, but you can restrict to a custom set of fields using the fields
parameter.
The fieldsLimit
parameter is used to limit the number of values output for
fields that may contain more than one value.
Here it is used for the 'name'
field, which may content more than one name
for each entry.
By setting the parameter to 1
we select only the first name for each entry.
You will find a complete description of this method and other compound methods
by running ?biodb::BiodbCompounddbConn
.
mz | rt | mydb.accession | mydb.formula | mydb.molecular.mass | mydb.monoisotopic.mass | mydb.name |
---|---|---|---|---|---|---|
282.0839 | 334 | 16750 | C10H13N5O5 | 283.2409 | 283.0917 | guanosine |
282.0839 | 334 | 35485 | C10H13N5O5 | 283.2409 | 283.0917 | adenosine 1-oxide |
282.0839 | 334 | 40304 | C10H13N5O5 | 283.2407 | 283.0917 | 8-hydroxy-2’-deoxyguanosine |
283.0623 | 872 | NA | NA | NA | NA | NA |
346.0546 | 536 | 64679 | C9H18NO11P | 347.2131 | 347.0618 | O-(alpha-D-mannose-1-phosphoryl)-L-serine |
821.3964 | 740 | 15939 | C42H62O16 | 822.9321 | 822.4038 | glycyrrhizinic acid |
See also vignette In-house mass spectra database for annotation using a mass spectra database.
The SQLite connector operates in the same way as the CSV file connector, except for the instantiation step for which it needs an SQLite file as input instead of a CSV file. Moreover the SQLite file needs to be in biodb format (the relations need to be created by biodb).
Here is a biodb SQLite compounds file, already filled with entries:
sqliteFile <- system.file("extdata", "generated", "chebi_extract.sqlite", package='biodb')
We create a connector from this file:
sqliteConn <- mybiodb$getFactory()$createConn('comp.sqlite', url=sqliteFile)
We can search inside this database the same way we have been searching inside the CSV file database:
sqliteConn$searchForEntries(list(monoisotopic.mass=list(value=283.0917, delta=0.1)))
## [1] "16750" "35485" "40304"
Do not forget to terminate your biodb instance once you are done with it:
mybiodb$terminate()
## INFO [08:58:49.076] Closing BiodbMain instance...
## INFO [08:58:49.079] Connector "comp.csv.file" deleted.
## INFO [08:58:49.081] Connector "comp.csv.file.1" deleted.
## INFO [08:58:49.083] Connector "comp.sqlite" deleted.