1 Introduction

biodb is able to handle in-house mass spectra databases either stored inside a CSV file, using the mass.csv.file, or inside an SQLite file, using mass.sqlite connectors.

Both connectors accept the creation of new entries through biodb methods. The CSV file connector is able to read any CSV file in input, while the SQLite connector is only able to interact with a database file created by biodb.

Inside vignette Manipulating entry objects you can learn how to create a new empty SQLite of CSV file connector in order to copy entries into it entries from another connector, and thus create your own database.

To start we create an instance of the BiodbMain class:

mybiodb <- biodb::newInst()

## INFO  [08:58:50.144] Loading definitions from package biodb version 1.0.4.

2 CSV File connector

We are going to use the mass.csv.file database connector to load an in-house LCMS database. This connector is able to access any in-house database CSV file, containing LCMS entries and/or MSMS entries.

We will use an extract of Massbank (Horai et al. 2010) database for our example:

fileUrl <- system.file("extdata", "massbank_extract_lcms_2.tsv", package='biodb')

We create the connector from the data frame, and will use this connector for all subsequent examples.

conn <- mybiodb$getFactory()$createConn('mass.csv.file', url=fileUrl)

Two fields are missing inside this database: the MS level and unit used for chromatography retention times. We define them using the addField() method:

conn$addField('ms.level', 1)

## INFO  [08:58:50.500] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/massbank_extract_lcms_2.tsv".

conn$addField('chrom.rt.unit', 's')

3 Searching for entries

Both peak.mzexp and peak.mztheo are defined inside the database. The following call to get the field that will be used for matching will emit a warning because of this:

conn$getMatchingMzField()

## WARN  [08:58:50.777] Field "peak.mztheo" has been automatically chosen among several possibilities (peak.mztheo, peak.mzexp) for matching M/Z values. Use setMatchingMzField() method explicitly to avoid this warning in the future.

## Warning in warn0("Field \"", field, "\" has been automatically chosen", :
## Field "peak.mztheo" has been automatically chosen among several possibilities
## (peak.mztheo, peak.mzexp) for matching M/Z values. Use setMatchingMzField()
## method explicitly to avoid this warning in the future.

## [1] "peak.mztheo"

Setting the M/Z field to use:

conn$setMatchingMzField('peak.mztheo')

Searching by M/Z:

conn$searchForMassSpectra(mz.min=145.96, mz.max=147.13)

## [1] "PR010007"

Searching with multiple M/Z:

conn$searchForMassSpectra(mz=c(86.000002, 146.999997), mz.tol=10, mz.tol.unit='ppm')

## [1] "PR010001" "PR010007"

Searching by M/Z and RT:

conn$searchForMassSpectra(mz.min=72.78, mz.max=73.12, rt=45.5, rt.tol=1, rt.unit='s')

## [1] "PR010003"

Other parameters can be used in searchForMassSpectra():

MS level.
MS mode.
Minimum of relative intensity.
Matching of precursor peak.
Maximum of results.

4 Peak annotation

Peak annotation can be done with searchMsPeaks() method.

We must first define the data frame of inputs:

input <- data.frame(mz=c(73.01, 116.04, 174.2),
                    rt=c(79, 173, 79))

Then we run the annotation:

conn$searchMsPeaks(input, mz.tol=0.1, rt.unit='s', rt.tol=10, match.rt=TRUE, prefix='match.')

##       mz  rt match.accession match.chrom.col.id match.chrom.col.name
## 1  73.01  79        PR010001              mycol                mycol
## 2 116.04 173        PR010006              mycol                mycol
## 3 174.20  79            <NA>               <NA>                 <NA>
##   match.chrom.rt match.chrom.rt.unit match.formula match.mass.csv.file.id
## 1             78                   s       C3H10N2               PR010001
## 2            176                   s      C9H13NO2               PR010006
## 3             NA                <NA>          <NA>                   <NA>
##   match.molecular.mass match.ms.level match.ms.mode            match.name
## 1              74.0844              1           pos    1,3-Diaminopropane
## 2             167.0946              1           pos (R)-(-)-Phenylephrine
## 3                   NA             NA          <NA>                  <NA>
##   match.peak.intensity match.peak.mz match.peak.mzexp match.peak.mztheo
## 1                  999        73.012           73.012                73
## 2                  999       116.011          116.011               116
## 3                   NA            NA               NA                NA
##               match.smiles
## 1                    NCCCN
## 2 CNC[C@H](O)c(c1)cc(O)cc1
## 3                     <NA>

5 MS Annotation

5.1 Annotation using an LCMS database

Here is the input data frame we want to annotate with both M/Z values and RT values in seconds:

ms.tsv <- system.file("extdata", "ms.tsv", package='biodb')
mzdf <- read.table(ms.tsv, header=TRUE, sep="\t")

Annotation is also possible using an LCMS database.

For this example we use an in-house database saved as a TSV file:

lcmsdb <- system.file("extdata", "massbank_extract_lcms_1.tsv", package="biodb")
massbank <- mybiodb$getFactory()$createConn('mass.csv.file', url=lcmsdb)

The database file was built using Massbank data. The accession numbers correspond to real Massbank entries.

Now that the database is loaded, we define some missing fields (MS level and RT unit):

massbank$addField('ms.level', 1)

## INFO  [08:58:52.122] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/massbank_extract_lcms_1.tsv".

massbank$addField('chrom.rt.unit', 's')

We will now annotate our M/Z peaks, selecting a set of fields from the database to write in the output data frame:

massbank$searchMsPeaks(mzdf, mz.tol=1e-3, fields=c('accession', 'name', 'formula', 'chebi.id'), prefix='mydb.')

##         mz  rt mydb.accession mydb.chebi.id mydb.formula
## 1 282.0839 334       KNA00776         16750   C10H13N5O5
## 2 283.0623 872       BS003495          <NA>     C16H12O5
## 3 346.0546 536       KNA00596         16027  C10H14N5O7P
## 4 821.3964 740       TY000021          <NA>    C42H62O16
##                         mydb.name
## 1                       Guanosine
## 2 5,7-Dihydroxy-2'-methoxyflavone
## 3                             AMP
## 4               Glycyrrhizic acid

Annotation is also possible using retention times. We then choose the chromatographic columns on which we want to match, and the set of fields we want in output:

chromColIds <- c('TOSOH TSKgel ODS-100V  5um Part no. 21456')
fields <- c('accession', 'name', 'formula', 'chebi.id', 'chrom.rt', 'chrom.col.id')

Finally we run the annotation on M/Z and RT values:

massbank$searchMsPeaks(mzdf, mz.tol=1e-3, fields=fields, prefix='mydb.', chrom.col.ids=chromColIds, rt.unit='s', rt.tol=10, match.rt=TRUE)

##         mz  rt mydb.accession mydb.chebi.id
## 1 282.0839 334       KNA00776         16750
## 2 283.0623 872           <NA>          <NA>
## 3 346.0546 536       KNA00596         16027
## 4 821.3964 740           <NA>          <NA>
##                           mydb.chrom.col.id mydb.chrom.rt mydb.formula
## 1 TOSOH TSKgel ODS-100V  5um Part no. 21456      329.8511   C10H13N5O5
## 2                                      <NA>            NA         <NA>
## 3 TOSOH TSKgel ODS-100V  5um Part no. 21456      532.0656  C10H14N5O7P
## 4                                      <NA>            NA         <NA>
##   mydb.name
## 1 Guanosine
## 2      <NA>
## 3       AMP
## 4      <NA>

For more details on the searchMsPeaks() method, please this the help page of BiodbMassdbConn.

6 MS/MS

6.1 Defining the database

We create the connector to a CSV file database built with MS2 spectra extracted from Massbank:

db.tsv <- system.file("extdata", "massbank_extract_msms.tsv", package='biodb')
conn <- mybiodb$getFactory()$createConn('mass.csv.file', url=db.tsv)

6.2 Searching for entries

We can search for entries containing certain M/Z values with the following method:

conn$searchForMassSpectra(mz.min=115, mz.max=115.1, max.results=5)

## INFO  [08:58:53.825] Loading file database "/tmp/RtmpPuvVqj/Rinst3c210b213500d9/biodb/extdata/massbank_extract_msms.tsv".

## [1] "AU116602" "AU116606" NA         NA         NA

It is also possible to call it with a tolerance:

conn$searchForMassSpectra(mz=115, mz.tol=0.1, mz.tol.unit='plain', max.results=5)

## [1] "AU116602" "AU116606" NA         NA         NA

6.3 Spectrum matching

The spectrum to search for must be defined inside a data frame:

spectrum <- data.frame(mz=c(286.1456, 287.1488, 288.1514), rel.int=c(100, 45, 18))

Then we call the msmsSearch() method:

conn$msmsSearch(spectrum, precursor.mz=286.1438, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')

##         id     score peak.1 peak.2 peak.3
## 1 AU158001 0.7804225      1      2      3
## 2 AU158002 0.7429446      1      2      4

7 SQLite connector

The SQLite connector operates in the same way as the CSV file connector, except for the instantiation step for which it needs an SQLite file as input instead of a CSV file. Moreover the SQLite file needs to be in biodb format (the relations need to be created by biodb).

Here is a biodb SQLite mass spectra file, already filled with entries:

sqliteFile <- system.file("extdata", "generated", "massbank_extract_full.sqlite", package='biodb')

We create a connector from this file:

sqliteConn <-  mybiodb$getFactory()$createConn('mass.sqlite', url=sqliteFile)

We can search inside this database the same way we have been searching inside the CSV file database:

sqliteConn$searchMsPeaks(mzdf, mz.tol=1e-3, fields=c('accession', 'name', 'formula', 'chebi.id'), prefix='mydb.')

##         mz  rt mydb.accession mydb.chebi.id mydb.formula
## 1 282.0839 334       KNA00776         16750   C10H13N5O5
## 2 283.0623 872       BS003495          <NA>     C16H12O5
## 3 346.0546 536       KNA00596         16027  C10H14N5O7P
## 4 821.3964 740       TY000021          <NA>    C42H62O16
##                         mydb.name
## 1                       Guanosine
## 2 5,7-Dihydroxy-2'-methoxyflavone
## 3                             AMP
## 4               Glycyrrhizic acid

8 Closing biodb instance

Do not forget to terminate your biodb instance once you are done with it:

mybiodb$terminate()

## INFO  [08:58:55.655] Closing BiodbMain instance... 
## INFO  [08:58:55.658] Connector "mass.csv.file" deleted. 
## INFO  [08:58:55.660] Connector "mass.csv.file.1" deleted. 
## INFO  [08:58:55.661] Connector "mass.csv.file.2" deleted. 
## INFO  [08:58:55.664] Connector "mass.sqlite" deleted.

References

Horai, Hisayuki, Masanori Arita, Shigehiko Kanaya, Yoshito Nihei, Tasuku Ikeda, Kazuhiro Suwa, Yuya Ojima, et al. 2010. “MassBank: A Public Repository for Sharing Mass Spectral Data for Life Sciences.” Journal of Mass Spectrometry 45 (7): 703–14. https://doi.org/https://doi.org/10.1002/jms.1777.

In-house mass spectra database

10 June 2021

Abstract

Package