Foreword

topdownr is free and open-source software. If you use it, please support the project by citing it in publications:

P.V. Shliaha, S. Gibb, V. Gorshkov, M.S. Jespersen, G.R. Andersen, D. Bailey, J. Schwartz, S. Eliuk, V. Schwämmle, and O.N. Jensen. 2018. Maximizing Sequence Coverage in Top-Down Proteomics By Automated Multi-modal Gas-phase Protein Fragmentation. Analytical Chemistry. DOI: 10.1021/acs.analchem.8b02344

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/sgibb/topdownr/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction/Working with topdownr

Load the package.

library("topdownr")

1.1 Importing Files

Some example files are provided in the topdownrdata package. For a full analysis you need a .fasta file with the protein sequence, the .experiments.csv files containing the method information, the .txt files containing the scan header information and the .mzML files with the deconvoluted spectra.

## list.files(topdownrdata::topDownDataPath("myoglobin"))
$csv
[1] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_1.experiments.csv.gz"
[2] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_2.experiments.csv.gz"
[3] "..."                                                                            

$fasta
[1] ".../20170629_myo/fasta/myoglobin.fasta.gz"
[2] "..."                                      

$mzML
[1] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_1.mzML.gz"
[2] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_2.mzML.gz"
[3] "..."                                                          

$txt
[1] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_1.txt.gz"
[2] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_2.txt.gz"
[3] "..."                                                           

All these files have to be in a directory. You could import them via readTopDownFiles. This function has some arguments. The most important ones are the path of the directory containing the files, the protein modification (e.g. initiator methionine removal, "Met-loss"), and adducts (e.g. proton transfer often occurs from c to z-fragment after ETD reaction).

## the mass adduct for a proton
H <- 1.0078250321

myoglobin <- readTopDownFiles(
    ## directory path
    path = topdownrdata::topDownDataPath("myoglobin"),
    ## fragmentation types
    type = c("a", "b", "c", "x", "y", "z"),
    ## adducts (add -H/H to c/z and name
    ## them cmH/zpH (c minus H, z plus H)
    adducts = data.frame(
        mass=c(-H, H),
        to=c("c", "z"),
        name=c("cmH", "zpH")),
    ## initiator methionine removal
    modifications = "Met-loss",
    ## don't use neutral loss
    neutralLoss = NULL,
    ## tolerance for fragment matching
    tolerance = 5e-6
)
## Warning in FUN(X[[i]], ...): 61 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 63 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 53 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 55 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.

## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.

## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
myoglobin
## TopDownSet object (7.12 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1216x5882 (5.15% != 0)
## Number of matched fragments: 368296 
## Intensity range: [87.61;10704001.00]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.

1.2 The TopDownSet Anatomy

The assembled object is an TopDownSet object. Briefly it is composed of three interconnected tables:

  1. rowViews/fragment data: holds the information on the type of fragments, their modifications and adducts.
  2. colData/condition data: contains the corresponding fragmentation condition for every spectrum.
  3. assayData: contains the intensity of assigned fragments.
TopDownSet anatomy, image adopted from [Morgan et al. (2017)].

TopDownSet anatomy, image adopted from [Morgan et al. (2017)].

1.3 Technical Details

This section explains the implementation details of the TopDownSet class. It is not necessary to understand everything written here to use topdownr for the analysis of fragmentation data.

The TopDownSet contains the following components: Fragment data, Condition data, Assay data.

1.3.1 Fragment data

rowViews(myoglobin)
## FragmentViews on a 153-letter sequence:
##   GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGH...KHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG
## Mass:
##   16922.95406
## Modifications:
##   Met-loss
## Views:
##        start end width     mass name   type   z                              
##    [1]     1   1     1    30.03 a1     a      1 [G]                          
##    [2]     1   1     1    58.03 b1     b      1 [G]                          
##    [3]     1   1     1    59.01 z1     z      1 [G]                          
##    [4]     1   1     1    60.02 zpH1   z      1 [G]                          
##    [5]     1   1     1    74.05 cmH1   c      1 [G]                          
##    ...   ... ...   ...      ... ...    ...  ... ...                          
## [1212]     2 153   152 16868.93 zpH152 z      1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1213]     1 152   152 16882.96 cmH152 c      1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1214]     1 152   152 16883.97 c152   c      1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1215]     2 153   152 16884.95 y152   y      1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1216]     2 153   152 16910.93 x152   x      1 [LSDGEWQQVLNV...IAAKYKELGFQG]

The fragmentation data are represented by an FragmentViews object that is an overloaded XStringViews object. It contains one AAString (the protein sequence) and an IRanges object that stores the start, end (and width) values of the fragments. Additionally it has a DataFrame for the mass, type and z information of each fragment.

1.3.2 Condition data

conditionData(myoglobin)[, 1:5]
## DataFrame with 5882 rows and 5 columns
##                                                                    File
##                                                                   <Rle>
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_1   myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_2   myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_3   myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_4   myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_1.0e+06_02.50_14_00_1   myo_707_ETDReagentTarget_1e6_1
## ...                                                                 ...
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_07 myo_1211_ETDReagentTarget_1e7_2
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_08 myo_1211_ETDReagentTarget_5e6_1
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_09 myo_1211_ETDReagentTarget_5e6_1
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_10 myo_1211_ETDReagentTarget_5e6_2
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_11 myo_1211_ETDReagentTarget_5e6_2
##                                              Scan SpectrumIndex PeaksCount
##                                         <numeric>     <integer>  <integer>
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_1         33            22        161
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_2         34            23        175
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_3         33            23        180
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_4         34            24        171
## C0707.30_1.0e+05_1.0e+06_02.50_14_00_1         36            25        172
## ...                                           ...           ...        ...
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_07       223           203        213
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_08       221           202        250
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_09       222           203        145
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_10       223           203        207
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_11       224           204        158
##                                            TotIonCurrent
##                                                <numeric>
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_1  27224936.7177734
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_2  29167765.3955078
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_3   26132872.484375
## C0707.30_1.0e+05_1.0e+06_02.50_07_00_4  25475501.0371094
## C0707.30_1.0e+05_1.0e+06_02.50_14_00_1  27347105.4853516
## ...                                                  ...
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_07 2566120.42312622
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_08 2348707.19299316
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_09 2305899.88635254
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_10 2262800.11270142
## C1211.70_1.0e+06_0.0e+00_00.00_00_35_11  2212188.7298584

Condition data is a DataFrame that contains the combined header information for each MS run (combined from method (.experiments.csv files)/scan header (.txt files) table and metadata from the .mzML files).

1.3.3 Assay data

assayData(myoglobin)[206:215, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_1', 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_2', 'C0707.30_1.0e+05_1.0e+06_02.50_07_00_3' ... ]]
##                                                                     
## z26        .        .        .        .        .        .        .  
## zpH26 491328.4 446301.1 407389.1 473200.9 470679.3 493244.8 390025.8
## y26        .        .        .        .        .        .        .  
## b27        .        .        .        .        .        .        .  
## cmH27      .        .        .        .        .        .        .  
## c27        .        .        .        .        .        .        .  
## x26        .        .        .        .        .        .        .  
## z27        .        .        .        .        .        .        .  
## zpH27 534307.6 534135.1 434296.8 436866.2 550887.3 513038.8 460476.4
## y27        .        .        .        .        .        .        .  
##                                  
## z26        .         .        .  
## zpH26 389430.25 496551.3 554295.7
## y26    23648.63      .        .  
## b27        .         .        .  
## cmH27      .         .        .  
## c27        .         .        .  
## x26        .         .        .  
## z27        .         .        .  
## zpH27 456524.97 602207.0 579989.8
## y27        .         .        .

Assay data is a sparseMatrix from the Matrix package (in detail a dgCMatrix) where the rows correspond to the fragments, the columns to the runs/conditions and the entries to the intensity values. A sparseMatrix is similar to the classic matrix in R but stores just the values that are different from zero.

1.4 Subsetting a TopDownSet

A TopDownSet could be subsetted by the fragment and the condition data.

# select the first 100 fragments
myoglobin[1:100]
## TopDownSet object (3.40 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 100 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;1426.70]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 100x5882 (9.68% != 0)
## Number of matched fragments: 56955 
## Intensity range: [105.70;1076768.00]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 56955 fragments [100;5882].
# select all "c" fragments
myoglobin["c"]
## TopDownSet object (4.35 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304 
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 304x5882 (7.69% != 0)
## Number of matched fragments: 137461 
## Intensity range: [87.61;1203763.75]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 137461 fragments [304;5882].
# select just the 100. "c" fragment
myoglobin["c100"]
## TopDownSet object (2.73 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1 
## Theoretical fragment types (1): c
## Theoretical mass range: [11085.96;11085.96]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1x5882 (0.09% != 0)
## Number of matched fragments: 5 
## Intensity range: [1276.91;17056.12]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 5 fragments [1;5882].
# select all "a" and "b" fragments but just the first 100 "c"
myoglobin[c("a", "b", paste0("c", 1:100))]
## TopDownSet object (4.43 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 404 
## Theoretical fragment types (3): a, b, c
## Theoretical mass range: [30.03;16866.94]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 404x5882 (6.04% != 0)
## Number of matched fragments: 143582 
## Intensity range: [87.61;1630533.12]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 143582 fragments [404;5882].
# select condition/run 1 to 10
myoglobin[, 1:10]
## TopDownSet object (0.26 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 3 
## Number of scans: 10 
## Condition variables (61): File, Scan, ..., Sample, MedianIonInjectionTimeMs
## - - - Intensity data - - -
## Size of array: 1216x10 (8.38% != 0)
## Number of matched fragments: 1019 
## Intensity range: [7872.05;1036892.19]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 1019 fragments [1216;10].
# select all conditions from one file
myoglobin[, myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.24 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 0 fragments [1216;0].
# select all "c" fragments from a single file
myoglobin["c", myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.11 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304 
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Processing information - - -
## [2019-01-04 21:59:00] 368296 fragments [1216;5882] matched (tolerance: 5 ppm, strategies ion/fragment: remove/remove).
## [2019-01-04 21:59:01] Condition names updated based on: Mz, AgcTarget, EtdReagentTarget, EtdActivation, CidActivation, HcdActivation. Order of conditions changed. 1852 conditions.
## [2019-01-04 21:59:01] Recalculate median injection time based on: Mz, AgcTarget.
## [2019-01-04 21:59:01] Subsetted 368296 fragments [1216;5882] to 0 fragments [304;0].

1.5 Plotting a TopDownSet

Each condition represents one spectrum. We could plot a single condition interactively or all spectra into a pdf file (or any other R device that supports multiple pages/plots).

# plot a single condition
plot(myoglobin[, "C0707.30_1.0e+05_1.0e+06_10.00_00_28_3"])
## [[1]]

## # example to plot the first ten conditions into a pdf
## # (not evaluated in the vignette)
## pdf("topdown-conditions.pdf", paper="a4r", width=12)
## plot(myoglobin[, 1:10])
## dev.off()

plot returns a list (an item per condition) of ggplot objects which could further modified or investigated interactively by calling plotly::ggplotly().

2 Fragmentation Data Analysis of Myoglobin

We follow the following workflow: