Foreword

topdownr is free and open-source software.

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/sgibb/topdownr/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Introduction/Working with topdownr

Load the package.

library("topdownr")

1.1 Importing Files

Some example files are provided in the topdownrdata package. For a full analysis you need a .fasta file with the protein sequence, the .experiments.csv files containing the method information, the .txt files containing the scan header information and the .mzML files with the deconvoluted spectra.

## list.files(topdownrdata::topDownDataPath("myoglobin"))
$csv
[1] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_1.experiments.csv.gz"
[2] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_2.experiments.csv.gz"
[3] "..."                                                                            

$fasta
[1] ".../20170629_myo/fasta/myoglobin.fasta.gz"
[2] "..."                                      

$mzML
[1] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_1.mzML.gz"
[2] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_2.mzML.gz"
[3] "..."                                                          

$txt
[1] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_1.txt.gz"
[2] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_2.txt.gz"
[3] "..."                                                           

All these files have to be in a directory. You could import them via readTopDownFiles. This function has some arguments. The most important ones are the path of the directory containing the files, the protein modification (e.g. initiator methionine removal, "Met-loss"), and adducts (e.g. proton transfer often occurs from c to z-fragment after ETD reaction).

## the mass adduct for a proton
H <- 1.0078250321

myoglobin <- readTopDownFiles(
    ## directory path
    path = topdownrdata::topDownDataPath("myoglobin"),
    ## fragmentation types
    type = c("a", "b", "c", "x", "y", "z"),
    ## adducts (add -H/H to c/z and name
    ## them cmH/zpH (c minus H, z plus H)
    adducts = data.frame(
        mass=c(-H, H),
        to=c("c", "z"),
        name=c("cmH", "zpH")),
    ## initiator methionine removal
    modifications = "Met-loss",
    ## don't use neutral loss
    neutralLoss = NULL,
    ## tolerance for fragment matching
    tolerance = 5e-6
)
## Warning in FUN(X[[i]], ...): 61 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 63 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 53 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 55 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.

## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.

## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
myoglobin
## TopDownSet object (7.00 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x5882 (5.21% != 0)
## Number of matched fragments: 372991 
## Intensity range: [87.61;10704001.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).

1.2 The TopDownSet Anatomy

The assembled object is an TopDownSet object. Briefly it is composed of three interconnected tables:

  1. rowViews/fragment data: holds the information on the type of fragments, their modifications and adducts.
  2. colData/condition data: contains the corresponding fragmentation condition for every spectrum.
  3. assayData: contains the intensity of assigned fragments.
TopDownSet anatomy, image adopted from (Morgan et al. 2017).

TopDownSet anatomy, image adopted from (Morgan et al. 2017).

1.3 Technical Details

This section explains the implementation details of the TopDownSet class. It is not necessary to understand everything written here to use topdownr for the analysis of fragmentation data.

The TopDownSet contains the following components: Fragment data, Condition data, Assay data.

1.3.1 Fragment data

rowViews(myoglobin)
## FragmentViews on a 153-letter sequence:
##   GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGH...KHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG
## Mass:
##   16922.95406
## Modifications:
##   Met-loss
## Views:
##        start end width     mass name   type   z                              
##    [1]     1   1     1    30.03 a1     a      1 [G]                          
##    [2]     1   1     1    58.03 b1     b      1 [G]                          
##    [3]     1   1     1    59.01 z1     z      1 [G]                          
##    [4]     1   1     1    60.02 zpH1   z      1 [G]                          
##    [5]     1   1     1    74.05 cmH1   c      1 [G]                          
##    ...   ... ...   ...      ... ...    ...  ... ...                          
## [1212]     2 153   152 16868.93 zpH152 z      1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1213]     1 152   152 16882.96 cmH152 c      1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1214]     1 152   152 16883.97 c152   c      1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1215]     2 153   152 16884.95 y152   y      1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1216]     2 153   152 16910.93 x152   x      1 [LSDGEWQQVLNV...IAAKYKELGFQG]

The fragmentation data are represented by an FragmentViews object that is an overloaded XStringViews object. It contains one AAString (the protein sequence) and an IRanges object that stores the start, end (and width) values of the fragments. Additionally it has a DataFrame for the mass, type and z information of each fragment.

1.3.2 Condition data

conditionData(myoglobin)[, 1:5]
## DataFrame with 5882 rows and 5 columns
##                                                                    File
##                                                                   <Rle>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01  myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02  myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03  myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04  myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05  myo_707_ETDReagentTarget_1e7_1
## ...                                                                 ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1  myo_1211_ETDReagentTarget_1e7_1
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2  myo_1211_ETDReagentTarget_1e7_2
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3  myo_1211_ETDReagentTarget_1e7_2
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1  myo_1211_ETDReagentTarget_1e7_1
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2  myo_1211_ETDReagentTarget_1e7_2
##                                          Scan PeaksCount TotIonCurrent
##                                         <Rle>      <Rle>     <numeric>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01    13         64      16904906
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02    14         69      17520226
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03    13        109      19653865
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04    14         85      20301836
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05    13         75      17174796
## ...                                       ...        ...           ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1    293        306       4618113
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2    294        188       2603466
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3    295        150       2104302
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1    295        256       3725114
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2    297        135       1786274
##                                         RetentionTime
##                                             <numeric>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01      65.69495
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02      80.23453
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03      65.57049
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04      80.12018
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05      65.56137
## ...                                               ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1       6778.102
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2       6751.959
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3       6779.711
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1       6834.398
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2       6835.891

Condition data is a DataFrame that contains the combined header information for each MS run (combined from method (.experiments.csv files)/scan header (.txt files) table and metadata from the .mzML files).

1.3.3 Assay data

assayData(myoglobin)[206:215, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##    [[ suppressing 10 column names 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_01', 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_02', 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_03' ... ]]
##                                                                          
## z26        .         .         .         .         .   124981.9      .   
## zpH26      .         .         .         .         .        .        .   
## y26   508116.78 454086.72 483405.88 428070.62 374857.8 453643.2 367292.47
## b27        .         .         .         .         .        .        .   
## cmH27      .         .         .         .         .        .        .   
## c27        .         .         .         .         .        .        .   
## x26        .         .         .         .         .        .        .   
## z27    18677.47  24941.96  42561.77  36388.28      .        .    24822.07
## zpH27      .         .         .         .         .        .        .   
## y27   366331.19 377103.00 309914.03 322329.38 323019.6 335588.8 277860.31
##                                  
## z26        .         .        .  
## zpH26      .         .        .  
## y26   389549.88 354230.6 340419.0
## b27        .         .        .  
## cmH27      .         .        .  
## c27        .         .        .  
## x26        .         .        .  
## z27    11467.17      .        .  
## zpH27      .         .        .  
## y27   305515.78 287764.5 271953.4

Assay data is a sparseMatrix from the Matrix package (in detail a dgCMatrix) where the rows correspond to the fragments, the columns to the runs/conditions and the entries to the intensity values. A sparseMatrix is similar to the classic matrix in R but stores just the values that are different from zero.

1.4 Subsetting a TopDownSet

A TopDownSet could be subsetted by the fragment and the condition data.

# select the first 100 fragments
myoglobin[1:100]
## TopDownSet object (3.25 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 100 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;1426.70]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 100x5882 (9.68% != 0)
## Number of matched fragments: 56956 
## Intensity range: [105.70;1076768.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 56956 fragments [100;5882].
# select all "c" fragments
myoglobin["c"]
## TopDownSet object (4.22 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304 
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 304x5882 (7.80% != 0)
## Number of matched fragments: 139458 
## Intensity range: [87.61;1203763.75]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 139458 fragments [304;5882].
# select just the 100. "c" fragment
myoglobin["c100"]
## TopDownSet object (2.58 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1 
## Theoretical fragment types (1): c
## Theoretical mass range: [11085.96;11085.96]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1x5882 (0.09% != 0)
## Number of matched fragments: 5 
## Intensity range: [1276.91;17056.12]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 5 fragments [1;5882].
# select all "a" and "b" fragments but just the first 100 "c"
myoglobin[c("a", "b", paste0("c", 1:100))]
## TopDownSet object (4.30 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 404 
## Theoretical fragment types (3): a, b, c
## Theoretical mass range: [30.03;16866.94]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 5882 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 404x5882 (6.12% != 0)
## Number of matched fragments: 145348 
## Intensity range: [87.61;1630533.12]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 145348 fragments [404;5882].
# select condition/run 1 to 10
myoglobin[, 1:10]
## TopDownSet object (0.23 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1 
## Number of scans: 10 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x10 (2.23% != 0)
## Number of matched fragments: 271 
## Intensity range: [7996.26;5499596.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 271 fragments [1216;10].
# select all conditions from one file
myoglobin[, myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.22 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 0 fragments [1216;0].
# select all "c" fragments from a single file
myoglobin["c", myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.10 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304 
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 0 fragments [304;0].

2 Fragmentation Data Analysis of Myoglobin

We follow the following workflow:

topdownr workflow

topdownr workflow

We use the example data loaded in Importing Files.

The data contains several replicates for each fragmentation condition. Before aggregation can be performed we need to remove scans with inadequate injection times and fragments with low intensity or poor intensity reproducibility.

2.1 Filter Conditions on Injection Times

Injection times should be consistent for a particular m/z and particular AGC target. High or low injection times indicate problems with on-the-flight AGC calculation or spray instability for a particular scan. Hence the topdownr automatically calculates median injection time for a given m/z and AGC target combination. The user can choose to remove all scans that deviate more than a certain amount from the corresponding median and/or choose to keep N scans with the lowest deviation from the median for every condition.

Here we show an example of such filtering and the effect on the distribution of injection times.

injTimeBefore <- colData(myoglobin)
injTimeBefore$Status <- "before filtering"

## filtering on max deviation and just keep the
## 2 technical replicates per condition with the
## lowest deviation
myoglobin <- filterInjectionTime(
    myoglobin,
    maxDeviation = log2(3),
    keepTopN = 2
)

myoglobin
## TopDownSet object (4.89 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 3696 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (5.70% != 0)
## Number of matched fragments: 256215 
## Intensity range: [109.29;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
injTimeAfter <- colData(myoglobin)
injTimeAfter$Status <- "after filtering"

injTime <- as.data.frame(rbind(injTimeBefore, injTimeAfter))

## use ggplot for visualisation
library("ggplot2")

ggplot(injTime,
    aes(x = as.factor(AgcTarget),
        y = IonInjectionTimeMs,
        group = AgcTarget)) +
    geom_boxplot() +
    facet_grid(Status ~ Mz)

2.2 Filter Fragments on CV

High CV of intensity for a fragment suggests either fragment contamination by another m/z species or problems with deisotoping and we recommend removing all fragments with CV > 30, as shown below.

myoglobin <- filterCv(myoglobin, threshold=30)
myoglobin
## TopDownSet object (4.43 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 3696 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (4.82% != 0)
## Number of matched fragments: 216651 
## Intensity range: [109.29;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].

2.3 Filter Fragments on Intensity

When optimizing protein fragmentation we also want to focus on the most intense fragments, hence we recommend removing all low intensity fragments from analysis.

Low intensity is defined relatively to the most intense observation for this fragment (i.e. relatively to the maximum value in an assayData row). In the example below all intensity values, which have less than 10% intensity of the highest intensity to their corresponding fragment (in their corresponding row) are removed.

myoglobin <- filterIntensity(myoglobin, threshold=0.1)
myoglobin
## TopDownSet object (3.57 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 3696 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (3.14% != 0)
## Number of matched fragments: 141020 
## Intensity range: [219.52;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].

2.4 Data Aggregation

The next step of analysis is aggregating technical replicates of fragmentation conditions (columns of assayData).

myoglobin <- aggregate(myoglobin)
myoglobin
## TopDownSet object (2.13 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216 
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 1852 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x1852 (3.95% != 0)
## Number of matched fragments: 89063 
## Intensity range: [219.52;8492743.50]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].
## [2017-10-30 21:37:14] Aggregated 141020 fragments [1216;3696] to 89063 fragments [1216;1852].

2.5 Random Forest

To examine which of the features (fragmentation parameters) have the highest overall impact for a protein we perform random forest machine learning using the ranger (Wright and Ziegler 2017) R-package.

Before we compute some fragmentation statistics (number of assigned fragments, total assigned intensity, etc.).

library("ranger")

## statistics
head(summary(myoglobin))
##   Fragments     Total      Min       Q1    Median      Mean       Q3
## 1        27 9728506.2 16043.55 36608.61 123825.95 8000.4163 324536.7
## 2        30 7529430.7 21894.47 71422.85 103939.98 6191.9661 215970.9
## 3         9  756978.5 40572.99 49291.46  60046.72  622.5152 121674.7
## 4         3  335156.1 55048.98 90398.54 125748.09  275.6218 140053.6
## 5         2  211671.4 75538.09 90686.89 105835.69  174.0719 120984.5
## 6         0       0.0     0.00     0.00      0.00    0.0000      0.0
##         Max
## 1 4371568.4
## 2 2545802.2
## 3  158469.6
## 4  154359.0
## 5  136133.3
## 6       0.0
## number of fragments
nFragments <- summary(myoglobin)$Fragments

## features of interest
foi <- c(
    "AgcTarget",
    "EtdReagentTarget",
    "EtdActivation",
    "CidActivation",
    "HcdActivation",
    "Charge"
)

rfTable <- as.data.frame(cbind(
    scale(as.data.frame(colData(myoglobin)[foi])),
    Fragments = nFragments
))

featureImportance <- ranger(
    Fragments ~ .,
    data = rfTable,
    importance = "impurity"
)$variable.importance

barplot(
    featureImportance/sum(featureImportance),
    cex.names = 0.7
)

The two parameters having the lowest overall impact in the myoglobin dataset across all conditions are ETD reagent Target (EtdReagentTarget) and CID activation energy (CidActivation), while ETD reaction time (EtdReactionTime) demonstrates the highest overall impact.

2.6 Combining Fragmentation Conditions to Maximize Coverage

The purpose of topdownr is to investigate how maximum coverage with high intensity fragments can be achieved with minimal instrument time. Therefore topdownr reports the best combination of fragmentation conditions (with user specified number of conditions) that covers the highest number of different bonds.

Different fragmentation methods predominantly generate different types of fragments (e.g. b and y for HCD and CID, c and z for ETD, a and x for UVPD).

However N-terminal (a, b and c) as well as C-terminal (x, y and z) fragments originating from the same bond, cover the same number of amino acid sidechains. Hence different types of N-terminal (a, b and c) or C-terminal (x, y and z) fragments from the same bond add no extra sequence information.

Before we compute combinations all the fragments are converted to either N- or C-terminal, as shown in the image below.

Schema of N-/C-terminal fragments or bidirectional

Schema of N-/C-terminal fragments or bidirectional

In topdownr we convert the TopDownSet into an NCBSet object (N-terminal/C-terminal/Bidirectional).

myoglobinNcb <- as(myoglobin, "NCBSet")
myoglobinNcb
## NCBSet object (1.94 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG 
## - - - Fragment data - - -
## Number of N-terminal fragments: 30569
## Number of C-terminal fragments: 29579
## Number of N- and C-terminal fragments: 9306
## - - - Condition data - - -
## Number of conditions: 1852 
## Number of scans: 1852 
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Assay data - - -
## Size of array: 152x1852 (24.67% != 0)
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].
## [2017-10-30 21:37:14] Aggregated 141020 fragments [1216;3696] to 89063 fragments [1216;1852].
## [2017-10-30 21:37:16] Coerced TopDownSet into an NCBSet object; 69454 fragments [152;1852].

An NCBSet is very similar to a TopDownSet but instead of an FragmentViews the rowViews are an XStringViews for the former. Another difference is that the NCBSet has one row per bond instead one row per fragment. Also the assayData contains no intensity information but a 1 for an N-terminal, a 2 for a C-terminal and a 3 for bidirectional fragments.

The NCBSet can be used to select the combination of conditions that provide the best fragment coverage. While computing coverage topdownr awards 1 point for every fragment going from every bond in either N or C directions. This means that bonds covered in both directions increase the score of a condition by 2 points. For the myoglobin fragmentation example we get the following table for the best three conditions:

bestConditions(myoglobinNcb, n=3)
##                                         index fragments
## C0893.10_1.0e+06_1.0e+06_05.00_14_00_1   1065       143
## C1211.70_1.0e+05_0.0e+00_00.00_28_00_05  1253        63
## C0707.30_5.0e+05_5.0e+06_02.50_07_00_1    291        27

2.7 Building a Fragmentation Map

Fragmentation maps allow visualising the type of fragments produced by fragmentation conditions and their overall distribution along the protein backbone. It also illustrates how the combination of conditions results in a cumulative increase in fragment coverage. Shown below is a fragmentation map for myoglobin m/z 707.3, AGC target 1e6 and ETD reagent target of 1e7 for ETD (plotting more conditions is not practical for the vignette):

sel <-
    myoglobin$Mz == 707.3 &
    myoglobin$AgcTarget == 1e6 &
    myoglobin$EtdReagentTarget == 1e7

myoglobinNcbSub <- myoglobinNcb[, sel]

fragmentationMap(
    myoglobinNcbSub,
    nCombinations = 10,
    labels = seq_len(ncol(myoglobinNcbSub))
)

3 Session Information

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ggplot2_2.2.1       ranger_0.8.0        topdownrdata_0.99.3
##  [4] topdownr_1.0.0      Biostrings_2.46.0   XVector_0.18.0     
##  [7] IRanges_2.12.0      S4Vectors_0.16.0    BiocGenerics_0.24.0
## [10] BiocStyle_2.6.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13          highr_0.6             BiocInstaller_1.28.0 
##  [4] compiler_3.4.2        plyr_1.8.4            iterators_1.0.8      
##  [7] ProtGenerics_1.10.0   tools_3.4.2           zlibbioc_1.24.0      
## [10] MALDIquant_1.16.4     digest_0.6.12         preprocessCore_1.40.0
## [13] evaluate_0.10.1       tibble_1.3.4          gtable_0.2.0         
## [16] lattice_0.20-35       rlang_0.1.2           Matrix_1.2-11        
## [19] foreach_1.4.3         yaml_2.1.14           stringr_1.2.0        
## [22] knitr_1.17            rprojroot_1.2         grid_3.4.2           
## [25] Biobase_2.38.0        impute_1.52.0         XML_3.98-1.9         
## [28] BiocParallel_1.12.0   rmarkdown_1.6         bookdown_0.5         
## [31] limma_3.34.0          reshape2_1.4.2        mzR_2.12.0           
## [34] magrittr_1.5          pcaMethods_1.70.0     backports_1.1.1      
## [37] scales_0.5.0          codetools_0.2-15      htmltools_0.3.6      
## [40] mzID_1.16.0           MSnbase_2.4.0         colorspace_1.3-2     
## [43] labeling_0.3          affy_1.56.0           stringi_1.1.5        
## [46] doParallel_1.0.11     lazyeval_0.2.1        munsell_0.4.3        
## [49] vsn_3.46.0            affyio_1.48.0

References

Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2017. SummarizedExperiment: SummarizedExperiment Container.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. doi:10.18637/jss.v077.i01.