This vignette describes the fragmentation analysis functionality of the topdownr
package.
topdownr 1.0.0
topdownr is free and open-source software.
For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/sgibb/topdownr/issues) providing as much information as possible, a reproducible example and the output of sessionInfo()
.
If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.
topdownr
Load the package.
library("topdownr")
Some example files are provided in the topdownrdata
package. For a full analysis you need a .fasta
file with the protein sequence, the .experiments.csv
files containing the method information, the .txt
files containing the scan header information and the .mzML
files with the deconvoluted spectra.
## list.files(topdownrdata::topDownDataPath("myoglobin"))
$csv
[1] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_1.experiments.csv.gz"
[2] ".../20170629_myo/experiments/myo_1211_ETDReagentTarget_1e6_2.experiments.csv.gz"
[3] "..."
$fasta
[1] ".../20170629_myo/fasta/myoglobin.fasta.gz"
[2] "..."
$mzML
[1] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_1.mzML.gz"
[2] ".../20170629_myo/mzml/myo_1211_ETDReagentTarget_1e6_2.mzML.gz"
[3] "..."
$txt
[1] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_1.txt.gz"
[2] ".../20170629_myo/header/myo_1211_ETDReagentTarget_1e6_2.txt.gz"
[3] "..."
All these files have to be in a directory. You could import them via readTopDownFiles
. This function has some arguments. The most important ones are the path
of the directory containing the files, the protein modification
(e.g. initiator methionine removal, "Met-loss"
), and adducts (e.g. proton transfer often occurs from c to z-fragment after ETD reaction).
## the mass adduct for a proton
H <- 1.0078250321
myoglobin <- readTopDownFiles(
## directory path
path = topdownrdata::topDownDataPath("myoglobin"),
## fragmentation types
type = c("a", "b", "c", "x", "y", "z"),
## adducts (add -H/H to c/z and name
## them cmH/zpH (c minus H, z plus H)
adducts = data.frame(
mass=c(-H, H),
to=c("c", "z"),
name=c("cmH", "zpH")),
## initiator methionine removal
modifications = "Met-loss",
## don't use neutral loss
neutralLoss = NULL,
## tolerance for fragment matching
tolerance = 5e-6
)
## Warning in FUN(X[[i]], ...): 61 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 63 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 53 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 55 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): 50 FilterString entries modified because of
## duplicated ID for different conditions.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
## Warning in FUN(X[[i]], ...): ID in FilterString are not sorted in ascending
## order. Introduce own condition ID via 'cumsum'.
myoglobin
## TopDownSet object (7.00 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x5882 (5.21% != 0)
## Number of matched fragments: 372991
## Intensity range: [87.61;10704001.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
TopDownSet
AnatomyThe assembled object is an TopDownSet
object. Briefly it is composed of three interconnected tables:
rowViews
/fragment data: holds the information on the type of fragments, their modifications and adducts.colData
/condition data: contains the corresponding fragmentation condition for every spectrum.assayData
: contains the intensity of assigned fragments.This section explains the implementation details of the TopDownSet
class. It is not necessary to understand everything written here to use topdownr
for the analysis of fragmentation data.
The TopDownSet
contains the following components: Fragment data, Condition data, Assay data.
rowViews(myoglobin)
## FragmentViews on a 153-letter sequence:
## GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGH...KHPGDFGADAQGAMTKALELFRNDIAAKYKELGFQG
## Mass:
## 16922.95406
## Modifications:
## Met-loss
## Views:
## start end width mass name type z
## [1] 1 1 1 30.03 a1 a 1 [G]
## [2] 1 1 1 58.03 b1 b 1 [G]
## [3] 1 1 1 59.01 z1 z 1 [G]
## [4] 1 1 1 60.02 zpH1 z 1 [G]
## [5] 1 1 1 74.05 cmH1 c 1 [G]
## ... ... ... ... ... ... ... ... ...
## [1212] 2 153 152 16868.93 zpH152 z 1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1213] 1 152 152 16882.96 cmH152 c 1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1214] 1 152 152 16883.97 c152 c 1 [GLSDGEWQQVLN...DIAAKYKELGFQ]
## [1215] 2 153 152 16884.95 y152 y 1 [LSDGEWQQVLNV...IAAKYKELGFQG]
## [1216] 2 153 152 16910.93 x152 x 1 [LSDGEWQQVLNV...IAAKYKELGFQG]
The fragmentation data are represented by an FragmentViews
object that is an overloaded XStringViews
object. It contains one AAString
(the protein sequence) and an IRanges
object that stores the start
, end
(and width
) values of the fragments. Additionally it has a DataFrame
for the mass
, type
and z
information of each fragment.
conditionData(myoglobin)[, 1:5]
## DataFrame with 5882 rows and 5 columns
## File
## <Rle>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01 myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02 myo_707_ETDReagentTarget_1e6_1
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03 myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04 myo_707_ETDReagentTarget_1e6_2
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05 myo_707_ETDReagentTarget_1e7_1
## ... ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1 myo_1211_ETDReagentTarget_1e7_1
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2 myo_1211_ETDReagentTarget_1e7_2
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3 myo_1211_ETDReagentTarget_1e7_2
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1 myo_1211_ETDReagentTarget_1e7_1
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2 myo_1211_ETDReagentTarget_1e7_2
## Scan PeaksCount TotIonCurrent
## <Rle> <Rle> <numeric>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01 13 64 16904906
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02 14 69 17520226
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03 13 109 19653865
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04 14 85 20301836
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05 13 75 17174796
## ... ... ... ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1 293 306 4618113
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2 294 188 2603466
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3 295 150 2104302
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1 295 256 3725114
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2 297 135 1786274
## RetentionTime
## <numeric>
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_01 65.69495
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_02 80.23453
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_03 65.57049
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_04 80.12018
## C0707.30_1.0e+05_0.0e+00_00.00_00_07_05 65.56137
## ... ...
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_1 6778.102
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_2 6751.959
## C1211.70_1.0e+06_1.0e+07_50.00_28_00_3 6779.711
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_1 6834.398
## C1211.70_1.0e+06_1.0e+07_50.00_35_00_2 6835.891
Condition data is a DataFrame
that contains the combined header information for each MS run (combined from method (.experiments.csv
files)/scan header (.txt
files) table and metadata from the .mzML
files).
assayData(myoglobin)[206:215, 1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_01', 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_02', 'C0707.30_1.0e+05_0.0e+00_00.00_00_07_03' ... ]]
##
## z26 . . . . . 124981.9 .
## zpH26 . . . . . . .
## y26 508116.78 454086.72 483405.88 428070.62 374857.8 453643.2 367292.47
## b27 . . . . . . .
## cmH27 . . . . . . .
## c27 . . . . . . .
## x26 . . . . . . .
## z27 18677.47 24941.96 42561.77 36388.28 . . 24822.07
## zpH27 . . . . . . .
## y27 366331.19 377103.00 309914.03 322329.38 323019.6 335588.8 277860.31
##
## z26 . . .
## zpH26 . . .
## y26 389549.88 354230.6 340419.0
## b27 . . .
## cmH27 . . .
## c27 . . .
## x26 . . .
## z27 11467.17 . .
## zpH27 . . .
## y27 305515.78 287764.5 271953.4
Assay data is a sparseMatrix
from the Matrix
package (in detail a dgCMatrix
) where the rows correspond to the fragments, the columns to the runs/conditions and the entries to the intensity values. A sparseMatrix
is similar to the classic matrix
in R but stores just the values that are different from zero.
TopDownSet
A TopDownSet
could be subsetted by the fragment and the condition data.
# select the first 100 fragments
myoglobin[1:100]
## TopDownSet object (3.25 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 100
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;1426.70]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 100x5882 (9.68% != 0)
## Number of matched fragments: 56956
## Intensity range: [105.70;1076768.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 56956 fragments [100;5882].
# select all "c" fragments
myoglobin["c"]
## TopDownSet object (4.22 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 304x5882 (7.80% != 0)
## Number of matched fragments: 139458
## Intensity range: [87.61;1203763.75]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 139458 fragments [304;5882].
# select just the 100. "c" fragment
myoglobin["c100"]
## TopDownSet object (2.58 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1
## Theoretical fragment types (1): c
## Theoretical mass range: [11085.96;11085.96]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1x5882 (0.09% != 0)
## Number of matched fragments: 5
## Intensity range: [1276.91;17056.12]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 5 fragments [1;5882].
# select all "a" and "b" fragments but just the first 100 "c"
myoglobin[c("a", "b", paste0("c", 1:100))]
## TopDownSet object (4.30 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 404
## Theoretical fragment types (3): a, b, c
## Theoretical mass range: [30.03;16866.94]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 5882
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 404x5882 (6.12% != 0)
## Number of matched fragments: 145348
## Intensity range: [87.61;1630533.12]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:08] Subsetted 372991 fragments [1216;5882] to 145348 fragments [404;5882].
# select condition/run 1 to 10
myoglobin[, 1:10]
## TopDownSet object (0.23 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1
## Number of scans: 10
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x10 (2.23% != 0)
## Number of matched fragments: 271
## Intensity range: [7996.26;5499596.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 271 fragments [1216;10].
# select all conditions from one file
myoglobin[, myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.22 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 0 fragments [1216;0].
# select all "c" fragments from a single file
myoglobin["c", myoglobin$File == "myo_1211_ETDReagentTarget_1e+06_1"]
## TopDownSet object (0.10 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 304
## Theoretical fragment types (1): c
## Theoretical mass range: [74.05;16883.97]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 0 fragments [304;0].
We follow the following workflow:
We use the example data loaded in Importing Files.
The data contains several replicates for each fragmentation condition. Before aggregation can be performed we need to remove scans with inadequate injection times and fragments with low intensity or poor intensity reproducibility.
Injection times should be consistent for a particular m/z and particular AGC target. High or low injection times indicate problems with on-the-flight AGC calculation or spray instability for a particular scan. Hence the topdownr
automatically calculates median injection time for a given m/z and AGC target combination. The user can choose to remove all scans that deviate more than a certain amount from the corresponding median and/or choose to keep N
scans with the lowest deviation from the median for every condition.
Here we show an example of such filtering and the effect on the distribution of injection times.
injTimeBefore <- colData(myoglobin)
injTimeBefore$Status <- "before filtering"
## filtering on max deviation and just keep the
## 2 technical replicates per condition with the
## lowest deviation
myoglobin <- filterInjectionTime(
myoglobin,
maxDeviation = log2(3),
keepTopN = 2
)
myoglobin
## TopDownSet object (4.89 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 3696
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (5.70% != 0)
## Number of matched fragments: 256215
## Intensity range: [109.29;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
injTimeAfter <- colData(myoglobin)
injTimeAfter$Status <- "after filtering"
injTime <- as.data.frame(rbind(injTimeBefore, injTimeAfter))
## use ggplot for visualisation
library("ggplot2")
ggplot(injTime,
aes(x = as.factor(AgcTarget),
y = IonInjectionTimeMs,
group = AgcTarget)) +
geom_boxplot() +
facet_grid(Status ~ Mz)
High CV of intensity for a fragment suggests either fragment contamination by another m/z species or problems with deisotoping and we recommend removing all fragments with CV > 30, as shown below.
myoglobin <- filterCv(myoglobin, threshold=30)
myoglobin
## TopDownSet object (4.43 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 3696
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (4.82% != 0)
## Number of matched fragments: 216651
## Intensity range: [109.29;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
When optimizing protein fragmentation we also want to focus on the most intense fragments, hence we recommend removing all low intensity fragments from analysis.
Low intensity is defined relatively to the most intense observation for this fragment (i.e. relatively to the maximum value in an assayData
row). In the example below all intensity values, which have less than 10% intensity of the highest intensity to their corresponding fragment (in their corresponding row) are removed.
myoglobin <- filterIntensity(myoglobin, threshold=0.1)
myoglobin
## TopDownSet object (3.57 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 3696
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x3696 (3.14% != 0)
## Number of matched fragments: 141020
## Intensity range: [219.52;8493567.00]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].
The next step of analysis is aggregating technical replicates of fragmentation conditions (columns of assayData
).
myoglobin <- aggregate(myoglobin)
myoglobin
## TopDownSet object (2.13 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## Mass : 16922.95
## Modifications (1): Met-loss
## - - - Fragment data - - -
## Number of theoretical fragments: 1216
## Theoretical fragment types (6): a, b, c, x, y, z
## Theoretical mass range: [30.03;16910.93]
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 1852
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Intensity data - - -
## Size of array: 1216x1852 (3.95% != 0)
## Number of matched fragments: 89063
## Intensity range: [219.52;8492743.50]
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].
## [2017-10-30 21:37:14] Aggregated 141020 fragments [1216;3696] to 89063 fragments [1216;1852].
To examine which of the features (fragmentation parameters) have the highest overall impact for a protein we perform random forest machine learning using the ranger
(Wright and Ziegler 2017) R
-package.
Before we compute some fragmentation statistics (number of assigned fragments, total assigned intensity, etc.).
library("ranger")
## statistics
head(summary(myoglobin))
## Fragments Total Min Q1 Median Mean Q3
## 1 27 9728506.2 16043.55 36608.61 123825.95 8000.4163 324536.7
## 2 30 7529430.7 21894.47 71422.85 103939.98 6191.9661 215970.9
## 3 9 756978.5 40572.99 49291.46 60046.72 622.5152 121674.7
## 4 3 335156.1 55048.98 90398.54 125748.09 275.6218 140053.6
## 5 2 211671.4 75538.09 90686.89 105835.69 174.0719 120984.5
## 6 0 0.0 0.00 0.00 0.00 0.0000 0.0
## Max
## 1 4371568.4
## 2 2545802.2
## 3 158469.6
## 4 154359.0
## 5 136133.3
## 6 0.0
## number of fragments
nFragments <- summary(myoglobin)$Fragments
## features of interest
foi <- c(
"AgcTarget",
"EtdReagentTarget",
"EtdActivation",
"CidActivation",
"HcdActivation",
"Charge"
)
rfTable <- as.data.frame(cbind(
scale(as.data.frame(colData(myoglobin)[foi])),
Fragments = nFragments
))
featureImportance <- ranger(
Fragments ~ .,
data = rfTable,
importance = "impurity"
)$variable.importance
barplot(
featureImportance/sum(featureImportance),
cex.names = 0.7
)
The two parameters having the lowest overall impact in the myoglobin
dataset across all conditions are ETD reagent Target (EtdReagentTarget
) and CID activation energy (CidActivation
), while ETD reaction time (EtdReactionTime
) demonstrates the highest overall impact.
The purpose of topdownr
is to investigate how maximum coverage with high intensity fragments can be achieved with minimal instrument time. Therefore topdownr
reports the best combination of fragmentation conditions (with user specified number of conditions) that covers the highest number of different bonds.
Different fragmentation methods predominantly generate different types of fragments (e.g. b and y for HCD and CID, c and z for ETD, a and x for UVPD).
However N-terminal (a, b and c) as well as C-terminal (x, y and z) fragments originating from the same bond, cover the same number of amino acid sidechains. Hence different types of N-terminal (a, b and c) or C-terminal (x, y and z) fragments from the same bond add no extra sequence information.
Before we compute combinations all the fragments are converted to either N- or C-terminal, as shown in the image below.
In topdownr
we convert the TopDownSet
into an NCBSet
object (N-terminal/C-terminal/Bidirectional).
myoglobinNcb <- as(myoglobin, "NCBSet")
myoglobinNcb
## NCBSet object (1.94 Mb)
## - - - Protein data - - -
## Amino acid sequence (153): GLSDGEWQQVLNVWGKVEADIAGH...AMTKALELFRNDIAAKYKELGFQG
## - - - Fragment data - - -
## Number of N-terminal fragments: 30569
## Number of C-terminal fragments: 29579
## Number of N- and C-terminal fragments: 9306
## - - - Condition data - - -
## Number of conditions: 1852
## Number of scans: 1852
## Condition variables (58): File, Scan, ..., Sample, Charge
## - - - Assay data - - -
## Size of array: 152x1852 (24.67% != 0)
## - - - Processing information - - -
## [2017-10-30 21:37:07] 372991 fragments [1216;5882] matched (tolerance: 5 ppm).
## [2017-10-30 21:37:09] Subsetted 372991 fragments [1216;5882] to 256215 fragments [1216;3696].
## [2017-10-30 21:37:09] 2186 scans filtered with injection time deviation >= 1.58496250072116 or rank >= 3; 256215 fragments [1216;3696].
## [2017-10-30 21:37:13] 39564 fragments with CV > 30% filtered; 216651 fragments [1216;3696].
## [2017-10-30 21:37:13] 75631 intensity values < 0.1 (relative) filtered; 141020 fragments [1216;3696].
## [2017-10-30 21:37:14] Aggregated 141020 fragments [1216;3696] to 89063 fragments [1216;1852].
## [2017-10-30 21:37:16] Coerced TopDownSet into an NCBSet object; 69454 fragments [152;1852].
An NCBSet
is very similar to a TopDownSet
but instead of an FragmentViews
the rowViews
are an XStringViews
for the former. Another difference is that the NCBSet
has one row per bond instead one row per fragment. Also the assayData
contains no intensity information but a 1
for an N-terminal, a 2
for a C-terminal and a 3
for bidirectional fragments.
The NCBSet
can be used to select the combination of conditions that provide the best fragment coverage. While computing coverage topdownr
awards 1 point for every fragment going from every bond in either N or C directions. This means that bonds covered in both directions increase the score of a condition by 2 points. For the myoglobin fragmentation example we get the following table for the best three conditions:
bestConditions(myoglobinNcb, n=3)
## index fragments
## C0893.10_1.0e+06_1.0e+06_05.00_14_00_1 1065 143
## C1211.70_1.0e+05_0.0e+00_00.00_28_00_05 1253 63
## C0707.30_5.0e+05_5.0e+06_02.50_07_00_1 291 27
Fragmentation maps allow visualising the type of fragments produced by fragmentation conditions and their overall distribution along the protein backbone. It also illustrates how the combination of conditions results in a cumulative increase in fragment coverage. Shown below is a fragmentation map for myoglobin m/z 707.3, AGC target 1e6
and ETD reagent target of 1e7
for ETD (plotting more conditions is not practical for the vignette):
sel <-
myoglobin$Mz == 707.3 &
myoglobin$AgcTarget == 1e6 &
myoglobin$EtdReagentTarget == 1e7
myoglobinNcbSub <- myoglobinNcb[, sel]
fragmentationMap(
myoglobinNcbSub,
nCombinations = 10,
labels = seq_len(ncol(myoglobinNcbSub))
)
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ggplot2_2.2.1 ranger_0.8.0 topdownrdata_0.99.3
## [4] topdownr_1.0.0 Biostrings_2.46.0 XVector_0.18.0
## [7] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
## [10] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 highr_0.6 BiocInstaller_1.28.0
## [4] compiler_3.4.2 plyr_1.8.4 iterators_1.0.8
## [7] ProtGenerics_1.10.0 tools_3.4.2 zlibbioc_1.24.0
## [10] MALDIquant_1.16.4 digest_0.6.12 preprocessCore_1.40.0
## [13] evaluate_0.10.1 tibble_1.3.4 gtable_0.2.0
## [16] lattice_0.20-35 rlang_0.1.2 Matrix_1.2-11
## [19] foreach_1.4.3 yaml_2.1.14 stringr_1.2.0
## [22] knitr_1.17 rprojroot_1.2 grid_3.4.2
## [25] Biobase_2.38.0 impute_1.52.0 XML_3.98-1.9
## [28] BiocParallel_1.12.0 rmarkdown_1.6 bookdown_0.5
## [31] limma_3.34.0 reshape2_1.4.2 mzR_2.12.0
## [34] magrittr_1.5 pcaMethods_1.70.0 backports_1.1.1
## [37] scales_0.5.0 codetools_0.2-15 htmltools_0.3.6
## [40] mzID_1.16.0 MSnbase_2.4.0 colorspace_1.3-2
## [43] labeling_0.3 affy_1.56.0 stringi_1.1.5
## [46] doParallel_1.0.11 lazyeval_0.2.1 munsell_0.4.3
## [49] vsn_3.46.0 affyio_1.48.0
Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2017. SummarizedExperiment: SummarizedExperiment Container.
Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. doi:10.18637/jss.v077.i01.