Load the required packages
This Vignette provides an example workflow for how to use the package MSstatsLOBD.
To install this package, start R (version “4.0”) and enter:
We will estimate the LOB/LOD for a few peptides an assay available on the CPTAC (Clinical Proteomic Tumor Analysis Consortium) assay portal c.f. [@Thomas]. The dataset contains spike in data for 43 distinct peptides. For each peptide, 8 distinct concentration spikes for 3 different replicates are measured. The Skyline files for the assay along with details about the experiment can be obtained from this webpage: https://assays.cancer.gov. The particular dataset examined here (called JHU_DChan_HZhang_ZZhang) can be found at https://panoramaweb.org/labkey/project/CPTAC%20Assay%20Portal/JHU_DChan_HZhang_ZZhang/Serum_QExactive_GlycopeptideEnrichedPRM/begin.view?. It should be downloaded from the MSStats website http://msstats.org/?smd_process_download=1&download_id=548.
The data is then exported in a csv file (calibration_data_raw.csv) from Skyline. This is done in Skyline by selecting File → Export → Report → QuaSAR input and then clicking Export. The csv file contains the measured peak area for each fragment of each light and heavy version of each peptide. Depending on the format of the Skyline file and depending on whether standards were used, the particular outputs obtained in the csv file may vary. In this particular case the following variables are obtained in the output file calibration_data_raw.csv: File Name,Sample Name,Replicate Name,Protein Name,Peptide Sequence,Peptide Modified Sequence,
Precursor Charge,Product Charge, Fragment Ion,Average Measured Retention Time, SampleGroup,IS Spike, Concentration,Replicate,light Area,heavy Area. A number of variables are byproducts of the acquisition process and will not be considered for the following, i.e. File Name,Sample Name,Replicate Name,SampleGroup,IS Spike.
Variables that are important for the assay characterization are detailed below (others are assumed to be self explanatory):
## File.Name Sample.Name Replicate.Name Protein.Name
## 1 Blank_0_1.raw NA Blank_0_1 sp|Q9HDC9|APMAP_HUMAN
## 2 Blank_0_2.raw NA Blank_0_2 sp|Q9HDC9|APMAP_HUMAN
## 3 Blank_0_3.raw NA Blank_0_3 sp|Q9HDC9|APMAP_HUMAN
## 4 A_1.raw NA A_1 sp|Q9HDC9|APMAP_HUMAN
## 5 B_1.raw NA B_1 sp|Q9HDC9|APMAP_HUMAN
## 6 C_1.raw NA C_1 sp|Q9HDC9|APMAP_HUMAN
## Peptide.Sequence Peptide.Modified.Sequence Precursor.Charge Product.Charge
## 1 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 2 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 3 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 4 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 5 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## 6 AGPNGTLFVADAYK AGPN[+1]GTLFVADAYK 2 1
## Fragment.Ion Average.Measured.Retention.Time SampleGroup IS.Spike
## 1 y10 35.19 Bl NA
## 2 y10 35.19 Bl NA
## 3 y10 35.19 Bl NA
## 4 y10 35.19 A NA
## 5 y10 35.19 B NA
## 6 y10 35.19 C NA
## Concentration Replicate light.Area heavy.Area
## 1 0.0000 1 59322 0
## 2 0.0000 2 75627 0
## 3 0.0000 3 62117 0
## 4 0.0576 1 75369 0
## 5 0.2880 1 77955 21216
## 6 1.4400 1 81893 329055
We normalize the intensity of the light peptides using that of the heavy peptides. This corrects any systematic errors that can occur during a run or across replicates. The calculation is greatly simplified by the use of the tidyr and dplyr packages. The area from all the different peptide fragments is first summed then log transformed. The median intensity of the reference heavy peptides medianlog2heavy is calculated. Their intensities should ideally remain constant across runs since the spiked concentration of the heavy peptide is constant. The difference between the median for all the heavy peptide spikes is calculated. It is then used to correct (i.e. to normalize) the intensity of the light peptides log2light to obtain the adjusted intensity log2light_norm. The intensity is finally converted back to original space.
#Select variable that are need
df <- raw_data %>% select(Peptide.Sequence,Precursor.Charge,
Product.Charge, Fragment.Ion,Concentration,
Replicate, light.Area, heavy.Area,
SampleGroup, File.Name)
#Convert factors to numeric and remove NA values:
df <- df %>% mutate(heavy.Area = as.numeric(heavy.Area)) %>%
filter(!is.na(heavy.Area))
#Sum area over all fragments
df2 <- df %>% group_by(Peptide.Sequence, Replicate, SampleGroup,
Concentration, File.Name) %>%
summarize(A_light = sum(light.Area), A_heavy = sum(heavy.Area))
## `summarise()` has grouped output by 'Peptide.Sequence', 'Replicate',
## 'SampleGroup', 'Concentration'. You can override using the `.groups` argument.
#Convert to log scale
df2 <- df2 %>% mutate(log2light = log2(A_light), log2heavy = log2(A_heavy))
#Calculate median of heavy(reference) for a run
df3 <- df2 %>% group_by(Peptide.Sequence) %>%
summarize(medianlog2light = median(log2light),
medianlog2heavy= median(log2heavy))
#Modify light intensity so that the intensity of the heavy is constant (=median) across a run.
df4 <- left_join(df2,df3, by = "Peptide.Sequence") %>%
mutate(log2light_delta = medianlog2light - log2light) %>%
mutate(log2heavy_norm = log2heavy + log2light_delta,
log2light_norm = log2light + log2light_delta) %>%
mutate(A_heavy_norm = 2**log2heavy_norm, A_light_norm = 2**log2light_norm)
#Format the data for MSstats:
#Select the heavy area, concentration, peptide name and Replicate
df_out <- df4 %>% ungroup() %>%
select(A_heavy_norm, Concentration, Peptide.Sequence, Replicate)
#Change the names according to MSStats requirement:
df_out <- df_out %>% rename(INTENSITY = A_heavy_norm,
CONCENTRATION = Concentration,
NAME = Peptide.Sequence, REPLICATE = Replicate)
# We choose NAME as the peptide sequence
head(df_out)
## # A tibble: 6 × 4
## INTENSITY CONCENTRATION NAME REPLICATE
## <dbl> <dbl> <chr> <int>
## 1 34177. 0.0576 AAPAPQEATATFNSTADR 1
## 2 448636. 0.288 AAPAPQEATATFNSTADR 1
## 3 0 0 AAPAPQEATATFNSTADR 1
## 4 0 0 AAPAPQEATATFNSTADR 1
## 5 62968. 0 AAPAPQEATATFNSTADR 1
## 6 43668. 0 AAPAPQEATATFNSTADR 1
In the following we estimate the LOB and LOD for individual peptides. The first step in the estimation is to fit a function to all the (Spiked Concentration, Measured Intensity) points. When the nonlinear_quantlim function is used, the function that is fit automatically adapts to the data. For instance, when the data is linear, a straight line is used, while when a threshold (i.e. a leveling off of the measured intensity at low concentrations) an elbow like function is fit. The fit is called MEAN in the output of function nonlinear_quantlim as shown in Fig.1. Each value of MEAN is given for a particular CONCENTRATION value CONCENTRATION is thus a discretization of x–Spiked Concentration axis. The lower and upper bound of the 90% prediction interval of the fit are called LOW and UP in the output of nonlinear_quantlim. They correspond respectively to the 5% and 95% percentile of predictions.
The second step in the procedure is to estimate the upper bound of the noise in the blank sample (blue dashed line in Fig. 1). It is found by assuming that blank sample measurements are normally distributed.
We define the LOB as the highest apparent concentration of a peptide expected when replicates of a blank sample containing no peptides are measured. This amounts to finding the concentration at the intersection of the fit (which represents the averaged measured intensity) with the 95% upper prediction bound of the noise.
The LOD is defined as the measured concentration value for which the probability of falsely claiming the absence of a peptide in the sample is 0.05, given a probability 0.05 of falsely claiming its presence. Estimating the LOD thus amounts to finding the concentration at the intersection between the 5% percentile line of the prediction interval of the fit (i.e. the lower bound of the 90% prediction interval) and the 95% percentile line of the blank sample. At the LOB concentration, there is an 0.05 probability of false positive and a 50% chance of false negative. At the LOD concentration there is 0.05 probability of false negative and a false positive probability of 0.05 in accordance with its definition. By default, a probability of 0.05 for the LOB/LOD estimation is used but it can be changed, as detailed in the manual.
#Select peptide of interest: LPPGLLANFTLLR
spikeindata <- df_out %>% filter(NAME == "LPPGLLANFTLLR")
#This contains the measured intensity for the peptide of interest
head(spikeindata)
## # A tibble: 6 × 4
## INTENSITY CONCENTRATION NAME REPLICATE
## <dbl> <dbl> <chr> <int>
## 1 26291. 0.0576 LPPGLLANFTLLR 1
## 2 244841. 0.288 LPPGLLANFTLLR 1
## 3 0 0 LPPGLLANFTLLR 1
## 4 774274. 0 LPPGLLANFTLLR 1
## 5 482008. 0 LPPGLLANFTLLR 1
## 6 780792. 0 LPPGLLANFTLLR 1
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 195487.8 -260478.7 750434.9 1.111985 1.252285 2381531
## 2 1.776357e-15 195487.8 -269892.2 764430.1 1.111985 1.252285 2381531
## 3 5.760000e-02 195487.8 159319.6 224231.8 1.111985 1.252285 2381531
## 4 2.880000e-01 195487.8 140136.0 244624.1 1.111985 1.252285 2381531
## 5 7.040283e-01 347294.4 217594.0 485330.0 1.111985 1.252285 2381531
## 6 1.440000e+00 1426016.6 1194376.2 1631834.4 1.111985 1.252285 2381531
## INTERCEPT NAME METHOD
## 1 -5750522 LPPGLLANFTLLR NONLINEAR
## 2 -5750522 LPPGLLANFTLLR NONLINEAR
## 3 -5750522 LPPGLLANFTLLR NONLINEAR
## 4 -5750522 LPPGLLANFTLLR NONLINEAR
## 5 -5750522 LPPGLLANFTLLR NONLINEAR
## 6 -5750522 LPPGLLANFTLLR NONLINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
The threshold is captured by the fit at low concentrations. The MEAN of the output of the function is the red line (mean prediction) in the plots. LOW is the orange line (5% percentile of predictions) while UP is the upper boundary of the red shaded area. The LOB is the concentration at the intersection of the fit and the estimate for the 95% upper bound of the noise (blue line). A more accurate “smoother” fit can be obtained by increasing the number of points Npoints used to discretize the concentration axis (see manual for nonlinear_quantlim).
The nonlinear MSStats function (nonlinear_quantlim) works for all peptides (those with a linear response and those with a non-linear response). We now examine a peptide with a linear behavior.
#Select peptide of interest: FLNDTMAVYEAK
spikeindata2 <- df_out %>% filter(NAME == "FVGTPEVNQTTLYQR")
#This contains the measured intensity for the peptide of interest
head(spikeindata2)
FALSE # A tibble: 6 × 4
FALSE INTENSITY CONCENTRATION NAME REPLICATE
FALSE <dbl> <dbl> <chr> <int>
FALSE 1 323763. 0.0576 FVGTPEVNQTTLYQR 1
FALSE 2 2036098. 0.288 FVGTPEVNQTTLYQR 1
FALSE 3 205. 0 FVGTPEVNQTTLYQR 1
FALSE 4 1431235. 0 FVGTPEVNQTTLYQR 1
FALSE 5 1244348. 0 FVGTPEVNQTTLYQR 1
FALSE 6 1455085. 0 FVGTPEVNQTTLYQR 1
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 323150.9 -854266.7 1251296.1 0.258187 0.2690786 5020096
## 2 1.776357e-15 323150.9 -597032.1 1354924.7 0.258187 0.2690786 5020096
## 3 5.760000e-02 659090.2 543574.7 758687.1 0.258187 0.2690786 5020096
## 4 2.880000e-01 2002847.6 1943976.7 2068885.9 0.258187 0.2690786 5020096
## 5 7.040283e-01 4429241.1 4298242.3 4539952.7 0.258187 0.2690786 5020096
## 6 1.440000e+00 8721634.3 8543147.5 8923922.8 0.258187 0.2690786 5020096
## INTERCEPT NAME METHOD
## 1 10349527 FVGTPEVNQTTLYQR NONLINEAR
## 2 10349527 FVGTPEVNQTTLYQR NONLINEAR
## 3 10349527 FVGTPEVNQTTLYQR NONLINEAR
## 4 10349527 FVGTPEVNQTTLYQR NONLINEAR
## 5 10349527 FVGTPEVNQTTLYQR NONLINEAR
## 6 10349527 FVGTPEVNQTTLYQR NONLINEAR
#plot results in the directory: "/Users/cyrilg/Desktop/Workflow/Results"
#Change directory appropriately for your computer
plot_quantlim(spikeindata = spikeindata2, quantlim_out = quant_out2,
address = FALSE)
## [[1]]
##
## [[2]]
The plots indicate that the fit is observed to be linear as the response is linear.
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 -78939.744 -733589.18 395721.20 0.7348839 0.8353927 2381531
## 2 1.776357e-15 -78939.744 -609804.01 379439.89 0.7348839 0.8353927 2381531
## 3 5.760000e-02 1335.281 -26175.01 32163.01 0.7348839 0.8353927 2381531
## 4 2.880000e-01 322435.381 275446.64 365879.82 0.7348839 0.8353927 2381531
## 5 7.040283e-01 902238.866 783119.47 1027269.40 0.7348839 0.8353927 2381531
## 6 1.440000e+00 1927935.880 1691409.40 2157122.74 0.7348839 0.8353927 2381531
## INTERCEPT NAME METHOD
## 1 -5750522 LPPGLLANFTLLR LINEAR
## 2 -5750522 LPPGLLANFTLLR LINEAR
## 3 -5750522 LPPGLLANFTLLR LINEAR
## 4 -5750522 LPPGLLANFTLLR LINEAR
## 5 -5750522 LPPGLLANFTLLR LINEAR
## 6 -5750522 LPPGLLANFTLLR LINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
## CONCENTRATION MEAN LOW UP LOB LOD SLOPE
## 1 0.000000e+00 64364.27 -834607.7 1148426.8 0.261008 0.2694706 5020096
## 2 1.776357e-15 64364.27 -832697.5 1039464.3 0.261008 0.2694706 5020096
## 3 5.760000e-02 453782.46 344853.8 568337.5 0.261008 0.2694706 5020096
## 4 2.880000e-01 2011455.20 1958764.7 2062007.5 0.261008 0.2694706 5020096
## 5 7.040283e-01 4824111.08 4715891.6 4942530.8 0.261008 0.2694706 5020096
## 6 1.440000e+00 9799818.92 9614262.0 10011274.2 0.261008 0.2694706 5020096
## INTERCEPT NAME METHOD
## 1 10349524 FVGTPEVNQTTLYQR LINEAR
## 2 10349524 FVGTPEVNQTTLYQR LINEAR
## 3 10349524 FVGTPEVNQTTLYQR LINEAR
## 4 10349524 FVGTPEVNQTTLYQR LINEAR
## 5 10349524 FVGTPEVNQTTLYQR LINEAR
## 6 10349524 FVGTPEVNQTTLYQR LINEAR
After estimating LOB/LOD we can plot the results.
#plot results in the directory
plot_quantlim(spikeindata = spikeindata2, quantlim_out = quant_out, address = FALSE)
## [[1]]
##
## [[2]]
C. Galitzine et al. “Nonlinear regression improves accuracy of characterization of multiplexed mass spectrometric assays.” Molecular & Cellular Proteomics (2018), doi:10.1074/mcp.RA117.000322