flowPloidy: Getting Started

Tyler Smith

2017-04-07

Overview

This document provides a walk-through of the normal workflow for analyzing flow cytometry (FCM) histograms with flowPloidy. See the vignette “flowPloidy: Flow Cytometry Histograms” (aka “histogram-tour”) for a general overview of FCM histograms.

Installation

Stable Release

flowPloidy is part of the BioConductor repository. You can install the stable version of the package directly from there. First, install bioconductor in your R session:

source("https://bioconductor.org/biocLite.R")
biocLite()

Then install flowPloidy:

biocLite("flowPloidy")

The examples in this vignette, and the flowPloidy help files, use data in the flowPloidyData package, which is also on BioConductor. (You don’t need flowPloidyData to use flowPloidy on your own data, only for the examples).

biocLite("flowPloidyData")

Development Release

The Bioconductor system emphasizes stable packages. As a consequence, updates are only released twice a year. If you need the latest version of flowPloidy, you may prefer to use the developmental version. This is available directly from the GitHub repository, but requires a few more steps to install.

First, you need to install bioconductor, if you haven’t already:

source("https://bioconductor.org/biocLite.R")
biocLite()

Next, you’ll need to install two dependencies from bioconductor:

biocLite("flowCore")
biocLite("flowPloidyData")

You also need the devtools package, if you don’t have it already:

install.packages("devtools", dependencies = TRUE)

Now you can use devtools to import the latest version of flowPloidy direct from the repository:

library("devtools")
install_github("plantarum/flowPloidy", dependencies = TRUE,
               build_vignettes = TRUE)

This should install flowPloidy along with all its dependencies. However, sometimes R gets confused and will complain about missing packages, even though we’ve asked it to automatically install everything that flowPloidy needs (i.e., dependencies = TRUE). If you see messages in the terminal indicating a package name that isn’t found, try installing that package directly and then re-run the install_github line above. You may need to repeat this process several times to get all the dependencies installed.

Loading the Package and Importing Data

You only need to install the package once (other than updating the package to get new versions as they are released). Once that’s done, you can load it from your R session:

library(flowPloidy)

Once you have the package loaded, you’re ready to process your files. You can load single files, but we usually process a directory-full of FCM files at once. For the purposes of this tutorial, we’ll use the sample files provided in the flowPloidyData package, which we installed in the previous section.

library(flowPloidyData)

After loading this package, you’ll have access to the variable flowPloidyFiles. This is a vector containing the full paths to the sample data files on your system. Whenever we use it in the examples below, you can substitute a vector containing the paths to your FCM data files. Subsetting this variable (i.e., flowPloidyFiles[1]) produces a single file name.


NOTE

To generate a list of your own files, use the R function list.files(). For example, if all your files are the directory ~/flow/data/, you can create a vector containing your files with:

my.files <- list.files("~/flow/data", full.names = TRUE)

If the directory contains additional files, you can provide a pattern to match only your FCM files. For example, if your FCM files have the suffix .LMD, you could use this:

my.files <- list.files("~/flow/data", full.names = TRUE, pattern = ".LMD")

Once you’ve generated a list of files, you can pass it to batchFlowHist as we do with flowPloidyFiles in the following example.


Before we can load our flow data, we need to know which channel to use for our histogram. We can see a list of all channel names with the function viewFlowChannels:

viewFlowChannels(flowPloidyFiles[1])
## [1] "FS.INT.LIN"   "SS.INT.LIN"   "TIME"         "FL3.INT.LIN" 
## [5] "FL3.PEAK.LIN" "FS.PEAK.LIN"  "SS.PEAK.LIN"  "FS.TOF.LIN"  
## [9] "SS.TOF.LIN"
## or viewFlowcChannels("my-flow-file.LMD")

For our flow cytometer, the primary data channel is called “FL3.INT.LIN”. Knowing this, we can now import all the files in a directory with the function batchFlowHist():

batch1 <- batchFlowHist(flowPloidyFiles, channel = "FL3.INT.LIN")
## processing /home/biocbuild/bbs-3.5-bioc/R/library/flowPloidyData/extdata//188-15.LMD
## analyzing 188-15.LMD
## processing /home/biocbuild/bbs-3.5-bioc/R/library/flowPloidyData/extdata//222.LMD
## analyzing 222.LMD
##     *** Model Fit Needs Attention: 222.LMD ***
## processing /home/biocbuild/bbs-3.5-bioc/R/library/flowPloidyData/extdata//240+S.LMD
## analyzing 240+S.LMD
##     *** Model Fit Needs Attention: 240+S.LMD ***
## processing /home/biocbuild/bbs-3.5-bioc/R/library/flowPloidyData/extdata//240-4-2+rad.LMD
## analyzing 240-4-2+rad.LMD

[ … output truncated to save space …]

The output indicates that there were problems with some of the histograms (i.e., *** Model Fit Needs Attention ***). This is expected, and we’ll correct the problems in the next step.

Reviewing and Correcting Histogram Analyses

The function browseFlowHist() provides an interactive way to view and correct our analyses.


IMPORTANT

To save your corrections, you need to store the output of this function in a variable. For example myresults <- browseFlowHist(batch1) will store the updated results in the variable myresults. It’s often easiest to update the analyses ‘in place’, assigning the output of browseFlowHist() to the variable you pass to is as an argument, as in the following example.


batch1 <- browseFlowHist(batch1) ## update the histograms in batch1

Histogram presentation

Calling browseFlowHist() opens a window in your internet browser that will display your histograms. The first one looks like this:

On the right we see the histogram. The graphical elements are:

You also see coloured circles. These are the initial peak estimates, blue for sample A, and orange for sample B. The solid circles are the G1 estimates, and there is a hollow circle to indicate the G2 peak (only visible for sample A in this plot).

The upper right corner of the plot contains some additional information regarding the model fit:

For this example, the default settings have worked perfectly. The model fit is acceptable (RCS 2.208), and the CVs for the peaks are well below the usual 5% threshold for acceptable.

One thing we don’t see is an indication of which peak is the standard, and which is our unknown sample. That is because flowPloidy doesn’t know, and has no way to infer this automatically. You will need to indicate this yourself, based on your understanding of the expected peak sizes and ratios, and perhaps running the standard on its own without an unknown. You can tell flowPloidy which peak is the standard by selecting the value in the “Standard Peak” drop-down selector, on the left side of the plot. By default this is “X”, indicating the standard has not been identified. If you set it to “A” or “B”, the value will be stored in the results. If you don’t set the value, you will still be able to recover the peak ratio from your results, but will have to calculate the pg values yourself. We’ll return to this below.

We’ll ignore the other options in the navigation panel for now (and the gating interface below), but will discuss them further below. For now, click the “Next” button to move on to the next histogram.

Correcting a Failed Model Fit

The second histogram didn’t work at all:

Since the model fitting failed, the plot shows only the raw data and the initial estimates. The fitting parameters are also absent.

Looking at the initial estimates, it’s easy to see what went wrong - a false peak in the noise near position 60 was incorrectly identified as a G1 peak. We can correct this by re-positioning the peak estimates. The “Peak” drop-down list at the left allows you to select which peak to move, and left-clicking on the plot will set the new position for that peak.

Moving Peak B to the second big peak, and Peak A to the first peak, the analysis works as expected:

In most cases, the exact placement of the initial peak isn’t critical. As long as you put it somewhere near the top of the appropriate peak, flowPloidy will be able to find the true value. You can confirm this by clicking around within the peaks: you end up with the same result (within a small rounding error) for quite a range of initial peak positions.

Note that the A peak must have a lower mean (i.e., be further left on the histogram) than the B peak. If you try to put the A peak above the B peak, flowPloidy will swap them around for you, to make the lower peak A and the upper peak B. This is necessary for some of the internal model-fitting code to work properly.

Changing Model Components: Debris Model

Moving on to the next histogram, we see that there were also problems with peak detection here:

We can fix this the same way:

Quick and easy. However, our RCS value is now off the charts! Two things are contributing to this. First, RCS is sensitive to a number of factors, not all of which are related to model fit (see ?RCS). Second, the most appropriate way to model histogram debris varies for different samples. As botanists, we will analyze samples from many different species, tissue types, and using different preparation protocols. In some cases, the Single Cut debris model will be most effective; in others, the Multi-Cut model works better. When your analysis produces high RCS values, as in this case, it may be a clue that the other debris model is more appropriate. We can check this by selecting “MC” from the “Debris Model” drop-down list at the left:

With the Multi-Cut debris model, we have a much more sensible RCS value. You can also see the improved fit visually - note the top of the fitted model is much closer to the raw data with the “MC” debris model, especially in the region between 80 and 140.

For the nitty-gritty details of the debris models, see ?DebrisModels. In practice, use whichever one gives you a lower RCS value. Also keep in mind that there is usually only a very small difference between the two options when it comes to peak means and CVs, even if the difference in RCS can be very large.

Local Minima Traps

Despite its power, non-linear regressions can get stuck in ‘local minima’. Unlike linear regression, there is no single unique solution to a non-linear regression. Fitting a model requires testing out different parameters before determining which combination is best. Sometimes, the algorithm gets stuck on incorrect parameters. Luckily, this is usually easily detected by humans.

An example is file 11, “734.LMD” in our sample data:

Once again, we need to correct our peak estimates. If we move the B peak over we get the following fit: