Purpose

The purpose of this package is to provide the infrastructure to store, represent and exchange the gated flow data. By this we mean, accessing the samples, groups, transformations, compensation matrices, gates, and population statistics in the gating tree, which is represented as GatingSet object in R.

The GatingSet can be built from scratch within R or imported from flowJo XML workspaces (i.e. .xml or .wsp files) or GatingML files . Note that we cannot import .jo files directly. You will have to save them in XML workspace format.

Import flowJo workspace

The following section walks through opening and importing a flowJo workspace.

Opening a Workspace

We represent flowJo workspaces using flowjo_workspace objects. We only need to know the path to, and filename of the flowJo workspace.

library(flowWorkspace)
path <- system.file("extdata",package="flowWorkspaceData");
wsfile <- list.files(path, pattern="A2004Analysis.xml", full = TRUE)

In order to open this workspace we need CytoML package:

library(CytoML)
ws <- open_flowjo_xml(wsfile)
ws
## FlowJo Workspace Version  2.0 
## File location:  /home/biocbuild/bbs-3.10-bioc/R/library/flowWorkspaceData/extdata 
## File name:  A2004Analysis.xml 
## Workspace is open. 
## 
## Groups in Workspace
##          Name Num.Samples
## 1 All Samples           2

We see that this a version 2.0 workspace file. It's location and filename are printed. Additionally, you are notified that the workspace file is open. This refers to the fact that the XML document is internally represented using 'C' data structures from the XML package. After importing the file, the workspace must be explicitly closed using flowjo_ws_close() in order to free up that memory.

With the workspace file open,some basic sample information can be accessed through some helper methods. For example, the list of samples in a workspace can be accessed by:

fj_ws_get_samples(ws)
##   sampleID                       name count compID pop.counts
## 1        1 a2004_O1T2pb05i_A1_A01.fcs 61832      1         19
## 2        2 a2004_O1T2pb05i_A2_A02.fcs 45363      1         19

The compID column tells you which compensation matrix to apply to a group of files, and similarly, based on the name of the compensation matrix, which transformations to apply.

And the groups can be accessed by:

fj_ws_get_sample_groups(ws)
##     groupName groupID sampleID
## 1 All Samples       0        1
## 2 All Samples       0        2

Keywords stored in xml workspace can also retrieved by:

sn <- "a2004_O1T2pb05i_A1_A01.fcs"
fj_ws_get_keywords(ws, sn)[1:5]
## $`$BEGINANALYSIS`
## [1] "0"
## 
## $`$BEGINDATA`
## [1] "3803"
## 
## $`$BEGINSTEXT`
## [1] "0"
## 
## $`$BTIM`
## [1] "09:20:24"
## 
## $`$BYTEORD`
## [1] "4,3,2,1"

Parsing the Workspace

These are all retrieved by directly querying xml file. In order to get more information about the gating tree, we need to actually parse the XML workspace into R data structures to represent some of the information therein. Specifically, by calling flowjo_to_gatingset() the user will be presented with a list of groups in the workspace file and need to choose one group to import. Why only one? Because of the way flowJo handles data transformation and compensation. Each group of samples is associated with a compensation matrix and specific data transformation. These are applied to all samples in the group. When a particular group of samples is imported, the package generates a GatingHierarchy for each sample, describing the set of gates applied to the data (note: polygons, rectangles, quadrants, and ovals and boolean gates are supported). The set of GatingHierarchies for the group of samples is stored in a GatingSet object. Calling flowjo_to_gatingset() is quite verbose, informing the user as each gate is created. The parsing can also be done non–interactively by specifying which group to import directly in the function call (either an index or a group name). An additional optional argument execute=T/F specifies whether you want to load, compensate, transform the data and compute statistics immediately after parsing the XML tree. Argument path can be used to specify where the FCS files are stored.

gs <- flowjo_to_gatingset(ws,name = 1); #import the first group
## mac version of flowJo workspace recognized.
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
## invalid zeroChan: -2147483648
## caused by the invalid biexp parameters!Downcast the biexp to Calibration table instead!
#Lots of output here suppressed for the vignette.
gs
## A GatingSet with 2 samples

We have generated a GatingSet with 2 samples, each of which has 19 associated gates.

To list the samples stored in GatingSet:

sampleNames(gs)
## [1] "a2004_O1T2pb05i_A1_A01.fcs_61832" "a2004_O1T2pb05i_A2_A02.fcs_45363"

Note that it is different from the previous call fj_ws_get_samples on workspace where the latter list all samples stored in xml file and here are the ones actually get parsed. Because sometime not all of samples in xml will be imported for various reason. Also we've seen an extra string _xxx is attached to the end of each sample name. It is due to the argument additional.keys has the default value set to '$TOT'. See more details on these parsing options from How to parse a flowJo workspace.

Import gatingML

We currently support gatingML2.0 files exported from Cytobank system. It can be done with one convevient function parse.gatingML from CytoML package that simply takes file paths of gatingML and FCS.

xmlfile <- system.file("extdata/cytotrol_tcell_cytobank.xml", package = "CytoML")
fcsFiles <- list.files(pattern = "CytoTrol", system.file("extdata", package = "flowWorkspaceData"), full = T)
gs1 <- parse.gatingML(xmlfile, fcsFiles)

If you want to dive into details and sub-steps of the parsing process, see vignette of CytoML.

Basics on GatingSet

Subsets of GatingSet can be accessed using the standard R subset syntax [.

gs[1]
## A GatingSet with 1 samples

At this point we have parsed the workspace file and generate the gating hierarchy associated with each sample imported from the file. The data have been loaded, compensated, and transformed in the workspace, and gating has been executed. The resulting GatingSet contains a replicated analysis of the original flowJo workspace.

We can plot the gating tree:

plot(gs)

plot of chunk plotTree

We can list the nodes (populations) in the gating hierarchy:

gs_get_pop_paths(gs, path = 1)
##  [1] "root"       "Live"       "APC"        "B Cell"     "mDC"       
##  [6] "IFNa+"      "IL-6+"      "IL-12+"     "TNFa+"      "pDC"       
## [11] "IFNa+"      "IL-6+"      "IL-12+"     "TNFa+"      "CD14-MHC2-"
## [16] "Monocytes"  "IFNa+"      "IL-6+"      "IL-12+"     "TNFa+"

Note that path argument specifies the depth of the gating path for each population. As shown, depth of 1 (i.e. leaf or terminal node name) may not be sufficient to uniquely identify each population. The issue can be resolved by increasing the path or simply returning the full path of the node:

gs_get_pop_paths(gs, path = "full")
##  [1] "root"                   "/Live"                  "/Live/APC"             
##  [4] "/Live/APC/B Cell"       "/Live/APC/mDC"          "/Live/APC/mDC/IFNa+"   
##  [7] "/Live/APC/mDC/IL-6+"    "/Live/APC/mDC/IL-12+"   "/Live/APC/mDC/TNFa+"   
## [10] "/Live/APC/pDC"          "/Live/APC/pDC/IFNa+"    "/Live/APC/pDC/IL-6+"   
## [13] "/Live/APC/pDC/IL-12+"   "/Live/APC/pDC/TNFa+"    "/Live/CD14-MHC2-"      
## [16] "/Live/Monocytes"        "/Live/Monocytes/IFNa+"  "/Live/Monocytes/IL-6+" 
## [19] "/Live/Monocytes/IL-12+" "/Live/Monocytes/TNFa+"

But full path may not be neccessary and could be too long to be visualized. So we provide the path = 'auto' option to determine the shortest path that is still unique within the gating tree.

nodelist <- gs_get_pop_paths(gs, path = "auto")
nodelist
##  [1] "root"             "Live"             "APC"              "B Cell"          
##  [5] "mDC"              "mDC/IFNa+"        "mDC/IL-6+"        "mDC/IL-12+"      
##  [9] "mDC/TNFa+"        "pDC"              "pDC/IFNa+"        "pDC/IL-6+"       
## [13] "pDC/IL-12+"       "pDC/TNFa+"        "CD14-MHC2-"       "Monocytes"       
## [17] "Monocytes/IFNa+"  "Monocytes/IL-6+"  "Monocytes/IL-12+" "Monocytes/TNFa+"

We can get the gate associated with the specific population:

node <- nodelist[3]
g <- gs_pop_get_gate(gs, node)
g
## $a2004_O1T2pb05i_A1_A01.fcs_61832
## Polygonal gate 'APC' with 14 vertices in dimensions <PerCP-CY5-5-A> and <PE-CY7-A>
## 
## $a2004_O1T2pb05i_A2_A02.fcs_45363
## Polygonal gate 'APC' with 14 vertices in dimensions <PerCP-CY5-5-A> and <PE-CY7-A>

We can retrieve the population statistics :

gs_pop_get_count_fast(gs)[1:10,]
##                                 name           Population        Parent Count
##  1: a2004_O1T2pb05i_A1_A01.fcs_61832                /Live          root 49484
##  2: a2004_O1T2pb05i_A1_A01.fcs_61832            /Live/APC         /Live  4154
##  3: a2004_O1T2pb05i_A1_A01.fcs_61832     /Live/APC/B Cell     /Live/APC  2311
##  4: a2004_O1T2pb05i_A1_A01.fcs_61832        /Live/APC/mDC     /Live/APC   512
##  5: a2004_O1T2pb05i_A1_A01.fcs_61832  /Live/APC/mDC/IFNa+ /Live/APC/mDC     2
##  6: a2004_O1T2pb05i_A1_A01.fcs_61832  /Live/APC/mDC/IL-6+ /Live/APC/mDC    22
##  7: a2004_O1T2pb05i_A1_A01.fcs_61832 /Live/APC/mDC/IL-12+ /Live/APC/mDC     2
##  8: a2004_O1T2pb05i_A1_A01.fcs_61832  /Live/APC/mDC/TNFa+ /Live/APC/mDC    72
##  9: a2004_O1T2pb05i_A1_A01.fcs_61832        /Live/APC/pDC     /Live/APC   433
## 10: a2004_O1T2pb05i_A1_A01.fcs_61832  /Live/APC/pDC/IFNa+ /Live/APC/pDC     0
##     ParentCount
##  1:       61832
##  2:       49484
##  3:        4154
##  4:        4154
##  5:         512
##  6:         512
##  7:         512
##  8:         512
##  9:        4154
## 10:         433

We can plot individual gates: note the scale of the transformed axes. Second argument is the node path of any depths as long as it is uniquely identifieable.

plotGate(gs, "pDC")

plot of chunk plotGate-nodeName More details about gate visualization can be found here.

If we have metadata associated with the experiment, it can be attached to the GatingSet.

d <- data.frame(sample=factor(c("sample 1", "sample 2")),treatment=factor(c("sample","control")) )
pd <- pData(gs)
pd <- cbind(pd,d)
pData(gs) <- pd
pData(gs)
##                                                        name   sample treatment
## a2004_O1T2pb05i_A1_A01.fcs_61832 a2004_O1T2pb05i_A1_A01.fcs sample 1    sample
## a2004_O1T2pb05i_A2_A02.fcs_45363 a2004_O1T2pb05i_A2_A02.fcs sample 2   control

We cann subset the GatingSet by its pData directly:

subset(gs, treatment == "control")
## A GatingSet with 1 samples

The underling flow data(flowSet or ncdfFlowSet ) can be retrieved by:

fs <- gs_pop_get_data(gs)
class(fs)
## [1] "ncdfFlowSet"
## attr(,"package")
## [1] "ncdfFlow"
nrow(fs[[1]])
## events 
##  61832

Note that the data is already compensated and transformed during the parsing.

We can retrieve the subset of data associated with a population node:

fs <- gs_pop_get_data(gs, node)
nrow(fs[[1]])
## events 
##   4154

GatingHierarchy

We can retrieve a single gating hierarchical tree (corresponding to one sample) by [[ operator

gh <- gs[[1]]
gh
## Sample:  a2004_O1T2pb05i_A1_A01.fcs_61832 
## GatingHierarchy with  20  gates

Note that the index can be either numeric or character (the guid returned by sampleNames method)

We can do similar operations on this GatingHierarchy object and same methods behave differently from GatingSet

head(gh_pop_compare_stats(gh))
##    openCyto.freq    xml.freq openCyto.count xml.count      node
## 1:    1.00000000 1.000000000          61832     61832      root
## 2:    0.80029758 0.801235606          49484     49542      Live
## 3:    0.08394633 0.083585645           4154      4141       APC
## 4:    0.55633125 0.548418256           2311      2271    B Cell
## 5:    0.12325469 0.121226757            512       502       mDC
## 6:    0.00390625 0.003984064              2         2 mDC/IFNa+

Here gh_pop_compare_stats returns both the stats directly stored in flowJo xml workspace and one calcuated by GatingSet through the gating. There are could be minor difference between the two due to the numerical errors. However the difference should not be significant. Therore this can be used as the validity check for the parsing accuracy.

gh_plot_pop_count_cv(gh)

plot of chunk gh_plot_pop_count_cv

plotGate method without specifying any node will layout all the gates in the same plot

plotGate(gh)

plot of chunk unnamed-chunk-5

We can retrieve the indices specifying if an event is included inside or outside a gate using:

table(gh_pop_get_indices(gh,node))
## 
## FALSE  TRUE 
## 57678  4154

The indices returned are relative to the parent population (member of parent AND member of current gate), so they reflect the true hierarchical gating structure.

We can retrieve all the compensation matrices from the GatingHierarchy in case we wish to use the compensation or transformation for the new data,

C <- gh_get_compensations(gh);
C
## Compensation object 'defaultCompensation':
##                Am Cyan-A Pacific Blue-A    APC-A APC-CY7-A Alexa 700-A
## Am Cyan-A        1.00000        0.04800 0.000000    0.0000     0.00000
## Pacific Blue-A   0.38600        1.00000 0.000529    0.0000     0.00000
## APC-A            0.00642        0.00235 1.000000    0.0611     0.19800
## APC-CY7-A        0.03270        0.02460 0.084000    1.0000     0.02870
## Alexa 700-A      0.07030        0.05800 0.016200    0.3990     1.00000
## FITC-A           0.74500        0.02090 0.001870    0.0000     0.00000
## PerCP-CY5-5-A    0.00368        0.00178 0.015300    0.0269     0.07690
## PE-CY7-A         0.01330        0.00948 0.000951    0.1380     0.00182
##                   FITC-A PerCP-CY5-5-A PE-CY7-A
## Am Cyan-A       0.028500       0.00104  0.00000
## Pacific Blue-A  0.000546       0.00000  0.00000
## APC-A          -0.000611       0.00776  0.00076
## APC-CY7-A       0.002690       0.00304  0.01010
## Alexa 700-A     0.001530       0.10800  0.00679
## FITC-A          1.000000       0.04180  0.00281
## PerCP-CY5-5-A   0.000000       1.00000  0.07030
## PE-CY7-A        0.002340       0.03360  1.00000

Or we can retrieve transformations:

T <- gh_get_transformations(gh)
names(T)
##  [1] "<Alexa 700-A>"    "<Am Cyan-A>"      "<APC-A>"          "<APC-CY7-A>"     
##  [5] "<FITC-A>"         "<Pacific Blue-A>" "<PE-CY7-A>"       "<PerCP-CY5-5-A>" 
##  [9] "Alexa 700-H"      "Am Cyan-H"        "APC-CY7-H"        "APC-H"           
## [13] "FITC-H"           "Pacific Blue-H"   "PE-CY7-H"         "PerCP-CY5-5-H"
T[[1]]
## function (x, deriv = 0) 
## {
##     deriv <- as.integer(deriv)
##     if (deriv < 0 || deriv > 3) 
##         stop("'deriv' must be between 0 and 3")
##     if (deriv > 0) {
##         z0 <- double(z$n)
##         z[c("y", "b", "c")] <- switch(deriv, list(y = z$b, b = 2 * 
##             z$c, c = 3 * z$d), list(y = 2 * z$c, b = 6 * z$d, 
##             c = z0), list(y = 6 * z$d, b = z0, c = z0))
##         z[["d"]] <- z0
##     }
##     res <- stats:::.splinefun(x, z)
##     if (deriv > 0 && z$method == 2 && any(ind <- x <= z$x[1L])) 
##         res[ind] <- ifelse(deriv == 1, z$y[1L], 0)
##     res
## }
## <bytecode: 0x555ca6385658>
## <environment: 0x555c94711298>
## attr(,"type")
## [1] "biexp"
## attr(,"parameters")
## attr(,"parameters")$channelRange
## [1] 4096
## 
## attr(,"parameters")$maxValue
## [1] 262144
## 
## attr(,"parameters")$neg
## [1] 0
## 
## attr(,"parameters")$pos
## [1] 4.5
## 
## attr(,"parameters")$widthBasis
## [1] -10

gh_get_transformations returns a list of functions to be applied to different dimensions of the data. Above, the transformation is applied to this sample, the appropriate dimension is transformed using a channel–specific function from the list.

Build the gating hierarchy from scratch

GatingSet provides methods to build a gating tree from raw FCS files and add or remove flowCore gates(or populations) to or from it.

Firstly,we start from a flowSet that contains three ungated flow samples:

library(flowCore)
data(GvHD)
#select raw flow data
fs <- GvHD[1:2]

Then construct a \code{GatingSet} from \code{flowSet}:

gs <- GatingSet(fs)

Then compensate it:

cfile <- system.file("extdata","compdata","compmatrix", package="flowCore")
comp.mat <- read.table(cfile, header=TRUE, skip=2, check.names = FALSE)
## create a compensation object 
comp <- compensation(comp.mat)
#compensate GatingSet
gs <- compensate(gs, comp)

New: You can now pass a list of compensation objects with elements named by sampleNames(gs) to achieve sample-specific compensations. e.g.

gs <- compensate(gs, comp.list)

Then we can transform it with any transformation defined by user through trans_new function of scales package.

require(scales)
trans.func <- asinh
inv.func <- sinh
trans.obj <- trans_new("myAsinh", trans.func, inv.func)

The inverse transformation is required so that the gates and data can be visualized in transformed scale with axis label still remains in the raw scale. Optionally breaks and format function can be supplied to further customize the appearance of axis labels.

Besides doing all these by hand, we also provide some buildin transformations: asinhtGml2_trans, flowjo_biexp_trans, flowjo_fasinh_trans and logicle_trans. These are all very commonly used transformations in flow data analysis. User can construct the transform object by simply one-line of code. e.g.

trans.obj <- asinhtGml2_trans()
trans.obj
## Transformer:  asinhtGml2

Once transformer object is created, we must convert it to transformerList for GatingSet to use.

chnls <- colnames(fs)[3:6] 
transList <- transformerList(chnls, trans.obj)

Alternatively, the overloaded estimateLogicle method can be used directly on GatingHierarchy to generate a transformerList object automatically.

estimateLogicle(gs[[1]], chnls)
## $`FL1-H`
## Transformer:  logicle 
## 
## $`FL2-H`
## Transformer:  logicle 
## 
## $`FL3-H`
## Transformer:  logicle 
## 
## $`FL2-A`
## Transformer:  logicle 
## 
## attr(,"class")
## [1] "transformerList" "list"

Now we can transform GatingSet with transformerList object. It will also store the transformation in the GatingSet and can be used to inverse-transform the data.

gs <- transform(gs, transList)
gs_get_pop_paths(gs) 
## [1] "root"

It now only contains the root node. We can add our first rectangleGate:

rg <- rectangleGate("FSC-H"=c(200,400), "SSC-H"=c(250, 400), filterId="rectangle")
nodeID <- gs_pop_add(gs, rg)
nodeID
## [1] 2
gs_get_pop_paths(gs)  
## [1] "root"       "/rectangle"

Note that the gate is added to root node by default if parent is not specified. Then we add a quadGate to the new population generated by the rectangeGate which is named after filterId of the gate because the name is not specified when add method is called.

qg <- quadGate("FL1-H"= 0.2, "FL2-H"= 0.4)
nodeIDs <- gs_pop_add(gs,qg,parent="rectangle")
nodeIDs 
## [1] 3 4 5 6
gs_get_pop_paths(gs)
## [1] "root"                          "/rectangle"                   
## [3] "/rectangle/CD15 FITC-CD45 PE+" "/rectangle/CD15 FITC+CD45 PE+"
## [5] "/rectangle/CD15 FITC+CD45 PE-" "/rectangle/CD15 FITC-CD45 PE-"

Here quadGate produces four population nodes/populations whose names are named after dimensions of gate if not specified.

Boolean Gate can also be defined and added to GatingSet:

bg <- booleanFilter(`CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-`)
bg
## booleanFilter filter 'CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-' evaluating the expression:
## CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-
nodeID2 <- gs_pop_add(gs,bg,parent="rectangle")
nodeID2
## [1] 7
gs_get_pop_paths(gs)
## [1] "root"                                            
## [2] "/rectangle"                                      
## [3] "/rectangle/CD15 FITC-CD45 PE+"                   
## [4] "/rectangle/CD15 FITC+CD45 PE+"                   
## [5] "/rectangle/CD15 FITC+CD45 PE-"                   
## [6] "/rectangle/CD15 FITC-CD45 PE-"                   
## [7] "/rectangle/CD15 FITC-CD45 PE+|CD15 FITC+CD45 PE-"

The gating hierarchy is plotted by:

plot(gs, bool=TRUE)

Note that boolean gate is skipped by default and thus need to be enabled explictily.

Now all the gates are added to the gating tree but the actual data is not gated yet. This is done by calling recompute method explictily:

recompute(gs)

After gating is finished,gating results can be visualized by plotGate method:

plotGate(gs,"rectangle") #plot one Gate

plot of chunk plotGate-rect Multiple gates can be plotted on the same pannel:

plotGate(gs,gs_pop_get_children(gs[[1]], "rectangle")) 

plot of chunk plotGate-multiple We may also want to plot all the gates without specifying the gate index:

plotGate(gs[[1]], bool=TRUE)

If we want to remove one node, simply:

Rm('rectangle', gs)
## Warning: 'Rm' is deprecated.
## Use 'gs_pop_remove' instead.
## See help("Deprecated")
gs_get_pop_paths(gs)
## [1] "root"

As we see,removing one node causes all its descendants to be removed as well.

archive and clone

Oftentime, we need to save a GatingSet including the gated flow data,gates and populations to disk and reload it later on. It can be done by:

tmp <- tempdir()
save_gs(gs,path = file.path(tmp,"my_gs"))
gs <- load_gs(file.path(tmp,"my_gs"))

We also provide the clone method to make a full copy of an existing GatingSet:

gs_gs_cloned <- clone(gs)

Note that the GatingSet contains environment slots and external pointer that point to the internal C data structure. So make sure to use these methods in order to save or make a copy of existing object. The regular R assignment (<-) or save routine doesn't work as expected for the GatingSet object.

Troubleshooting and error reporting

If this package is throwing errors when parsing your workspace, contact the package author by emails for post an issue on https://github.com/RGLab/flowWorkspace/issues. If you can send your workspace by email, we can test, debug, and fix the package so that it works for you. Our goal is to provide a tool that works, and that people find useful.