The **cytofkit** package is designed to facilitate the analysis of mass cytometry data from one or multiple FCS files. User can customize their data merging strategy to combine data from multiple files with proper transformation; apply state-of-the-art clustering methods like *DensVM* [1] and *PhenoGraph* [2] as well as a in-house developed algorithm *ClusterX* to automatically detect cell subsets; visualize the high dimensional data with cluster labels by either linear transformation like *PCA* or non-linear dimensionality reduction like *ISOMAP* [3] or *t-SNE* [4]; map the progression of cellular subtypes estimated by *ISOMAP* or *Diffusion map* [5] and profile the marker expression along the cell progression. In addition to providing the research community with an one-stop analysis platform, **cytofkit** is shipped together with an intuitively designed GUI and an interactive web app to facilitate the easy analysis and interpretation of the mass cytometry data.

A schematic overview of the **cytofkit** pipeline is as below, it contains four major components:

- Pre-processing
- Cell subset detection
- Cell subset visualization and interpretation
- Inference of subset progression.

In this step, one or multiple selected FCS files will be loaded into the R environment using the `read.FCS`

function of **flowCore** package. Expression values of selected markers are then extracted and transformed with a proper transformation method (including *cytofAsinh* (suggest for CyTOF data), *autoLgcl* (suggest for FCM data), *logicle* and *arcsinh*), expression matrixes from multiple files are then combined into a single matrix. For multiple FCS files, **cytofkit** provides four data combining methods including i) *ceil* which samples up to a user specified number of cells without replacement from each FCS file then combine, ii) *all* which takes all cells from each FCS file then combine, iii) *min* which samples a minimum number of cells among all FCS files from each FCS file then combine and iv) *fixed* which samples an user specified number of cells (with replacement when the total number of cell in the file is less than the specified number) from each FCS file then combine. After combination, the whole dataset will be stored in a data matrix, with row names represented as *filename_cellID*, column mames represented as *marker_name*. This step is implemented in funciton

`cytof_exprsMerge`

, read the details of this funcion:`?cytof_exprsMerge`

The subset detection is performed by clustering algorithms. **cytofkit** is shipped with two state-of-the-art clustering methods *DensVM* and *PhenoGraph* and an in-house developed clustering algorithm *ClusterX*. Among them, *DensVM* and *ClusterX* are density-based clustering algorithms estimated from the *t-SNE* embedded map. While *PhenoGraph* is a graph based clustering algorithm performed directly on the high dimensional data. For version later than 1.4.4, we plugged in the *FlowSOM* clustering algorithm [6]. Check the function `cytof_cluster`

for implementation of this step.

`?cytof_cluster`

Three dimensionality reduction methods are integrated into **cytofkit** for visualizing the high dimensional mass cytometry data into an information-condensed and relationship-preserved two-dimensional map. These included one linear transformation method *PCA* and two non-linear transformation methods *ISOMAP* and *t-SNE*. *PCA* is a widely used technique to remove the linear correlation between variables. After *PCA*, data is represented by principal components, and those two or three principal components with highest variance are used to visualize the data. *ISOMAP* is used to embed the high dimensional data points into low dimensions while preserving the intrinsic geometry of the data structure. *t-SNE* finds the two-dimensional representation of high-dimensional data that best preserve both local and global geometry. After dimensional reduction, **cytofkit** plots the cell points into these transformed two-dimensional map with point color representing its cell type detected from last step, and point shape representing which sample the cell is from. To summarize the expression profile of each cell subset, a heat map can be generated to visualize the mean or median expression for each marker in each cell type. This heat map can greatly facilitate the interpretation of known cell types with prior knowledge of their special marker expression features as well as detection of new cell types with novel expression patterns. Check the function `cytof_dimReduction`

for implementation of this step.

`?cytof_dimReduction`

Instead of inferring the subset progression using all cells, we perform the progression path estimation based on the detected subsets of cells. Specifically, we down sample the number of cells to a fixed number for each cell subset, thus to make every subset has the equal number of cells. This removes the potential dominant influence from cell subtypes with most number of cells. Then *Diffusion Map* or *ISOMAP* is used to infer the relationships among the down sampled subsets. *Diffusion Map* and *ISOMAP* are both able to compare the overall relatedness of cells and embed the cells taking into account the nonlinear geodesic distances between cells. By plotting the marker expression along the *Diffusion Map* or *ISOMAP* dimension 1 or dimension 2 with lowess regression, markers with expression patterns that highly correlated with progression can be defined as regulatory markers. Check the function `cytof_progression`

for implementation of this step.

`?cytof_progression`

The cluster results can be visualized with dimension transformed data from *t-SNE*, *PCA* and *ISOMAP*. The cluster coordinates, together with the *t-SNE*, *PCA* and *ISOMAP* coordinates, will be stored to the FCS files as additional parameters and for post analysis, all result files and plots can be saved using the function `cytof_writeResults`

.

`?cytof_writeResults`

Among the saved results, an R data object will be saved with suffix of *.RData*, this is specially for loading to the shiny web APP for interactive visualization and exploration of the analysis results. The shiny APP can be launched with command `cytofkitShinyAPP()`

or through this link https://chenhao.shinyapps.io/cytofkitShinyAPP/.

[1] Becher B, Schlitzer A, Chen J, Mair F, Sumatoh HR, Wei K, et al. High-dimensional analysis of the murine myeloid cell system. Nat Immunol. 2014;15.doi:10.1038/ni.3006.nature

[2] Levine JH, Simonds EF, Bendall SC, Davis KL, Amir ED, Tadmor MD, et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell. Elsevier Inc.; 2015; 1???14. doi:10.1016/j.cell.2015.05.047

[3] Tenenbaum JB, de Silva V, Langford JC, Silva V De, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. American Association for the Advancement of Science; 2000;290: 2319-23. doi:10.1126/science.290.5500.2319

[4] Maaten L Van Der. Accelerating t-SNE using Tree-Based Algorithms. J Mach Learn Res. 2014;15: 1-21.

[5] Haghverdi L, Buettner F, Theis FJ. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics. 2015; 1-10. doi:10.1093/bioinformatics/btv325

[6] Van Gassen S, Callebaut B, Van Helden MJ, Lambrecht BN, Demeester P, Dhaene T, et al. FlowSOM: Using Self-Organizing Maps for Visualization and Interpretation of Cytometry Data. Cytometry A. 2015; 1-10. doi:10.1002/cyto.a.22625