1 Introduction

Cell-free (cf) DNA enters blood circulation via various biological processes. There are increasing evidences showing cfDNA fragmentation patterns could be used in cancer detection and monitoring. However, a lack of R package for automated analysis of fragmentation data hindered the research in this field. Hence, I wrote this tutorial showing how to use cfDNAPro functions to perform characterisation and visualisation of cfDNA whole genome sequencing data. The functions of cfDNAPro help reach a high degree of clarity, transparency and reproducibility of analyses.

cfDNAPro now includes two sets of functions for data characterisation and visualisation respectively. Data characterisation functions consist of examplePath, callMode, callSize, callMetrics, callPeakDistance, and callValleyDistance;

Data visualisation functions are: plotAllToOne, plotMetrics, plotMode, plotModeSummary, plotPeakDistance, plotValleyDistance, and plotSingleGroup.

Details about each function please refer to their built-in documentations in R. For example, typing ?callMode in R console will show you the user manual of callMode function.

2 Installation

For steady release, install via bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("cfDNAPro")

For develop version, install via github:

if(!require(devtools)) {
  install.packages("devtools", repos="https://cloud.r-project.org")}

devtools::install_github("hw538/cfDNAPro")

3 Fragment Size Distribution of Each Group

In cfDNAPro package, gathering all your insert size data into sub-folders named by cohort name is required, even if when you have only one cohort. Example data were installed with cfDNAPro into the library folder on you server or personal computer. Here is how to get the data path:

library(cfDNAPro)
#> Loading required package: magrittr
#> Registered S3 method overwritten by 'quantmod':
#>   method            from
#>   as.zoo.data.frame zoo
# Get the path to example data in cfDNAPro package library path.
data_path <- examplePath("groups_picard")

The example data has four groups separated into four sub-folders: “cohort_1”, “cohort_2”, “cohort_3”, and “cohort_4”. Each group contains their own samples (i.e. txt files). Feel feel to use list.files(data_path, full.names = TRUE,recursive = TRUE) to check.

3.1 Generate a plot for a single group/cohort.

Alright, you might be wondering how easy it is to generate the fragmentation patterns…Let’s move on… There are data from four groups at the very beginning, we want to simply peep into one cohort (e.g. cohort_1) to confirm that nothing has been messed up. How to plot those three txt files in “cohort_1”?

cohort1_plot <- cfDNAPro::callSize(path = data_path) %>%
  dplyr::filter(group == as.character("cohort_1")) %>%
  cfDNAPro::plotSingleGroup()
#> setting default outfmt to df.
#> setting default input_type to picard.

In cohort1_plot (actually it is a list in R!) we could see three ggplot2 objects: prop_plot, cdf_plot, and one_minus_cdf_plot. This means you could modify those plots using ggplot2 syntax! Imagine how we could multiplex the plot in one figure, here is a good example!

3.2 Matipulate your plots.

library(scales)
library(ggpubr)
#> Loading required package: ggplot2
library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1",
               "cohort_2"="cohort_2",
               "cohort_3"="cohort_3",
               "cohort_4"="cohort_4")

# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){
  result <-callSize(path = data_path) %>% 
    dplyr::filter(group==as.character(x)) %>% 
    plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.

# Multiplexing the plots in one figure
multiplex <-
  ggarrange(result$cohort_1$prop_plot + 
              theme(axis.title.x = element_blank()),
            result$cohort_4$prop_plot + 
              theme(axis.title = element_blank()),
            result$cohort_1$cdf_plot,
            result$cohort_4$cdf_plot + 
              theme(axis.title.y = element_blank()),
            labels = c("Cohort 1 (n=3)", "Cohort 4 (n=1)"),
            label.x = 0.2,
            ncol = 2,
            nrow = 2)
multiplex

4 Median Size Metrics

In SESSION 2 we plotted all samples in each group. However, you might be wondering how to calculate the median size fragment distribution of each group and then plot those median distribution together?

# Set an oder for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generateplots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.

# Modify plots.
p1<-compare_grps$median_prop_plot +
  ylim(c(0, 0.028)) +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.

p2<-compare_grps$median_cdf_plot +
  scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
  theme(axis.title=element_text(size=12,face="bold")) +
  theme(legend.position = c(0.7, 0.5),
        legend.text = element_text( size = 11),
        legend.title = element_blank())

# Finalize plots.
median_grps<-ggpubr::ggarrange(p1,
                       p2,
                       label.x = 0.3,
                       ncol = 1,
                       nrow = 2
                       )

median_grps

5 Modal Fragment Size

5.1 Bar chart.

You might want to calculate the modal fragment size of each sample. Remember that you could modify the object generated using ggplot2 syntax.

# Set an oder for your groups, it will affect the group oder along x axis!
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate mode bin chart.
mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81))
#> setting default input_type to picard.
#> setting default mincount as 0.
# Show the plot.
mode_bin

5.2 Stacked bar chart.

We have another way to visualize the modal fragment size: stacked bar chart. .

# Set an order for your groups, it will affect the group order along x axis.
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")

# Generate mode stacked bar chart. You could specify how to stratify the modes
# using 'mode_partition' arguments. If other modes exist other than you 
# specified, an 'other' group will be added to the plot.

mode_stacked <- 
  callMode(data_path) %>% 
  plotModeSummary(order=order,
                  mode_partition = list(c(166,167)))
#> setting default input_type to picard.
# Modify the plot using ggplot syntax.
mode_stacked <- mode_stacked + theme(legend.position = "top")
# Show the plot.
mode_stacked

6 Inter-peak/valley Distance

10 bp periodical oscillations were observed in cell-free DNA fragment size distributions. How to calculate them?

6.1 Inter-peak distance

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_peak_dist<-callPeakDistance(path = data_path,  limit = c(50, 135)) %>%
  plotPeakDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
#> setting the mincount to 0.
#>  setting the xlim to c(7,13). 
#> 
# Show the plot.
inter_peak_dist

6.2 Inter-valley distance

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,  
                                      limit = c(50, 135)) %>%
  plotValleyDistance(order = order) +
  labs(y="Fraction") +
  theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
#> setting the mincount to 0. 
#>  setting the xlim to c(7,13). 
#> 
# Show the plot.
inter_valley_dist

7 Others

Imagine that you want to label the peaks and valleys in the size distribution of a sample. How to do this? We use cohort_4 sample as an example:

library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path) 
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +
  geom_vline(xintercept = valleys$insert_size,colour="blue")
# Show the plot.
exam_plot_prop
#> Warning: Removed 500 row(s) containing missing values (geom_path).


# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop + 
  coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
  geom_point(data= peaks, 
             mapping = aes(x= insert_size, y= prop),
             color="blue",alpha=0.5,size=3) +
  geom_point(data= valleys, 
             mapping = aes(x= insert_size, y= prop),
             color="red",alpha=0.5,size=3) 
# Show the plot.
exam_plot_prop_dot
#> Warning: Removed 500 row(s) containing missing values (geom_path).

8 Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.0.2      ggpubr_0.4.0     ggplot2_3.3.2    scales_1.1.1    
#> [5] cfDNAPro_1.0.0   magrittr_1.5     BiocStyle_2.18.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyr_1.1.2            carData_3.0-4          TTR_0.24.2            
#>  [4] BiocManager_1.30.10    stats4_4.0.3           GenomeInfoDbData_1.2.4
#>  [7] cellranger_1.1.0       Rsamtools_2.6.0        yaml_2.2.1            
#> [10] pillar_1.4.6           backports_1.1.10       lattice_0.20-41       
#> [13] glue_1.4.2             digest_0.6.27          GenomicRanges_1.42.0  
#> [16] ggsignif_0.6.0         XVector_0.30.0         colorspace_1.4-1      
#> [19] cowplot_1.1.0          htmltools_0.5.0        pkgconfig_2.0.3       
#> [22] broom_0.7.2            magick_2.5.0           haven_2.3.1           
#> [25] bookdown_0.21          zlibbioc_1.36.0        purrr_0.3.4           
#> [28] openxlsx_4.2.3         rio_0.5.16             BiocParallel_1.24.0   
#> [31] tibble_3.0.4           generics_0.0.2         farver_2.0.3          
#> [34] IRanges_2.24.0         car_3.0-10             ellipsis_0.3.1        
#> [37] withr_2.3.0            BiocGenerics_0.36.0    quantmod_0.4.17       
#> [40] crayon_1.3.4           readxl_1.3.1           evaluate_0.14         
#> [43] rstatix_0.6.0          forcats_0.5.0          xts_0.12.1            
#> [46] foreign_0.8-80         tools_4.0.3            data.table_1.13.2     
#> [49] hms_0.5.3              lifecycle_0.2.0        stringr_1.4.0         
#> [52] S4Vectors_0.28.0       munsell_0.5.0          zip_2.1.1             
#> [55] Biostrings_2.58.0      compiler_4.0.3         GenomeInfoDb_1.26.0   
#> [58] rlang_0.4.8            grid_4.0.3             RCurl_1.98-1.2        
#> [61] bitops_1.0-6           labeling_0.4.2         rmarkdown_2.5         
#> [64] gtable_0.3.0           abind_1.4-5            curl_4.3              
#> [67] R6_2.4.1               zoo_1.8-8              knitr_1.30            
#> [70] stringi_1.5.3          parallel_4.0.3         Rcpp_1.0.5            
#> [73] vctrs_0.3.4            tidyselect_1.1.0       xfun_0.18

cfDNAPro

27 October 2020

Abstract

Package