cfDNAPro 1.0.0
Cell-free (cf) DNA enters blood circulation via various biological processes. There are increasing evidences showing cfDNA fragmentation patterns could be used in cancer detection and monitoring. However, a lack of R package for automated analysis of fragmentation data hindered the research in this field. Hence, I wrote this tutorial showing how to use cfDNAPro functions to perform characterisation and visualisation of cfDNA whole genome sequencing data. The functions of cfDNAPro help reach a high degree of clarity, transparency and reproducibility of analyses.
cfDNAPro now includes two sets of functions for data characterisation and
visualisation respectively. Data characterisation functions consist of
examplePath
, callMode
, callSize
, callMetrics
, callPeakDistance
, and
callValleyDistance
;
Data visualisation functions are: plotAllToOne
,
plotMetrics
, plotMode
, plotModeSummary
, plotPeakDistance
,
plotValleyDistance
, and plotSingleGroup
.
Details about each function please
refer to their built-in documentations in R. For example, typing
?callMode
in R console will show you the user manual of callMode
function.
For steady release, install via bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("cfDNAPro")
For develop version, install via github:
if(!require(devtools)) {
install.packages("devtools", repos="https://cloud.r-project.org")}
devtools::install_github("hw538/cfDNAPro")
In cfDNAPro package, gathering all your insert size data into sub-folders named by cohort name is required, even if when you have only one cohort. Example data were installed with cfDNAPro into the library folder on you server or personal computer. Here is how to get the data path:
library(cfDNAPro)
#> Loading required package: magrittr
#> Registered S3 method overwritten by 'quantmod':
#> method from
#> as.zoo.data.frame zoo
# Get the path to example data in cfDNAPro package library path.
data_path <- examplePath("groups_picard")
The example data has four groups separated into four sub-folders:
“cohort_1”, “cohort_2”, “cohort_3”, and “cohort_4”. Each group contains
their own samples (i.e. txt files). Feel feel to use
list.files(data_path, full.names = TRUE,recursive = TRUE)
to check.
Alright, you might be wondering how easy it is to generate the fragmentation patterns…Let’s move on… There are data from four groups at the very beginning, we want to simply peep into one cohort (e.g. cohort_1) to confirm that nothing has been messed up. How to plot those three txt files in “cohort_1”?
cohort1_plot <- cfDNAPro::callSize(path = data_path) %>%
dplyr::filter(group == as.character("cohort_1")) %>%
cfDNAPro::plotSingleGroup()
#> setting default outfmt to df.
#> setting default input_type to picard.
In cohort1_plot (actually it is a list in R!)
we could see three ggplot2 objects:
prop_plot
, cdf_plot
, and one_minus_cdf_plot
.
This means you could modify those plots using ggplot2 syntax! Imagine how we
could multiplex the plot in one figure, here is a good example!
library(scales)
library(ggpubr)
#> Loading required package: ggplot2
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# Define a list for the groups/cohorts.
grp_list<-list("cohort_1"="cohort_1",
"cohort_2"="cohort_2",
"cohort_3"="cohort_3",
"cohort_4"="cohort_4")
# Generating the plots and store them in a list.
result<-sapply(grp_list, function(x){
result <-callSize(path = data_path) %>%
dplyr::filter(group==as.character(x)) %>%
plotSingleGroup()
}, simplify = FALSE)
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
#> setting default outfmt to df.
#> setting default input_type to picard.
# Multiplexing the plots in one figure
multiplex <-
ggarrange(result$cohort_1$prop_plot +
theme(axis.title.x = element_blank()),
result$cohort_4$prop_plot +
theme(axis.title = element_blank()),
result$cohort_1$cdf_plot,
result$cohort_4$cdf_plot +
theme(axis.title.y = element_blank()),
labels = c("Cohort 1 (n=3)", "Cohort 4 (n=1)"),
label.x = 0.2,
ncol = 2,
nrow = 2)
multiplex
In SESSION 2 we plotted all samples in each group. However, you might be wondering how to calculate the median size fragment distribution of each group and then plot those median distribution together?
# Set an oder for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generateplots.
compare_grps<-callMetrics(data_path) %>% plotMetrics(order=order)
#> setting default input_type to picard.
# Modify plots.
p1<-compare_grps$median_prop_plot +
ylim(c(0, 0.028)) +
theme(axis.title.x = element_blank(),
axis.title.y = element_text(size=12,face="bold")) +
theme(legend.position = c(0.7, 0.5),
legend.text = element_text( size = 11),
legend.title = element_blank())
#> Scale for 'y' is already present. Adding another scale for 'y', which will
#> replace the existing scale.
p2<-compare_grps$median_cdf_plot +
scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
theme(axis.title=element_text(size=12,face="bold")) +
theme(legend.position = c(0.7, 0.5),
legend.text = element_text( size = 11),
legend.title = element_blank())
# Finalize plots.
median_grps<-ggpubr::ggarrange(p1,
p2,
label.x = 0.3,
ncol = 1,
nrow = 2
)
median_grps
You might want to calculate the modal fragment size of each sample. Remember that you could modify the object generated using ggplot2 syntax.
# Set an oder for your groups, it will affect the group oder along x axis!
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate mode bin chart.
mode_bin <- callMode(data_path) %>% plotMode(order=order,hline = c(167,111,81))
#> setting default input_type to picard.
#> setting default mincount as 0.
# Show the plot.
mode_bin
We have another way to visualize the modal fragment size: stacked bar chart. .
# Set an order for your groups, it will affect the group order along x axis.
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate mode stacked bar chart. You could specify how to stratify the modes
# using 'mode_partition' arguments. If other modes exist other than you
# specified, an 'other' group will be added to the plot.
mode_stacked <-
callMode(data_path) %>%
plotModeSummary(order=order,
mode_partition = list(c(166,167)))
#> setting default input_type to picard.
# Modify the plot using ggplot syntax.
mode_stacked <- mode_stacked + theme(legend.position = "top")
# Show the plot.
mode_stacked
10 bp periodical oscillations were observed in cell-free DNA fragment size distributions. How to calculate them?
# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_peak_dist<-callPeakDistance(path = data_path, limit = c(50, 135)) %>%
plotPeakDistance(order = order) +
labs(y="Fraction") +
theme(axis.title = element_text(size=12,face="bold"),
legend.title = element_blank(),
legend.position = c(0.91, 0.5),
legend.text = element_text(size = 11))
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
#> setting the mincount to 0.
#> setting the xlim to c(7,13).
#>
# Show the plot.
inter_peak_dist
# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,
limit = c(50, 135)) %>%
plotValleyDistance(order = order) +
labs(y="Fraction") +
theme(axis.title = element_text(size=12,face="bold"),
legend.title = element_blank(),
legend.position = c(0.91, 0.5),
legend.text = element_text(size = 11))
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
#> setting the mincount to 0.
#> setting the xlim to c(7,13).
#>
# Show the plot.
inter_valley_dist
Imagine that you want to label the peaks and valleys in the size distribution of a sample. How to do this? We use cohort_4 sample as an example:
library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path)
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path)
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> setting the mincount to 0.
#> setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path=exam_path) %>% plotSingleGroup(vline = NULL)
#> setting default outfmt to df.
#> setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop +
coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
geom_vline(xintercept=peaks$insert_size, colour="red",linetype="dashed") +
geom_vline(xintercept = valleys$insert_size,colour="blue")
# Show the plot.
exam_plot_prop
#> Warning: Removed 500 row(s) containing missing values (geom_path).
# Label peaks and valleys with dots.
exam_plot_prop_dot<- exam_plot_all$prop +
coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
geom_point(data= peaks,
mapping = aes(x= insert_size, y= prop),
color="blue",alpha=0.5,size=3) +
geom_point(data= valleys,
mapping = aes(x= insert_size, y= prop),
color="red",alpha=0.5,size=3)
# Show the plot.
exam_plot_prop_dot
#> Warning: Removed 500 row(s) containing missing values (geom_path).
Here is the output of sessionInfo() on the system on which this document was compiled:
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.0.2 ggpubr_0.4.0 ggplot2_3.3.2 scales_1.1.1
#> [5] cfDNAPro_1.0.0 magrittr_1.5 BiocStyle_2.18.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyr_1.1.2 carData_3.0-4 TTR_0.24.2
#> [4] BiocManager_1.30.10 stats4_4.0.3 GenomeInfoDbData_1.2.4
#> [7] cellranger_1.1.0 Rsamtools_2.6.0 yaml_2.2.1
#> [10] pillar_1.4.6 backports_1.1.10 lattice_0.20-41
#> [13] glue_1.4.2 digest_0.6.27 GenomicRanges_1.42.0
#> [16] ggsignif_0.6.0 XVector_0.30.0 colorspace_1.4-1
#> [19] cowplot_1.1.0 htmltools_0.5.0 pkgconfig_2.0.3
#> [22] broom_0.7.2 magick_2.5.0 haven_2.3.1
#> [25] bookdown_0.21 zlibbioc_1.36.0 purrr_0.3.4
#> [28] openxlsx_4.2.3 rio_0.5.16 BiocParallel_1.24.0
#> [31] tibble_3.0.4 generics_0.0.2 farver_2.0.3
#> [34] IRanges_2.24.0 car_3.0-10 ellipsis_0.3.1
#> [37] withr_2.3.0 BiocGenerics_0.36.0 quantmod_0.4.17
#> [40] crayon_1.3.4 readxl_1.3.1 evaluate_0.14
#> [43] rstatix_0.6.0 forcats_0.5.0 xts_0.12.1
#> [46] foreign_0.8-80 tools_4.0.3 data.table_1.13.2
#> [49] hms_0.5.3 lifecycle_0.2.0 stringr_1.4.0
#> [52] S4Vectors_0.28.0 munsell_0.5.0 zip_2.1.1
#> [55] Biostrings_2.58.0 compiler_4.0.3 GenomeInfoDb_1.26.0
#> [58] rlang_0.4.8 grid_4.0.3 RCurl_1.98-1.2
#> [61] bitops_1.0-6 labeling_0.4.2 rmarkdown_2.5
#> [64] gtable_0.3.0 abind_1.4-5 curl_4.3
#> [67] R6_2.4.1 zoo_1.8-8 knitr_1.30
#> [70] stringi_1.5.3 parallel_4.0.3 Rcpp_1.0.5
#> [73] vctrs_0.3.4 tidyselect_1.1.0 xfun_0.18