RangedSummarizedExpriment
objectsuppressPackageStartupMessages({
library(comapr)
library(SummarizedExperiment)
})
In the previous vignette, we demonstrated how to calculate genetic distances using genotyped markers from a group of samples.
Genetic distance calculate from genotype shifts of markers
In this document, we will focus on building individualized genetic maps from
output files of sscocaller
which is available at
here.
The comapr
package includes a list of toy example output files from the sscocaller
command line tool. The follow code will get the file path that we will use later.
demo_path <-paste0(system.file("extdata",package = "comapr"),"/")
We can see that we have two samples (individual donors) and each of them has haplotype states inferred for chr1 to chr5.
list.files(demo_path)
#> [1] "s1_barcodes.txt" "s1_chr1_altCount.mtx" "s1_chr1_snpAnnot.txt"
#> [4] "s1_chr1_totalCount.mtx" "s1_chr1_vi.mtx" "s1_chr1_viSegInfo.txt"
#> [7] "s1_chr2_altCount.mtx" "s1_chr2_snpAnnot.txt" "s1_chr2_totalCount.mtx"
#> [10] "s1_chr2_vi.mtx" "s1_chr2_viSegInfo.txt" "s1_chr3_altCount.mtx"
#> [13] "s1_chr3_snpAnnot.txt" "s1_chr3_totalCount.mtx" "s1_chr3_vi.mtx"
#> [16] "s1_chr3_viSegInfo.txt" "s1_chr4_altCount.mtx" "s1_chr4_snpAnnot.txt"
#> [19] "s1_chr4_totalCount.mtx" "s1_chr4_vi.mtx" "s1_chr4_viSegInfo.txt"
#> [22] "s1_chr5_altCount.mtx" "s1_chr5_snpAnnot.txt" "s1_chr5_totalCount.mtx"
#> [25] "s1_chr5_vi.mtx" "s1_chr5_viSegInfo.txt" "s2_barcodes.txt"
#> [28] "s2_chr1_snpAnnot.txt" "s2_chr1_vi.mtx" "s2_chr1_viSegInfo.txt"
#> [31] "s2_chr2_snpAnnot.txt" "s2_chr2_vi.mtx" "s2_chr2_viSegInfo.txt"
#> [34] "s2_chr3_snpAnnot.txt" "s2_chr3_vi.mtx" "s2_chr3_viSegInfo.txt"
#> [37] "s2_chr4_snpAnnot.txt" "s2_chr4_vi.mtx" "s2_chr4_viSegInfo.txt"
#> [40] "s2_chr5_snpAnnot.txt" "s2_chr5_vi.mtx" "s2_chr5_viSegInfo.txt"
comapr
provides quality-check functions for examining SNP coverage per chr and
per cell, chromosome segregation pattern checks, and summary statics for
filtering low confidence crossovers.
pcQC <- perCellChrQC(sampleName="s1",chroms=c("chr1"),
path=paste0(demo_path,"/"),
barcodeFile=NULL)
Input-parsing functions are implemented in comapr
to construct
RangedSummarizedExpriment
object that parses files generated
from sscocaller
and filter out low-confidence COs at the same time. For the
demo dataset, these filters do not have any effects:
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1
RangedSummarizedExpriment
objectWe first construct RangedSummarizedExpriment
object from parsing output files
from sscocaller
and filter out low-confidence crossovers.
s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"),
path=demo_path,barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"),
path=demo_path,
barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s1_rse_state
#> class: RangedSummarizedExperiment
#> dim: 12 5
#> metadata(10): ithSperm Seg_start ... bp_dist barcode
#> assays(1): vi_state
#> rownames: NULL
#> rowData names(0):
#> colnames(5): BC1 BC2 BC3 BC4 BC5
#> colData names(1): barcodes
The Viterbi states for SNP markers are stored in the “assay” slot:
assay(s1_rse_state)
#> 12 x 5 sparse Matrix of class "dgCMatrix"
#> BC1 BC2 BC3 BC4 BC5
#> [1,] 1 2 1 2 1
#> [2,] 1 . . . .
#> [3,] . 2 . . .
#> [4,] . . . . 1
#> [5,] 2 1 . . .
#> [6,] 2 1 1 2 .
#> [7,] 1 2 1 2 1
#> [8,] 1 . . . .
#> [9,] . 2 . . .
#> [10,] . . . . 1
#> [11,] 2 1 . . .
#> [12,] 2 1 1 2 .
The rowRanges
contains the SNP’s positions:
rowRanges(s1_rse_state)
#> GRanges object with 12 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3000 *
#> [2] chr1 3200 *
#> [3] chr1 4000 *
#> [4] chr1 4500 *
#> [5] chr1 5500 *
#> ... ... ... ...
#> [8] chr2 3200 *
#> [9] chr2 4000 *
#> [10] chr2 4500 *
#> [11] chr2 5500 *
#> [12] chr2 6000 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
We have read in the Viterbi states for cells from two samples: s1 and s2. We now combine them into one data object for downstream analysis.
colData(s1_rse_state)
#> DataFrame with 5 rows and 1 column
#> barcodes
#> <character>
#> BC1 BC1
#> BC2 BC2
#> BC3 BC3
#> BC4 BC4
#> BC5 BC5
colData(s1_rse_state)$sampleGroup <- "s1"
colData(s2_rse_state)$sampleGroup <- "s2"
We now call combineHapState
function to combine sample s1 and sample s2
into one RangedSummarizedExperiment
object.
twoSample <- combineHapState(s1_rse_state,
s2_rse_state)
Now the assay
data slot contains the Viterbi states across SNP positions for
the combined samples.
twoSample <- combineHapState(s1_rse_state,s2_rse_state)
Now the twoSample
object contains the cells from both samples with assay
slot
contains the Viterbi states and rowRanges
contains position annotaitons for
the list SNP markers.
assay(twoSample)
#> 12 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'BC1', 'BC2', 'BC3' ... ]]
#>
#> [1,] 1 2 1 2 1 1 2 1 2 1
#> [2,] 1 . . . . . . . . .
#> [3,] . 2 . . . . 2 . . .
#> [4,] . . . . 1 . . . . 1
#> [5,] 2 1 . . . . 1 . . .
#> [6,] 2 1 1 2 . 1 1 1 2 .
#> [7,] 1 2 1 2 1 1 2 1 2 1
#> [8,] 1 . . . . . . . . .
#> [9,] . 2 . . . . 2 . . .
#> [10,] . . . . 1 . . . . 1
#> [11,] 2 1 . . . . 1 . . .
#> [12,] 2 1 1 2 . 1 1 1 2 .
The countCOs
function can then find out the crossover intervals according to
the Viterbi states for each cell and the number of crossovers per cell is then
calculated.
countCOs
function will find the crossover intervals for each samples and
attribute the observed crossovers from each sample to the corresponding intervals.
cocounts <- countCOs(twoSample)
The rowRanges
from the resulting data object now denotes the crossover interval
and the assay
slot now contains the number of crossovers in each cell.
Now rowRanges
contains the intervals
rowRanges(cocounts)
#> GRanges object with 6 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3201-3999 *
#> [2] chr1 4001-4499 *
#> [3] chr1 4501-5499 *
#> [4] chr2 3201-3999 *
#> [5] chr2 4001-4499 *
#> [6] chr2 4501-5499 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
The colData slot still contains the annotations of each cell.
colData(cocounts)
#> DataFrame with 10 rows and 2 columns
#> barcodes sampleGroup
#> <character> <character>
#> BC1 BC1 s1
#> BC2 BC2 s1
#> BC3 BC3 s1
#> BC4 BC4 s1
#> BC5 BC5 s1
#> BCa BCa s2
#> BCb BCb s2
#> BCc BCc s2
#> BCd BCd s2
#> BCe BCe s2
The genetic distances can be calculated by using mapping functions such as the
Kosambi or Haldane and assay
slot contains the number of crossovers
found in each sample across these intervals.
assay(cocounts)
#> DataFrame with 6 rows and 10 columns
#> BC1 BC2 BC3 BC4 BC5 BCa BCb
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 0.347826 0.000000 0 0 0 0 0.000000
#> 2 0.217391 0.333333 0 0 0 0 0.333333
#> 3 0.434783 0.666667 0 0 0 0 0.666667
#> 4 0.347826 0.000000 0 0 0 0 0.000000
#> 5 0.217391 0.333333 0 0 0 0 0.333333
#> 6 0.434783 0.666667 0 0 0 0 0.666667
#> BCc BCd BCe
#> <numeric> <numeric> <numeric>
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
calGeneticDist
function will convert the raw crossover frequencies to genetic
distances via selected mapping function (ie. kosambi or haldane).
dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k")
dist_gr
#> class: RangedSummarizedExperiment
#> dim: 6 10
#> metadata(0):
#> assays(1): co_count
#> rownames: NULL
#> rowData names(2): raw_rate kosambi
#> colnames(10): BC1 BC2 ... BCd BCe
#> colData names(2): barcodes sampleGroup
The genetic distances for each interval are stored in rowData
.
rowData(dist_gr)
#> DataFrame with 6 rows and 2 columns
#> raw_rate kosambi
#> <numeric> <numeric>
#> 1 0.0347826 3.48389
#> 2 0.0884058 8.93447
#> 3 0.1768116 18.47894
#> 4 0.0347826 3.48389
#> 5 0.0884058 8.93447
#> 6 0.1768116 18.47894
The above genetic distances have been calculated using all samples. We can also specify the group factor so that we can calculate genetic distances for different sample groups:
## sampleGroup is a column in the colData slot
dist_gr <- calGeneticDist(co_count = cocounts,
group_by = "sampleGroup",
mapping_fun = "k")
Now the group/Sample specific distances are stored in rowData
rowData(dist_gr)$kosambi
#> s1 s2
#> [1,] 7.001937 0.00000
#> [2,] 11.198036 6.70660
#> [3,] 23.647496 13.66359
#> [4,] 7.001937 0.00000
#> [5,] 11.198036 6.70660
#> [6,] 23.647496 13.66359
We construct a GRange
object from the dist_gr
first.
p_gr <- granges(dist_gr)
mcols(p_gr) <- rowData(dist_gr)$kosambi
We can plot whole-genome genetic distances
plotWholeGenome(p_gr)
We can also do per chromosome plot
plotGeneticDist(p_gr,chr = "chr1",cumulative = TRUE)
bootstrapDist
function generates the sampling distribution of the difference
in total genetic distances for the two groups under comparisons.
set.seed(100)
bootsDiff <- bootstrapDist(co_gr = cocounts,B=10,
group_by = "sampleGroup")
hist(bootsDiff)
From the bootsDiff
data object, we can find a 95% confidence interval to test
whether the difference is statistically significant.
quantile(bootsDiff,c(0.025,0.975),na.rm =TRUE)
#> 2.5% 97.5%
#> -3.11145 213.24397
An alternative re-sampling method, permuteDist
, can be applied to generate a
null distribution for the group difference by reshuffling the group labels across
the samples.
set.seed(100)
perms <- permuteDist(cocounts,B=1000,group_by = "sampleGroup")
A p-value can be calculated using the statmod::permp
function (Phipson and Smyth 2010).
statmod::permp(x = sum(perms$permutes>=perms$observed_diff,
na.rm = TRUE),
nperm = sum(!is.na(perms$permutes)),
n1 = perms$nSample[1],
n2 = perms$nSample[2])
#> [1] 0.4885392
sessionInfo()
#> R version 4.3.0 RC (2023-04-13 r84269)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.30.0 Biobase_2.60.0
#> [3] MatrixGenerics_1.12.0 matrixStats_0.63.0
#> [5] BiocParallel_1.34.0 GenomicRanges_1.52.0
#> [7] GenomeInfoDb_1.36.0 IRanges_2.34.0
#> [9] S4Vectors_0.38.0 BiocGenerics_0.46.0
#> [11] comapr_1.4.0 BiocStyle_2.28.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.4
#> [4] shape_1.4.6 magrittr_2.0.3 magick_2.7.4
#> [7] GenomicFeatures_1.52.0 farver_2.1.1 rmarkdown_2.21
#> [10] GlobalOptions_0.1.2 BiocIO_1.10.0 zlibbioc_1.46.0
#> [13] vctrs_0.6.2 memoise_2.0.1 Rsamtools_2.16.0
#> [16] RCurl_1.98-1.12 base64enc_0.1-3 htmltools_0.5.5
#> [19] progress_1.2.2 curl_5.0.0 Formula_1.2-5
#> [22] sass_0.4.5 bslib_0.4.2 htmlwidgets_1.6.2
#> [25] plyr_1.8.8 Gviz_1.44.0 plotly_4.10.1
#> [28] cachem_1.0.7 GenomicAlignments_1.36.0 lifecycle_1.0.3
#> [31] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.5-4
#> [34] R6_2.5.1 fastmap_1.1.1 GenomeInfoDbData_1.2.10
#> [37] digest_0.6.31 colorspace_2.1-0 AnnotationDbi_1.62.0
#> [40] Hmisc_5.0-1 RSQLite_2.3.1 labeling_0.4.2
#> [43] filelock_1.0.2 fansi_1.0.4 httr_1.4.5
#> [46] compiler_4.3.0 withr_2.5.0 bit64_4.0.5
#> [49] htmlTable_2.4.1 backports_1.4.1 DBI_1.1.3
#> [52] highr_0.10 biomaRt_2.56.0 rappdirs_0.3.3
#> [55] DelayedArray_0.26.0 rjson_0.2.21 tools_4.3.0
#> [58] foreign_0.8-84 nnet_7.3-18 glue_1.6.2
#> [61] restfulr_0.0.15 grid_4.3.0 checkmate_2.1.0
#> [64] reshape2_1.4.4 cluster_2.1.4 generics_0.1.3
#> [67] gtable_0.3.3 BSgenome_1.68.0 tidyr_1.3.0
#> [70] ensembldb_2.24.0 data.table_1.14.8 hms_1.1.3
#> [73] xml2_1.3.3 utf8_1.2.3 XVector_0.40.0
#> [76] foreach_1.5.2 pillar_1.9.0 stringr_1.5.0
#> [79] circlize_0.4.15 dplyr_1.1.2 BiocFileCache_2.8.0
#> [82] lattice_0.21-8 rtracklayer_1.60.0 bit_4.0.5
#> [85] deldir_1.0-6 biovizBase_1.48.0 tidyselect_1.2.0
#> [88] Biostrings_2.68.0 knitr_1.42 gridExtra_2.3
#> [91] bookdown_0.33 ProtGenerics_1.32.0 xfun_0.39
#> [94] statmod_1.5.0 stringi_1.7.12 lazyeval_0.2.2
#> [97] yaml_2.3.7 evaluate_0.20 codetools_0.2-19
#> [100] interp_1.1-4 tibble_3.2.1 BiocManager_1.30.20
#> [103] cli_3.6.1 rpart_4.1.19 munsell_0.5.0
#> [106] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.0.10
#> [109] dbplyr_2.3.2 png_0.1-8 XML_3.99-0.14
#> [112] parallel_4.3.0 ggplot2_3.4.2 blob_1.2.4
#> [115] prettyunits_1.1.1 latticeExtra_0.6-30 jpeg_0.1-10
#> [118] AnnotationFilter_1.24.0 bitops_1.0-7 viridisLite_0.4.1
#> [121] VariantAnnotation_1.46.0 scales_1.2.1 purrr_1.0.1
#> [124] crayon_1.5.2 rlang_1.1.0 KEGGREST_1.40.0
Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation P-Values Should Never Be Zero: Calculating Exact P-Values When Permutations Are Randomly Drawn.” Stat. Appl. Genet. Mol. Biol. 9 (October): Article39.