screenCounter 1.0.1
The screenCounter package implements several functions to quantify the frequency of barcodes in a sequencing screen experiment. These functions on the raw FASTQ files, yielding a matrix of counts for each barcode in each sample for downstream analysis. Quantification can be performed for both single and combinatorial barcodes, though only single-end sequencing data are currently supported.
The “barcode” is considered to be the entire sequence structure that is used to label a particular molecular or biological entity. This includes regions of variable sequence that distinguish one barcode from another as well as constant sequences that flank or separate the variable regions. The most common barcode design uses only one variable region, which we refer to as a “single barcode”.
# Example of a single barcode:
CAGCAGCATGCTGNNNNNNNNNNNNNNCAGCATCGTGC
-------------^^^^^^^^^^^^^^-----------
constant variable constant
flank region flank
To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen. For simplicity, we will assume that each read only contains the barcode. However, it is entirely permissible to have additional (unknown) flanking sequences in the read.
# Our pool of known variable sequences, one per barcode:
known <- c("AAAAAAAA", "CCCCCCCC", "GGGGGGGG", "TTTTTTTT")
# Mocking up some sequence data, where each read randomly contains
# one of the variable sequences, flanked by constant regions.
library(Biostrings)
chosen <- sample(known, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG", chosen)
names(reads) <- sprintf("READ_%i", seq_along(reads))
# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=single.fq, format="fastq")
We quantify single barcodes across one or more files using the matrixOfSingleBarcodes()
function.
This produces a SummarizedExperiment
containing a count matrix of frequencies of each barcode (row) in each file (column).
The order of rows corresponds to the order of variable regions in known
that distinguishes each barcode.
library(screenCounter)
out <- matrixOfSingleBarcodes(single.fq,
flank5="GTAC", flank3="CATG", choices=known)
out
## class: SummarizedExperiment
## dim: 4 1
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(1): file2089ad75e6fdb7.fastq
## colData names(3): paths nreads nmapped
assay(out)
## file2089ad75e6fdb7.fastq
## AAAAAAAA 248
## CCCCCCCC 263
## GGGGGGGG 253
## TTTTTTTT 236
We specify the constant sequences that immediately flank the variable region1 A template structure can also be specified in template=
, which may be more convenient..
This improves stringency of barcode identification as the barcode will not be recognized in the read sequence unless the constant sequences match perfectly.
It also improves speed as the matching of the variable sequence is only performed at positions along the read where the flanking constant sequences have matched.
Obviously, users should not supply the full length of the flanking sequence if that sequence is not captured in the read.
For example, the 3’ flanking sequence may be lost in experiments using short reads that only reach to the end of the variable region.
In such cases, an empty string should be specified as flank3=
.
A more complex experimental design involves combinatorial barcodes where multiple variable regions are randomly ligated to form a single sequence. This exploits combinatorial complexity to generate a large number of unique barcodes from a limited pool of known variable sequences.
# Example of a combinatorial barcode:
CAGCTANNNNNNNNNNCACGNNNNNNNNNNCAGCT
------^^^^^^^^^^----^^^^^^^^^^-----
variable variable
To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.
# Our pool of known variable sequences:
known1 <- c("AAAA", "CCCC", "GGGG", "TTTT")
known2 <- c("ATTA", "CGGC", "GCCG", "TAAT")
# Mocking up some sequence data, where each read randomly contains
# two of the variable sequences within a template structure.
library(Biostrings)
chosen1 <- sample(known1, 1000, replace=TRUE)
chosen2 <- sample(known2, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG%sGTAC", chosen1, chosen2)
names(reads) <- sprintf("READ_%i", seq_along(reads))
# Writing to a FASTQ file.
combo.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=combo.fq, format="fastq")
We quantify combinatorial barcodes across one or more files using the matrixOfComboBarcodes()
function.
This requires a template for the barcode structure to specify how the variable sequences are used to construct the final barcode.
It is permissible for each variable sequence to be sampled from a different known pool.
out <- matrixOfComboBarcodes(combo.fq,
template="GTACNNNNCATGNNNNGTAC",
choices=list(first=known1, second=known2))
out
## class: SummarizedExperiment
## dim: 16 1
## metadata(0):
## assays(1): counts
## rownames(16): BARCODE_1 BARCODE_2 ... BARCODE_15 BARCODE_16
## rowData names(2): first second
## colnames(1): file2089ad57ae3f8.fastq
## colData names(3): paths nreads nmapped
This function yields a SummarizedExperiment
object containing a matrix of frequencies of each barcode (row) in each file (column).
assay(out)
## file2089ad57ae3f8.fastq
## BARCODE_1 58
## BARCODE_2 60
## BARCODE_3 53
## BARCODE_4 60
## BARCODE_5 61
## BARCODE_6 58
## BARCODE_7 59
## BARCODE_8 58
## BARCODE_9 64
## BARCODE_10 71
## BARCODE_11 66
## BARCODE_12 78
## BARCODE_13 68
## BARCODE_14 57
## BARCODE_15 75
## BARCODE_16 54
The identities of the variable regions in each combinatorial barcode are specified in the rowData
,
which contains the variable sequences in known1
and known2
that define each barcode.
rowData(out)
## DataFrame with 16 rows and 2 columns
## first second
## <character> <character>
## BARCODE_1 AAAA ATTA
## BARCODE_2 AAAA CGGC
## BARCODE_3 AAAA GCCG
## BARCODE_4 AAAA TAAT
## BARCODE_5 CCCC ATTA
## ... ... ...
## BARCODE_12 GGGG TAAT
## BARCODE_13 TTTT ATTA
## BARCODE_14 TTTT CGGC
## BARCODE_15 TTTT GCCG
## BARCODE_16 TTTT TAAT
Another experimental design involves the use of dual barcodes, often to modulate the expression of two genes at once. The aim is to, again, count the frequency of each combination of barcodes among the observed read pairs. This differs from the combinatorial barcodes in that the dual barcodes are present on paired reads, with one barcode per read; in addition, not all combinations of two barcodes may be valid.
# Example of a dual barcode:
# READ 1:
CAGCTANNNNNNNNNNCACG
------^^^^^^^^^^----
variable
# READ 2:
CACGGTTNNNNNNNNNNCAGC
------^^^^^^^^^^-----
variable
To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.
# Creating an example dual barcode sequencing experiment.
known.pool1 <- c("AGAGAGAGA", "CTCTCTCTC",
"GTGTGTGTG", "CACACACAC")
known.pool2 <- c("ATATATATA", "CGCGCGCGC",
"GAGAGAGAG", "CTCTCTCTC")
# Mocking up the barcode sequences.
N <- 1000
read1 <- sprintf("CAGCTACGTACG%sCCAGCTCGATCG",
sample(known.pool1, N, replace=TRUE))
names(read1) <- seq_len(N)
read2 <- sprintf("TGGGCAGCGACA%sACACGAGGGTAT",
sample(known.pool2, N, replace=TRUE))
names(read2) <- seq_len(N)
# Writing them to FASTQ files.
tmp <- tempfile()
tmp1 <- paste0(tmp, "_1.fastq")
writeXStringSet(DNAStringSet(read1), filepath=tmp1, format="fastq")
tmp2 <- paste0(tmp, "_2.fastq")
writeXStringSet(DNAStringSet(read2), filepath=tmp2, format="fastq")
Let us imagine that only a subset of the barcode combinations are actually valid. This is often the case because, in these studies, the combinations are explicitly designed to refer to known phenomena (e.g., gene combinations) rather than being randomized as described for combinatorial barcodes.
choices <- expand.grid(known.pool1, known.pool2)
choices <- DataFrame(barcode1=choices[,1], barcode2=choices[,2])
choices <- choices[sample(nrow(choices), nrow(choices)*0.9),]
We quantify dual barcodes across one or more pairs of files using the matrixOfDualBarcodes()
function.
This requires templates for the barcode structure to specify how the variable sequences are used to construct each final barcode;
the first template corresponds to the first barcode, while the second template corresponds to the second barcode.
Many of the options available for single barcode matching in matrixOfSingleBarcodes()
are also available here
and can be specificed separately for each barcode.
out <- matrixOfDualBarcodes(list(c(tmp1, tmp2)),
choices=choices,
template=c("CAGCTACGTACGNNNNNNNNNCCAGCTCGATCG",
"TGGGCAGCGACANNNNNNNNNACACGAGGGTAT"))
out
## class: SummarizedExperiment
## dim: 14 1
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): barcode1 barcode2
## colnames(1): file2089ad30038ec3_1.fastq
## colData names(3): paths1 paths2 npairs
This function yields a SummarizedExperiment
object containing a matrix of frequencies
of each barcode pair (row) in each pair of files (column).
assay(out)
## file2089ad30038ec3_1.fastq
## [1,] 58
## [2,] 79
## [3,] 55
## [4,] 64
## [5,] 76
## [6,] 61
## [7,] 74
## [8,] 59
## [9,] 53
## [10,] 71
## [11,] 46
## [12,] 64
## [13,] 60
## [14,] 69
Further diagnostics about the mapping are provided in the colData
:
colData(out)
## DataFrame with 1 row and 3 columns
## paths1 paths2
## <character> <character>
## file2089ad30038ec3_1.fastq /tmp/RtmprxQHyg/file.. /tmp/RtmprxQHyg/file..
## npairs
## <integer>
## file2089ad30038ec3_1.fastq 1000
The identities of the variable regions corresponding to each row are specified in the rowData
,
which contains the variable sequences in barcode1
and barcode1
that define each barcode combination.
rowData(out)
## DataFrame with 14 rows and 2 columns
## barcode1 barcode2
## <factor> <factor>
## 1 AGAGAGAGA CGCGCGCGC
## 2 GTGTGTGTG CTCTCTCTC
## 3 CACACACAC CTCTCTCTC
## 4 GTGTGTGTG GAGAGAGAG
## 5 AGAGAGAGA CTCTCTCTC
## ... ... ...
## 10 GTGTGTGTG CGCGCGCGC
## 11 GTGTGTGTG ATATATATA
## 12 CACACACAC GAGAGAGAG
## 13 CACACACAC ATATATATA
## 14 CTCTCTCTC CTCTCTCTC
By default, we assume that the read sequence in the first FASTQ file contains the first barcode (i.e., choices[,1]
)
and the second FASTQ file contains the second barcode.
If the arrangement is randomized, we can set randomized=TRUE
to search the first FASTQ file for the second barcode and vice versa.
Note that this will filter out read pairs that match different valid combinations in both arrangements.
In this barcode design, the variable region is not sampled from a pool of known sequences but is instead randomly synthesized from a nucleotide mixture. We cannot use any of the previous functions as they need a known pool; rather, we need to extract the observed sequences (and their frequencies) from the reads. To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen with a single random variable region.
# Mocking up a 8-bp random variable region.
N <- 1000
randomized <- lapply(1:N, function(i) {
paste(sample(c("A", "C", "G", "T"), 8, replace=TRUE), collapse="")
})
barcodes <- sprintf("CCCAGT%sGGGATAC", randomized)
names(reads) <- sprintf("READ_%i", seq_along(reads))
# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=single.fq, format="fastq")
For this design, we can count the frequency of each observed barcode using the matrixOfRandomBarcodes()
function.
This produces a SummarizedExperiment
containing a count matrix of frequencies for each barcode.
library(screenCounter)
out <- matrixOfRandomBarcodes(single.fq,
template="CCCAGTNNNNNNNNGGGATAC")
out
## class: SummarizedExperiment
## dim: 0 1
## metadata(0):
## assays(1): counts
## rownames(0):
## rowData names(1): sequences
## colnames(1): file2089ad16a938d8.fastq
## colData names(3): paths nreads nmapped
assay(out)
## file2089ad16a938d8.fastq
Mismatch tolerance can be enabled by setting substitutions
to the desired number of mismatches.
This may improve barcode recovery in the presence of sequencing errors.
While such errors are rare per base2 At least on Illumina machines., they will be more likely to occur when considering the entire length of the barcode.
In practice, mismatch tolerance is turned off by default in all counting functions. This is because many errors are introduced during barcode synthesis, more than those due to library preparation or sequencing. Synthesis errors result in genuinely different barcodes (e.g., guides targeting different regions, or different molecular tags) that should not be counted together. Nonetheless, this setting may be useful for debugging experiments with low mapping rates.
Indels are not tolerated in any matches.
By default, the counting functions will search the read sequence on both the original strand reported by the sequencer and on the reverse complement.
It is possible to change this behaviour with the strand=
argument to only search one of the strands.
The most appropriate setting requires knowledge of the sequencing protocol and the barcode design.
For example:
strand="original"
if the barcode is on the forward strand of the reachable end, or strand="reverse"
if it is on the reverse strand.
If the sequencing is performed from a randomly chosen end, only 50% of the reads will contain barcodes.strand="both"
to search both strands of each read.
This avoids loss of 50% of reads provided that the constant regions are captured by reads from both ends.Paired-end sequencing of single or combinatorial barcodes requires some more consideration:
strand="original"
or strand="reverse"
.strand="original"
or strand="reverse"
and sum the counts.
(We assume that the chance of randomly encountering the barcode sequence and its flanking constant regions is negligible,
so each read pair will not contribute more than once to the final count.)strand="both"
.
Alternatively, users can consolidate read pairs into a consensus read3 For example, see FLASh or vsearch with --fastq_mergepairs
. prior to using strand="both"
.For paired-end sequencing of dual barcodes, each paired barcode is covered by exactly one paired read in the same direction.
This means that we should never have to set strand="both"
as the strand orientation should be known ahead of time.
The most that should need to be done is to set strand
separately for each barcode,
which is achieved by passing a vector of length 2 to specify the orientation for the first and second barcodes, respectively;
or to set randomized=TRUE
as described above.
Counting functions are parallelized across files using the BiocParallel framework. Each file is processed separately on a single worker to create the final matrix of counts.
# Pretend that these are different samples:
all.files <- c(single.fq, single.fq, single.fq)
# Parallel execution:
library(BiocParallel)
out <- matrixOfSingleBarcodes(all.files,
flank5="GTAC", flank3="CATG", choices=known,
BPPARAM=SnowParam(2))
out
## class: SummarizedExperiment
## dim: 4 3
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(3): file2089ad16a938d8.fastq file2089ad16a938d8.fastq
## file2089ad16a938d8.fastq
## colData names(3): paths nreads nmapped
Users should read the relevant documentation to determine the most effective parallelization approach for their system.
For example, users working on a cluster should use BatchToolsParam()
, while users on a local non-Windows machines may prefer to use MulticoreParam()
.
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocParallel_1.34.2 screenCounter_1.0.1
## [3] SummarizedExperiment_1.30.2 Biobase_2.60.0
## [5] GenomicRanges_1.52.0 MatrixGenerics_1.12.2
## [7] matrixStats_1.0.0 Biostrings_2.68.1
## [9] GenomeInfoDb_1.36.1 XVector_0.40.0
## [11] IRanges_2.34.1 S4Vectors_0.38.1
## [13] BiocGenerics_0.46.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.5-4.1 jsonlite_1.8.7 compiler_4.3.1
## [4] BiocManager_1.30.21 crayon_1.5.2 Rcpp_1.0.10
## [7] bitops_1.0-7 parallel_4.3.1 jquerylib_0.1.4
## [10] yaml_2.3.7 fastmap_1.1.1 lattice_0.21-8
## [13] R6_2.5.1 S4Arrays_1.0.4 knitr_1.43
## [16] DelayedArray_0.26.6 bookdown_0.34 snow_0.4-4
## [19] GenomeInfoDbData_1.2.10 bslib_0.5.0 rlang_1.1.1
## [22] cachem_1.0.8 xfun_0.39 sass_0.4.6
## [25] cli_3.6.1 zlibbioc_1.46.0 digest_0.6.32
## [28] grid_4.3.1 evaluate_0.21 codetools_0.2-19
## [31] RCurl_1.98-1.12 rmarkdown_2.23 tools_4.3.1
## [34] htmltools_0.5.5