Contents

1 Overview

The screenCounter package implements several functions to quantify the frequency of barcodes in a sequencing screen experiment. These functions on the raw FASTQ files, yielding a matrix of counts for each barcode in each sample for downstream analysis. Quantification can be performed for both single and combinatorial barcodes, though only single-end sequencing data are currently supported.

2 Counting single barcodes

The “barcode” is considered to be the entire sequence structure that is used to label a particular molecular or biological entity. This includes regions of variable sequence that distinguish one barcode from another as well as constant sequences that flank or separate the variable regions. The most common barcode design uses only one variable region, which we refer to as a “single barcode”.

# Example of a single barcode:

CAGCAGCATGCTGNNNNNNNNNNNNNNCAGCATCGTGC
-------------^^^^^^^^^^^^^^-----------
constant     variable      constant
flank        region        flank

To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen. For simplicity, we will assume that each read only contains the barcode. However, it is entirely permissible to have additional (unknown) flanking sequences in the read.

# Our pool of known variable sequences, one per barcode:
known <- c("AAAAAAAA", "CCCCCCCC", "GGGGGGGG", "TTTTTTTT")

# Mocking up some sequence data, where each read randomly contains 
# one of the variable sequences, flanked by constant regions. 
library(Biostrings)
chosen <- sample(known, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG", chosen)
names(reads) <- sprintf("READ_%i", seq_along(reads))

# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=single.fq, format="fastq")

We quantify single barcodes across one or more files using the matrixOfSingleBarcodes() function. This produces a SummarizedExperiment containing a count matrix of frequencies of each barcode (row) in each file (column). The order of rows corresponds to the order of variable regions in known that distinguishes each barcode.

library(screenCounter)
out <- matrixOfSingleBarcodes(single.fq, 
    flank5="GTAC", flank3="CATG", choices=known)
out
## class: SummarizedExperiment 
## dim: 4 1 
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(1): file2089ad75e6fdb7.fastq
## colData names(3): paths nreads nmapped
assay(out)
##          file2089ad75e6fdb7.fastq
## AAAAAAAA                      248
## CCCCCCCC                      263
## GGGGGGGG                      253
## TTTTTTTT                      236

We specify the constant sequences that immediately flank the variable region1 A template structure can also be specified in template=, which may be more convenient.. This improves stringency of barcode identification as the barcode will not be recognized in the read sequence unless the constant sequences match perfectly. It also improves speed as the matching of the variable sequence is only performed at positions along the read where the flanking constant sequences have matched.

Obviously, users should not supply the full length of the flanking sequence if that sequence is not captured in the read. For example, the 3’ flanking sequence may be lost in experiments using short reads that only reach to the end of the variable region. In such cases, an empty string should be specified as flank3=.

3 Counting combinatorial barcodes

A more complex experimental design involves combinatorial barcodes where multiple variable regions are randomly ligated to form a single sequence. This exploits combinatorial complexity to generate a large number of unique barcodes from a limited pool of known variable sequences.

# Example of a combinatorial barcode:

CAGCTANNNNNNNNNNCACGNNNNNNNNNNCAGCT
------^^^^^^^^^^----^^^^^^^^^^-----
      variable      variable

To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.

# Our pool of known variable sequences:
known1 <- c("AAAA", "CCCC", "GGGG", "TTTT")
known2 <- c("ATTA", "CGGC", "GCCG", "TAAT")

# Mocking up some sequence data, where each read randomly contains 
# two of the variable sequences within a template structure.
library(Biostrings)
chosen1 <- sample(known1, 1000, replace=TRUE)
chosen2 <- sample(known2, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG%sGTAC", chosen1, chosen2)
names(reads) <- sprintf("READ_%i", seq_along(reads))

# Writing to a FASTQ file.
combo.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=combo.fq, format="fastq")

We quantify combinatorial barcodes across one or more files using the matrixOfComboBarcodes() function. This requires a template for the barcode structure to specify how the variable sequences are used to construct the final barcode. It is permissible for each variable sequence to be sampled from a different known pool.

out <- matrixOfComboBarcodes(combo.fq,
    template="GTACNNNNCATGNNNNGTAC",
    choices=list(first=known1, second=known2))
out
## class: SummarizedExperiment 
## dim: 16 1 
## metadata(0):
## assays(1): counts
## rownames(16): BARCODE_1 BARCODE_2 ... BARCODE_15 BARCODE_16
## rowData names(2): first second
## colnames(1): file2089ad57ae3f8.fastq
## colData names(3): paths nreads nmapped

This function yields a SummarizedExperiment object containing a matrix of frequencies of each barcode (row) in each file (column).

assay(out)
##            file2089ad57ae3f8.fastq
## BARCODE_1                       58
## BARCODE_2                       60
## BARCODE_3                       53
## BARCODE_4                       60
## BARCODE_5                       61
## BARCODE_6                       58
## BARCODE_7                       59
## BARCODE_8                       58
## BARCODE_9                       64
## BARCODE_10                      71
## BARCODE_11                      66
## BARCODE_12                      78
## BARCODE_13                      68
## BARCODE_14                      57
## BARCODE_15                      75
## BARCODE_16                      54

The identities of the variable regions in each combinatorial barcode are specified in the rowData, which contains the variable sequences in known1 and known2 that define each barcode.

rowData(out)
## DataFrame with 16 rows and 2 columns
##                  first      second
##            <character> <character>
## BARCODE_1         AAAA        ATTA
## BARCODE_2         AAAA        CGGC
## BARCODE_3         AAAA        GCCG
## BARCODE_4         AAAA        TAAT
## BARCODE_5         CCCC        ATTA
## ...                ...         ...
## BARCODE_12        GGGG        TAAT
## BARCODE_13        TTTT        ATTA
## BARCODE_14        TTTT        CGGC
## BARCODE_15        TTTT        GCCG
## BARCODE_16        TTTT        TAAT

4 Counting dual barcodes

Another experimental design involves the use of dual barcodes, often to modulate the expression of two genes at once. The aim is to, again, count the frequency of each combination of barcodes among the observed read pairs. This differs from the combinatorial barcodes in that the dual barcodes are present on paired reads, with one barcode per read; in addition, not all combinations of two barcodes may be valid.

# Example of a dual barcode:

# READ 1:
CAGCTANNNNNNNNNNCACG
------^^^^^^^^^^----
      variable

# READ 2:
CACGGTTNNNNNNNNNNCAGC
------^^^^^^^^^^-----
      variable

To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.

# Creating an example dual barcode sequencing experiment.
known.pool1 <- c("AGAGAGAGA", "CTCTCTCTC",
    "GTGTGTGTG", "CACACACAC")
known.pool2 <- c("ATATATATA", "CGCGCGCGC",
    "GAGAGAGAG", "CTCTCTCTC")

# Mocking up the barcode sequences.
N <- 1000
read1 <- sprintf("CAGCTACGTACG%sCCAGCTCGATCG",
   sample(known.pool1, N, replace=TRUE))
names(read1) <- seq_len(N)

read2 <- sprintf("TGGGCAGCGACA%sACACGAGGGTAT",
   sample(known.pool2, N, replace=TRUE))
names(read2) <- seq_len(N)

# Writing them to FASTQ files.
tmp <- tempfile()
tmp1 <- paste0(tmp, "_1.fastq")
writeXStringSet(DNAStringSet(read1), filepath=tmp1, format="fastq")
tmp2 <- paste0(tmp, "_2.fastq")
writeXStringSet(DNAStringSet(read2), filepath=tmp2, format="fastq")

Let us imagine that only a subset of the barcode combinations are actually valid. This is often the case because, in these studies, the combinations are explicitly designed to refer to known phenomena (e.g., gene combinations) rather than being randomized as described for combinatorial barcodes.

choices <- expand.grid(known.pool1, known.pool2)
choices <- DataFrame(barcode1=choices[,1], barcode2=choices[,2])
choices <- choices[sample(nrow(choices), nrow(choices)*0.9),]

We quantify dual barcodes across one or more pairs of files using the matrixOfDualBarcodes() function. This requires templates for the barcode structure to specify how the variable sequences are used to construct each final barcode; the first template corresponds to the first barcode, while the second template corresponds to the second barcode. Many of the options available for single barcode matching in matrixOfSingleBarcodes() are also available here and can be specificed separately for each barcode.

out <- matrixOfDualBarcodes(list(c(tmp1, tmp2)), 
    choices=choices,
    template=c("CAGCTACGTACGNNNNNNNNNCCAGCTCGATCG",
               "TGGGCAGCGACANNNNNNNNNACACGAGGGTAT"))
out
## class: SummarizedExperiment 
## dim: 14 1 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): barcode1 barcode2
## colnames(1): file2089ad30038ec3_1.fastq
## colData names(3): paths1 paths2 npairs

This function yields a SummarizedExperiment object containing a matrix of frequencies of each barcode pair (row) in each pair of files (column).

assay(out)
##       file2089ad30038ec3_1.fastq
##  [1,]                         58
##  [2,]                         79
##  [3,]                         55
##  [4,]                         64
##  [5,]                         76
##  [6,]                         61
##  [7,]                         74
##  [8,]                         59
##  [9,]                         53
## [10,]                         71
## [11,]                         46
## [12,]                         64
## [13,]                         60
## [14,]                         69

Further diagnostics about the mapping are provided in the colData:

colData(out)
## DataFrame with 1 row and 3 columns
##                                            paths1                 paths2
##                                       <character>            <character>
## file2089ad30038ec3_1.fastq /tmp/RtmprxQHyg/file.. /tmp/RtmprxQHyg/file..
##                               npairs
##                            <integer>
## file2089ad30038ec3_1.fastq      1000

The identities of the variable regions corresponding to each row are specified in the rowData, which contains the variable sequences in barcode1 and barcode1 that define each barcode combination.

rowData(out)
## DataFrame with 14 rows and 2 columns
##      barcode1  barcode2
##      <factor>  <factor>
## 1   AGAGAGAGA CGCGCGCGC
## 2   GTGTGTGTG CTCTCTCTC
## 3   CACACACAC CTCTCTCTC
## 4   GTGTGTGTG GAGAGAGAG
## 5   AGAGAGAGA CTCTCTCTC
## ...       ...       ...
## 10  GTGTGTGTG CGCGCGCGC
## 11  GTGTGTGTG ATATATATA
## 12  CACACACAC GAGAGAGAG
## 13  CACACACAC ATATATATA
## 14  CTCTCTCTC CTCTCTCTC

By default, we assume that the read sequence in the first FASTQ file contains the first barcode (i.e., choices[,1]) and the second FASTQ file contains the second barcode. If the arrangement is randomized, we can set randomized=TRUE to search the first FASTQ file for the second barcode and vice versa. Note that this will filter out read pairs that match different valid combinations in both arrangements.

5 Counting random barcodes

In this barcode design, the variable region is not sampled from a pool of known sequences but is instead randomly synthesized from a nucleotide mixture. We cannot use any of the previous functions as they need a known pool; rather, we need to extract the observed sequences (and their frequencies) from the reads. To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen with a single random variable region.

# Mocking up a 8-bp random variable region.
N <- 1000
randomized <- lapply(1:N, function(i) {
    paste(sample(c("A", "C", "G", "T"), 8, replace=TRUE), collapse="")
})
barcodes <- sprintf("CCCAGT%sGGGATAC", randomized)
names(reads) <- sprintf("READ_%i", seq_along(reads))

# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=single.fq, format="fastq")

For this design, we can count the frequency of each observed barcode using the matrixOfRandomBarcodes() function. This produces a SummarizedExperiment containing a count matrix of frequencies for each barcode.

library(screenCounter)
out <- matrixOfRandomBarcodes(single.fq,
    template="CCCAGTNNNNNNNNGGGATAC")
out
## class: SummarizedExperiment 
## dim: 0 1 
## metadata(0):
## assays(1): counts
## rownames(0):
## rowData names(1): sequences
## colnames(1): file2089ad16a938d8.fastq
## colData names(3): paths nreads nmapped
assay(out)
##      file2089ad16a938d8.fastq

6 Further options

6.1 Supporting mismatches

Mismatch tolerance can be enabled by setting substitutions to the desired number of mismatches. This may improve barcode recovery in the presence of sequencing errors. While such errors are rare per base2 At least on Illumina machines., they will be more likely to occur when considering the entire length of the barcode.

In practice, mismatch tolerance is turned off by default in all counting functions. This is because many errors are introduced during barcode synthesis, more than those due to library preparation or sequencing. Synthesis errors result in genuinely different barcodes (e.g., guides targeting different regions, or different molecular tags) that should not be counted together. Nonetheless, this setting may be useful for debugging experiments with low mapping rates.

Indels are not tolerated in any matches.

6.2 Searching the other strand

By default, the counting functions will search the read sequence on both the original strand reported by the sequencer and on the reverse complement. It is possible to change this behaviour with the strand= argument to only search one of the strands. The most appropriate setting requires knowledge of the sequencing protocol and the barcode design. For example:

  • Single-end sequencing of a construct where the barcode is only covered by a read sequenced from one of the ends. This requires strand="original" if the barcode is on the forward strand of the reachable end, or strand="reverse" if it is on the reverse strand. If the sequencing is performed from a randomly chosen end, only 50% of the reads will contain barcodes.
  • Single-end sequencing of a construct where the barcode can be covered by reads from either end. This requires strand="both" to search both strands of each read. This avoids loss of 50% of reads provided that the constant regions are captured by reads from both ends.

Paired-end sequencing of single or combinatorial barcodes requires some more consideration:

  • A construct where the barcode can only be covered by a read sequenced from one of the ends and that end is known in advance. In this case, the data can be treated as single-end. Counting can be performed using the file containing reads from that end with strand="original" or strand="reverse".
  • A construct where the barcode can only be covered by a read sequenced from one of the ends and that end is randomly chosen. Users can run the counting functions on each file separately with strand="original" or strand="reverse" and sum the counts. (We assume that the chance of randomly encountering the barcode sequence and its flanking constant regions is negligible, so each read pair will not contribute more than once to the final count.)
  • Paired-end sequencing of a construct where the barcode can be covered by both reads. In this case, one of the files can be chosen for use with strand="both". Alternatively, users can consolidate read pairs into a consensus read3 For example, see FLASh or vsearch with --fastq_mergepairs. prior to using strand="both".

For paired-end sequencing of dual barcodes, each paired barcode is covered by exactly one paired read in the same direction. This means that we should never have to set strand="both" as the strand orientation should be known ahead of time. The most that should need to be done is to set strand separately for each barcode, which is achieved by passing a vector of length 2 to specify the orientation for the first and second barcodes, respectively; or to set randomized=TRUE as described above.

6.3 Parallelization across files

Counting functions are parallelized across files using the BiocParallel framework. Each file is processed separately on a single worker to create the final matrix of counts.

# Pretend that these are different samples:
all.files <- c(single.fq, single.fq, single.fq)

# Parallel execution:
library(BiocParallel)
out <- matrixOfSingleBarcodes(all.files, 
    flank5="GTAC", flank3="CATG", choices=known,
    BPPARAM=SnowParam(2))
out
## class: SummarizedExperiment 
## dim: 4 3 
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(3): file2089ad16a938d8.fastq file2089ad16a938d8.fastq
##   file2089ad16a938d8.fastq
## colData names(3): paths nreads nmapped

Users should read the relevant documentation to determine the most effective parallelization approach for their system. For example, users working on a cluster should use BatchToolsParam(), while users on a local non-Windows machines may prefer to use MulticoreParam().

7 Session information

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.34.2         screenCounter_1.0.1        
##  [3] SummarizedExperiment_1.30.2 Biobase_2.60.0             
##  [5] GenomicRanges_1.52.0        MatrixGenerics_1.12.2      
##  [7] matrixStats_1.0.0           Biostrings_2.68.1          
##  [9] GenomeInfoDb_1.36.1         XVector_0.40.0             
## [11] IRanges_2.34.1              S4Vectors_0.38.1           
## [13] BiocGenerics_0.46.0         BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.5-4.1          jsonlite_1.8.7          compiler_4.3.1         
##  [4] BiocManager_1.30.21     crayon_1.5.2            Rcpp_1.0.10            
##  [7] bitops_1.0-7            parallel_4.3.1          jquerylib_0.1.4        
## [10] yaml_2.3.7              fastmap_1.1.1           lattice_0.21-8         
## [13] R6_2.5.1                S4Arrays_1.0.4          knitr_1.43             
## [16] DelayedArray_0.26.6     bookdown_0.34           snow_0.4-4             
## [19] GenomeInfoDbData_1.2.10 bslib_0.5.0             rlang_1.1.1            
## [22] cachem_1.0.8            xfun_0.39               sass_0.4.6             
## [25] cli_3.6.1               zlibbioc_1.46.0         digest_0.6.32          
## [28] grid_4.3.1              evaluate_0.21           codetools_0.2-19       
## [31] RCurl_1.98-1.12         rmarkdown_2.23          tools_4.3.1            
## [34] htmltools_0.5.5