1 Overview of crisprScore

What makes a single-guide RNA (sgRNA) desirable for CRISPR-induced knockout? It usually boils down to two criteria:

  • Maximal on-target knockout efficiency (it cuts efficiently where it should)
  • Minimal off-targeting effects (it does not cut where it shouldn’t)

For both Cas9 and Cas12, several on-target and off-target scoring methods have been developed to predict these characteristics from nucleotide content. These methods have been made available through a heterogeneous array of software packages, and therefore there is currently no easy way to easily apply all methods at once to a set of sgRNAs. We provide in crisprScore a user-friendly unified framework for both developers and users to score sgRNAs using state-of-the-art algorithms. This is made possible by the use of the basilisk package, an elegant and powerful R package that enables the management and use of incompatible versions of Python modules in the course of a single R session.

2 Installation and getting started

When calling testCrisprScore for the first time after package installation, all Python modules and conda environments needed for crisprScore will be automatically downloaded and installed. This should take a few minutes.

Note that RStudio users will need to add the following line to their .Rprofile file in order for crisprScore to work properly:

options(reticulate.useImportHook=FALSE)

The scoringMethodsInfo data.frame contains a succinct summary of scoring methods available in crisprScore:

library(crisprScore)
## Loading required package: crisprScoreData
## Loading required package: ExperimentHub
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## snapshotDate(): 2022-04-19
data(scoringMethodsInfo)
print(scoringMethodsInfo)
##        method   nuclease left right       type      label len
## 1    ruleset1     SpCas9  -24     5  On-target   RuleSet1  30
## 2     azimuth     SpCas9  -24     5  On-target    Azimuth  30
## 3      deephf     SpCas9  -20     2  On-target     DeepHF  23
## 4      lindel     SpCas9  -33    31  On-target     Lindel  65
## 5         mit     SpCas9  -20     2 Off-target        MIT  23
## 6         cfd     SpCas9  -20     2 Off-target        CFD  23
## 7    deepcpf1   AsCas12a   -4    29  On-target   DeepCpf1  34
## 8     enpamgb enAsCas12a   -4    29  On-target    EnPAMGB  34
## 9  crisprscan     SpCas9  -26     8  On-target CRISPRscan  35
## 10    casrxrf      CasRx   NA    NA  On-target   CasRx-RF  NA
## 11   crisprai     SpCas9  -19     2  On-target   CRISPRai  22

See ?scoringMethodsInfo for more information about the different columns.

3 On-targeting efficiency scores

Predicting on-target cutting efficiency is an extensive area of research, and we try to provide in crisprScore the latest state-of-the-art algorithms as they become available. Different algorithms require different input nucleotide sequences to predict cutting efficiency as illustrated in the two figures below.

Sequence inputs for Cas9 scoring methods

Figure 1: Sequence inputs for Cas9 scoring methods

Sequence inputs for Cas12a scoring methods

Figure 2: Sequence inputs for Cas12a scoring methods

3.1 Rule Set 1 (Cas9)

The Rule Set 1 algorithm is one of the first on-target efficiency methods developed for the Cas9 nuclease (Doench et al. 2014). It generates a probability (therefore a score between 0 and 1) that a given sgRNA will cut at its intended target. 4 nucleotides upstream and 3 nucleotides downstream of the PAM sequence are needed for scoring:

flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam    <- "AGG" #3bp 
flank3 <- "TTG" #3bp
input  <- paste0(flank5, spacer, pam, flank3) 
results <- getRuleSet1Scores(input)

The Azimuth score described below is an improvement over Rule Set 1 from the same lab.

3.2 Azimuth score (Cas9)

The Azimuth algorithm is an improved version of the popular Rule Set 2 score for the Cas9 nuclease (Doench et al. 2016). It generates a probability (therefore a score between 0 and 1) that a given sgRNA will cut at its intended target. 4 nucleotides upstream and 3 nucleotides downstream of the PAM sequence are needed for scoring:

flank5 <- "ACCT" #4bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam    <- "AGG" #3bp 
flank3 <- "TTG" #3bp
input  <- paste0(flank5, spacer, pam, flank3) 
results <- getAzimuthScores(input)

3.3 DeepHF score (Cas9)

The DeepHF algorithm is an on-target cutting efficiency prediction algorithm for several variants of the Cas9 nuclease (Wang et al. 2019) using a recurrent neural network (RNN) framework. Similar to the Azimuth score, it generates a probability of cutting at the intended on-target. The algorithm only needs the protospacer and PAM sequences as inputs:

spacer  <- "ATCGATGCTGATGCTAGATA" #20bp
pam     <- "AGG" #3bp 
input   <- paste0(spacer, pam) 
results <- getDeepHFScores(input)

Users can specify for which Cas9 they wish to score sgRNAs by using the argument enzyme: “WT” for Wildtype Cas9 (WT-SpCas9), “HF” for high-fidelity Cas9 (SpCas9-HF), or “ESP” for enhancedCas9 (eSpCas9). For wildtype Cas9, users can also specify the promoter used for expressing sgRNAs using the argument promoter (“U6” by default). See ?getDeepHFScores for more details.

3.4 DeepCpf1 score (Cas12a)

The DeepCpf1 algorithm is an on-target cutting efficiency prediction algorithm for the Cas12a nuclease (Kim et al. 2018) using a convolutional neural network (CNN) framework. It generates a score between 0 and 1 to quantify the likelihood of Cas12a to cut for a given sgRNA. 3 nucleotides upstream and 4 nucleotides downstream of the PAM sequence are needed for scoring:

flank5 <- "ACC" #3bp
pam    <- "TTTT" #4bp
spacer <- "AATCGATGCTGATGCTAGATATT" #23bp
flank3 <- "AAGT" #4bp
input  <- paste0(flank5, pam, spacer, flank3) 
results <- getDeepCpf1Scores(input)

3.5 enPAM+GB score (enCas12a)

The enPAM+GB algorithm is an on-target cutting efficiency prediction algorithm for the enhanced Cas12a (enCas12a) nuclease (DeWeirdt et al. 2020) using a gradient-booster (GB) model. The enCas12a nuclease as an extended set of active PAM sequences in comparison to the wildtype Cas12 nuclease (Kleinstiver et al. 2019), and the enPAM+GB algorithm takes PAM activity into account in the calculation of the final score. It generates a probability (therefore a score between 0 and 1) of a given sgRNA to cut at the intended target. 3 nucleotides upstream of the PAM sequence and 4 nucleotides downstream of the protospacer sequence are needed for scoring:

flank5 <- "ACC" #3bp
pam    <- "TTTT" #4bp
spacer <- "AATCGATGCTGATGCTAGATATT" #23bp
flank3 <- "AAGT" #4bp
input  <- paste0(flank5, pam, spacer, flank3) 
results <- getEnPAMGBScores(input)

3.6 CRISPRscan (Moreno-Mateos score)

The CRISPRscan algorithm, also known as the Moreno-Mateos score), is an on-target efficiency method for the SpCas9 nuclease developed for sgRNAs expressed from a T7 promoter, and trained on zebrafish data (Moreno-Mateos et al. 2015). It generates a probability (therefore a score between 0 and 1) that a given sgRNA will cut at its intended target. 6 nucleotides upstream of the protospacer sequence and 6 nucleotides downstream of the PAM sequence are needed for scoring:

flank5 <- "ACCTAA" #6bp
spacer <- "ATCGATGCTGATGCTAGATA" #20bp
pam    <- "AGG" #3bp 
flank3 <- "TTGAAT" #6bp
input  <- paste0(flank5, spacer, pam, flank3) 
results <- getCRISPRscanScores(input)

4 Off-target specificity scores

For CRISPR knockout systems, off-targeting effects can occur when the CRISPR nuclease tolerates some levels of imperfect complementarity between gRNA spacer sequences and protospacer sequences of the targeted genome. Generally, a greater number of mismatches between spacer and protospacer sequences decreases the likelihood of cleavage by a nuclease, but the nature of the nucleotide substitution can module the likelihood as well. Several off-target specificity scores were developed to predict the likelihood of a nuclease to cut at an unintended off-target site given a position-specific set of nucleotide mismatches.

We provide in crisprScore two popular off-target specificity scoring methods for CRISPR/Cas9 knockout systems: the MIT score (Hsu et al. 2013) and the cutting frequency determination (CFD) score (Doench et al. 2016).

4.1 MIT score

The MIT score was an early off-target specificity prediction algorithm developed for the CRISPR/Cas9 system (Hsu et al. 2013). It predicts the likelihood that the Cas9 nuclease will cut at an off-target site using position-specific mismatch tolerance weights. It also takes into consideration the total number of mismatches, as well as the average distance between mismatches. However, it does not take into account the nature of the nucleotide substitutions. The exact formula used to estimate the cutting likelihood is

\[\text{MIT} = \biggl(\prod_{p \in M}{w_p}\biggr)\times\frac{1}{\frac{19-d}{19}\times4+1}\times\frac{1}{m^2}\]

where \(M\) is the set of positions for which there is a mismatch between the sgRNA spacer sequence and the off-target sequence, \(w_p\) is an experimentally-derived mismatch tolerance weight at position \(p\), \(d\) is the average distance between mismatches, and \(m\) is the total number of mismatches. As the number of mismatches increases, the cutting likelihood decreases. In addition, off-targets with more adjacent mismatches will have a lower cutting likelihood.

The getMITScores function takes as argument a character vector of 20bp sequences specifying the spacer sequences of sgRNAs (spacers argument), as well as a vector of 20bp sequences representing the protospacer sequences of the putative off-targets in the targeted genome (protospacers argument). PAM sequences (pams) must also be provided. If only one spacer sequence is provided, it will reused for all provided protospacers.

The following code will generate MIT scores for 3 off-targets with respect to the sgRNA ATCGATGCTGATGCTAGATA:

spacer   <- "ATCGATGCTGATGCTAGATA"
protospacers  <- c("ACCGATGCTGATGCTAGATA",
                   "ATCGATGCTGATGCTAGATT",
                   "ATCGATGCTGATGCTAGATA")
pams <- c("AGG", "AGG", "AGA")
getMITScores(spacers=spacer,
             protospacers=protospacers,
             pams=pams)
##                 spacer          protospacer      score
## 1 ATCGATGCTGATGCTAGATA ACCGATGCTGATGCTAGATA 0.68500000
## 2 ATCGATGCTGATGCTAGATA ATCGATGCTGATGCTAGATT 0.00000000
## 3 ATCGATGCTGATGCTAGATA ATCGATGCTGATGCTAGATA 0.06944444

4.2 CFD score

The CFD off-target specificity prediction algorithm was initially developed for the CRISPR/Cas9 system, and was shown to be superior to the MIT score (Doench et al. 2016). Unlike the MIT score, position-specific mismatch weights vary according to the nature of the nucleotide substitution (e.g. an A->G mismatch at position 15 has a different weight than an A->T mismatch at position 15).

Similar to the getMITScores function, the getCFDScores function takes as argument a character vector of 20bp sequences specifying the spacer sequences of sgRNAs (spacers argument), as well as a vector of 20bp sequences representing the protospacer sequences of the putative off-targets in the targeted genome (protospacers argument). pams must also be provided. If only one spacer sequence is provided, it will be used for all provided protospacers.

The following code will generate CFD scores for 3 off-targets with respect to the sgRNA ATCGATGCTGATGCTAGATA:

spacer   <- "ATCGATGCTGATGCTAGATA"
protospacers  <- c("ACCGATGCTGATGCTAGATA",
                   "ATCGATGCTGATGCTAGATT",
                   "ATCGATGCTGATGCTAGATA")
pams <- c("AGG", "AGG", "AGA")
getCFDScores(spacers=spacer,
             protospacers=protospacers,
             pams=pams)
##                 spacer          protospacer      score
## 1 ATCGATGCTGATGCTAGATA ACCGATGCTGATGCTAGATA 0.85714286
## 2 ATCGATGCTGATGCTAGATA ATCGATGCTGATGCTAGATT 0.60000000
## 3 ATCGATGCTGATGCTAGATA ATCGATGCTGATGCTAGATA 0.06944444

5 Indel prediction score

5.1 Lindel score (Cas9)

Non-homologous end-joining (NHEJ) plays an important role in double-strand break (DSB) repair of DNA. Error patterns of NHEJ can be strongly biased by sequence context, and several studies have shown that microhomology can be used to predict indels resulting from CRISPR/Cas9-mediated cleavage. Among other useful metrics, the frequency of frameshift-causing indels can be estimated for a given sgRNA.

Lindel (Chen et al. 2019) is a logistic regression model that was trained to use local sequence context to predict the distribution of mutational outcomes. In crisprScore, the function getLindelScores return the proportion of “frameshifting” indels estimated by Lindel. By chance, assuming a random distribution of indel lengths, frameshifting proportions should be roughly around 0.66. A Lindel score higher than 0.66 indicates a higher than by chance probability that a sgRNA induces a frameshift mutation.

The Lindel algorithm requires nucleotide context around the protospacer sequence; the following full sequence is needed: [13bp upstream flanking sequence][23bp protospacer sequence] [29bp downstream flanking sequence], for a total of 65bp. The function getLindelScores takes as inputs such 65bp sequences:

flank5 <- "ACCTTTTAATCGA" #13bp
spacer <- "TGCTGATGCTAGATATTAAG" #20bp
pam    <- "TGG" #3bp
flank3 <- "CTTTTAATCGATGCTGATGCTAGATATTA" #29bp
input <- paste0(flank5, spacer, pam, flank3)
results <- getLindelScores(input)
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] crisprScore_1.0.0       crisprScoreData_0.99.15 ExperimentHub_2.4.0    
## [4] AnnotationHub_3.4.0     BiocFileCache_2.4.0     dbplyr_2.1.1           
## [7] BiocGenerics_0.42.0     BiocStyle_2.24.0       
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.56.0                httr_1.4.2                   
##  [3] sass_0.4.1                    bit64_4.0.5                  
##  [5] jsonlite_1.8.0                bslib_0.3.1                  
##  [7] shiny_1.7.1                   assertthat_0.2.1             
##  [9] interactiveDisplayBase_1.34.0 highr_0.9                    
## [11] BiocManager_1.30.17           stats4_4.2.0                 
## [13] blob_1.2.3                    GenomeInfoDbData_1.2.8       
## [15] yaml_2.3.5                    BiocVersion_3.15.2           
## [17] pillar_1.7.0                  RSQLite_2.2.12               
## [19] lattice_0.20-45               glue_1.6.2                   
## [21] reticulate_1.24               digest_0.6.29                
## [23] promises_1.2.0.1              XVector_0.36.0               
## [25] randomForest_4.7-1            htmltools_0.5.2              
## [27] httpuv_1.6.5                  Matrix_1.4-1                 
## [29] pkgconfig_2.0.3               dir.expiry_1.4.0             
## [31] bookdown_0.26                 zlibbioc_1.42.0              
## [33] purrr_0.3.4                   xtable_1.8-4                 
## [35] later_1.3.0                   tibble_3.1.6                 
## [37] KEGGREST_1.36.0               generics_0.1.2               
## [39] IRanges_2.30.0                ellipsis_0.3.2               
## [41] cachem_1.0.6                  cli_3.3.0                    
## [43] magrittr_2.0.3                crayon_1.5.1                 
## [45] mime_0.12                     memoise_2.0.1                
## [47] evaluate_0.15                 fansi_1.0.3                  
## [49] tools_4.2.0                   lifecycle_1.0.1              
## [51] basilisk.utils_1.8.0          stringr_1.4.0                
## [53] S4Vectors_0.34.0              AnnotationDbi_1.58.0         
## [55] Biostrings_2.64.0             compiler_4.2.0               
## [57] jquerylib_0.1.4               GenomeInfoDb_1.32.0          
## [59] rlang_1.0.2                   grid_4.2.0                   
## [61] RCurl_1.98-1.6                rappdirs_0.3.3               
## [63] bitops_1.0-7                  rmarkdown_2.14               
## [65] basilisk_1.8.0                DBI_1.1.2                    
## [67] curl_4.3.2                    R6_2.5.1                     
## [69] knitr_1.38                    dplyr_1.0.8                  
## [71] fastmap_1.1.0                 bit_4.0.4                    
## [73] utf8_1.2.2                    filelock_1.0.2               
## [75] stringi_1.7.6                 parallel_4.2.0               
## [77] Rcpp_1.0.8.3                  vctrs_0.4.1                  
## [79] png_0.1-7                     tidyselect_1.1.2             
## [81] xfun_0.30

References

Chen, Wei, Aaron McKenna, Jacob Schreiber, Maximilian Haeussler, Yi Yin, Vikram Agarwal, William Stafford Noble, and Jay Shendure. 2019. “Massively Parallel Profiling and Predictive Modeling of the Outcomes of Crispr/Cas9-Mediated Double-Strand Break Repair.” Nucleic Acids Research 47 (15): 7989–8003.

DeWeirdt, Peter C, Kendall R Sanson, Annabel K Sangree, Mudra Hegde, Ruth E Hanna, Marissa N Feeley, Audrey L Griffith, et al. 2020. “Optimization of Ascas12a for Combinatorial Genetic Screens in Human Cells.” Nature Biotechnology, 1–11.

Doench, John G, Nicolo Fusi, Meagan Sullender, Mudra Hegde, Emma W Vaimberg, Katherine F Donovan, Ian Smith, et al. 2016. “Optimized sgRNA Design to Maximize Activity and Minimize Off-Target Effects of Crispr-Cas9.” Nature Biotechnology 34 (2): 184.

Doench, John G, Ella Hartenian, Daniel B Graham, Zuzana Tothova, Mudra Hegde, Ian Smith, Meagan Sullender, Benjamin L Ebert, Ramnik J Xavier, and David E Root. 2014. “Rational Design of Highly Active sgRNAs for Crispr-Cas9–Mediated Gene Inactivation.” Nature Biotechnology 32 (12): 1262–7.

Hsu, Patrick D, David A Scott, Joshua A Weinstein, F Ann Ran, Silvana Konermann, Vineeta Agarwala, Yinqing Li, et al. 2013. “DNA Targeting Specificity of Rna-Guided Cas9 Nucleases.” Nature Biotechnology 31 (9): 827.

Kim, Hui Kwon, Seonwoo Min, Myungjae Song, Soobin Jung, Jae Woo Choi, Younggwang Kim, Sangeun Lee, Sungroh Yoon, and Hyongbum Henry Kim. 2018. “Deep Learning Improves Prediction of Crispr–Cpf1 Guide Rna Activity.” Nature Biotechnology 36 (3): 239.

Kleinstiver, Benjamin P, Alexander A Sousa, Russell T Walton, Y Esther Tak, Jonathan Y Hsu, Kendell Clement, Moira M Welch, et al. 2019. “Engineered Crispr–Cas12a Variants with Increased Activities and Improved Targeting Ranges for Gene, Epigenetic and Base Editing.” Nature Biotechnology 37 (3): 276–82.

Moreno-Mateos, Miguel A, Charles E Vejnar, Jean-Denis Beaudoin, Juan P Fernandez, Emily K Mis, Mustafa K Khokha, and Antonio J Giraldez. 2015. “CRISPRscan: Designing Highly Efficient sgRNAs for Crispr-Cas9 Targeting in Vivo.” Nature Methods 12 (10): 982–88.

Wang, Daqi, Chengdong Zhang, Bei Wang, Bin Li, Qiang Wang, Dong Liu, Hongyan Wang, et al. 2019. “Optimized Crispr Guide Rna Design for Two High-Fidelity Cas9 Variants by Deep Learning.” Nature Communications 10 (1): 1–14.