cleanUpdTSeq 1.24.0
3’ ends of transcripts have generally been poorly annotated. With the advent of deep sequencing, many methods have been developed to identify 3’ ends. The majority of these methods use an oligo-dT primer, which can bind to internal adenine-rich sequences, and lead to artifactual identification of polyadenylation sites. Heuristic filtering methods rely on a certain number of adenines in the genomic sequence downstream of a putative polyadenylation site to remove internal priming events. We introduce a package to provide a robust method to classify putative polyadenylation sites. cleanUpdTSeq uses a naïve Bayes classifier, implemented through the e1071 [1], and sequence features surrounding the putative polyadenylation sites for classification.
The package includes a training dataset constructed from 6 different Zebrafish sequencing dataset, and functions for fetching surrounding sequences using BSgenome [2], building feature vectors and classifying whether the putative polyadenylations site is a true polyadenylation site or a mis-primed false site.
If you use cleanUpdTSeq, please cite:
Sheppard, S., Lawson, N.D. and Zhu, L.J., 2013. Accurate identification of polyadenylation sites from 3’ end deep sequencing using a naive Bayes classifier. Bioinformatics, 29(20), pp.2564-2571 and 2014, 30(4), pp.596, https://doi.org/10.1093/bioinformatics/btt714
Here is a step-by-step guide on using cleanUpdTSeq to classify a list of putative polyadenylation sites
library(cleanUpdTSeq)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect,
## is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unsplit, which,
## which.max, which.min
testFile <- system.file("extdata", "test.bed", package="cleanUpdTSeq")
testSet <- read.table(testFile, sep="\t", header=TRUE)
peaks <- BED2GRangesSeq(testSet, withSeq=FALSE)
If test dataset contains sequence information already, then use the following command instead.
peaks <- BED2GRangesSeq(testSet, upstream.seq.ind=7,
downstream.seq.ind=8, withSeq=TRUE)
To work with your own test dataset, please set testFile to the file path that contains the putative sites.
Here is how the test dataset look like.
head(testSet)
## chr start stop name score strand
## 1 chr10 2965327 2965327 6hpas-22249 1 -
## 2 chr10 2966558 2966558 6hpas-22250 1 -
## 3 chr10 2974251 2974251 6hpas-22251 2 -
## 4 chr10 2978441 2978441 6hpas-22252 1 -
## 5 chr11 16772291 16772291 6hpas-33204 1 -
## 6 chr11 16777848 16777848 6hpas-33205 1 -
## upstream downstream
## 1 TCTTCATCATGGTCATCTCGCACCAGAGAGTGTGCCAGGG CAGGAAGTTTTACCTGTCTGTCATTATCGT
## 2 ACCCTGGTGAGGGTATAGAGCTGGTCCAGTGTGCCACGGC AAAGAGGAAAACAGCATTGTTCCTCCTGGA
## 3 TGATTTGTTTGTAACTGATTTTATCTTTTAATAAAAAAGA AAAAAGAAAGTCAAGCCAAGAGGCAAATAC
## 4 GGAGCGCGACCGCATCAACAAAATCTTGCAGGATTATCAG AAGAAAAAGATGGTGAGTTATTATCATTCA
## 5 AGGGAAATAAATACAAAAGAATAAAAATATGATTCATTGT AAGAAAAACACTTTAGCTACAAAAGTCCTT
## 6 ATTTAGTTGGGTATTATTTCAAATAAAGAGAGAGAGAGAC ACAAAAACTACATCAAATTTGAGGACAAAA
The zebrafish genome from BSgenome is used in this example for obtaining surrounding sequences. For a list of other genomes available through BSgenome, please refer to the BSgenome package documentation [2].
library(BSgenome.Drerio.UCSC.danRer7)
## Loading required package: BSgenome
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: rtracklayer
testSet.NaiveBayes <- buildFeatureVector(peaks, BSgenomeName=Drerio,
upstream=40, downstream=30,
wordSize=6, alphabet=c("ACGT"),
sampleType="unknown",
replaceNAdistance=30,
method="NaiveBayes",
ZeroBasedIndex=1, fetchSeq=TRUE)
## Not all the sequence names are in the genome.
## Try to convert.
## Not all the sequence names are in the genome.
## Try to convert.
If sequences are present in the test dataset already, then set fetchSeq=FALSE.
The output file is a tab-delimited file containing the name of the putative polyadenylation sites, the probability that the putative polyadenylation site is false/oligodT internally primed, the probability the putative polyadenylation site is true, the predicted class based on the assignment cutoff and the sequence surrounding the putative polyadenylation site.
data(data.NaiveBayes)
if(interactive()){
predictTestSet(data.NaiveBayes$Negative, data.NaiveBayes$Positive,
testSet.NaiveBayes=testSet.NaiveBayes,
outputFile="test-predNaiveBayes.tsv",
assignmentCutoff=0.5)
}
Alternatively, instead of passing in a positive and a negative training dataset, set the parameter classifier to a pre-built PASclassifier to speed up the process. To built a PASclassifier using the training dataset, please use function buildClassifier. A PASclassifier named as classifier is included in the package which is generated using the included training dataset with upstream=40, downstream=30, and wordSize=6. Please note that in order to use this pre-built classier, you need to build feature vector using buildFeatureVector from your test dataset with the same setting, i.e., upstream=40, downstream=30, and wordSize=6.
data(classifier)
testResults <- predictTestSet(testSet.NaiveBayes=testSet.NaiveBayes,
classifier=classifier,
outputFile=NULL,
assignmentCutoff=0.5)
head(testResults)
## PeakName prob False/oligodT internally primed prob True pred.class
## 1 6hpas-78439 9.999997e-01 3.137403e-07 0
## 2 6hpas-78440 1.000000e+00 7.431591e-14 0
## 3 6hpas-78441 4.474675e-06 9.999955e-01 1
## 4 6hpas-78442 3.085156e-07 9.999997e-01 1
## 5 6hpas-22249 9.899582e-01 1.004181e-02 0
## 6 6hpas-22250 1.000000e+00 1.636415e-09 0
## UpstreamSeq DownstreamSeq
## 1 GGTCATTGTCCTGCAAAATGGACTACTTAACCGAACTGGA GAAGTATAAGAAGTAAGTACATTAAAGCTAC
## 2 TGGATTTAAATAACAAACAAGTTAAATAAAACGATTTGTA AAAAAATAAAACAACTGAAGAAGAAAATGAA
## 3 ATCTGCTTCAAAATGGATGCTCTGTTGAATCCTGAGCTCA GGTAATCTTTCAAGTGCTGCTATTGAGCCAA
## 4 AAATGCTTGCACATAATAAATGTAGGCTTAAAAGATTTCA AAACGTTTGTGAGAGACGGATTTTACTTTGC
## 5 TCTTCATCATGGTCATCTCGCACCAGAGAGTGTGCCAGGG CAGGAAGTTTTACCTGTCTGTCATTATCGTC
## 6 ACCCTGGTGAGGGTATAGAGCTGGTCCAGTGTGCCACGGC AAAGAGGAAAACAGCATTGTTCCTCCTGGAT
Meyer, D., et al., e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. 2012.
Pages, H., BSgenome: Infrastructure for Biostrings-based genome data packages.
Sheppard, S., Lawson, N.D. and Zhu, L.J., 2013. Accurate identification of polyadenylation sites from 3’ end deep sequencing using a naive Bayes classifier. Bioinformatics, 29(20), pp.2564-2571.
sessionInfo()
R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS
Matrix products: default BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages:
[1] BSgenome.Drerio.UCSC.danRer7_1.4.0 BSgenome_1.54.0
[3] rtracklayer_1.46.0 Biostrings_2.54.0
[5] XVector_0.26.0 GenomicRanges_1.38.0
[7] GenomeInfoDb_1.22.0 IRanges_2.20.0
[9] S4Vectors_0.24.0 cleanUpdTSeq_1.24.0
[11] BiocGenerics_0.32.0 BiocStyle_2.14.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 compiler_3.6.1
[3] BiocManager_1.30.9 class_7.3-15
[5] bitops_1.0-6 tools_3.6.1
[7] zlibbioc_1.32.0 digest_0.6.22
[9] lattice_0.20-38 evaluate_0.14
[11] rlang_0.4.1 Matrix_1.2-17
[13] DelayedArray_0.12.0 yaml_2.2.0
[15] xfun_0.10 e1071_1.7-2
[17] GenomeInfoDbData_1.2.2 stringr_1.4.0
[19] knitr_1.25 grid_3.6.1
[21] ade4_1.7-13 Biobase_2.46.0
[23] XML_3.98-1.20 BiocParallel_1.20.0
[25] rmarkdown_1.16 bookdown_0.14
[27] seqinr_3.6-1 magrittr_1.5
[29] matrixStats_0.55.0 GenomicAlignments_1.22.0
[31] Rsamtools_2.2.0 htmltools_0.4.0
[33] MASS_7.3-51.4 SummarizedExperiment_1.16.0
[35] stringi_1.4.3 RCurl_1.95-4.12