Alternative polyadenylation (APA) is one of the important post-transcriptional regulation mechanisms which occurs in most human genes. InPAS facilitates the discovery of novel APA sites from RNAseq data. It leverages cleanUpdTSeq to fine tune identified APA sites.
Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which occurs in most human genes. APA in a gene can result in altered expression of the gene, which can lead pathological effect to the cells, such as uncontrolled cell cycle and growth. However, there are only limited ways to identify and quantify APA in genes, and most of them suffers from complicated process for library construction and requires large amount of RNAs.
RNA-seq has become one of the most commonly used methods for quantifying genome-wide gene expression. There are massive RNA-seq datasets available publicly such as GEO and TCGA. For this reason, we develop the InPAS algorithm for identifying APA from conventional RNA-seq data.
The workflow for InPAS is:
To use InPAS, BSgenome and TxDb object have to be loaded before run.
suppressPackageStartupMessages(library(InPAS))
suppressPackageStartupMessages(library(BSgenome.Mmusculus.UCSC.mm10))
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")
Users can prepare annotaiton by utr3Annotation with a TxDb and org annotation. The 3UTR annotation prepared by utr3Annotation includes the gaps to next annotation region.
suppressPackageStartupMessages(library(org.Hs.eg.db))
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb,
orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno
## GRanges object with 177 ranges and 6 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## uc001bum.2_5|IQCC|utr3 chr1 [ 32673684, 32674288] + |
## uc001fbq.3_3|S100A9|utr3 chr1 [153333315, 153333503] + |
## uc001gde.2_2|LRRC52|utr3 chr1 [165533062, 165533185] + |
## uc001hfg.3_15|PFKFB2|utr3 chr1 [207245717, 207251162] + |
## uc010psc.2_17|PFKFB2|utr3 chr1 [207252365, 207254368] + |
## ... ... ... ... .
## uc004dst.3_22|PHF8|CDS chrX [ 53965591, 53965679] - |
## uc004dsx.3_15|PHF8|CDS chrX [ 53969797, 53969835] - |
## uc004ehz.1_5|ARMCX3|CDS chrX [100879970, 100881109] + |
## uc004elw.3_6|FAM199X|CDS chrX [103434289, 103434459] + |
## uc004fmj.1_10|GAB3|CDS chrX [153906455, 153906571] - |
## feature id exon
## <character> <character> <character>
## uc001bum.2_5|IQCC|utr3 unknown utr3 uc001bum.2_5
## uc001fbq.3_3|S100A9|utr3 unknown utr3 uc001fbq.3_3
## uc001gde.2_2|LRRC52|utr3 unknown utr3 uc001gde.2_2
## uc001hfg.3_15|PFKFB2|utr3 unknown utr3 uc001hfg.3_15
## uc010psc.2_17|PFKFB2|utr3 unknown utr3 uc010psc.2_17
## ... ... ... ...
## uc004dst.3_22|PHF8|CDS unknown CDS uc004dst.3_22
## uc004dsx.3_15|PHF8|CDS unknown CDS uc004dsx.3_15
## uc004ehz.1_5|ARMCX3|CDS unknown CDS uc004ehz.1_5
## uc004elw.3_6|FAM199X|CDS unknown CDS uc004elw.3_6
## uc004fmj.1_10|GAB3|CDS unknown CDS uc004fmj.1_10
## transcript gene symbol
## <character> <character> <character>
## uc001bum.2_5|IQCC|utr3 uc001bum.2 55721 IQCC
## uc001fbq.3_3|S100A9|utr3 uc001fbq.3 6280 S100A9
## uc001gde.2_2|LRRC52|utr3 uc001gde.2 440699 LRRC52
## uc001hfg.3_15|PFKFB2|utr3 uc001hfg.3 5208 PFKFB2
## uc010psc.2_17|PFKFB2|utr3 uc010psc.2 5208 PFKFB2
## ... ... ... ...
## uc004dst.3_22|PHF8|CDS uc004dst.3 23133 PHF8
## uc004dsx.3_15|PHF8|CDS uc004dsx.3 23133 PHF8
## uc004ehz.1_5|ARMCX3|CDS uc004ehz.1 51566 ARMCX3
## uc004elw.3_6|FAM199X|CDS uc004elw.3 139231 FAM199X
## uc004fmj.1_10|GAB3|CDS uc004fmj.1 139716 GAB3
## -------
## seqinfo: 27 sequences from hg19 genome; no seqlengths
Users can load mm10 and hg19 annotation from pre-prepared data. Here we loaded the prepared mm10 3UTR annotation file.
##step1 annotation
# load from dataset
data(utr3.mm10)
The coverage is calculated from BEDGraph file. The RNA-seq BAM files could be converted to BEDGraph files by bedtools genomecov tool (parameter: -bg -split). PWM and a classifier of polyA signal can be used for adjusting CP sites prediction.
load(file.path(path, "polyA.rds"))
suppressPackageStartupMessages(library(cleanUpdTSeq))
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"),
file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs,
tags=c("Baf3", "UM15"),
genome=BSgenome.Mmusculus.UCSC.mm10,
hugeData=hugeData)
##step2 Predict cleavage sites
CPs <- CPsites(coverage=coverage,
genome=BSgenome.Mmusculus.UCSC.mm10,
utr3=utr3.mm10,
search_point_START=200,
cutEnd=.2,
long_coverage_threshold=3,
background="10K",
txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
PolyA_PWM=pwm,
classifier=classifier,
shift_range=50)
##step3 Estimate 3UTR usage
res <- testUsage(CPsites=CPs,
coverage=coverage,
genome=BSgenome.Mmusculus.UCSC.mm10,
utr3=utr3.mm10,
method="fisher.exact",
gp1="Baf3", gp2="UM15")
##step4 view the results
as(res, "GRanges")
## GRanges object with 4 ranges and 25 metadata columns:
## seqnames ranges strand | transcript
## <Rle> <IRanges> <Rle> | <character>
## uc009daz.2 chr6 [ 98018176, 98021358] + | uc009daz.2
## uc009dhz.2 chr6 [114859343, 114862071] + | uc009dhz.2
## uc009eet.1 chr6 [128846244, 128850081] - | uc009eet.1
## uc012ero.1 chr6 [119315133, 119317055] - | uc012ero.1
## gene symbol fit_value Predicted_Proximal_APA
## <character> <character> <numeric> <numeric>
## uc009daz.2 17342 Mitf 12594.6737 98018978
## uc009dhz.2 74244 Atg7 27383.4125 114859674
## uc009eet.1 232406 BC035044 128.8275 128849128
## uc012ero.1 211187 Lrtm2 168.5509 119316541
## Predicted_Distal_APA type utr3start utr3end
## <numeric> <character> <numeric> <numeric>
## uc009daz.2 98021358 novel proximal 98018276 98021358
## uc009dhz.2 114862071 novel distal 114859443 114860614
## uc009eet.1 128846244 novel distal 128849981 128848044
## uc012ero.1 119315133 novel proximal 119316955 119315133
## Baf3_short.form.usage UM15_short.form.usage
## <numeric> <numeric>
## uc009daz.2 33.531338 1.0521497
## uc009dhz.2 520.914790 172.1523492
## uc009eet.1 22.671784 0.7334097
## uc012ero.1 8.857865 49.5356160
## Baf3_long.form.usage UM15_long.form.usage Baf3_PDUI UM15_PDUI
## <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 282.824024 189.14112 0.8940074 0.9944680
## uc009dhz.2 208.024604 456.54462 0.2853798 0.7261760
## uc009eet.1 7.876603 21.96014 0.2578402 0.9676820
## uc012ero.1 8.535131 70.81263 0.4907223 0.5883977
## short.mean.gp1 long.mean.gp1 short.mean.gp2 long.mean.gp2
## <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 33.531338 282.824024 1.0521497 189.14112
## uc009dhz.2 520.914790 208.024604 172.1523492 456.54462
## uc009eet.1 22.671784 7.876603 0.7334097 21.96014
## uc012ero.1 8.857865 8.535131 49.5356160 70.81263
## PDUI.gp1 PDUI.gp2 dPDUI logFC P.Value
## <numeric> <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 0.8940074 0.9944680 -0.10046063 -0.1536382 1.586980e-06
## uc009dhz.2 0.2853798 0.7261760 -0.44079612 -1.3474358 1.709476e-60
## uc009eet.1 0.2578402 0.9676820 -0.70984179 -1.9080557 2.632879e-08
## uc012ero.1 0.4907223 0.5883977 -0.09767539 -0.2618847 5.930309e-01
## adj.P.Val
## <numeric>
## uc009daz.2 2.115973e-06
## uc009dhz.2 6.837904e-60
## uc009eet.1 5.265757e-08
## uc012ero.1 5.930309e-01
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
filterRes(res, gp1="Baf3", gp2="UM15",
background_coverage_threshold=3,
adj.P.Val_cutoff=0.05,
dPDUI_cutoff=0.3,
PDUI_logFC_cutoff=0.59)
## GRanges object with 4 ranges and 26 metadata columns:
## seqnames ranges strand | transcript
## <Rle> <IRanges> <Rle> | <character>
## uc009daz.2 chr6 [ 98018176, 98021358] + | uc009daz.2
## uc009dhz.2 chr6 [114859343, 114862071] + | uc009dhz.2
## uc009eet.1 chr6 [128846244, 128850081] - | uc009eet.1
## uc012ero.1 chr6 [119315133, 119317055] - | uc012ero.1
## gene symbol fit_value Predicted_Proximal_APA
## <character> <character> <numeric> <numeric>
## uc009daz.2 17342 Mitf 12594.6737 98018978
## uc009dhz.2 74244 Atg7 27383.4125 114859674
## uc009eet.1 232406 BC035044 128.8275 128849128
## uc012ero.1 211187 Lrtm2 168.5509 119316541
## Predicted_Distal_APA type utr3start utr3end
## <numeric> <character> <numeric> <numeric>
## uc009daz.2 98021358 novel proximal 98018276 98021358
## uc009dhz.2 114862071 novel distal 114859443 114860614
## uc009eet.1 128846244 novel distal 128849981 128848044
## uc012ero.1 119315133 novel proximal 119316955 119315133
## Baf3_short.form.usage UM15_short.form.usage
## <numeric> <numeric>
## uc009daz.2 33.531338 1.0521497
## uc009dhz.2 520.914790 172.1523492
## uc009eet.1 22.671784 0.7334097
## uc012ero.1 8.857865 49.5356160
## Baf3_long.form.usage UM15_long.form.usage Baf3_PDUI UM15_PDUI
## <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 282.824024 189.14112 0.8940074 0.9944680
## uc009dhz.2 208.024604 456.54462 0.2853798 0.7261760
## uc009eet.1 7.876603 21.96014 0.2578402 0.9676820
## uc012ero.1 8.535131 70.81263 0.4907223 0.5883977
## short.mean.gp1 long.mean.gp1 short.mean.gp2 long.mean.gp2
## <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 33.531338 282.824024 1.0521497 189.14112
## uc009dhz.2 520.914790 208.024604 172.1523492 456.54462
## uc009eet.1 22.671784 7.876603 0.7334097 21.96014
## uc012ero.1 8.857865 8.535131 49.5356160 70.81263
## PDUI.gp1 PDUI.gp2 dPDUI logFC P.Value
## <numeric> <numeric> <numeric> <numeric> <numeric>
## uc009daz.2 0.8940074 0.9944680 -0.10046063 -0.1536382 1.586980e-06
## uc009dhz.2 0.2853798 0.7261760 -0.44079612 -1.3474358 1.709476e-60
## uc009eet.1 0.2578402 0.9676820 -0.70984179 -1.9080557 2.632879e-08
## uc012ero.1 0.4907223 0.5883977 -0.09767539 -0.2618847 5.930309e-01
## adj.P.Val PASS
## <numeric> <logical>
## uc009daz.2 2.115973e-06 FALSE
## uc009dhz.2 6.837904e-60 TRUE
## uc009eet.1 5.265757e-08 TRUE
## uc012ero.1 5.930309e-01 FALSE
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
The process described above can be done in one step.
if(interactive()){
res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"),
genome=BSgenome.Mmusculus.UCSC.mm10,
utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
search_point_START=200,
short_coverage_threshold=15,
long_coverage_threshold=3,
cutStart=0, cutEnd=.2,
hugeData=FALSE,
shift_range=50,
PolyA_PWM=pwm, classifier=classifier,
method="fisher.exact",
adj.P.Val_cutoff=0.05,
dPDUI_cutoff=0.3,
PDUI_logFC_cutoff=0.59)
}
InPAS can handle single group data.
inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"),
genome=BSgenome.Mmusculus.UCSC.mm10,
utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
search_point_START=200,
short_coverage_threshold=15,
long_coverage_threshold=3,
cutStart=0, cutEnd=.2,
hugeData=FALSE,
PolyA_PWM=pwm, classifier=classifier,
method="singleSample",
adj.P.Val_cutoff=0.05)
## converged at iteration 1 with logLik: -1835.501
## converged at iteration 5 with logLik: -838.8306
## converged at iteration 1 with logLik: -1487.249
## converged at iteration 7 with logLik: -711.1667
## converged at iteration 1 with logLik: -998.7353
## converged at iteration 25 with logLik: -554.8191
## converged at iteration 1 with logLik: -459.6845
## converged at iteration 11 with logLik: -213.0043
## converged at iteration 2 with logLik: -191.9692
## converged at iteration 11 with logLik: -153.2558
## dPDUI is calculated by gp2 - gp1.
## GRanges object with 6 ranges and 20 metadata columns:
## seqnames ranges strand | transcript
## "<Rle>" "<IRanges>" "<Rle>" "|" "<character>"
## uc009daz.2 "chr6" "[ 98018176, 98021358]" "+" "|" "uc009daz.2"
## uc009dhz.2 "chr6" "[114859343, 114862071]" "+" "|" "uc009dhz.2"
## uc009die.2 "chr6" "[114860615, 114862164]" "-" "|" "uc009die.2"
## uc009eet.1 "chr6" "[128847265, 128850081]" "-" "|" "uc009eet.1"
## uc009eol.1 "chr6" "[140651362, 140651622]" "+" "|" "uc009eol.1"
## uc012ero.1 "chr6" "[119315133, 119317055]" "-" "|" "uc012ero.1"
## gene symbol fit_value
## "<character>" "<character>" "<numeric>"
## uc009daz.2 "17342" "Mitf" 17842.87
## uc009dhz.2 "74244" "Atg7" 7576.055
## uc009die.2 "232334" "Vgll4" 10716.47
## uc009eet.1 "232406" "BC035044" 239.932
## uc009eol.1 "11569" "Aebp2" 190776
## uc012ero.1 "211187" "Lrtm2" 19.16645
## Predicted_Proximal_APA Predicted_Distal_APA type
## "<numeric>" "<numeric>" "<character>"
## uc009daz.2 98019024 98021358 "novel proximal"
## uc009dhz.2 114860286 114862071 "novel distal"
## uc009die.2 114861444 114860615 "novel distal"
## uc009eet.1 128848709 128847265 "novel distal"
## uc009eol.1 140651561 140651622 "novel proximal"
## uc012ero.1 119316666 119315133 "novel proximal"
## utr3start utr3end data2 Baf3_short.form.usage
## "<numeric>" "<numeric>" "<list>" "<numeric>"
## uc009daz.2 98018176 98021358 Integer,848 29.6839
## uc009dhz.2 114859343 114860614 Integer,943 2.934986
## uc009die.2 114862164 114862092 Integer,73 122.1074
## uc009eet.1 128850081 128848044 Integer,1372 20.88953
## uc009eol.1 140651362 140651622 Integer,199 1057.506
## uc012ero.1 119317055 119315133 Integer,389 8.935672
## Baf3_long.form.usage Baf3_PDUI short.mean long.mean
## "<numeric>" "<numeric>" "<list>" "<list>"
## uc009daz.2 283.3645 0.9051779 "########" "########"
## uc009dhz.2 210.5062 0.9862492 "########" "########"
## uc009die.2 173.9036 0.5874903 "########" "########"
## uc009eet.1 9.841522 0.3202469 "########" "########"
## uc009eol.1 248.3387 0.1901748 "########" "########"
## uc012ero.1 9.095176 0.5044231 "########" "########"
## PDUI P.Value adj.P.Val dPDUI PASS
## "<list>" "<list>" "<list>" "<numeric>" "<logical>"
## uc009daz.2 "########" "########" "########" NaN FALSE
## uc009dhz.2 "########" "########" "########" NaN FALSE
## uc009die.2 "########" "########" "########" NaN FALSE
## uc009eet.1 "########" "########" "########" NaN FALSE
## uc009eol.1 "########" "########" "########" NaN FALSE
## uc012ero.1 "########" "########" "########" NaN FALSE
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
sessionInfo()
R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.4 LTS
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base
other attached packages: [1] cleanUpdTSeq_1.10.2
[2] e1071_1.6-7
[3] seqinr_3.1-5
[4] BSgenome.Drerio.UCSC.danRer7_1.4.0
[5] org.Hs.eg.db_3.3.0
[6] TxDb.Mmusculus.UCSC.mm10.knownGene_3.2.2 [7] BSgenome.Mmusculus.UCSC.mm10_1.4.0
[8] BSgenome_1.40.1
[9] rtracklayer_1.32.1
[10] Biostrings_2.40.2
[11] XVector_0.12.0
[12] InPAS_1.4.4
[13] GenomicFeatures_1.24.3
[14] AnnotationDbi_1.34.3
[15] GenomicRanges_1.24.2
[16] GenomeInfoDb_1.8.2
[17] IRanges_2.6.1
[18] S4Vectors_0.10.1
[19] Biobase_2.32.0
[20] BiocGenerics_0.18.0
[21] BiocStyle_2.0.2
loaded via a namespace (and not attached): [1] httr_1.2.0 AnnotationHub_2.4.2
[3] splines_3.3.0 Formula_1.2-1
[5] shiny_0.13.2 interactiveDisplayBase_1.10.3 [7] latticeExtra_0.6-28 Rsamtools_1.24.0
[9] yaml_2.1.13 RSQLite_1.0.0
[11] lattice_0.20-33 biovizBase_1.20.0
[13] limma_3.28.10 chron_2.3-47
[15] digest_0.6.9 RColorBrewer_1.1-2
[17] colorspace_1.2-6 preprocessCore_1.34.0
[19] htmltools_0.3.5 httpuv_1.3.3
[21] Matrix_1.2-6 plyr_1.8.4
[23] XML_3.98-1.4 biomaRt_2.28.0
[25] zlibbioc_1.18.0 xtable_1.8-2
[27] scales_0.4.0 BiocParallel_1.6.2
[29] ggplot2_2.1.0 SummarizedExperiment_1.2.3
[31] nnet_7.3-12 Gviz_1.16.1
[33] survival_2.39-4 magrittr_1.5
[35] mime_0.4 evaluate_0.9
[37] MASS_7.3-45 truncnorm_1.0-7
[39] foreign_0.8-66 class_7.3-14
[41] BiocInstaller_1.22.2 tools_3.3.0
[43] data.table_1.9.6 formatR_1.4
[45] matrixStats_0.50.2 stringr_1.0.0
[47] depmixS4_1.3-3 munsell_0.4.3
[49] cluster_2.0.4 ensembldb_1.4.6
[51] ade4_1.7-4 grid_3.3.0
[53] RCurl_1.95-4.8 dichromat_2.0-0
[55] VariantAnnotation_1.18.1 Rsolnp_1.16
[57] bitops_1.0-6 rmarkdown_0.9.6
[59] gtable_0.2.0 DBI_0.4-1
[61] R6_2.1.2 GenomicAlignments_1.8.3
[63] gridExtra_2.2.1 knitr_1.13
[65] Hmisc_3.17-4 stringi_1.1.1
[67] Rcpp_0.12.5 rpart_4.1-10
[69] acepack_1.3-3.3