1 Introduction

Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in Eukaryotes. Like alternative splicing, APA can increase transcriptome diversity. In addition, it defines 3’ UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can affect many biological processed, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data dedicated to identifying APA events.

However, massive RNA-seq datasets, which were originally created to quantify genome-wide gene expression, are available in public databases such as GEO and TCGA. These RNA-seq datasets also contain information of genome-wide APA. Thus, we developed the InPAS package for identifying APA from the conventional RNA-seq data.

The major procedures in InPAS workflow are as follows:

Extract genome-wide 3’ UTR annotation from known genome annotation
Set up a SQLite database for storing experimental metadata and tracking intermediate files generated during analysis
Convert genome-wide read coverage per sample from a BEDGraph file to a run-length encoding (Rle) format
Identify putative cleavage and polyadenylation (CP) site for each gene based on the read coverage profile along 3’ UTR regions and optionally remove potential false positive CP sites due to technical artifacts by using the Naive Bayes classifier (NBC) model from the cleanUpdTseq package or by using the polyadenylation scores by matching the position-weight matrix (PWM) for the hexamer polyadenylation signal (AAUAAA and the like)
Estimate usage of proximal and distal CP sites based on read coverage along the short and long 3’ UTRs
Identify differential usage of proximal and distal CP sites between different conditions leveraging different statistical models according to the experimental design

In addition, the InPAS package also provide functions to perform quality control over RNA-seq data coverage, visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.

2 How to run InPAS

First, load the required packages, including InPAS, and species-specific genome and genome annotation database: BSgenome, TxDb and EnsDb.

suppressPackageStartupMessages({
    library(InPAS)
    library(BSgenome.Mmusculus.UCSC.mm10)
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
    library(EnsDb.Hsapiens.v86)
    library(EnsDb.Mmusculus.v79)
    library(cleanUpdTSeq)
    library("RSQLite")
    library("DBI")
})

2.1 Step 1: Extracting 3’ UTR annotation

3’ UTR annotation, including start and end coordinates, and strand information of 3’ UTRs, last CDS and the gaps between 3’ extremities of 3’ UTRs and immediate downstream exons, is extracted using the function extract_UTR3Anno from genome annotation databases: a TxDb database and an Ensembldb database for a species of interest. For demonstration, the following snippet of R scripts shows how to extract 3’ UTR annotation from a abridged TxDb for a human reference genome (hg19). In reality, users should use a TxDb for the most reliable genome annotation of the PRIMARY reference genome assembly (NOT including the alternative patches) used for RNA-seq read alignment. If a TxDb is not available for the species of interest, users can build one using the function makeTxDbFromUCSC, makeTxDbFromBiomart, makeTxDbFromEnsembl, or makeTxDbFromGFF from the GenomicFeatures package, depending on the sources of the genome annotation file.

samplefile <- system.file("extdata", 
                          "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
TxDb <- loadDb(samplefile)
utr3_anno <- extract_UTR3Anno(TxDb = TxDb,
                              edb = EnsDb.Hsapiens.v86,
                              removeScaffolds = TRUE)

## Warning: Unable to map 3 of 42 requested IDs.

head(utr3_anno$chr1)

## GRanges object with 6 ranges and 8 metadata columns:
##                                              seqnames              ranges
##                                                 <Rle>           <IRanges>
##         chr1:lastutr3:uc001bum.2_5|IQCC|utr3     chr1   32673684-32674288
##       chr1:lastutr3:uc001fbq.3_3|S100A9|utr3     chr1 153333315-153333503
##   chr1:lastutr3:uc031pqa.1_13|100129405|utr3     chr1 155719929-155720673
##       chr1:lastutr3:uc001gde.2_2|LRRC52|utr3     chr1 165533062-165533185
##      chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3     chr1 207245717-207251162
##      chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3     chr1 207252365-207254368
##                                              strand | exon_rank  transcript
##                                               <Rle> | <integer> <character>
##         chr1:lastutr3:uc001bum.2_5|IQCC|utr3      + |         5  uc001bum.2
##       chr1:lastutr3:uc001fbq.3_3|S100A9|utr3      + |         3  uc001fbq.3
##   chr1:lastutr3:uc031pqa.1_13|100129405|utr3      + |        13  uc031pqa.1
##       chr1:lastutr3:uc001gde.2_2|LRRC52|utr3      + |         2  uc001gde.2
##      chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3      + |        15  uc001hfg.3
##      chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3      + |        15  uc001hfh.3
##                                                  feature        gene
##                                              <character> <character>
##         chr1:lastutr3:uc001bum.2_5|IQCC|utr3        utr3       55721
##       chr1:lastutr3:uc001fbq.3_3|S100A9|utr3        utr3        6280
##   chr1:lastutr3:uc031pqa.1_13|100129405|utr3        utr3   100129405
##       chr1:lastutr3:uc001gde.2_2|LRRC52|utr3        utr3      440699
##      chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3        utr3        5208
##      chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3        utr3        5208
##                                                                exon      symbol
##                                                         <character> <character>
##         chr1:lastutr3:uc001bum.2_5|IQCC|utr3 chr1:lastutr3:uc001b..        IQCC
##       chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 chr1:lastutr3:uc001f..      S100A9
##   chr1:lastutr3:uc031pqa.1_13|100129405|utr3 chr1:lastutr3:uc031p..   100129405
##       chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 chr1:lastutr3:uc001g..      LRRC52
##      chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h..      PFKFB2
##      chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h..      PFKFB2
##                                               annotatedProximalCP truncated
##                                                       <character> <logical>
##         chr1:lastutr3:uc001bum.2_5|IQCC|utr3              unknown     FALSE
##       chr1:lastutr3:uc001fbq.3_3|S100A9|utr3              unknown     FALSE
##   chr1:lastutr3:uc031pqa.1_13|100129405|utr3 proximalCP_155720479     FALSE
##       chr1:lastutr3:uc001gde.2_2|LRRC52|utr3              unknown     FALSE
##      chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3              unknown     FALSE
##      chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3              unknown     FALSE
##   -------
##   seqinfo: 24 sequences from an unspecified genome

This vignette will use the prepared 3’ UTR annotation for the mouse reference genome mm10 for subsequent demonstration

# load R object: utr3.mm10
data(utr3.mm10)

## convert the GRanges into GRangesList for the 3' UTR annotation
utr3.mm10 <- split(utr3.mm10, seqnames(utr3.mm10))

2.2 Step 2: set up a SQLite database

Seven tables are created in the database. Table “metadata” stores the metadata, including information for tag (sample name), condition (experimental treatment group), bedgraph_file (paths to BEDGraph files), and depth (whole genome coverage depth) which is initially set to zeros and later updated during analysis. Tables “sample_coverage”, “chromosome_coverage”, “total_coverage”, “utr3_total_coverage”, “CPsites”, and “utr3cds_coverage” store names of intermediate files and the chromosome and tag (sample name) relevant to the files.

data_dir <- system.file("extdata", package = "InPAS")
bedgraphs <- c(file.path(data_dir, "Baf3.extract.bedgraph"), 
               file.path(data_dir, "UM15.extract.bedgraph"))
hugeData <- FALSE
genome <- BSgenome.Mmusculus.UCSC.mm10

tags <- c("Baf3", "UM15")
metadata <- data.frame(tag = tags, 
                      condition = c("Baf3", "UM15"),
                      bedgraph_file = bedgraphs)

## In reality, don't use a temporary directory for your analysis. Instead, use a
## persistent directory to save your analysis output.
outdir = tempdir()
write.table(metadata, file =file.path(outdir, "metadata.txt"), 
            sep = "\t", quote = FALSE, row.names = FALSE)
    
sqlite_db <- setup_sqlitedb(metadata = file.path(outdir, 
                                                 "metadata.txt"),
                           outdir)

## check the database
db_conn <- dbConnect(drv = RSQLite::SQLite(), dbname = sqlite_db)
dbListTables(db_conn)

## [1] "CPsites"             "chromosome_coverage" "metadata"           
## [4] "sample_coverage"     "total_coverage"      "utr3_total_coverage"
## [7] "utr3cds_coverage"

dbReadTable(db_conn, "metadata")

##    tag condition
## 1 Baf3      Baf3
## 2 UM15      UM15
##                                                            bedgraph_file depth
## 1 /tmp/RtmpZT7Z2B/Rinst583b910cca42f/InPAS/extdata/Baf3.extract.bedgraph     0
## 2 /tmp/RtmpZT7Z2B/Rinst583b910cca42f/InPAS/extdata/UM15.extract.bedgraph     0

dbDisconnect(db_conn)

2.3 Step 3: reformatting coverage data

Before this step, genome coverage in the BEDGraph format should be prepared from BAM files resulted from RNA-seq data alignment using the genomecov command in the BEDTools suite. BAM files can be filtered to remove multi-mapping alignments, alignments with low mapping quality and so on. Commands for reference are as follows:

## for single end RNA-seq data aligned with STAR
## -q 255, unique mapping
samtools view -bu -h -q 255 /path/to/XXX.SE.bam | \
    bedtools genomecov -ibam  - -bga -split  > XXX.SE.uniq.bedgraph

## for paired-end RNA-seq data aligned with STAR
samtools view -bu -h -q 255 /path/to/XXX.PE.bam | \
    bedtools genomecov -ibam  - -bga -split  > XXX.PE.uniq.bedgraph

The genome coverage data in the BEDGraph formatis converted into R objects of Rle-class using the get_ssRleCov function for each chromosome of each sample. Rle objects for each individual chromosome are save to outdir. The filename, tag (sample name), and chromosome name are save to Table “sample_coverage”. Subsequently, chromosome-specific Rle objects for all samples are assemble together into a two-level list of Rle objects, with level 1 being the chromosome name and level 2 being Rle for each tag (sample name). Notably, the sample BEDGraph files used here only contain coverage data for “chr6” of the mouse reference genome mm10.

coverage <- list()
for (i in seq_along(bedgraphs)){
coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i],
                                    tag = tags[i],
                                    genome = genome,
                                    sqlite_db = sqlite_db,
                                    outdir = outdir,
                                    removeScaffolds = TRUE,
                                    BPPARAM = NULL)
}
coverage_files <- assemble_allCov(sqlite_db, 
                                  outdir, 
                                  genome, 
                                  removeScaffolds = FALSE)

At this point, users can check the data quality in terms of coverage for all and expressed genes and 3’ UTRs using run_coverageQC. This function output summarized coverage metrics: gene.coverage.rate, expressed.gene.coverage.rate, UTR3.coverage.rate, and UTR3.expressed.gene.subset.coverage.rate. The coverage rate of quality data should be greater than 0.75 for 3’ UTRs of expressed genes.

edb <- EnsDb.Mmusculus.v79
TxDb <- TxDb.Mmusculus.UCSC.mm10.knownGene
run_coverageQC(sqlite_db, TxDb, edb, genome,
               removeScaffolds = TRUE,
               which = GRanges("chr6",
               ranges = IRanges(98013000, 140678000)))

## strand information will be ignored.

## Warning: Unable to map 6025 of 24568 requested IDs.

##      gene.coverage.rate expressed.gene.coverage.rate UTR3.coverage.rate
## Baf3        0.003463505                    0.5778441         0.01419771
## UM15        0.003428528                    0.5719748         0.01405159
##      UTR3.expressed.gene.subset.coverage.rate
## Baf3                                0.8035821
## UM15                                0.7953112

2.4 Step 4: Identifying potential CP sites

depth weight, Z-score cutoff thresholds, and total coverage along 3’ UTRs merged across biological replicates within each condition (huge data) or individual sample (non-huge data) are returned by the setup_CPsSearch function. Potential novel CP sites are identified for each chromosome using the search_CPs function. These potential CP sites can be filtered and/or adjusted using the Naive Bayes classifier provided by cleanUpdTseq and/or by using the polyadenylation scores by simply matching the position-weight matrix (PWM) for the hexamer polyadenylation signal (AAUAAA and the like).

## PWM for hexamer polyadenylation signal (AAUAAA and the like) from the InPAS
## package
load(system.file("extdata", "polyA.rda", package = "InPAS"))

## load the Naive Bayes classifier model for classify CP sites from the 
## cleanUpdTseq package
data(classifier)

prepared_data <- setup_CPsSearch(sqlite_db,
                                 genome, 
                                 utr3 = utr3.mm10,
                                 background = "10K",
                                 TxDb = TxDb,
                                 removeScaffolds = TRUE,
                                 BPPARAM = NULL,
                                 hugeData = TRUE,
                                 outdir=outdir,
                                 silence = TRUE)

cpsites <-  search_CPs(seqname = "chr6",
                       sqlite_db = sqlite_db, 
                       utr3 = utr3.mm10,
                       background = prepared_data$background, 
                       z2s = prepared_data$z2s,
                       depth.weight = prepared_data$depth.weight,
                       genome = genome, 
                       MINSIZE = 10, 
                       window_size = 100,
                       search_point_START = 50,
                       search_point_END = NA,
                       cutStart = 10, 
                       cutEnd = 0,
                       adjust_distal_polyA_end = TRUE,
                       coverage_threshold = 5,
                       long_coverage_threshold = 2,
                       PolyA_PWM = pwm, 
                       classifier = classifier,
                       classifier_cutoff = 0.8,
                       shift_range = 100,
                       step = 5,
                       two_way = FALSE,
                       hugeData = TRUE,
                       outdir = outdir, 
                       silence = TRUE)

2.5 Step 5: Estimate usage of proximal and distal CP sites

Estimate usage of proximal and distal CP sites based on read coverage along the short and long 3’ UTRs

utr3_cds_cov <- get_regionCov(chr.utr3 = utr3.mm10[["chr6"]],
                              sqlite_db,
                              outdir,
                              BPPARAM = NULL,
                              phmm = FALSE)

eSet <- get_UTR3eSet(sqlite_db,
                     normalize ="none", 
                     singleSample = FALSE)

2.6 Step 6. identifying differential PDUI events

InPAS provides the function test_dPDUI to identify differential usage of proximal and distal CP sites between different conditions leveraging different statistical models according to the experimental design. InPAS offers statistical methods for single sample differential PDUI analysis, and single group analysis. Additionally, InPAS provides Fisher exact test for two-group unreplicated design, and empirical Bayes linear model leveraging the limma package for more complex design. The test results can be further filtered using the filter_testOut function based on the fraction samples within each condition with coverage data for the identified differential PDUI events, and/or cutoffs of nominal p-values, adjusted p-values or log2 (fold change).

test_out <- test_dPDUI(eset = eSet, 
                       method = "fisher.exact",
                       normalize = "none",
                       sqlite_db = sqlite_db)

filter_out <- filter_testOut(res = test_out,
                             gp1 = "Baf3",
                             gp2 = "UM15",
                             background_coverage_threshold = 2,
                             P.Value_cutoff = 0.05,
                             adj.P.Val_cutoff = 0.05,
                             dPDUI_cutoff = 0.3,
                             PDUI_logFC_cutoff = 0.59)

2.7 Step 7. Visualizing dPDUI events and preparing files for GSEA

InPAS package also provide functions, get_usage4plot, plot_utr3Usage, and setup_GSEA, to visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.

## Visualize dPDUI events                       
gr <- GRanges("chr6", IRanges(128846245, 128850081), strand = "-")
names(gr) <- "128846245-128850081"
data4plot <- get_usage4plot(gr, 
                            proximalSites = 128849193, 
                            sqlite_db,
                            hugeData = TRUE) 

plot_utr3Usage(usage_data = data4plot, 
               vline_color = "purple", 
               vline_type = "dashed")

## prepare a rank file for GSEA
setup_GSEA(eset = test_out,
           groupList= list(Baf3 = "Baf3", UM15 ="UM15"),
           outdir = outdir,
           preranked = TRUE,
           rankBy = "logFC",
           rnkFilename = "InPAS.rnk")

3 Session Info

sessionInfo()

R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] DBI_1.1.1
[2] RSQLite_2.2.7
[3] cleanUpdTSeq_1.30.0
[4] BSgenome.Drerio.UCSC.danRer7_1.4.0
[5] EnsDb.Mmusculus.v79_2.99.0
[6] EnsDb.Hsapiens.v86_2.99.0
[7] ensembldb_2.16.0
[8] AnnotationFilter_1.16.0
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [10] GenomicFeatures_1.44.0
[11] AnnotationDbi_1.54.0
[12] BSgenome.Mmusculus.UCSC.mm10_1.4.0
[13] BSgenome_1.60.0
[14] rtracklayer_1.52.0
[15] Biostrings_2.60.0
[16] XVector_0.32.0
[17] InPAS_2.0.0
[18] GenomicRanges_1.44.0
[19] GenomeInfoDb_1.28.0
[20] IRanges_2.26.0
[21] S4Vectors_0.30.0
[22] Biobase_2.52.0
[23] BiocGenerics_0.38.0
[24] BiocStyle_2.20.0

loaded via a namespace (and not attached): [1] colorspace_2.0-1 seqinr_4.2-5
[3] rjson_0.2.20 ellipsis_0.3.2
[5] class_7.3-19 depmixS4_1.5-0
[7] proxy_0.4-25 farver_2.1.0
[9] bit64_4.0.5 fansi_0.4.2
[11] cachem_1.0.5 knitr_1.33
[13] ade4_1.7-16 jsonlite_1.7.2
[15] Rsamtools_2.8.0 dbplyr_2.1.1
[17] png_0.1-7 BiocManager_1.30.15
[19] readr_1.4.0 compiler_4.1.0
[21] httr_1.4.2 lazyeval_0.2.2
[23] assertthat_0.2.1 Matrix_1.3-3
[25] fastmap_1.1.0 limma_3.48.0
[27] htmltools_0.5.1.1 prettyunits_1.1.1
[29] tools_4.1.0 gtable_0.3.0
[31] glue_1.4.2 GenomeInfoDbData_1.2.6
[33] reshape2_1.4.4 dplyr_1.0.6
[35] rappdirs_0.3.3 Rcpp_1.0.6
[37] jquerylib_0.1.4 vctrs_0.3.8
[39] preprocessCore_1.54.0 nlme_3.1-152
[41] xfun_0.23 stringr_1.4.0
[43] plyranges_1.12.0 lifecycle_1.0.0
[45] restfulr_0.0.13 XML_3.99-0.6
[47] zlibbioc_1.38.0 MASS_7.3-54
[49] scales_1.1.1 ProtGenerics_1.24.0
[51] hms_1.1.0 MatrixGenerics_1.4.0
[53] SummarizedExperiment_1.22.0 yaml_2.2.1
[55] curl_4.3.1 memoise_2.0.0
[57] ggplot2_3.3.3 sass_0.4.0
[59] biomaRt_2.48.0 stringi_1.6.2
[61] highr_0.9 BiocIO_1.2.0
[63] e1071_1.7-6 filelock_1.0.2
[65] BiocParallel_1.26.0 truncnorm_1.0-8
[67] rlang_0.4.11 pkgconfig_2.0.3
[69] matrixStats_0.58.0 bitops_1.0-7
[71] Rsolnp_1.16 evaluate_0.14
[73] lattice_0.20-44 purrr_0.3.4
[75] labeling_0.4.2 GenomicAlignments_1.28.0
[77] bit_4.0.4 tidyselect_1.1.1
[79] plyr_1.8.6 magrittr_2.0.1
[81] bookdown_0.22 R6_2.5.0
[83] magick_2.7.2 generics_0.1.0
[85] DelayedArray_0.18.0 pillar_1.6.1
[87] KEGGREST_1.32.0 RCurl_1.98-1.3
[89] nnet_7.3-16 tibble_3.1.2
[91] crayon_1.4.1 utf8_1.2.1
[93] BiocFileCache_2.0.0 rmarkdown_2.8
[95] progress_1.2.2 grid_4.1.0
[97] blob_1.2.1 digest_0.6.27
[99] munsell_0.5.0 bslib_0.2.5.1

1. Sheppard, S., Lawson, N. D. & Zhu, L. J. Accurate identification of polyadenylation sites from 3′ end deep sequencing using a naive bayes classifier. Bioinformatics 29, 2564–2571 (2013).

InPAS Vignette

2021-05-19

Abstract

Package

Contents