1 Introduction

An ever-growing variety of short read RNA sequencing methods is available to study the various aspects of transcript biosynthesis, processing and degradation (Wang, Gerstein, and Snyder 2010). Some methods, such as mRNA sequencing (mRNA-seq), measure signals derived from processed mRNA and are thus ideally suitable for steady-state mature transcript level measurements and for the investigation of splice variants (Morozova, Hirst, and Marra 2009). Other techniques, such as Global Run-On sequencing (GRO-seq), nuclear RNA-seq (nucRNA-seq) and chromatin-associated RNA-seq (chrRNA-seq) provide information on primary transcription and give a more comprehensive picture of transcriptional activity, also for non-polyadenylated RNA (Mitchell et al. 2012; Werner and Ruthenburg 2015; Hah et al. 2013; Gaidatzis et al. 2015). The differences in the RNA types being sequenced have an impact on the resulting sequencing profiles. mRNA-seq data is enriched with reads derived from exons, while GRO-, nucRNA- and chrRNA-seq demonstrate a substantial broader coverage of both exonic and intronic regions (Zaghlool et al. 2013). The presence of intronic reads in GRO-seq type of data makes it possible to use it to computationally identify and quantify all de novo continuous regions of transcription distributed across the genome. This type of data, however, is more challenging to interpret and less common practice compared to mRNA-seq. One of the challenges for primary transcript detection concerns the simultaneous transcription of closely spaced genes, which needs to be properly divided into individually transcribed units. The R package transcriptR combines RNA-seq data with ChIP-seq data of histone modifications that mark active Transcription Start Sites (TSSs), such as, H3K4me3 or H3K9/14Ac to overcome this challenge. The advantage of this approach over the use of, for example, gene annotations is that this approach is data driven and therefore able to deal also with novel and case specific events. Furthermore, the integration of ChIP- and RNA-seq data allows the identification all known and novel active transcription start sites within a given sample.

transcriptR is an R package with a pipeline that is made up out of two main parts; an RNA-seq and a ChIP-seq part, of which the outputs are integrated to ultimately yield a comprehensive database of de novo identified primary transcripts and their abundance (Figure 1). In the first (RNA-seq) part, strand-specific GRO-seq, nucRNA-seq or chrRNA-seq short-reads in Binary Sequence Alignment Map (BAM) format are converted to coverage profiles. Background noise levels are estimated from regions with the low reads coverage using a Poisson-based approach. Genomic regions with read densities above background levels are considered to be expressed and small gaps in otherwise continuously transcribed regions are bridged when the gaps sizes are below a certain threshold which is extrapolated from the sequencing data and reference gene annotations. The second part, operates on ChIP-seq data and requires two input files: 1) a BAM file with the sequencing reads and 2) a peak file - output of a peak calling algorithm (for example MACS2) (Zhang et al. 2008). As a first step, a classification model, based on the logistic regression, is used to predict and discriminate gene associated peaks from background peaks, using estimated characteristics of the peaks. Next, transcription initiation within a peak region is investigated by comparing RNA-seq read densities upstream and downstream of empirically determined transcription start sites. Putative transcription of both forward and reverse genomic strands is tested and the results are stored with each ChIP-seq peak.

At the end of the pipeline, both parts are combined and, where applicable, closely spaced transcripts are divided into individually transcribed units using the detected active transcription start sites. Additionally, the read count and FPKM value is calculated for each transcript in the dataset to facilitate further quantitative analysis.

The advantage of the two-part approach presented here is that the transcript detection and quantification can still be performed even in the absence of ChIP-seq data, bearing in mind that some adjacent transcripts may be detected as one transcribed unit.