SGSeq is a software package for analyzing splice events from RNA-seq data. Input data are RNA-seq reads mapped to a reference genome in BAM format. Genes are represented as a splice graph, which can be obtained from existing annotation or predicted from the mapped sequence reads. Splice events are identified from the graph and are quantified locally using structurally compatible reads at the start or end of each splice variant. The software includes functions for splice event prediction, quantification, visualization and interpretation.
SGSeq 1.10.0
SGSeq provides functions and data structures for analyzing splice events from RNA-seq data. An overview of SGSeq classes and how they are related is shown in Figure 1 and summarized below:
If you use SGSeq, please cite:
library(SGSeq)
When starting a new project, SGSeq requires information about the samples to be analyzed. This information is obtained once initially, and can then be used for all future analyses. Sample information is provided as a data frame with the following columns:
Sample information can be stored in a data.frame or DataFrame object (if BAM files are specified as a BamFileList, it must be stored in a DataFrame). Sample information can be obtained automatically with function getBamInfo(), which takes as input a data frame with columns sample_name and file_bam and extracts all other information from the specified BAM files.
For SGSeq to work correctly it is essential that reads were mapped with a splice-aware alignment program, such as GSNAP (T. D. Wu and Nacu 2010), HISAT (Kim, Langmead, and Salzberg 2015) or STAR (Dobin et al. 2013), which generates SAM/BAM files with a custom tag ‘XS’ for spliced reads, indicating the direction of transcription. BAM files must be indexed. Note that lib_size should be the total number of aligned fragments, even if BAM files were subset to genomic regions of interest. The total number of fragments is required for accurate calculation of FPKM values (fragments per kilobase and million sequenced fragments). Here, the term ‘fragment’ denotes a sequenced cDNA fragment, which is represented by a single read in single-end data, or a pair of reads in paired-end data.
This vignette illustrates an analysis of paired-end RNA-seq data from four tumor and four normal colorectal samples, which are part of a data set published in (Seshagiri et al. 2012). RNA-seq reads were mapped to the human reference genome using GSNAP (T. D. Wu and Nacu 2010). The analysis is based on BAM files that were subset to reads mapping to a single gene of interest (FBXO31). A data.frame si with sample information was generated from the original BAM files with function getBamInfo(). Note that column lib_size reflects the total number of aligned fragments in the original BAM files.
si
## sample_name file_bam paired_end read_length frag_length lib_size
## 1 N1 N1.bam TRUE 75 293 12405197
## 2 N2 N2.bam TRUE 75 197 13090179
## 3 N3 N3.bam TRUE 75 206 14983084
## 4 N4 N4.bam TRUE 75 207 15794088
## 5 T1 T1.bam TRUE 75 284 14345976
## 6 T2 T2.bam TRUE 75 235 15464168
## 7 T3 T3.bam TRUE 75 259 15485954
## 8 T4 T4.bam TRUE 75 247 15808356
The following code block sets the correct BAM file paths for the current SGSeq installation.
path <- system.file("extdata", package = "SGSeq")
si$file_bam <- file.path(path, "bams", si$file_bam)
Transcript annotation can be obtained from a TxDb object or imported from GFF format using function importTranscripts(). Alternatively, transcripts can be specified as a GRangesList of exons grouped by transcripts. In the following code block, the UCSC knownGene table is loaded as a TxDb object. Transcripts on chromosome 16 (where the FBXO31 gene is located) are retained, and chromosome names are changed to match the naming convention used in the BAM files.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb <- keepSeqlevels(txdb, "chr16")
seqlevelsStyle(txdb) <- "NCBI"
To work with the annotation in the SGSeq framework, transcript features are extracted from the TxDb object using function convertToTxFeatures(). Only features overlapping the genomic locus of the FBXO31 gene are retained. The genomic coordinates of FBXO31 are stored in a GRanges object gr.
txf_ucsc <- convertToTxFeatures(txdb)
txf_ucsc <- txf_ucsc[txf_ucsc %over% gr]
head(txf_ucsc)
## TxFeatures object with 6 ranges and 0 metadata columns:
## seqnames ranges strand type
## <Rle> <IRanges> <Rle> <factor>
## [1] 16 [87362942, 87365116] - L
## [2] 16 [87365116, 87367492] - J
## [3] 16 [87367492, 87367892] - I
## [4] 16 [87367892, 87368910] - J
## [5] 16 [87368910, 87369063] - I
## [6] 16 [87369063, 87369761] - J
## txName geneName
## <CharacterList> <CharacterList>
## [1] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [2] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [3] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [4] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [5] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [6] uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## -------
## seqinfo: 1 sequence from hg19 genome
convertToTxFeatures() returns a TxFeatures object, which is a GRanges-like object with additional columns. Column type indicates the feature type and can take values
Columns txName and geneName indicate the transcript and gene each feature belongs to. Note that a feature can belong to more than one transcript, and accordingly these columns can store multiple values for each feature. For TxFeatures and other data structures in SGSeq, columns can be accessed with accessor functions as shown in the following code block.
type(txf_ucsc)
## [1] L J I J I J I J I J I J I J I F J J F U I J F
## Levels: J I F L U
head(txName(txf_ucsc))
## CharacterList of length 6
## [[1]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[2]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[3]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[4]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[5]] uc002fjv.3 uc002fjw.3 uc010vot.2
## [[6]] uc002fjv.3 uc002fjw.3 uc010vot.2
head(geneName(txf_ucsc))
## CharacterList of length 6
## [[1]] 79791
## [[2]] 79791
## [[3]] 79791
## [[4]] 79791
## [[5]] 79791
## [[6]] 79791
Exons stored in a TxFeatures object represent the exons spliced together in an RNA molecule. In the context of the splice graph, exons are represented by unique non-overlapping exonic regions. Function convertToSGFeatures() converts TxFeatures to SGFeatures. In the process, overlapping exons are disjoined into non-overlapping exon bins.
sgf_ucsc <- convertToSGFeatures(txf_ucsc)
head(sgf_ucsc)
## SGFeatures object with 6 ranges and 0 metadata columns:
## seqnames ranges strand type splice5p splice3p
## <Rle> <IRanges> <Rle> <factor> <logical> <logical>
## [1] 16 [87362942, 87365116] - E TRUE FALSE
## [2] 16 [87365116, 87365116] - A <NA> <NA>
## [3] 16 [87365116, 87367492] - J <NA> <NA>
## [4] 16 [87367492, 87367492] - D <NA> <NA>
## [5] 16 [87367492, 87367892] - E TRUE TRUE
## [6] 16 [87367892, 87367892] - A <NA> <NA>
## featureID geneID txName geneName
## <integer> <integer> <CharacterList> <CharacterList>
## [1] 1 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [2] 2 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [3] 3 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [4] 4 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [5] 5 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [6] 6 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## -------
## seqinfo: 1 sequence from hg19 genome
Similar to TxFeatures, SGFeatures are GRanges-like objects with additional columns. Column type for an SGFeatures object takes values
By convention, splice donor and acceptor sites correspond to the exonic positions immediately flanking the intron. SGFeatures has additional columns not included in TxFeatures: spliced5p and spliced3p indicate whether exon bins have a mandatory splice at the 5\(^\prime\) and 3\(^\prime\) end, respectively. This information is used to determine whether a read is structurally compatible with an exon bin, and whether an exon bin is consistent with an annotated transcript. featureID provides a unique identifier for each feature, geneID indicates the unique component of the splice graph a feature belongs to.
This section illustrates an analysis based on annotated transcripts. Function analyzeFeatures() converts transcript features to splice graph features and obtains compatible fragment counts for each feature and each sample.
sgfc_ucsc <- analyzeFeatures(si, features = txf_ucsc)
sgfc_ucsc
## class: SGFeatureCounts
## dim: 42 8
## metadata(0):
## assays(2): counts FPKM
## rownames: NULL
## rowData names(0):
## colnames(8): N1 N2 ... T3 T4
## colData names(6): sample_name file_bam ... frag_length lib_size
analyzeFeatures() returns an SGFeatureCounts object. SGFeatureCounts contains the sample information as colData, splice graph features as rowRanges and assays counts and FPKM, which store compatible fragment counts and FPKMs, respectively. The different data types can be accessed using accessor functions with the same name.
colData(sgfc_ucsc)
rowRanges(sgfc_ucsc)
head(counts(sgfc_ucsc))
head(FPKM(sgfc_ucsc))
Counts for exons and splice junctions are based on structurally compatible fragments. In the case of splice donors and acceptors, counts indicate the number of fragments with reads spanning the spliced boundary (overlapping the splice site and the flanking intronic position).
FPKM values are calculated as \(\frac{x}{NL}10^6\), where \(x\) is the number of compatible fragments, \(N\) is the library size (stored in lib_size) and L is the effective feature length (the number of possible positions for a compatible fragment). For paired-end data it is assumed that fragment length is equal to frag_length.
FPKMs for splice graph features can be visualized with function plotFeatures. plotFeatures generates a two-panel figure with a splice graph shown in the top panel and a heatmap of expression levels for individual features in the bottom panel. For customization of plotFeatures output, see section Visualization. The plotting function invisibly returns a data.frame with details about the splice graph shown in the plot.
df <- plotFeatures(sgfc_ucsc, geneID = 1)
Note that the splice graph includes three alternative transcript start sites (TSSs). However, the heatmap indicates that the first TSS is not used in this data set.
Instead of relying on existing annotation, annotation can be augmented with predictions from RNA-seq data, or the splice graph can be constructed from RNA-seq data without the use of annotation. The following code block predicts transcript features supported by RNA-seq reads, converts them into splice graph features, and then obtains compatible fragment counts. For details on how predictions are obtained, please see (Goldstein et al. 2016).
sgfc_pred <- analyzeFeatures(si, which = gr)
head(rowRanges(sgfc_pred))
## SGFeatures object with 6 ranges and 0 metadata columns:
## seqnames ranges strand type splice5p splice3p
## <Rle> <IRanges> <Rle> <factor> <logical> <logical>
## [1] 16 [87362930, 87365116] - E TRUE FALSE
## [2] 16 [87365116, 87365116] - A <NA> <NA>
## [3] 16 [87365116, 87367492] - J <NA> <NA>
## [4] 16 [87367492, 87367492] - D <NA> <NA>
## [5] 16 [87367492, 87367892] - E TRUE TRUE
## [6] 16 [87367892, 87367892] - A <NA> <NA>
## featureID geneID txName geneName
## <integer> <integer> <CharacterList> <CharacterList>
## [1] 1 1
## [2] 2 1
## [3] 3 1
## [4] 4 1
## [5] 5 1
## [6] 6 1
## -------
## seqinfo: 84 sequences from an unspecified genome
For interpretation, predicted features can be annotated with respect to known transcripts. The annotate() function assigns compatible transcripts to each feature and stores the corresponding transcript and gene name in columns txName and geneName, respectively.
sgfc_pred <- annotate(sgfc_pred, txf_ucsc)
head(rowRanges(sgfc_pred))
## SGFeatures object with 6 ranges and 0 metadata columns:
## seqnames ranges strand type splice5p splice3p
## <Rle> <IRanges> <Rle> <factor> <logical> <logical>
## [1] 16 [87362930, 87365116] - E TRUE FALSE
## [2] 16 [87365116, 87365116] - A <NA> <NA>
## [3] 16 [87365116, 87367492] - J <NA> <NA>
## [4] 16 [87367492, 87367492] - D <NA> <NA>
## [5] 16 [87367492, 87367892] - E TRUE TRUE
## [6] 16 [87367892, 87367892] - A <NA> <NA>
## featureID geneID txName geneName
## <integer> <integer> <CharacterList> <CharacterList>
## [1] 1 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [2] 2 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [3] 3 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [4] 4 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [5] 5 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## [6] 6 1 uc002fjv.3,uc002fjw.3,uc010vot.2 79791
## -------
## seqinfo: 84 sequences from an unspecified genome
The predicted splice graph and FPKMs can be visualized as previously. Splice graph features with missing annotation are highlighted using argument color_novel.
df <- plotFeatures(sgfc_pred, geneID = 1, color_novel = "red")