ORFik 1.2.1
Welcome to the ORFik
package. This vignette will walk you through our main
package usage with examples. ORFik
is an R package containing various
functions for analysis of RiboSeq, RNASeq and CageSeq data.
ORFik
currently supports:
In molecular genetics, an Open Reading Frame (ORF) is the part of a reading frame that has the ability to be translated. It does not mean that every ORF is being translated or is functional, but to be able to find novel genes we must be able to firstly identify potential ORFs.
To find all Open Reading Frames (ORFs) and possibly map them to genomic
coordinates ORFik
gives you three main functions:
findORFs
- find ORFs in sequences of interest,findMapORFs
- find ORFs and map them to their respective genomic coordinatesfindORFsFasta
- find ORFs in Fasta file or BSGenome
(supports circular genomes!)library(ORFik)
library(GenomicFeatures)
After loading libraries, load example data from GenomicFeatures
. We load gtf
file as txdb. We will extract the 5’ leaders to find all upstream open reading
frames.
txdbFile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package = "GenomicFeatures")
txdb <- loadDb(txdbFile)
fiveUTRs <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTRs
## GRangesList object of length 150:
## $uc001bum.2
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr1 32671236-32671282 + | 1 <NA> 1
##
## $uc009vua.2
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## [1] chr1 32671236-32671282 + | 2 <NA> 1
##
## $uc010ogz.1
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## [1] chr1 32671236-32671324 + | 1 <NA> 1
## [2] chr1 32671755-32672223 + | 4 <NA> 2
##
## ...
## <147 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
As we can see we have extracted 5’ UTRs for hg19 annotations. Now we can load
BSgenome
version of human genome (hg19). If you don’t have this package
installed you will not see the result from the code below. You might have to
install BSgenome.Hsapiens.UCSC.hg19
and run the code for yourself as we don’t
install this package together with ORFik
.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
# Extract sequences of fiveUTRs.
tx_seqs <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19::Hsapiens, fiveUTRs)
# Find all ORFs on those transcripts and get their genomic coordinates
fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs)
fiveUTR_ORFs
}
## GRangesList object of length 127:
## $uc010ogz.1
## GRanges object with 10 ranges and 1 metadata column:
## seqnames ranges strand | names
## <Rle> <IRanges> <Rle> | <character>
## uc010ogz.1 chr1 32671314-32671324 + | uc010ogz.1_1
## uc010ogz.1 chr1 32671755-32671902 + | uc010ogz.1_1
## uc010ogz.1 chr1 32672038-32672076 + | uc010ogz.1_2
## uc010ogz.1 chr1 32671237-32671324 + | uc010ogz.1_3
## uc010ogz.1 chr1 32671755-32671807 + | uc010ogz.1_3
## uc010ogz.1 chr1 32671934-32672044 + | uc010ogz.1_4
## uc010ogz.1 chr1 32672048-32672152 + | uc010ogz.1_5
## uc010ogz.1 chr1 32671274-32671324 + | uc010ogz.1_6
## uc010ogz.1 chr1 32671755-32671913 + | uc010ogz.1_6
## uc010ogz.1 chr1 32672034-32672192 + | uc010ogz.1_7
##
## ...
## <126 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
In the example above you can see that fiveUTR_ORFs are grouped by transcript, the first group is from transcript “uc010ogz.1”. Meta-column names contains name of the transcript and identifier of the ORF separated by “_“. When ORF is separated into two exons you can see it twice, like the first ORF with name”uc010ogz.1_1“. The first ORF will always be the one most upstream for”+“” strand, and least upstream for “-”" strand.
In the prerevious example we used the refence annotation of the 5’ UTRs from the package GenomicFeatures. Here we will use advantage of CageSeq data to set new Transcription Start Sites (TSS) and re-annotate 5’ UTRs.
# path to example CageSeq data from hg19 heart sample
cageData <- system.file("extdata", "cage-seq-heart.bed.bgz",
package = "ORFik")
# get new Transcription Start Sites using CageSeq dataset
newFiveUTRs <- reassignTSSbyCage(fiveUTRs, cageData)
## Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': chr17, chr21, chrM
## - in 'y': chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning in .Seqinfo.mergexy(x, y): Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': chr17, chr21, chrM
## - in 'y': chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
newFiveUTRs
## GRangesList object of length 150:
## $uc001bum.2
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## uc001bum.2 chr1 32671236-32671282 + | 1 <NA>
## exon_rank
## <integer>
## uc001bum.2 1
##
## $uc009vua.2
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## uc009vua.2 chr1 32671236-32671282 + | 2 <NA> 1
##
## $uc010ogz.1
## GRanges object with 2 ranges and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## uc010ogz.1 chr1 32671236-32671324 + | 1 <NA> 1
## uc010ogz.1 chr1 32671755-32672223 + | 4 <NA> 2
##
## ...
## <147 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome; no seqlengths
You will now see that most of the transcription start sites have changed. Depending on the species, regular annotations might be incomplete or not specific enough for your purposes.
In RiboSeq data ribosomal footprints are restricted to their p-site positions
and shifted with respect to the shifts visible over the start and stop
codons. ORFik
has multiple functions for processing of RiboSeq data. We will
go through an example processing of RiboSeq data below.
Load example raw RiboSeq footprints (unshifted).
bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik")
footprints <- GenomicAlignments::readGAlignments(bam_file)
Investigate what footprint lengths are present in our data.
table(qwidth(footprints))
##
## 28 29 30
## 5547 5576 5526
For the sake of this example we will focus only on most abundant length of 29.
footprints <- footprints[qwidth(footprints) == 29]
footprintsGR <- granges(footprints, use.mcols = TRUE)
footprintsGR
## GRanges object with 5576 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr23 17599129-17599157 -
## [2] chr23 17599129-17599157 -
## [3] chr23 17599129-17599157 -
## [4] chr23 17599129-17599157 -
## [5] chr23 17599129-17599157 -
## ... ... ... ...
## [5572] chr8 24068755-24068783 +
## [5573] chr8 24068755-24068783 +
## [5574] chr8 24068769-24068797 +
## [5575] chr8 24068802-24068830 +
## [5576] chr8 24068894-24068922 +
## -------
## seqinfo: 1133 sequences from an unspecified genome
Restrict footprints to their 5’ starts (after shifting it will be a p-site).
footprintsGR <- resize(footprintsGR, 1)
footprintsGR
## GRanges object with 5576 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr23 17599157 -
## [2] chr23 17599157 -
## [3] chr23 17599157 -
## [4] chr23 17599157 -
## [5] chr23 17599157 -
## ... ... ... ...
## [5572] chr8 24068755 +
## [5573] chr8 24068755 +
## [5574] chr8 24068769 +
## [5575] chr8 24068802 +
## [5576] chr8 24068894 +
## -------
## seqinfo: 1133 sequences from an unspecified genome
Now, lets prepare annotations and focus on START and STOP codons.
gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik")
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file, format = "gtf")
cds <- GenomicFeatures::cdsBy(txdb, by = "tx", use.names = TRUE)
cds[1]
## GRangesList object of length 1:
## $ENSDART00000032996
## GRanges object with 4 ranges and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] chr8 24067789-24067843 + | 1 <NA> 2
## [2] chr8 24068170-24068247 + | 2 <NA> 3
## [3] chr8 24068353-24068449 + | 4 <NA> 4
## [4] chr8 24068538-24068661 + | 6 <NA> 5
##
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Filter cds to only those who have some minimum trailer and leader lengths, as well as cds. And get start and stop codons with extra window of 30bp around them.
txNames <- txNamesWithLeaders(txdb)
windows <- getStartStopWindows(txdb, txNames)
windows
## $starts
## GRangesList object of length 2:
## $ENSDART00000000070
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr24 22711351-22711410 -
##
## $ENSDART00000032996
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] chr8 24067759-24067818 +
##
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
##
## $stops
## GRangesList object of length 2:
## $ENSDART00000000070
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr24 22679596-22679655 -
##
## $ENSDART00000032996
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] chr8 24068632-24068691 +
##
## -------
## seqinfo: 2 sequences from an unspecified genome; no seqlengths
Calculate meta-coverage over start and stop windowed regions.
hitMapStart <- metaWindow(footprintsGR, windows$start)
hitMapStop <- metaWindow(footprintsGR, windows$stop)
Plot start/stop windows for length 29.
if (requireNamespace("ggplot2")) {
library(ggplot2)
ggplot(hitMapStart, aes(x = factor(position), y = avg_counts, fill = factor(frame))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(title = paste0("Length 29 over START of canonical CDS")) +
xlab("\nshift from first START nucleotide [bp]") +
ylab("Averaged counts") +
guides(fill = FALSE)
}
## Loading required namespace: ggplot2
if (requireNamespace("ggplot2")) {
ggplot(hitMapStop, aes(x = factor(position), y = avg_counts, fill = factor(frame))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(title = paste0("Length 29 over STOP of canonical CDS")) +
xlab("\nshift from last STOP nucleotide [bp]") +
ylab("Averaged counts") +
guides(fill = FALSE)
}
We can also use automatic detection of RiboSeq shifts using the code below. As we can see reasonable conclusion from the plots would be to shift length 29 by 12, it is in agreement with the automatic detection of the offsets.
shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE)
shifts
## fragment_length offsets_start offsets_stop
## 1 29 -12 -12
Fortunately ORFik
has function that can be used to shift footprints using
desired shifts. Check documentation for more details.
shiftedFootprints <- shiftFootprints(
footprints, shifts$fragment_length, shifts$offsets_start)
## Shifting footprints of length 29
## Sorting shifted footprints...
shiftedFootprints
## GRanges object with 5576 ranges and 1 metadata column:
## seqnames ranges strand | size
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr8 24066297 + | 29
## [2] chr8 24066297 + | 29
## [3] chr8 24066297 + | 29
## [4] chr8 24066297 + | 29
## [5] chr8 24066330 + | 29
## ... ... ... ... . ...
## [5572] chr24 22711491 - | 29
## [5573] chr24 22711503 - | 29
## [5574] chr24 22711503 - | 29
## [5575] chr24 22711503 - | 29
## [5576] chr24 22711503 - | 29
## -------
## seqinfo: 1133 sequences from an unspecified genome
ORFik
contains functions of gene identity that can be used to predict
which ORFs are potentially coding and functional.
floss
,coverage
,orfScore
,entropy
,translationalEff
,insideOutsideScore
,distToCds
,All of the features are implemented based on scientific article published in
peer reviewed journal. ORFik
supports seemingles calculation of all available
features. See example below.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
library(GenomicFeatures)
# Extract sequences of fiveUTRs.
fiveUTRs <- fiveUTRs[1:10]
faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
tx_seqs <- extractTranscriptSeqs(faFile, fiveUTRs)
# Find all ORFs on those transcripts and get their genomic coordinates
fiveUTR_ORFs <- findMapORFs(fiveUTRs, tx_seqs)
unlistedORFs <- unlistGrl(fiveUTR_ORFs)
# group GRanges by ORFs instead of Transcripts, use 4 first ORFs
fiveUTR_ORFs <- groupGRangesBy(unlistedORFs, unlistedORFs$names)[1:4]
# make some toy ribo seq and rna seq data
starts <- unlist(ORFik:::firstExonPerGroup(fiveUTR_ORFs), use.names = FALSE)
RFP <- promoters(starts, upstream = 0, downstream = 1)
score(RFP) <- rep(29, length(RFP)) # the original read widths
# set RNA seq to duplicate transcripts
RNA <- unlist(exonsBy(txdb, by = "tx", use.names = TRUE), use.names = TRUE)
# transcript database
txdb <- loadDb(txdbFile)
dt <- computeFeatures(fiveUTR_ORFs, RFP, RNA, txdb, faFile,
orfFeatures = TRUE)
print(dt)
}
## floss entropyRFP disengagementScores RRS RSS fractionLengths
## 1: 0 0 0.6666667 7.610063 26.50000 0.07022968
## 2: 0 0 2.0000000 31.025641 6.50000 0.01722615
## 3: 0 1 1.0000000 12.872340 15.66667 0.06227915
## 4: 0 1 3.0000000 16.351351 12.33333 0.04902827
## te fpkmRFP fpkmRNA ORFScores ioScore kozak distORFCDS inFrameCDS
## 1: Inf 1572327 0 1.584963 0.40 0.3390461 322 0
## 2: Inf 6410256 0 1.584963 0.40 0.1949422 148 0
## 3: Inf 3546099 0 -1.000000 0.75 0.6140954 417 2
## 4: Inf 4504505 0 -1.000000 0.75 0.7079892 180 2
## isOverlappingCds rankInTx
## 1: FALSE 1
## 2: FALSE 2
## 3: FALSE 3
## 4: FALSE 4
You will now get a data.table with one column per score, the columns are named after the different scores, you can now go further with prediction, or making plots.
Instead of getting all features, we can also extract single features.
To understand how strong the binding affinitity of an ORF promoter region might be, we can use kozak sequence score. The kozak functions supports several species. In the first example we use human kozak sequence, then we make a self defined kozak sequence.
# In this example we will find kozak score of cds'
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
cds <- cdsBy(txdb, by = "tx", use.names = TRUE)[1:10]
faFile <- BSgenome.Hsapiens.UCSC.hg19::Hsapiens
kozakSequenceScore(cds, faFile, species = "human")
# A few species are pre supported, if not, make your own input pfm.
# here is an example where the human pfm is sent in again, even though
# it is already supported.
pfm <- t(matrix(as.integer(c(29,26,28,26,22,35,62,39,28,24,27,17,
21,26,24,16,28,32,5,23,35,12,42,21,
25,24,22,33,22,19,28,17,27,47,16,34,
25,24,26,25,28,14,5,21,10,17,15,28)),
ncol = 4))
kozakSequenceScore(cds, faFile, species = pfm)
}
## [1] 0.5531961 0.5531961 0.6652532 0.6925763 0.6370870 0.6370870 0.4854677
## [8] 0.4706279 0.6381237 0.6529909
ORFik
contains couple functions that can be utilized to speed up your coding.
Check documentations for these functions: unlistGrl
, sortPerGroup
,
strandBool
, tile1
.
Sometimes you want a GRangesList of ORFs grouped by transcript, or you might
want each ORF as groups in the GRangesList. To do this more easily you can use
the function groupGRangesBy
.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
# the orfs are now grouped by orfs. If we want to go back to transcripts we do:
unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
unlisted_ranges
test_ranges <- groupGRangesBy(unlisted_ranges, names(unlisted_ranges))
# test_ranges is now grouped by transcript, but we want them grouped by ORFs:
# we use the orfs exon column called ($names) to group, it is made by ORFik.
unlisted_ranges <- unlistGrl(test_ranges)
test_ranges <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
}
Lets say you found some ORFs, and you want to filter out some of them. ORFik provides several functions for filtering. A problem with the original GenomicRanges container, is that filtering on GRanges objects are much easier than on GRangesList objects, ORFik tries to fix this.
In this example we will filter out all orfs as following: 1. First group GRangesList by ORFs 2. width < 60 3. number of exons < 2 4. strand is negative
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
# lets use the fiveUTR_ORFs
#1. Group by ORFs
unlisted_ranges <- unlistGrl(fiveUTR_ORFs)
ORFs <- groupGRangesBy(unlisted_ranges, unlisted_ranges$names)
length(ORFs)
#2. Remove widths < 60
ORFs <- ORFs[widthPerGroup(ORFs) >= 60]
length(ORFs)
#3. Keep only ORFs with at least 2 exons
ORFs <- ORFs[numExonsPerGroup(ORFs) > 1]
length(ORFs)
#4. Keep only positive ORFs
ORFs <- ORFs[strandPerGroup(ORFs) == "+"]
# all remaining ORFs where on positive strand, so no change
length(ORFs)
}
## [1] 2
Specific part of the ORF are usually of interest, like start and stop codons. Here we run an example to show what ORFik can do for you.
if (requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
# let's use the ORFs from the previous examples
#1. Find the start and stop sites
startSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
stopSites(fiveUTR_ORFs, asGR = TRUE, keep.names = TRUE, is.sorted = TRUE)
#2. Lets find the start and stop codons,
# this takes care of potential 1 base exons etc.
starts <- startCodons(fiveUTR_ORFs, is.sorted = TRUE)
starts
stops <- stopCodons(fiveUTR_ORFs, is.sorted = TRUE)
stops
#3. Lets get the bases of the start and stop codons from the fasta file
# It's very important to check that ORFs are sorted here, else you could get
# the end of the ORF instead of the beginning etc.
txSeqsFromFa(starts, faFile, is.sorted = TRUE)
txSeqsFromFa(stops, faFile, is.sorted = TRUE)
}
## A DNAStringSet instance of length 4
## width seq names
## [1] 3 TAG uc010ogz.1_1
## [2] 3 TAG uc010ogz.1_2
## [3] 3 TGA uc010ogz.1_3
## [4] 3 TAG uc010ogz.1_4
Many more operations are also supported for manipulation
ORFik supports multiple ORF finding functions,here we describe their specific use.
If you have a DNAStringSet or a character vector use findORFs. DNAStringSet is safer since all characters are forced to uppercase.
findORFs will give you only 5’ to 3’ direction, so if you want both directions, you can do (for double stranded):
library(Biostrings)
library(S4Vectors)
seqs <- "ATGAAATGAAGTAAATCAAAACAT" # strand with ORFs in both directions
# positive strands
pos <- findORFs(seqs, startCodon = "ATG", minimumLength = 0)
# negative strands
neg <- findORFs(reverseComplement(DNAStringSet(seqs)),
startCodon = "ATG", minimumLength = 0)
# make GRanges since we want strand information
pos <- GRanges(pos, strand = "+")
neg <- GRanges(neg, strand = "-")
# as GRanges
res <- c(pos, neg)
# or merge together and make GRangesList
res <- split(res, seq.int(1, length(pos) + length(neg)))
res
## GRangesList object of length 3:
## $1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 1-9 +
##
## $2
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] 1 6-14 +
##
## $3
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] 1 1-9 -
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note that findORFsFasta automaticly finds (-) strand ORFs. Since that is normally used for genomes.
If you have transcriptomes, you dont want the (-) strand. If you get both (+/-) strand and only want (+) ORFs, do:
res[strandBool(res)]
## GRangesList object of length 2:
## $1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 1-9 +
##
## $2
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## [1] 1 6-14 +
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
If you want to find ORFs in spliced transcripts, use findMapORFs. It supports automatic exon splitting, see above for example.
If you want to find ORFs on circular genomes, use findORFsFasta.
Eucaryote splicing: findMapORFs Procaryote/circular: findORFsFasta Direct ORFs from character vector: findORFs