ORFik 1.10.13
Welcome to the ORFik
package.
ORFik
is an R package for analysis of transcript and translation features through manipulation of sequence data and NGS data.
This vignette will walk you through how to how to download annotation and align data with ORFik.
Here we will show a full example of aligning RNA-seq from yeast using the SacCer3 genome.
First specify where you want to save the different data types: 1. fastq files (raw_data) 2. bam files (processed_data) 3. annotation reference files (references)
library(ORFik) # This package
conf <- config.exper(experiment = "CHALMERS_Yeast", # Name
assembly = "Yeast_SacCer3", # Reference folder
type = c("RNA-seq")) # fastq and bam type
We need some data to align, if you have in-lab data, you don’t need this step, since you already have access to the fastq files.
On the other hand, if you want to use published data, you need to download it. I here show what would work for the paired end RNA-seq experiment SRP012047.
ORFik comes with a SRA run downloader, just specificy the SRR numbers, or a SRA experiment information csv file containing a column called ‘Run’. We will now show how to get data from SRA. You can also get data from ERA or DRA.
info <- download.SRA.metadata("SRP012047", outdir = conf["fastq RNA-seq"])
# Let's take 2 first runs in this experiment:
info <- info[1:2,]
# 18 MB, ~ 40 sec download time ->
download.SRA(info, conf["fastq RNA-seq"], subset = 50000)
# 1.6 GB, ~ 100 sec download time (faster download) ->
# download.SRA(info, conf["fastq RNA-seq"])
We now have the RNA-seq run, separated into 2 files, since this is paired end data. We could for ease also just have specified the SRR number in download.SRA, but then we get no meta-data csv file, which is handy for auto-detection of paired end data, the organism name etc. This is shown below:
organism <- info$ScientificName[1]
is_paired_end <- all(info$LibraryLayout == "PAIRED")
To download annotation we use the getGenomeAndAnnotation function. We need to decide 3 things:
annotation <- getGenomeAndAnnotation(
organism = organism,
output.dir = conf["ref"],
assembly_type = "toplevel"
)
The function will also create a txdb object to speed up loading of gtf annotation, and index your genome to a .fai file.
If you run this function again after you have run this function and downloaded the data once, it will not re-download, but just output the correct object paths, this makes it easy to rerun the script, when you have some steps already finished.
If you you want to remove contaminants: phix, non coding RNAs, ribosomal RNAs, or tRNAs, also specify these in the function. By default it will download phix from refseq and the other contaminants are within the genome of the species, so they are extracted from the .gtf file. Note that some species does not have well annotated rRNAs, tRNAs etc, so you can either set rRNA = “silva” to download the Silva database (~ 2GB file) or manually download and add the sequences from tRNAs from tRNA scan or similar databases. If the gtf does not have Non coding RNAs, they can be extracted by setting ncRNA = “auto”, it will then check if the species exists in the NONCODE database and automatically download them for you if they exists.
ORFik uses the STAR aligner, which is splice aware and fast. This will only work on unix systems (Linux or Mac) for now. To align the data we need two steps, the indexing of the genome step and the alignment to the genome step.
To index the genome just give it the annotation output from previous step. This will also make an index for each of the depletion steps like phix, if you specified them in the earlier step.
index <- STAR.index(annotation)
If you run this function again after index exists in current file location, it will not re-index, but just output the correct object paths. Do remake = TRUE if you want to re-index.
ORFik uses the fastp for trimming reads, this also only works on unix (Linux or Mac OS). If you are on windows, or you want to trim the reads yourself, just run the trimming and give the folder with the trimmed reads as input in next step. Also if you are unsure of what the 3’ adapter was, run first FASTQC to which adapters are detected. The great thing with fastp is that it has auto detection and removal of adapters, if you check out the resulting files you will see fastp has auto removed the Illumina adapters.
Now let’s see what we need as inputs for the alignment pipeline: We need usually 9 arguments (more are possible if you need them):
alignment <-
STAR.align.folder(conf["fastq RNA-seq"], conf["bam RNA-seq"], index,
paired.end = is_paired_end,
steps = "tr-ge", # (trim needed: adapters found, then genome)
adapter.sequence = "auto",
max.cpus = 30, trim.front = 3, min.length = 20)
If you used the fastp (tr step), you will get a pre-alignment QC report. Just like FastQC in html format. You will also get a MultiQC report from STAR runs made by ORFik for you.
STAR is very memory hungry, therefor if you want to index and align large genomes like human (3 GB usually), you might need to adjust 2 parameters. If you have less than 40 GB free memory (32 GB might work), adjust this during indexing:
index <- STAR.index(annotation, max.ram = 20, SAsparse = 2)
STAR can use a lot of threads, which makes many small files open. Some systems have a restriction on how many files you can have open. This is found by doing “ulimit -Hn” in the terminal. If STAR crashes from this error, you need either increase amount of open files allowed (requires root access) or decrease the amount of cores used by STAR (does not require root access):
STAR.align.folder(conf["fastq RNA-seq"], conf["bam RNA-seq"], index,
max.cpus = 12) # Reduce cores to 12 usually works for most systems
To simplify coding and sharing of your work, you should make a ORFik experiment, check out the ORFik experiment vignette if you are unfamiliar with this class. You should first rename the bam files to more meaningful names, like RNA_WT_1 etc. Remember to keep a table of which SRA numbers correspond to which new file name. You do not need to do this, but this will make the ORFik experiment able to guess correctly what the data is. If there are replicates etc.
We can now easily make an ORFik experiment from the data we have:
txdb_file <- paste0(annotation["gtf"], ".db") # Get txdb file, not raw gtf
fa <- annotation["genome"]
create.experiment(exper = "yeast_exp_RNA",
dir = paste0(conf["bam RNA-seq"], "/aligned/"),
txdb = txdb_file, fa = fa,
organism = organism,
viewTemplate = FALSE,
pairedEndBam = is_paired_end # True/False per bam file
)
The files is now saved to default directory which is: saveDir = “~/Bio_data/ORFik_experiments/”
df <- read.experiment("yeast_exp_RNA")
If you are not happy with the libtype, stage, replicates and so on for the file, you can edit the ORFik experiment in R (recreate experiment, wanted slots) or edit in Libre Office, Excel or another spreadsheet viewer.
See ?QCreport for details of what you will get as output
QCreport(df)
Now you have an experiment, but bam files are big and slow to load. Let’s convert to some faster formats.
If you want optimzed format identical to bam file (contains cigar information), use .ofst. (Fastest, not readable in IGV) (ofst files are made when running ORFikQC)
remove.experiments(df) # Remove loaded libraries
convertLibs(df, type = "ofst")
If you want peaks only, use wig files (Fast, readable in IGV)
remove.experiments(df)
convertLibs(df, type = "wig")
As an example of how to load the data to R in the optimized format .ofst.
This will output the libraries to the environment specified, default .GlobalEnv (the default R environment). The files are named from the experiment table RNA_1_WT, RNA_1_treated etc.
remove.experiments(df)
outputLibs(df, type = "ofst")
After you have run QCreport you will have count tables of peaks over the mrna’s, 5’ UTRs, CDS and 3’ UTRs.
Let’s do an example to find the ratio between fpkm of between the CDS and mRNAs transcript regions.
mrna <- countTable(df, region = "mrna", type = "fpkm")
cds <- countTable(df, region = "cds", type = "fpkm")
ratio <- cds / mrna
We now have a ratio of fpkm values between CDS and mrna.
You can now continue to the Ribo-seq pipeline to see a more advanced example.