Contents

1 Introduction

1.1 RNA-Seq Overview

RNA-Seq is a revolutionary approach for discovering and investigating the transcriptome using next-generation sequencing technologies (Wang et al. 2009). Typically, this transcriptome analysis aims to identify genes differentially expressed across different conditions, treatments or tissues, resulting in an understanding of the important pathways that are associated with the conditions (Wang et al. 2009). Following are the overview steps of RNA-Seq technique.

  1. Set up experiment:
    • First, scientists make their hypotheses and choose their target samples for RNA extraction.
  2. RNA isolation & purify:
    • RNA is extracted and isolated from sample tissue. Then quality and total amount of RNA are checked including the amount of RNA degradation in order to assess the samples.
  3. RNA filtration:
    • Ribosomal RNA (rRNA) which accounts for around 90% of transcriptome from total RNA-Seq sample needs to be removed. The interest areas of RNA-Seq analysis are mRNAs, small RNAs as well as lincRNAs.
  4. Reverse-transcribed to cDNA:
    • cDNA is more stable than RNA, thus would be a better polymer to do PCR amplification and storage.
  5. Sequencing:
    • Samples are sequenced by next generation sequencing machine. Raw fastq reads files are generated.
  6. Reads alignment & transcript assembly:
    • Sequenced reads are algined to reference genome (traditional mapping) and then assembled into potential transcripts by assembler. A different approach is to directly compare reads with target transcripts (pseudo alignment method) to quantify estimated expression quickly.
  7. Differential expression gene analysis:
    • Statistical methods are used in this step in order to get potential differential expressed genes.
  8. Functional Analysis:
    • Differential expressed genes are selected as inputs of GO functional analysis as well as KEGG pathway analysis in order to get further biology information.

1.2 RNASeqR Overview

RNASeqR is an user-friendly R-based tool for running a six-step automation RNA-Seq analysis pipeline including quality assessment, read alignment and transcript quantification, differential expression analysis, and functional analysis. The main features of this package are an automated workflow and comprehensive reports with data visualization. In this package, the new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented in the R environment, including extra functions such as reads quality assessment and functional analysis. RNASeqR is beneficial for clinical researchers without significant computational background. There are only six lines of code that users need to execute in order to finish an end-to-end RNA-Seq analysis.

The central concept behind this package is that each step involved in RNA-Seq data analysis is a function call in R. For the subsequent parts of this documentation, inputs, outputs as well as detail implementation for these six steps will be elaborated upon in order. Following are the six steps and the each corresponding function that users need to execute.

  1. RNASeqRParam S4 Object Creation
    • Function: RNASeqRParam()
    • In the beginning, users will create an RNASeqRParam S4 object by running the RNASeqRParam() constructor function for all variables being checked. This object will be used as input for the following steps.
  2. Environment Setup
    • Function: RNASeqEnvironmenenvironmenttSet_CMD() or RNASeqEnvironmentSet()
    • In the second step, the environment is set up automatically with necessary tools installed for the subsequent steps in RNASeqR.
  3. Quality Assessment (optional)
    • Function: RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment()
    • In the third step (optional), users check the quality of raw fastq files.
  4. Reads alignment and transcripts quantification
    • Function: RNASeqReadProcess_CMD() or RNASeqReadProcess()
    • In the fourth step, raw reads are processed and all intermediate files are stored properly.
  5. Differential Expression Gene Analysis
    • Function: RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis()
    • In the fifth step, differential analysis is done by edgeR, DESeq2 and ballgown R / Bioconductor packages.
  6. Functional Analysis
    • Function: RNASeqGoKegg_CMD() or RNASeqGoKegg()
    • In the sixth step, differential expressed genes are used as the input for the GO functional analysis and KEGG pathway analysis. Enriched pathways are displayed visually.

Functions with the CMD suffix create an R script and run nohup R CMD BATCH script.R in the background. Functions with no CMD suffix process in the R shell. After running the above functions, the whole RNA-Seq analysis is done and the files generated in each step are stored in an organized file directory. The RNASeqR package makes two-group comparative RNA-Seq analysis more efficient and easier for users.

The following are the main tools that are used in this package: ‘HISAT2’ (Kim et al. 2015) and ‘STAR’ (Dobin et al. 2013) for read alignment; ‘StringTie’ (Pertea et al. 2015) for alignment assembly and transcripts quantification; ‘Rsamtools’ (Morgan et al. 2018) for converting SAM files to BAM files; ‘Gffcompare’ for comparing merged GTF files with reference GTF files; ‘systemPipeR’ (Backman et al. 2016) for quality assessment; ‘ballgown’ (Fu et al. 2018), ‘DESeq2’ (Love et al. 2014) and ‘edgeR’ (Robinson et al. 2010;McCarthy et al. 2012) for finding potential differentially expressed genes; and ‘clusterProfiler’ (Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.

1.3 Sample Definition

Sample data used in this vignette can be downloaded from the RNASeqRData experiment package. It originated from NCBI’s Sequence Read Archive for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples are from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from iGenomes, Ensembl, R64-1-1. To create a mini dataset for demonstration purposes, reads aligned to the region from 0 to 100000 on chromosome XV were extracted. The analysis of this mini dataset will be shown in this vignette. The experimental data package is located here.

For more real case-control data and RNA-Seq analysis results from this package, please go to this website: https://github.com/HowardChao/RNASeqR_analysis_result.

1.4 System Requirements

Necessary:

  1. R version >= 3.5.0

  2. Operating System: Linux and macOS are supported in the RNASeqR package. Windows is not supported. (because StringTie and HISAT2 are not available for Windows).

  3. Third-party software used in this package includes HISAT2, StringTie and Gffcompare. The availability of these commands will be checked by system2() through the R shell at the end of the ‘Environment Setup’ step. The environment must successfully built before running the following RNA-Seq analysis. By default, binaries will be installed based on the operating system of the workstation; therefore there is no additional compiling required. Alternatively, users can still decide to skip certain software binary installation. For more details, please refer to the ‘Environment Setup’ chapter.

Recommended:

  1. Python: Python2 or Python3.

  2. 2to3: If the Python version on the workstation is 3, this command will be used. Generally, 2to3 is available if Python3 is available.
    • Python and 2to3 are used for creating raw read counts for DESeq2 and edgeR.

    • The following are two conditions that will create a raw read count from StringTie output.
      1. Python2
      2. Python3 with 2to3 command available.
    • If one of these conditions is met, the raw read count will be created and DESeq2 and edgeR will be run automatically in the ‘Gene-level Differential Analyses’ step. If not, DESeq2 and edgeR will be skipped during ‘Gene-level Differential Analysis’ step. Checking the Python version and 2to3 command on the workstation beforehand is highly recommended but not necessary.

  3. HISAT2 indexex: Users are advised to provide an ‘indices/’ directory in ‘inputfiles/’. HISAT2 requires at least 160 GB of RAM and several hours to index the entire human genome.

2 “RNASeqRParam” S4 Object Creation

This is the first step of the RNA-Seq analysis workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called RNASeqRParam() and to create a RNASeqRParam S4 object which stores parameters not only for pre-checking but also for utilizing as input the parameters in the subsequent steps.

2.1 RNASeqRParam Slots Explanation

There are 11 slots in RNASeqRParam:

  1. os.type : The operating system type. Value is linux or osx. This package only supports ‘Linux’ and ‘macOS’ (not ‘Windows’). If another operating system is detected, ERROR will be reported.

  2. python.variable : A Python-related variable. The value is a list of whether Python is available and the Python version (TRUE or FALSE, 2 or 3).

  3. python.2to3 : Availability of the 2to3 command. THe value is TRUE or FALSE.

  4. path.prefix : Path prefix of the ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended that you create a new, empty directory in which all the subsequent RNA-Seq results can be saved.

  1. input.path.prefix : Path prefix of the ‘input_files/’ directory. Users should prepare an ‘input_file/’ directory with the following rules:
    • genome.name.fa: Reference genome in FASTA file formation.

    • genome.name.gtf: Gene annotation in GTF file formation.

    • raw_fastq.gz/: Directory storing FASTQ files.

      • Supports both paired-end and single-end reads files.

      • Names of FASTQ files : ’sample.pattern_1.fastq.gz’ and ’sample.pattern_2.fastq.gz’. sample.pattern must be distinct for each sample.

    • phenodata.csv: Information about the RNA-Seq experiment design.

      • First column : Distinct ids for each sample. The value of each sample of this column must match sample.pattern in FASTQ files in ‘raw_fastq.gz/’. The column name must be ids.

      • Second column : An independent variable for the RNA-Seq experiment. Values of each sample of this column can only be parameter case.group or control.group. The column name is the parameter independent.variable which can be any string defined by users.
      • Third column : The color coding for case.group and control.group. Samples in the same group must have same color coding and their values are HEX color code. The column name must be color.

    • indices/ : The directory storing HT2 index files for the HISAT2 alignment tool.

      • This directory is optional. HT2 index files corresponding to the reference genome can be installed at HISAT2 official website. Providing HT2 files can accelerate the subsequent analysis steps. It is highly advised that you install HT2 files.

      • If HT2 index files are not provided, the ‘input_files/indices/’ directory should be deleted.

An example ‘phenodata.csv’ file. File is stored in ‘CSV’ format.