FASTQ
Sequence DataRNA-Seq is a revolutionary approach for discovering and investigating the transcriptome using next-generation sequencing technologies (Wang et al. 2009). Typically, this transcriptome analysis aims to identify genes differentially expressed across different conditions, treatments or tissues, resulting in an understanding of the important pathways that are associated with the conditions (Wang et al. 2009). Following are the overview steps of RNA-Seq technique.
RNASeqR is an user-friendly R-based tool for running a six-step automation RNA-Seq analysis pipeline including quality assessment, read alignment and transcript quantification, differential expression analysis, and functional analysis. The main features of this package are an automated workflow and comprehensive reports with data visualization. In this package, the new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented in the R environment, including extra functions such as reads quality assessment and functional analysis. RNASeqR is beneficial for clinical researchers without significant computational background. There are only six lines of code that users need to execute in order to finish an end-to-end RNA-Seq analysis.
The central concept behind this package is that each step involved in RNA-Seq data analysis is a function call in R. For the subsequent parts of this documentation, inputs, outputs as well as detail implementation for these six steps will be elaborated upon in order. Following are the six steps and the each corresponding function that users need to execute.
RNASeqRParam
S4 Object Creation
RNASeqRParam()
RNASeqRParam
S4 object by running the RNASeqRParam()
constructor function for all variables being checked. This object will be used as input for the following steps.RNASeqEnvironmenenvironmenttSet_CMD()
or RNASeqEnvironmentSet()
RNASeqQualityAssessment_CMD()
or RNASeqQualityAssessment()
RNASeqReadProcess_CMD()
or RNASeqReadProcess()
RNASeqDifferentialAnalysis_CMD()
or RNASeqDifferentialAnalysis()
RNASeqGoKegg_CMD()
or RNASeqGoKegg()
Functions with the CMD
suffix create an R script and run nohup R CMD BATCH script.R
in the background. Functions with no CMD
suffix process in the R shell. After running the above functions, the whole RNA-Seq analysis is done and the files generated in each step are stored in an organized file directory. The RNASeqR
package makes two-group comparative RNA-Seq analysis more efficient and easier for users.
The following are the main tools that are used in this package: ‘HISAT2’ (Kim et al. 2015) and ‘STAR’ (Dobin et al. 2013) for read alignment; ‘StringTie’ (Pertea et al. 2015) for alignment assembly and transcripts quantification; ‘Rsamtools’ (Morgan et al. 2018) for converting SAM
files to BAM
files; ‘Gffcompare’ for comparing merged GTF
files with reference GTF
files; ‘systemPipeR’ (Backman et al. 2016) for quality assessment; ‘ballgown’ (Fu et al. 2018), ‘DESeq2’ (Love et al. 2014) and ‘edgeR’ (Robinson et al. 2010;McCarthy et al. 2012) for finding potential differentially expressed genes; and ‘clusterProfiler’ (Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.
Sample data used in this vignette can be downloaded from the RNASeqRData
experiment package. It originated from NCBI’s Sequence Read Archive for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples are from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from iGenomes, Ensembl, R64-1-1. To create a mini dataset for demonstration purposes, reads aligned to the region from 0 to 100000 on chromosome XV were extracted. The analysis of this mini dataset will be shown in this vignette. The experimental data package is located here.
For more real case-control data and RNA-Seq analysis results from this package, please go to this website: https://github.com/HowardChao/RNASeqR_analysis_result.
Necessary:
R version >= 3.5.0
Operating System: Linux and macOS are supported in the RNASeqR
package. Windows is not supported. (because StringTie and HISAT2 are not available for Windows).
Third-party software used in this package includes HISAT2, StringTie and Gffcompare. The availability of these commands will be checked by system2()
through the R shell at the end of the ‘Environment Setup’ step. The environment must successfully built before running the following RNA-Seq analysis. By default, binaries will be installed based on the operating system of the workstation; therefore there is no additional compiling required. Alternatively, users can still decide to skip certain software binary installation. For more details, please refer to the ‘Environment Setup’ chapter.
Recommended:
Python: Python2 or Python3.
2to3
: If the Python version on the workstation is 3, this command will be used. Generally, 2to3
is available if Python3 is available.
Python and 2to3
are used for creating raw read counts for DESeq2 and edgeR.
2to3
command available.If one of these conditions is met, the raw read count will be created and DESeq2 and edgeR will be run automatically in the ‘Gene-level Differential Analyses’ step. If not, DESeq2 and edgeR will be skipped during ‘Gene-level Differential Analysis’ step. Checking the Python version and 2to3
command on the workstation beforehand is highly recommended but not necessary.
HISAT2 indexex: Users are advised to provide an ‘indices/’ directory in ‘inputfiles/’. HISAT2 requires at least 160 GB of RAM and several hours to index the entire human genome.
This is the first step of the RNA-Seq analysis workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called RNASeqRParam()
and to create a RNASeqRParam
S4 object which stores parameters not only for pre-checking but also for utilizing as input the parameters in the subsequent steps.
RNASeqRParam
Slots ExplanationThere are 11 slots in RNASeqRParam
:
os.type : The operating system type. Value is linux
or osx
. This package only supports ‘Linux’ and ‘macOS’ (not ‘Windows’). If another operating system is detected, ERROR will be reported.
python.variable
: A Python-related variable. The value is a list of whether Python is available and the Python version (TRUE
or FALSE
, 2
or 3
).
python.2to3
: Availability of the 2to3
command. THe value is TRUE
or FALSE
.
path.prefix
: Path prefix of the ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended that you create a new, empty directory in which all the subsequent RNA-Seq results can be saved.
input.path.prefix
: Path prefix of the ‘input_files/’ directory. Users should prepare an ‘input_file/’ directory with the following rules:
genome.name
.fa: Reference genome in FASTA file formation.
genome.name
.gtf: Gene annotation in GTF file formation.
raw_fastq.gz/: Directory storing FASTQ
files.
Supports both paired-end and single-end reads files.
Names of FASTQ
files : ’sample.pattern
_1.fastq.gz’ and ’sample.pattern
_2.fastq.gz’. sample.pattern
must be distinct for each sample.
phenodata.csv: Information about the RNA-Seq experiment design.
First column : Distinct ids for each sample. The value of each sample of this column must match sample.pattern
in FASTQ
files in ‘raw_fastq.gz/’. The column name must be ids.
case.group
or control.group
. The column name is the parameter independent.variable
which can be any string defined by users.Third column : The color coding for case.group
and control.group
. Samples in the same group must have same color coding and their values are HEX color code. The column name must be color.
indices/ : The directory storing HT2
index files for the HISAT2 alignment tool.
This directory is optional. HT2
index files corresponding to the reference genome can be installed at HISAT2 official website. Providing HT2
files can accelerate the subsequent analysis steps. It is highly advised that you install HT2
files.
If HT2
index files are not provided, the ‘input_files/indices/’ directory should be deleted.
An example ‘phenodata.csv’ file. File is stored in ‘CSV’ format.
genome.name
: The genome name defined in this RNA-Seq workflow (ex. ‘genome.name
.fa’, ‘genome.name
.gtf’)
sample.pattern
: A regular expression of paired-end or single-end fastq.gz files under ‘input_files/raw_fastq.gz/’. IMPORTANT!! The expression shouldn’t have _[1,2].fastq.gz
at the end.
independent.variable
: Independent variable for the biological experiment design of a two-group RNA-Seq analysis workflow.
case.group
: Name of the case group.
control.group
: Name of the control group.
indices.optional
: A logical value indicating whether ‘input_files/indices/’ exists. Value is TRUE
or FALSE
RNASeqRParam
Constructor CheckingCreate a new directory for the RNA-Seq analysis. It is highly recommended to create a new, completely empty directory. The parameter path.prefix
of RNASeqRParam()
constructor should be the absolute path of this new directory. All the RNA-SeqR-related files are generated in the subsequent steps will be stored inside of this directory.
Create a valid ‘input_files/’ directory. You should create a file directory named ‘input_files/’ with the neccessary files inside. It should follow the rules mentioned above.
RNASeqRParam
S4 object. This constructor will check the validity of the input parameters before creating the S4 objects.
Operating system
Python version
2to3 command
Structure, content, and rules of ‘inputfiles/’
Validity of input parameters
input_files.path <- system.file("extdata/", package = "RNASeqRData")
#rnaseq_result.path <- "/Users/chaokuan-hao/Documents/ANU_2019_Semester_2/Lanfear_Lab/HI"
rnaseq_result.path <- "/tmp/RNASeqR/"
dir.create(rnaseq_result.path, recursive = TRUE)
list.files(input_files.path, recursive = TRUE)
## [1] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.fa"
## [2] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.gtf"
## [3] "input_files/phenodata.csv"
## [4] "input_files/raw_fastq.gz/SRR3396381_XV_1.fastq.gz"
## [5] "input_files/raw_fastq.gz/SRR3396381_XV_2.fastq.gz"
## [6] "input_files/raw_fastq.gz/SRR3396382_XV_1.fastq.gz"
## [7] "input_files/raw_fastq.gz/SRR3396382_XV_2.fastq.gz"
## [8] "input_files/raw_fastq.gz/SRR3396384_XV_1.fastq.gz"
## [9] "input_files/raw_fastq.gz/SRR3396384_XV_2.fastq.gz"
## [10] "input_files/raw_fastq.gz/SRR3396385_XV_1.fastq.gz"
## [11] "input_files/raw_fastq.gz/SRR3396385_XV_2.fastq.gz"
## [12] "input_files/raw_fastq.gz/SRR3396386_XV_1.fastq.gz"
## [13] "input_files/raw_fastq.gz/SRR3396386_XV_2.fastq.gz"
## [14] "input_files/raw_fastq.gz/SRR3396387_XV_1.fastq.gz"
## [15] "input_files/raw_fastq.gz/SRR3396387_XV_2.fastq.gz"
Check the reads files in ‘input_files/’ directory. Users can set fastq.gz.type as “PE” (paired-end) or “SE” (single-end) to specify which kinds of fastq.gz types they are.
exp <- RNASeqRParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control",
fastq.gz.type = "PE")
exp <- RNASeqRParam(path.prefix = rnaseq_result.path,
input.path.prefix = input_files.path,
genome.name = "Saccharomyces_cerevisiae_XV_Ensembl",
sample.pattern = "SRR[0-9]*_XV",
independent.variable = "state",
case.group = "60mins_ID20_amphotericin_B",
control.group = "60mins_ID20_control",
fastq.gz.type = "SE")
## RNASeqRParam S4 object
## os.type : linux
## python.variable : (Availability: TRUE , Version: 2 )
## python.2to3 : FALSE
## path.prefix : /tmp/RNASeqR/
## input.path.prefix : /home/biocbuild/bbs-3.10-bioc/R/library/RNASeqRData/extdata/
## genome.name : Saccharomyces_cerevisiae_XV_Ensembl
## sample.pattern : SRR[0-9]*_XV
## independent.variable : state
## case.group : 60mins_ID20_amphotericin_B
## control.group : 60mins_ID20_control
## indices.optional : FALSE
## fastq.gz.type : PE
In this example, the RNASeqRParam
S4 object is stored in exp
for subsequent RNA-Seq analysis steps. Any ERROR occurring in the checking steps will terminate the program.
This is the second step of the RNA-Seq analysis workflow in this package. To set up the environment, run RNASeqEnvironmentSet_CMD()
to execute the process in the background, or run RNASeqEnvironmentSet()
to execute the process in the R shell.
path.prefix
directory. Here is the usage of these five main directories:
‘gene_data/’: Symbolic links of ‘input_files/’ and files that are created in each step of RNA-Seq analysis will be stored in this directory.
‘RNASeq_bin/’: The binaries of necessary tools, HISAT2, SAMtools, StringTie and Gffcompare, are installed in this directory.
‘RNASeq_results’: The RNA-Seq results, for example, alignment results, quality assessment results, differential analysis results etc., will be stored in this directory.
‘Rscript’: If your run the XXX_CMD()
function, the corresponding R script(XXX.R
) for certain steps will be created in this directory.
‘Rscript_out’: The corresponding output report for R scripts (XXX.Rout
) will be stored in this directory.
path.prefix
directory.The operating system of your workstation will be detected. If the operating system is not Linux
or macOS
, ERROR will be reported. Users can decide whether the installation of essential programs(HISAT2, StringTie and Gffcompare) is going be automatically processed.
Third-party softwares used in this package includes HISAT2, StringTie and Gffcompare. Binaries are available for these three programs, and by default, they will be installed automatically based on the operating system of the workstation. Zipped binaries will be unpacked and exported to the R environment PATH. No compilation is needed.
To specify, there are three parameters(install.hisat2
, install.stringtie
and install.gffcompare
) in both the RNASeqEnvironmentSet_CMD()
and RNASeqEnvironmentSet()
functions for users to determine which software is going to be installed automatically or skipped. The default settings of these parameters are TRUE
so that these three programs will be installed directly. Otherwise, users can skip certain software installation processes by turning the values to FALSE
. Please make sure to check that the skipped programs are available using system2()
through the R shell. Unavailability of any of the programs will cause failure in the ‘Environment Setup’ step.
Here is the version information of each software binary.
hisat2-2.1.0-Linux_x86_64.zip
or hisat2-2.1.0-OSX_x86_64.zip
zipped file will be installed.stringtie-1.3.4d.Linux_x86_64.tar.gz
or stringtie-1.3.4d.Linux_x86_64
will be installed.gffcompare-0.10.4.Linux_x86_64.tar.gz
or gffcompare-0.10.4.Linux_x86_64.tar.gz
will be installed.‘RNASeq_bin/’ will be added to the R environment PATH so that these binaries can be found in the R environment in the R shell through system2()
. In the last step of environment setup, the hisat2 --version
,stringtie --version
,gffcompare --version
, and samtools --version
commands will be checked in order to make sure the environment is correctly constructed. The environment must be set up successfully before the subsequent analyses.
Run RNASeqEnvironmentSet_CMD()
or RNASeqEnvironmentSet()
.
FASTQ
Sequence DataThis is the third step of the RNA-Seq analysis workflow in this package. Different from other necessary steps, it is optional and can be run several times with each result stored separately. Although this step can be skipped, it is strongly recommended that it be performed before processing the alignment step. To evaluate the quality of the raw reads in the FASTQ
files, run RNASeqQualityAssessment_CMD()
in the background or run RNASeqQualityAssessment()
to execute the process in the R shell.
In this step, the systemPipeR package is used for evaluating sequencing reads and the details are as follows:
Check the number of times that the user has run the quality assessment process and create the corresponding files ‘RNASeq_results/QA_results/QA_{times}’.
RNA-Seq environment set up. The ‘rnaseq/’ directory will be created by systemPipeR package.
Create the ‘data.list.txt’ file.
Reading FASTQ
files and create ‘fastqReport.pdf’ containing the results of the quality assessment.
Remove the ‘rnaseq/’ directory.
This quality assessment result (example below) is generated by systemPipeR package. It will be stored as a PDF
.
Run RNASeqQualityAssessment_CMD()
or RNASeqQualityAssessment()
.
## [1] "Generated rnaseq directory. Next run in rnaseq directory, the R code from *.Rmd (*.Rnw) template interactively. Alternatively, workflows can be exectued with a single command as instructed in the vignette."
Sliding through quality score in each base would be computational-intensive. Moreover, due to R package limitation, it is not suitable to trim large RNA-Seq fastq.gz in R environment and is not supported in RNASeqR. Sample with an average quality score below Q30 is suggested to be omitted from the sub-sequent analyses.
In order to let users update trimmed fastq.gz files easily, an update function is provided. User need the RNASeqRParam
S4 object created at the first step, the prepared new ‘raw_fasrq.gz/’ directory with trimmed files and the samples stored in list as Update_Fastq_gz
function inputs.
This is the fourth step of the RNA-Seq analysis workflow in this package. To process raw reads in FASTQ
files, users can either run RNASeqRawReadProcess_CMD()
to execute the process in background or run RNASeqRawReadProcess()
to execute the process in the R shell. For further details about the commands and parameters that executed during each step, please check the reported ‘RNASeq_results/COMMAND.txt’.
The defalt indexer is Hisat2. In a preparation step (the RNASeqRParam
creation step), the ‘indices/’ directory is checked for whether HT2
index files already exist. If not, the following commands will be executed:
Input: ‘genome.name
.gtf’, ‘genome.name
.fa’
Output: ‘genome.name
.ss’, ‘genome.name
.exon’, ’genome.name
_tran.{number}.ht2’
extract_splice_sites.py
, extract_exons.py
execution
hisat2-build
index creation
genome.name
.ss and genome.name
.exon created in the previous step. Be aware that the index building step requires a larger amount of memory and longer time than other steps, and it might not be possible to run on some personal workstations. It is highly recommended that you check the availability of HT2
index files at the HISAT2 official website for your target reference genome beforehands. Pre-installing HT2
index files will greatly shorten the analysis time.Input: ’genome.name
_tran.{number}.ht2’, ‘sample.pattern
.fastq.gz’
Output: ‘sample.pattern
.sam’
The defalt Aligner is Hisat2.
hisat2
command is executed on paired-end or single-end FASTQ
files. SAM
files will be created.
SAM
files are stored in ‘gene_data/raw_sam/’.CSV
) and picture (PNG
) format is created and kept in the directory ‘RNASeq_results/Alignment_Report’.Input: ‘genome.name
.gtf’, ‘genome.name
.fa’, ‘sample.pattern
.fastq.gz’
Output: ‘`sample.pattern
.sam`’
Users could select STAR as their aligner in this pipeline by setting STAR.Alignment.run
to TRUE
and Hisat2.Index.run
, Hisat2.Alignment.run
to FALSE
. Indexed results would be stored in ‘indices/’ directory, and generated SAM
files are stored in ‘gene_data/raw_sam/’.
SAM
to BAM
ConverterIn this step, users can choose whether they want to use ‘Rsamtools’(an R package) or ‘SAMtools’ ( a command-line-based tool) to do files conversion by setting the SAMtools.or.Rsamtools
parameter to Rsamtools
or SAMtools
. By default, Rsamtools
will be used. However, if the amount of RNA-Seq data is too large, ‘Rsamtools’ might not be able to finish this process due to the Rtmp file issue, and therefore ‘SAMtools’ is recommended. Users have to make sure the ‘samtools’ command is available on the workstation beforehands or ERROR will be reported.
Input: ‘sample.pattern
.sam’
Output: ‘sample.pattern
.bam’
samtools
in the R environment. In this step, SAM
files from HISAT2 will be converted to BAM
files by running the asBam()
function.
BAM
files are stored in ‘gene_data/raw_sam/’.Input: ‘genome.name
.gtf’, ‘sample.pattern
.bam’
Output: ‘sample.pattern
.gtf’
stringtie
command is executed.
GTF
files which are from each FASTQ
files are stored in ‘gene_data/raw_gtf/’GTF
MergerInput: ‘sample.pattern
.gtf’
Output: ‘stringtiemerged.gtf’, ‘mergelist.txt’
stringtie
command is executed.
sample.pattern
.gtf into stringtiemerged.gtfInput: ‘genome.name
.gtf’, ‘stringtie_merged.gtf’
Output: ‘merged.annotated.gtf’, ‘merged.loci’, ‘merged.stats’, ‘merged.stringtie_merged.gtf.refmap’, ‘merged.stringtie_merged.gtf.tmap’, ‘merged.tracking’
gffcompare
command is executed.
GTF
file and reference annotation files is reported in the ‘merged/’ directory.Input: ‘stringtie_merged.gtf’
Output: ‘ballgown/’, ‘gene_abundance/’
stringtie
command is executed.
TSV
file under each sample name in ‘gene_data/gene_abundance/’.Whether this step is executed depends on the availability of Python on your workstation.
Input: ‘samplelst.txt’
Output: ‘gene_count_matrix.csv’, ‘transcript_count_matrix.csv’
The reads count table converter Python script is downloaded as prepDE.py
When Python is not available, this step is skipped.
When Python2 is available, prepDE.py
is executed.
When Python3 is available, the 2to3
command will be checked.(Usually, if Python3 is installed, 2to3
command will be installed too.)
When Python3 is available but the 2to3
command is unavailable, the raw read count step will be skipped.
When Python3 and the 2to3
command are available, prepDE.py
is converted to a file that can be executed by Python2 and is automatically executed.
Run RNASeqReadProcess_CMD()
or RNASeqReadProcess()
. Parameters for index, alignmet and assembly are set by default. Only S4 RNASeqRParam
object is need. If users want to change their parameters, please check function inputs of RNASeqReadProcess_CMD()
and RNASeqReadProcess()
.
HISAT2
.STAR
and run in the background, run this oneRNASeqReadProcess_CMD(exp, STAR.Alignment.run=TRUE,
Hisat2.Index.run=FALSE,
Hisat2.Alignment.run = FALSE)
HISAT2
.## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:04"
## [4] "74641 reads; of these:"
## [5] " 74641 (100.00%) were paired; of these:"
## [6] " 5505 (7.38%) aligned concordantly 0 times"
## [7] " 69136 (92.62%) aligned concordantly exactly 1 time"
## [8] " 0 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 5505 pairs aligned concordantly 0 times; of these:"
## [11] " 882 (16.02%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 4623 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 9246 mates make up the pairs; of these:"
## [15] " 5372 (58.10%) aligned 0 times"
## [16] " 3874 (41.90%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "96.40% overall alignment rate"
## [19] "Time searching: 00:00:04"
## [20] "Overall time: 00:00:04"
## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:04"
## [4] "84711 reads; of these:"
## [5] " 84711 (100.00%) were paired; of these:"
## [6] " 7499 (8.85%) aligned concordantly 0 times"
## [7] " 77212 (91.15%) aligned concordantly exactly 1 time"
## [8] " 0 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 7499 pairs aligned concordantly 0 times; of these:"
## [11] " 1250 (16.67%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 6249 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 12498 mates make up the pairs; of these:"
## [15] " 7691 (61.54%) aligned 0 times"
## [16] " 4807 (38.46%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "95.46% overall alignment rate"
## [19] "Time searching: 00:00:04"
## [20] "Overall time: 00:00:04"
## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:05"
## [4] "89449 reads; of these:"
## [5] " 89449 (100.00%) were paired; of these:"
## [6] " 7441 (8.32%) aligned concordantly 0 times"
## [7] " 82008 (91.68%) aligned concordantly exactly 1 time"
## [8] " 0 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 7441 pairs aligned concordantly 0 times; of these:"
## [11] " 1197 (16.09%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 6244 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 12488 mates make up the pairs; of these:"
## [15] " 7431 (59.51%) aligned 0 times"
## [16] " 5057 (40.49%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "95.85% overall alignment rate"
## [19] "Time searching: 00:00:06"
## [20] "Overall time: 00:00:06"
## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:05"
## [4] "83081 reads; of these:"
## [5] " 83081 (100.00%) were paired; of these:"
## [6] " 7663 (9.22%) aligned concordantly 0 times"
## [7] " 75418 (90.78%) aligned concordantly exactly 1 time"
## [8] " 0 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 7663 pairs aligned concordantly 0 times; of these:"
## [11] " 1551 (20.24%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 6112 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 12224 mates make up the pairs; of these:"
## [15] " 7775 (63.60%) aligned 0 times"
## [16] " 4449 (36.40%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "95.32% overall alignment rate"
## [19] "Time searching: 00:00:05"
## [20] "Overall time: 00:00:05"
## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:04"
## [4] "93579 reads; of these:"
## [5] " 93579 (100.00%) were paired; of these:"
## [6] " 5984 (6.39%) aligned concordantly 0 times"
## [7] " 87593 (93.60%) aligned concordantly exactly 1 time"
## [8] " 2 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 5984 pairs aligned concordantly 0 times; of these:"
## [11] " 693 (11.58%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 5291 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 10582 mates make up the pairs; of these:"
## [15] " 5823 (55.03%) aligned 0 times"
## [16] " 4759 (44.97%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "96.89% overall alignment rate"
## [19] "Time searching: 00:00:04"
## [20] "Overall time: 00:00:04"
## [1] "Time loading forward index: 00:00:00"
## [2] "Time loading reference: 00:00:00"
## [3] "Multiseed full-index search: 00:00:05"
## [4] "90077 reads; of these:"
## [5] " 90077 (100.00%) were paired; of these:"
## [6] " 6826 (7.58%) aligned concordantly 0 times"
## [7] " 83250 (92.42%) aligned concordantly exactly 1 time"
## [8] " 1 (0.00%) aligned concordantly >1 times"
## [9] " ----"
## [10] " 6826 pairs aligned concordantly 0 times; of these:"
## [11] " 1077 (15.78%) aligned discordantly 1 time"
## [12] " ----"
## [13] " 5749 pairs aligned 0 times concordantly or discordantly; of these:"
## [14] " 11498 mates make up the pairs; of these:"
## [15] " 6876 (59.80%) aligned 0 times"
## [16] " 4622 (40.20%) aligned exactly 1 time"
## [17] " 0 (0.00%) aligned >1 times"
## [18] "96.18% overall alignment rate"
## [19] "Time searching: 00:00:05"
## [20] "Overall time: 00:00:05"
STAR
and run in the R shell, then run this oneThis is the fifth step of the RNA-Seq analysis workflow in this package. To identify differentially expressed genes, users can either run RNASeqDifferentialAnalysis_CMD()
to execute the process in the background or run RNASeqDifferentialAnalysis()
to execute the process in the R shell. In this package, we provide three normalized expression values: fragments per kilobase per million reads (FPKM) (Mortazavi et al. 2008), normalized counts by means of median of ratios normalization (MRN), or trimmed mean of m-values (TMM), each with proper statistical analyses using the R packages, ballgown, DESeq2, and edgeR. Gene IDs from StringTie and the ballgown R package will be mapped to ‘gene_name’ in the GTF file for further functional analysis.
For each differential expression analysis tool, all statistical results would be stored properly under ‘ballgown_analysis/’, ‘DESeq2_analysis/’ and ‘edgeR_analysis/’. In the file, known gene annotations would be listed in the front rows while unknown gene annotations would be listed in the back rows. The following picture shows partial result of an example dataset where ‘.’ represents unknow gene id.
Here we illustrate general data visualization before and after differential expression analysis. The results based on each differential analysis tool(ballgown, DESeq2, edgeR) are kept in the directory ‘RNASeq_results/’. The plots shown below are the statistical results visualization of example data in the RNASeqRData
package based on MRN-normalized expression values from DESeq2.
For real data analysis results, please go to this website: https://howardchao.github.io/RNASeqR_analysis_result/.
When visualizing the frequency of expression value per sample using the ggplot2 R package, the x-axis represents the range of normalized counts value by MRN or log2(MRN+1) value and the y-axis represents the frequency of each value on the x-axis
‘Frequency_Plot_normalized_count_ggplot2.png’
‘Frequency_Plot_log_normalized_count_ggplot2.png’
You can display the distribution of normalized expression values (e.g., log2(MRN+1) values) in a boxplot and a violin plot using the ggplot2 R package. Samples are colored as follows: blue for the case group and yellow for the control group.
‘Box_Plot_ggplot2.png’
‘Violin_Plot_ggplot2.png’
To display how the biological samples compare in overall similarities and differences using principal component analysis(PCA); the principal component scores of the top five dimensions are calculated using the FactoMineR package and the results are extracted and visualized using factoextra or ggplot2.
‘Dimension_PCA_Plot_factoextra.png’
‘PCA_Plot_factoextra.png’: Samples are colored as follows: blue for the case group and yellow for the control group. The small point represents each sample while the big one represents each comparison group. Ellipases can be further added for grouping samples.
‘PCA_Plot_ggplot2.png’: Samples are colored as described above.
You can display Pearson correlation coefficient of a pairwise correlation analysis of changes in gene expression from all samples calculated by stats using ggplot2(correlation heat plot), corrplot(correlation dot plot), and PerformanceAnalytics(correlation bar plot). The colors from red to blue indicate the value of the correlation, from maximum value to minimum value, among all samples.
‘Correlation_Heat_Plot_ggplot2.png’
‘Correlation_Dot_Plot_corrplot.png’
‘Correlation_Bar_Plot_PerformanceAnalytics.png’