1 Introduction

Rqc is an optimized tool designed for quality control and assessment of high-throughput sequencing data. It performs parallel processing of entire files and produces a report which contains a set of high-resolution graphics that can be used for quality assessment.

This version of Rqc produces high-quality images for the following statistics:

  • Average Quality: this plot describes the average quality pattern by showing on the X-axis quality thresholds and on the Y-axis the percentage of reads that exceed that quality level.
  • Cycle-specific Average Quality: this describes the average quality scores for each cycle of sequencing.
  • Read Length Distribution: this is a barplot that presents the distribuition of the lengths of the reads available in the FASTQ file.
  • Cycle-specific GC Content: a line plot showing the average GC content for every cycle of sequencing.
  • Cycle-specific Quality Distribution: a bar plot showing the proportion of quality calls per cycle. Colors are presented in a gradient Red-Blue, where red identifies calls of lower quality. This visualization is preferred as it is cleaner than the boxplots described below.
  • Cycle-specific Quality Distribution - Boxplots: boxplots describing empirical patterns of quality distribution on each cycle of sequencing.
  • Cycle-specific Base Call Proportion: this bar plot describes the proportion of each nucleotide called for every cycle of sequencing.

1.1 Basic Workflow

The main goal of Rqc is to provide graphical tools for quality assessment of reads contained in FASTQ files. This package is designed focusing on simplicity of use. Therefore, the Rqc package allows the user to call one single function called rqc. The rqc method processes a set of input files and generates an HTML report containing several plots that can be used for quality assessment.

To access this functionality, the user needs to load Rqc package.

library(Rqc)

The next step is to determine the location of the FASTQ files that should be analyzed. The example below, uses sample files provided by the ShortRead package, but the user must modify this location accordingly, in order to reflect the actual location of the files that need QA.

folder <- system.file(package="ShortRead", "extdata/E-MTAB-1147")

The basic usage of the rqc function requires the definition of 2 arguments. One, path, is the location where the files of interest are saved at (this was defined on the step above). The other argument, pattern, is a regular expression that identifies all files of interest. Below, we use .fastq.gz to specify that all files containing that string are to be processed.

rqc(path = folder, pattern = ".fastq.gz")

At this point, the user’s default Internet browser will open an HTML file. This file is the report generated by Rqc, which, by default, is stored in a temporary directory. A sample report is shown below:


2 Quality control report

2.1 File Information

This table describes input files. reads column can be total number of reads (sample=FALSE) or sample size.

knitr::kable(perFileInformation(qa))
filename pair format group reads total.reads path
ERR127302_1_subset.fastq.gz 1 FASTQ None 20000 20000 /home/biocbuild/bbs-3.7-bioc/R/library/ShortRead/extdata/E-MTAB-1147
ERR127302_2_subset.fastq.gz 2 FASTQ None 20000 20000 /home/biocbuild/bbs-3.7-bioc/R/library/ShortRead/extdata/E-MTAB-1147

2.2 Per Read Mean Quality Distribution of Files

This plot describe an overview of per read mean quality distribution of all files

rqcReadQualityBoxPlot(qa)

2.3 Average Quality

This plot describes the average quality pattern by showing on the X-axis quality thresholds and on the Y-axis the percentage of reads that exceed that quality level.

rqcReadQualityPlot(qa)

2.4 Cycle-specific Average Quality

This describes the average quality scores for each cycle of sequencing.

rqcCycleAverageQualityPlot(qa)