FASTQ Quality Control


[Up] [Top]

Documentation for package ‘qckitfastq’ version 1.16.0

Help Pages

adapter_content Creates a sorted from most frequent to least frequent abundance table of adapters that are found to be present in the reads at greater than 0.1% of the reads. If output_file is selected then will save the entire set of adapters and counts. Only available for macOS/Linux due to dependency on C++14.
calc_adapter_content Compute adapter content in reads. This function is only available for macOS/Linux.
calc_format_score Calculate score based on Illumina format
calc_over_rep_seq Calculate sequece counts for each unique sequence and create a table with unique sequences and corresponding counts
dimensions Extract the number of columns and rows for a FASTQ file using seqTools.
find_format Gets quality score encoding format from the FASTQ file. Return possibilities are Sanger(/Illumina1.8), Solexa(/Illumina1.0), Illumina1.3, and Illumina1.5. This encoding is heuristic based and may not be 100 since there is overlap in the encodings used, so it is best if you already know the format.
GC_content Calculates GC content percentage for each read in the dataset.
gc_per_read Calculate GC nucleotide sequence content per read of the FASTQ gzipped file
kmer_count Return kmer count per sequence for the length of kmer desired
overrep_kmer Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence.
overrep_reads Sort all sequences per read by count.
per_base_quality Compute the mean, median, and percentiles of quality score per base. This is returned as a data frame.
per_read_quality Compute the mean quality score per read. 'per_read_quality'
plot_adapter_content Creates a bar plot of the top 5 most present adapter sequences.
plot_GC_content Generate mean GC content histogram.
plot_outliers Determine how to plot outliers. Heuristic used is whether their obsexp_ratio differs by more than 1 and whether they fall into the same bin or not. If for 2 outliers, obsexp_ratio differs by less than .4 and they are in the same bin, then combine into a single plotting point. NOT FULLY FUNCTIONAL
plot_overrep_kmer Create a box plot of the log2(observed/expected) ratio across the length of the sequence as well as top overrepresented kmers. Only ratios greater than 2 are included in the box plot. Default is 20 bins across the length of the sequence and the top 2 overrepresented kmers, but this can be changed by the user.
plot_overrep_reads Plot the top 5 seqeunces
plot_per_base_quality Generate a boxplot of the per position quality score.
plot_per_read_quality Plot the mean quality score per sequence as a histogram. High quality sequences are those mostly distributed over 30. Low quality sequences are those mostly under 30. 'plot_per_read_quality'
plot_read_content Plot the per position nucleotide content.
plot_read_length Plot a histogram of the number of reads with each read length.
qual_score_per_read Calculate the mean quality score per read of the FASTQ gzipped file
read_base_content Compute nucleotide content per position for a single base pair. Wrapper function around seqTools.
read_content Compute nucleotide content per position. Wrapper function around seqTools.
read_length Creates a data frame of read lengths and the number of reads with that read length.
run_all Will run all functions in the qckitfastq suite and save the data frames and plots to a user-provided directory. Plot names are supplied by default.