adapter_content |
Creates a sorted from most frequent to least frequent abundance table of adapters that are found to be present in the reads at greater than 0.1% of the reads. If output_file is selected then will save the entire set of adapters and counts. Only available for macOS/Linux due to dependency on C++14. |
calc_adapter_content |
Compute adapter content in reads. This function is only available for macOS/Linux. |
calc_format_score |
Calculate score based on Illumina format |
calc_over_rep_seq |
Calculate sequece counts for each unique sequence and create a table with unique sequences and corresponding counts |
dimensions |
Extract the number of columns and rows for a FASTQ file using seqTools. |
find_format |
Gets quality score encoding format from the FASTQ file. Return possibilities are Sanger(/Illumina1.8), Solexa(/Illumina1.0), Illumina1.3, and Illumina1.5. This encoding is heuristic based and may not be 100 since there is overlap in the encodings used, so it is best if you already know the format. |
GC_content |
Calculates GC content percentage for each read in the dataset. |
gc_per_read |
Calculate GC nucleotide sequence content per read of the FASTQ gzipped file |
kmer_count |
Return kmer count per sequence for the length of kmer desired |
overrep_kmer |
Generate overrepresented kmers of length k based on their observed to expected ratio at each position across all sequences in the dataset. The expected proportion of a length k kmer assumes site independence and is computed as the sum of the count of each base pair in the kmer times the probability of observing that base pair in the data set, i.e. P(A)count_in_kmer(A)+P(C)count_in_kmer(C)+... The observed to expected ratio is computed as log2(obs/exp). Those with obsexp_ratio > 2 are considered to be overrepresented and appear in the returned data frame along with their position in the sequence. |
overrep_reads |
Sort all sequences per read by count. |
per_base_quality |
Compute the mean, median, and percentiles of quality score per base. This is returned as a data frame. |
per_read_quality |
Compute the mean quality score per read. 'per_read_quality' |
plot_adapter_content |
Creates a bar plot of the top 5 most present adapter sequences. |
plot_GC_content |
Generate mean GC content histogram. |
plot_outliers |
Determine how to plot outliers. Heuristic used is whether their obsexp_ratio differs by more than 1 and whether they fall into the same bin or not. If for 2 outliers, obsexp_ratio differs by less than .4 and they are in the same bin, then combine into a single plotting point. NOT FULLY FUNCTIONAL |
plot_overrep_kmer |
Create a box plot of the log2(observed/expected) ratio across the length of the sequence as well as top overrepresented kmers. Only ratios greater than 2 are included in the box plot. Default is 20 bins across the length of the sequence and the top 2 overrepresented kmers, but this can be changed by the user. |
plot_overrep_reads |
Plot the top 5 seqeunces |
plot_per_base_quality |
Generate a boxplot of the per position quality score. |
plot_per_read_quality |
Plot the mean quality score per sequence as a histogram. High quality sequences are those mostly distributed over 30. Low quality sequences are those mostly under 30. 'plot_per_read_quality' |
plot_read_content |
Plot the per position nucleotide content. |
plot_read_length |
Plot a histogram of the number of reads with each read length. |
qual_score_per_read |
Calculate the mean quality score per read of the FASTQ gzipped file |
read_base_content |
Compute nucleotide content per position for a single base pair. Wrapper function around seqTools. |
read_content |
Compute nucleotide content per position. Wrapper function around seqTools. |
read_length |
Creates a data frame of read lengths and the number of reads with that read length. |
run_all |
Will run all functions in the qckitfastq suite and save the data frames and plots to a user-provided directory. Plot names are supplied by default. |