1 Introduction

seqArchR is a non-negative matrix factorization (NMF)-based unsupervised learning approach for identifying different core promoter sequence architectures. seqArchR implements an algorithm based on chunking and iterative processing. While matrix factorization-based applications are known to scale poorly for large amounts of data, seqArchR’s algorithm enables scalable processing of large number of sequences. A notable advantage of seqArchR is that the sequence motifs – the lengths and positional specificities of individual motifs, and complex inter-relationships where multiple motifs are at play in tandem, all are simultaneously inferred from the data. To our knowledge, this is a novel application of NMF on biological sequence data capable of simultaneously discovering the sequence motifs and their positions. For a more detailed discussion, see preprint/publication.

This vignette demonstrates seqArchR’s usage with the help of a synthetic DNA sequences data set. Please refer to the paper for a detailed description of seqArchR’s algorithm. The paper also discusses the various parameters and their settings. For completeness, the following section gives a brief overview of the algorithm.

2 seqArchR’s algorithm

seqArchR implements a chunking-based iterative procedure. Below is a schematic of seqArchR’s algorithm.

Further details to follow.

3 Installation

3.1 Python scikit-learn dependency

seqArchR requires the Python module scikit-learn. Please see installation instructions here.

seqArchR is available on Bioconductor, and can be installed using:

if (!require("BiocManager", quietly = TRUE))


In case of any errors, please consider looking up: If none of the already noted points with regards to troubleshooting seqArchR’s installation help, please file a new issue.

4 Working with seqArchR

# Load seqArchR
library(Biostrings, quietly = TRUE)
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>, basename, cbind, colnames, dirname,,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax,, pmin,
#>, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>     I, expand.grid, unname
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>     strsplit

# Set seed for reproducibility

4.1 Synthetic data explained

In order to demonstrate the efficacy of seqArchR, we use seqArchR to cluster DNA sequences in a synthetic data set which was generated as follows. A set of 200 simulated DNA sequences was generated, each 100 nucleotides long and with uniform probability for all nucleotides. These sequences have four clusters in them, each with 50 sequences. The profiles of the four clusters are:

Cluster Characteristic Motifs Motif Occurrence Position #Sequences
A Dinucleotide repeat AT every 10 nt 50
C GAGAG 60 50
D GAGAG 80 50

All the motifs across the clusters were planted with a mutation rate of 0.

4.2 Input and feature representation

We use one-hot encoding to represent the dinucleotide profiles of each sequence in the data set. seqArchR provides functions to read input from (a) a FASTA file, and (b) Biostrings::DNAStringSet object.

4.2.1 Reading input as FASTA file

The function seqArchR::prepare_data_from_FASTA() enables one-hot-encoding the DNA sequences in the given FASTA file. The one-hot-encoded sequences are returned as a sparse matrix with as many columns as the number of sequences in the FASTA file and (sequence length x \(4^{2}\)) rows when dinucleotide profiles is selected. The number of rows will be (sequence length x \(4\)) when mononucleotide profiles is selected. See the sinuc_or_dinuc argument.

Upon setting the logical argument rawSeq to TRUE, the function returns the raw sequences as a Biostrings::DNAStringSet object, with FALSE it returns the column-wise one-hot encoded representation as noted above. When raw_seq is TRUE, sinuc_or_dinuc argument is ignored.

# Creation of one-hot encoded data matrix from FASTA file
inputFname <- system.file("extdata", "example_data.fa.gz", 
                                  package = "seqArchR", 
                                  mustWork = TRUE)

# Specifying `dinuc` generates dinucleotide features
inputSeqsMat <- seqArchR::prepare_data_from_FASTA(fasta_fname = inputFname,
                                                  sinuc_or_dinuc = "dinuc")
#> Sequences OK,
#> Read 200 sequences
#> Generating dinucleotide profiles

inputSeqsRaw <- seqArchR::prepare_data_from_FASTA(fasta_fname = inputFname, 
                                               raw_seq = TRUE)

nSeqs <- length(inputSeqsRaw)
positions <- seq(1, Biostrings::width(inputSeqsRaw[1]))

4.2.2 Reading input as a DNAStringSet object

If you already have a Biostrings::DNAStringSet object, you can use the seqArchR::get_one_hot_encoded_seqs() function which directly accepts a Biostrings::DNAStringSet object.

# Creation of one-hot encoded data matrix from a DNAStringSet object
inputSeqs_direct <- seqArchR::get_one_hot_encoded_seqs(seqs = inputSeqsRaw, 
                                                  sinuc_or_dinuc = "dinuc")
#> Generating dinucleotide profiles

identical(inputSeqs_direct, inputSeqsMat)
#> [1] TRUE

4.3 Visualize input sequences as an image

# Visualize the sequences in a image matrix where the DNA bases are 
# assigned fixed colors

                          pos_lab = positions, save_fname = NULL)