This is a package with a couple of functions to possibilite the use of the sWeeP method in R. This method was developed to favor the analizes between amino acids sequences and to assist alignment free phylogenetic studies. This method is based on the concept of sparse words, which is applied in the scan of biological sequences and its the conversion in a matrix of ocurrences. Aiming the generation of low dimensional matrices of Amino Acid Sequence occurrences.
rSWeeP 1.12.0
The “Spaced Words Projection (sWeeP)” is a method for representing biological sequences using relatively, it uses the spacedwords concept by scanning sequences and generating indices to create a higherdimensional vector that is later projected into a smaller randomly oriented orthonormal base. This function is suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The method is available at sWeeP (PIERRI, 2019). This tool has it’s main speed gain in constanci processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.
The package has two functions: orthBase, that generates an orthonormal matrix of a chosen size, and sWeeP, a function that applies the sWeeP method
The orthBase function can create a quasi-orthonormal matrix in any desired size. Here it is used to create a matrix to project the sWeeP method, so it must have 160.000 rows and the columns of the size wished for projection.
library(rSWeeP)
baseMatrix <- orthBase(160000,10)
The exdna.fas dataset consists in a list of three strings that simulates a DNA sequence used for demonstration purposes only.
path <- system.file(package = "rSWeeP", "extdata", "exdna.fas")
Then the sWeeP method is applied and the returns a matrix that represents the sequences compared by a vectorial method. And then it’s possible to see a graphic representation in a phylogenetic tree
return <- sWeeP(path,baseMatrix)
distancia <- dist(return, method = "euclidean")
tree <- hclust(distancia, method="ward.D")
plot(tree, hang = -1, cex = 1)
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rSWeeP_1.12.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] crayon_1.5.2 cli_3.6.1 knitr_1.42
## [4] magick_2.7.4 rlang_1.1.0 xfun_0.39
## [7] highr_0.10 jsonlite_1.8.4 S4Vectors_0.38.0
## [10] RCurl_1.98-1.12 Biostrings_2.68.0 htmltools_0.5.5
## [13] pracma_2.4.2 sass_0.4.5 stats4_4.3.0
## [16] rmarkdown_2.21 evaluate_0.20 jquerylib_0.1.4
## [19] bitops_1.0-7 fastmap_1.1.1 GenomeInfoDb_1.36.0
## [22] yaml_2.3.7 IRanges_2.34.0 bookdown_0.33
## [25] BiocManager_1.30.20 compiler_4.3.0 Rcpp_1.0.10
## [28] XVector_0.40.0 digest_0.6.31 R6_2.5.1
## [31] GenomeInfoDbData_1.2.10 magrittr_2.0.3 bslib_0.4.2
## [34] tools_4.3.0 zlibbioc_1.46.0 BiocGenerics_0.46.0
## [37] cachem_1.0.7