Maintainer: Ji-Ping Wang, <jzwang@northwestern.edu>
References for methods:
References for chemical map data used:
NuPoP V2 added a funciton predNuPoP_chem for prediction of nucleosomes using profiles trained based on chemical maps of nucleosomes for yeast, pombe, mouse and human (profiles for other species are extrapolated based on yeast profile). To show a clear improvement, the following figure compares the prediction based on MNase profile vs chemical profile for yeast where the red curve super-imposed is the occupancy from the chemical map for yeast data.
NuPoP is an R package for Nucleosome Positioning Prediction. NuPoP is built upon a duration hidden Markov model, in which the linker DNA length is explicitly modeled. The nucleosome or linker DNA state model can be chosen as either a fourth order or first order Markov chain. NuPoP outputs the Viterbi prediction of optimal nucleosome position map, nucleosome occupancy score (from backward and forward algorithms) and nucleosome affinity score.
In addition to the R package, we also developed a stand-alone Fortran program available at https://github.com/jipingw/NuPoP_Fortran. NuPoP R package and the Fortran program can predict nucleosome positioning for a DNA sequence of any length.
NuPoP does not depend on any other R packages. It has four major functions, predNuPoP, predNuPoP_chem, readNuPoP, and plotNuPoP. The predNuPoP function predicts the nucleosome positioning and nucleosome occupancy using MNase data trained nucleosome profiles while predNuPoP_chem predicts using profiles trained based on chemical maps. the readNuPoP function reads in the prediction results, and the function plotNuPoP visualizes the predictions.
The predNuPoP and predNuPoP_chem call a Fortran subroutine to process the DNA sequence and make predictions, and outputs the predictions into a text file into the current working directory. This method is based on a duration Hidden Markov model consisting of two states, one for the nucleosome and the other for the linker state. For each state, a first order Markov chain and a fourth order Markov chain models are built in. For example, a sample DNA sequence is test.seq
located in inst/extdata
subdirectory of the package. Call the predNuPoP
function as follows:
Note that the argument file
must be specified as the complete path and file name of the DNA sequence in FASTA format in any directory. For example, if the “test.seq” file is located in /Users/jon/DNA
, the function can be called by
or for chemical map prediction:
The user should not use file="~/DNA/test.seq"
to speficify the path to avoid error messages. The argument species
can be specified as follows: 1 = Human; 2 = Mouse; 3 = Rat; 4 = Zebrafish; 5 = D. melanogaster; 6 = C. elegans; 7 = S. cerevisiae; 8 = C. albicans; 9 = S. pombe; 10 = A. thaliana; 11 = Maize; 0 = other. If species = 0 is specified, the algorithm will identify a species from 1-11 that has most similar base composition to the input sequence, and use the models from the selected species for prediction. The default value is 7. The argument model
can be either 1 or 4, standing for the order of Markov chain models used for the nucleosome and linker states.
For predNuPoP_chem
, only species 1 (human), 2 (mouse), 7 (yeast), 9(S.pombe) are using actual profiles trained from chemical mapps of corresponding species. For other species, there are no chemical map data vailable. Extrapolated profiles based on yeast data with adjustment for base composition are used.
The output file, generated in the current working directory, will be named after the sequence file name, with an added extension as _Prediction1.txt
or _Prediction4.txt
. For the above codes, the output file will be test.seq_Prediction4.txt
. Note that predNuPoP_chem
and predNuPoP
do not distinguish the output names. The five columns in the output file are
Position
: position in the input DNA sequence.P-start
: probability that the current position is the start of a nucleosome.Occup
: nucleosome occupancy score. The nucleosome occupancy score is defined as the probability that the given position is covered by a nucleosome.N/L
: 1 indicates the given position is covered by nucleosome and 0 for linker linker based on Viterbi prediction.Affinity
: nucleosome binding affinity score. This affinity score is defined for every 147 bp of DNA sequence centered at the given position. Therefore for the first and last 73 bp of the DNA sequence, the affinity score is not defined (indicated as NA).The output file can be imported into R by readNuPoP
function:
results=readNuPoP(system.file("extdata", "test.seq_Prediction4.txt", package="NuPoP"),startPos=1,endPos=5000)
results[1:5,]
#> Position P.start Occup N/L Affinity
#> 1 1 0.000 0.000 0 NA
#> 2 2 0.412 0.412 1 NA
#> 3 3 0.075 0.487 1 NA
#> 4 4 0.005 0.492 1 NA
#> 5 5 0.005 0.497 1 NA
The genomic sequence can be extremely long. The user can import a part of the predictions by specifying the start position startPos
and the end position endPos
in the readNuPoP
function. For example, to visualize prediction results from startPos=1
to endPos=5000
,
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] NuPoP_2.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.27 R6_2.5.0 jsonlite_1.7.2 magrittr_2.0.1
#> [5] evaluate_0.14 highr_0.9 rlang_0.4.11 stringi_1.6.2
#> [9] jquerylib_0.1.4 bslib_0.2.5.1 rmarkdown_2.8 tools_4.1.0
#> [13] stringr_1.4.0 xfun_0.23 yaml_2.2.1 compiler_4.1.0
#> [17] htmltools_0.5.1.1 knitr_1.33 sass_0.4.0