library(idpr)
Intrinsically Disordered Proteins (IDPs) are a unique class of proteins that challenge the typical “structure leads to function” paradigm in biology. IDPs serve critical roles in various cellular processes, while lacking a single, rigid structure under native conditions (1-4). This is not rare phenomenon either, for proteins experiencing intrinsic disorder make up over 1/4 - 1/2 of the eukaryotic proteome (5-7). In addition to typical cellular processes, these proteins have been implicated in several human diseases such as Parkinson’s Disease, Alzheimer’s Disease, and various cancers (8-10). Due to their abundance and relevance, interest in IDPs has been increasing over the last few decades (11). In this regard, there have been dozens of computational tools developed to help predict intrinsic disorder of a protein sequence (12-14). See our peer-reviewed article published in PLOS ONE (https://doi.org/10.1371/journal.pone.0266929) to learn more on this topic!
These tools utilize known differences between disordered and ordered proteins. IDPs and Intrinsically Disordered Regions (IDRs) have distinct compositional profiles, evolutionary rates, and biochemical properties when compared to proteins or protein-regions with compact structure (15-18).
Almost all of the current tools make their predictions solely from the primary sequence of amino acids, represented as a character string of individual letters (12-14). Since their inception, the single-letter abbreviations of amino acids, designed by Margaret Dayhoff, have allowed for computational methods to easily handle biologically-relevant protein data as character strings (19, 20). For this reason, computer programs analyzing protein data are able to assign experimentally-determined and computationally-derived values to each letter within a character string for further analysis.
IDPs have decreased levels of secondary and tertiary structure (1), leaving the primary structure as the principal source of information on IDPs for computational studies. In fact, computational biology and bioinformatics have served a critical role in uncovering many of the properties and prevalence of IDPs (11).
There have been several R packages developed for protein analysis based on the amino acid sequence, a few examples are ‘seqinr’ (21), ‘Peptides’ (22), ‘ProteinDescriptors’ (23), and ‘SCORER2’ (24). However, to our knowledge, there is not a package that is focused on the unique characteristics of IDPs and IDRs.
The abbreviation IDPR has been used to describe Intrinsically Disordered Protein Regions, or regions of proteins that experience intrinsic disorder. Borrowing this same acronym, the R package ‘idpr’ stands for a few things: “Intrinsically Disordered Proteins in R” and “IDp PRofiles”. The goal of this package is to match several distinct features of IDPs (or lack thereof) to a protein sequence of interest as well as integrate additional tools for IDP analysis in R. These characteristics include amino acid composition, charge, and hydropathy. In addition to these mentioned properties, several amino acid substitution matrices specific to IDPs (25-27) and a connection to the suite of disorder predictions by IUPred2A (28, 29), retrieved by connection to their REST API, have been included for additional IDP-based work. ‘idpr’ can analyze a protein of interest and return several basic features of IDPs, resulting in a summary that we are calling the ‘idprofile’. This will help a user begin to investigate the biochemical and physical properties of a protein within the context of intrinsic disorder.
From a computational standpoint, ‘idpr’ has a goal of generating visualizations with ease while balancing a workflow that allows for dynamic input and custom output. Therefore, most functions within ‘idpr’ have the ability to either return a data frame of results or a visualization of said data. This reduces the burden on less-experienced users and users seeking expedited results while allowing others to utilize the data in any manner that they so choose. ‘idpr’ also aims to easily integrate with other packages. For this reason, most functions will accept amino acid sequences as various structures: single-letter amino acids as a character string, single-letter amino acids as a vector of individual characters, or a character string specifying the path to a fasta file containing a sequence of interest. All forms are handled automatically without user specification, and fasta files will be loaded using the ‘Bioconductor’ package. Additionally, all visualizations generated by ‘idpr’ are made using the ‘ggplot2’ package (30). This is to allow further customization on returned graphics.
Overall, ‘idpr’ aims to integrate tools for the computational analysis of intrinsically disordered proteins within R. This package is used to identify known characteristics of IDPs within a sequence of interest with easily reported and dynamic results. Additionally, this package includes tools for IDP-based sequence analysis to be used in conjunction with other R packages. ‘idpr’ represents one of the first, if not the first, attempt at bringing IDP sequence-based analysis into R.
The package can be installed from Bioconductor with the following line of code. This requires the BiocManager package to be installed.
#BiocManager::install("idpr")
The most recent version of the package can be installed with the following line of code. This requires the devtools package to be installed.
#devtools::install_github("wmm27/idpr")
To quickly generate the profile for a protein of interest, a UniProt ID and protein sequence is needed. The UniProt ID must be a character string; the sequence can be a character string of single-letter amino acid residues, a character string specifying the path to a .fasta file, or amino acid residues as a vector of individual letters.
TP53Sequences[2] #Getting a preloaded sequence from idpr
P53_HUMAN <-print(P53_HUMAN)
#> P04637|P53_HUMAN
#> "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD"
"P04637" #Human TP53 UniProt ID P53_ID <-
Then we can generate the profile with idprofile()
idprofile(sequence = P53_HUMAN,
uniprotAccession = P53_ID)
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
#>
#> [[4]]
#>
#> [[5]]
#>
#> [[6]]
idprofile returns 5-6 plots:
Detailed descriptions of each plot can be found in specific vignettes.
* Charge Hydropathy Vignette
† Structural Tendency Vignette
‡ IUPred Vignette; see Erdős, G., & Dosztányi, Z. (2020) (29).
A brief explanation of each plot is given below:
Uversky, Gillespie, & Fink (2000) showed that both high net charge and low mean hydropathy are properties of IDPs (15). One explanation is that a high net charge leads to increased repulsion of residues causing an extended structure and low hydrophobicity reduces the hydrophobic interactions causing reduced protein packing. When both average net charge and mean scaled hydropathy are plotted, extended IDPs occupy a unique area on the plot. Therefore, this graphic can be used to distinguish proteins that are extended or compact under native conditions. However, it is important to note that IDPs can have the characteristics of a collapsed protein or an extended protein. Therefore a protein within the “collapsed protein” field does not necessary mean that it lacks intrinsic disorder under native conditions (15, 31). This equation was later applied to a method of predicting unfolded peptides using a sliding window of charge and hydropathy in FoldIndex (44). When scores are negative, a region is predicted as unfolded; when scores are positive, a region is predicted as folded.
For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.
The composition of amino acids and the overall chemistry of IDPs are distinctly different from that of ordered proteins. Each amino acid has a tendency to favor a compact or extended secondary and tertiary structures based on the chemistry of the residue. Disorder-promoting residues, those enriched in IDPs, are typically hydrophilic, charged, or small. Order promoting residues, those enriched in structured proteins, tend to be aliphatic, hydrophobic, aromatic, or form tertiary structures. Disorder neutral residues favor neither ordered nor disordered structures (17).
Disorder-promoting residues are P, E, S, Q, K, A, and G; order-promoting residues are M, N, V, H, L, F, Y, I, W, and C; disorder‐neutral residues are D, T, and R (32).
For further theory and details, please refer to idpr’s “Structural Tendency Vignette” file.
As stated, IDPs are enriched in charged residues. Residues of similar charge tend to repel one another which can prevent protein packing and promote an unstructured protein configuration under native conditions (15). There are many pKa data sets, we utilize the IPC pKa data set for calculations (33). Beyond the use of IDP predictions, local charge is an important biochemical measurement with many applications. Charges are calculated using a sliding window to help identify regions of extreme charge. The resulting figure is similar to ProtScale from ExPASy (34).
For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.
As stated, hydrophobic residues are disfavored in IDPs (15). The hydrophobic effect is a significant driving force in protein packing as it leads to rigid structures (35). IDPs lack this driving force and have residues that preferentially interact with the solvent (17). This plot uses the Kyte and Doolittle measurement of hydropathy (36), scaled with Arg having a hydropathy of 0 and Ile having a hydropathy of 1. This was done to parallel the work in the Charge-Hydropathy plot which utilizes the same, normalized scale (15, 17). The resulting figure is similar to ProtScale from ExPASy (34). Scaled hydropathy is averaged locally along the protein using a sliding window to identify regions devoid of hydropathic characteristics.
For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.
IUPred2 analyzes an amino acid sequence and returns a score of intrinsic disorder depending on a model of the estimated energy potential for residue interactions (28) This is because structured proteins have the ability to create a network of interactions, while IDPs lack abundant interactions. The reduced number of interactions leads to an IDP’s lack of secondary and tertiary structure (37). Predictions are made on a scale of 0-1, where any residues with a score over 0.5 are predicted to be in a disordered region, and any residues scoring below 0.5 are predicted to be ordered (28, 29, 37).
The IUPred graph shown is the default setting, which is the prediction of long disordered regions. IUPred2A Offers several predictors. An additional prediction of protein-protein interactions is done with the ANCHOR2 method, and another predictor of redox-sensitive disorder prediction is done with IUPred2A Redox (28, 29, 37).
These plots can be generated independently, shown here, or with idprofile() using the iupredType argument.
iupredAnchor(P53_ID) #IUPred2 long + ANCHOR2 prediction of scaffolding
Redox-sensitive regions are shaded with a green background.
iupredRedox(P53_ID) #IUPred2 long with environmental context
For further theory, use, and details, please refer to idpr’s “IUPred Vignette” file.
As mentioned in the “Structural Tendency Plot” section above, there are specific amino acid residues that are enriched in unstructured protein regions and other amino acids that favor ordered protein regions (32). While the total composition is important to know, the location of these are important to visualize especially in proteins with both ordered and disordered domains.
Continuous values, like charge and disorder predictions, are easy to visualize along a protein sequence, while discrete values can be more challenging. ‘idpr’ contains a way to visualize discrete values of a protein sequence with the sequenceMap() function. The workflow shown here generates and visualizes a data frame of structural tendency for each residue in a sequence of interest, exemplifying one application for visualizing discrete values.
structuralTendency(P53_HUMAN)
p53_tendency_DF <-head(p53_tendency_DF) #see the first few rows of the generated data frame
#> Position AA Tendency
#> 1 1 M Order Promoting
#> 2 2 E Disorder Promoting
#> 3 3 E Disorder Promoting
#> 4 4 P Disorder Promoting
#> 5 5 Q Disorder Promoting
#> 6 6 S Disorder Promoting
sequenceMap(sequence = P53_HUMAN,
property = p53_tendency_DF$Tendency,
customColors = c("#F0B5B3", "#A2CD5A", "#BF3EFF")) #generate the map
sequenceMap() does accept continuous values as well. Additionally, custom plots that match the theme of plots generated by idprofile() and other idpr functions can be created using the sequencePlot() function within idpr
In addition to having different biochemistry, IDPs tend to experience evolution differently from ordered proteins. IDPs and IDRs tend to evolve faster than ordered proteins or folded domains because the restraint is in maintaining charge/disorder rather than a particular structure or function (16, 38). Therefore, IDPs tend to accept different (and more) point mutations when compared to ordered proteins (26).
Currently, the most commonly used amino acid substitution matrices are PAM and BLOSUM (39, 40). These are integrated into many web-based tools like those on EMBOSS and NCBI-BLAST+ (41, 42). These do not allow for custom matrices. BLOSUM and PAM matrices can be loaded via the Biostrings Package and can be used with alignment programs in R (43).
Something important to note when using these matrices is that both PAM and BLOSUM are derived from or favor structured proteins and therefore are not the most appropriate to use when analyzing IDPs (25-27). Trivedi & Nagarajaram (2019) provide comparisons between commonly used substitution matrices and matrices developed for IDPs. They additionally developed a matrix that is better at identifying homologs of IDPs (27).
Three groups of IDP-derived substitution matrices are incorporated into the idpr package for use in alignments done with R. These are the matrices from Trivedi & Nagarajaram (2019) (27), Brown et al. (2009) (26), and Radivojac et al. (2001) (25).
R Version
R.version.string#> [1] "R version 4.3.1 (2023-06-16)"
System Information
as.data.frame(Sys.info())
#> Sys.info()
#> sysname Linux
#> release 5.15.0-87-generic
#> version #97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023
#> nodename nebbiolo2
#> machine x86_64
#> login biocbuild
#> user biocbuild
#> effective_user biocbuild
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] msa_1.34.0 Biostrings_2.70.0 GenomeInfoDb_1.38.0
#> [4] XVector_0.42.0 IRanges_2.36.0 S4Vectors_0.40.0
#> [7] BiocGenerics_0.48.0 idpr_1.12.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.7 utf8_1.2.4 generics_0.1.3
#> [4] bitops_1.0-7 lattice_0.22-5 digest_0.6.33
#> [7] magrittr_2.0.3 evaluate_0.22 grid_4.3.1
#> [10] fastmap_1.1.1 seqinr_4.2-30 plyr_1.8.9
#> [13] jsonlite_1.8.7 ape_5.7-1 fansi_1.0.5
#> [16] scales_1.2.1 jquerylib_0.1.4 ade4_1.7-22
#> [19] cli_3.6.1 rlang_1.1.1 crayon_1.5.2
#> [22] munsell_0.5.0 withr_2.5.1 cachem_1.0.8
#> [25] yaml_2.3.7 parallel_4.3.1 tools_4.3.1
#> [28] dplyr_1.1.3 colorspace_2.1-0 ggplot2_3.4.4
#> [31] GenomeInfoDbData_1.2.11 vctrs_0.6.4 R6_2.5.1
#> [34] lifecycle_1.0.3 zlibbioc_1.48.0 MASS_7.3-60
#> [37] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1
#> [40] gtable_0.3.4 glue_1.6.2 Rcpp_1.0.11
#> [43] xfun_0.40 tibble_3.2.1 tidyselect_1.2.0
#> [46] knitr_1.44 farver_2.1.1 nlme_3.1-163
#> [49] htmltools_0.5.6.1 rmarkdown_2.25 labeling_0.4.3
#> [52] compiler_4.3.1 RCurl_1.98-1.12
citation()
To cite R in publications use:
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
A BibTeX entry for LaTeX users is
@Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2023}, url = {https://www.R-project.org/}, }
We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation(“pkgname”)’ for citing R packages.