idpr Package Overview Vignette

William McFadden

2020-12-23

library(idpr)

idpr: A Package for Profiling and Analyzing Intrinsically Disordered Proteins in R.

Intrinsically Disordered Proteins (IDPs) are a unique class of proteins that challenge the typical “structure leads to function” paradigm in biology. IDPs serve critical roles in various cellular processes, while lacking a single, rigid structure under native conditions (1-4). This is not rare phenomenon either, for proteins experiencing intrinsic disorder make up over 1/4 - 1/2 of the eukaryotic proteome (5-7). In addition to typical cellular processes, these proteins have been implicated in several human diseases such as Parkinson’s Disease, Alzheimer’s Disease, and various cancers (8-10). Due to their abundance and relevance, interest in IDPs has been increasing over the last few decades (11). In this regard, there have been dozens of computational tools developed to help predict intrinsic disorder of a protein sequence (12-14).

These tools utilize known differences between disordered and ordered proteins. IDPs and Intrinsically Disordered Regions (IDRs) have distinct compositional profiles, evolutionary rates, and biochemical properties when compared to proteins or protein-regions with compact structure (15-18).

Almost all of the current tools make their predictions solely from the primary sequence of amino acids, represented as a character string of individual letters (12-14). Since their inception, the single-letter abbreviations of amino acids, designed by Margaret Dayhoff, have allowed for computational methods to easily handle biologically-relevant protein data as character strings (19, 20). For this reason, computer programs analyzing protein data are able to assign experimentally-determined and computationally-derived values to each letter within a character string for further analysis.

IDPs have decreased levels of secondary and tertiary structure (1), leaving the primary structure as the principal source of information on IDPs for computational studies. In fact, computational biology and bioinformatics have served a critical role in uncovering many of the properties and prevalence of IDPs (11).

There have been several R packages developed for protein analysis based on the amino acid sequence, a few examples are ‘seqinr’ (21), ‘Peptides’ (22), ‘ProteinDescriptors’ (23), and ‘SCORER2’ (24). However, to our knowledge, there is not a package that is focused on the unique characteristics of IDPs and IDRs.

The abbreviation IDPR has been used to describe Intrinsically Disordered Protein Regions, or regions of proteins that experience intrinsic disorder. Borrowing this same acronym, the R package ‘idpr’ stands for a few things: “Intrinsically Disordered Proteins in R” and “IDp PRofiles”. The goal of this package is to match several distinct features of IDPs (or lack thereof) to a protein sequence of interest as well as integrate additional tools for IDP analysis in R. These characteristics include amino acid composition, charge, and hydropathy. In addition to these mentioned properties, several amino acid substitution matrices specific to IDPs (25-27) and a connection to the suite of disorder predictions by IUPred2A (28, 29), retrieved by connection to their REST API, have been included for additional IDP-based work. ‘idpr’ can analyze a protein of interest and return several basic features of IDPs, resulting in a summary that we are calling the ‘idprofile’. This will help a user begin to investigate the biochemical and physical properties of a protein within the context of intrinsic disorder.

From a computational standpoint, ‘idpr’ has a goal of generating visualizations with ease while balancing a workflow that allows for dynamic input and custom output. Therefore, most functions within ‘idpr’ have the ability to either return a data frame of results or a visualization of said data. This reduces the burden on less-experienced users and users seeking expedited results while allowing others to utilize the data in any manner that they so choose. ‘idpr’ also aims to easily integrate with other packages. For this reason, most functions will accept amino acid sequences as various structures: single-letter amino acids as a character string, single-letter amino acids as a vector of individual characters, or a character string specifying the path to a fasta file containing a sequence of interest. All forms are handled automatically without user specification, and fasta files will be loaded using the ‘Bioconductor’ package. Additionally, all visualizations generated by ‘idpr’ are made using the ‘ggplot2’ package (30). This is to allow further customization on returned graphics.

Overall, ‘idpr’ aims to integrate tools for the computational analysis of intrinsically disordered proteins within R. This package is used to identify known characteristics of IDPs within a sequence of interest with easily reported and dynamic results. Additionally, this package includes tools for IDP-based sequence analysis to be used in conjunction with other R packages. ‘idpr’ represents one of the first, if not the first, attempt at bringing IDP sequence-based analysis into R.

Installation

The package can be installed from Bioconductor with the following line of code. This requires the BiocManager package to be installed.

#BiocManager::install("idpr") 

The most recent version of the package can be installed with the following line of code. This requires the devtools package to be installed.

#devtools::install_github("wmm27/idpr")

Profiling

To quickly generate the profile for a protein of interest, a UniProt ID and protein sequence is needed. The UniProt ID must be a character string; the sequence can be a character string of single-letter amino acid residues, a character string specifying the path to a .fasta file, or amino acid residues as a vector of individual letters.

P53_HUMAN <- TP53Sequences[2] #Getting a preloaded sequence from idpr
print(P53_HUMAN)
#>                                                                                                                                                                                                                                                                                                                                                                                            P04637|P53_HUMAN 
#> "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD"

P53_ID <- "P04637" #Human TP53 UniProt ID

Then we can generate the profile with idprofile()

idprofile(sequence = P53_HUMAN,
          uniprotAccession = P53_ID)
#> [[1]]

#> 
#> [[2]]

#> 
#> [[3]]

#> 
#> [[4]]

#> 
#> [[5]]

idprofile returns 4-5 plots:

Detailed descriptions of each plot can be found in specific vignettes.

* Charge Hydropathy Vignette

Structural Tendency Vignette

IUPred Vignette; see Erdős, G., & Dosztányi, Z. (2020) (29).

A brief explanation of each plot is given below:

Charge-Hydropathy Plot

Uversky, Gillespie, & Fink (2000) showed that both high net charge and low mean hydropathy are properties of IDPs (15). One explanation is that a high net charge leads to increased repulsion of residues causing an extended structure and low hydrophobicity reduces the hydrophobic interactions causing reduced protein packing. When both average net charge and mean scaled hydropathy are plotted, extended IDPs occupy a unique area on the plot. Therefore, this graphic can be used to distinguish proteins that are extended or compact under native conditions. However, it is important to note that IDPs can have the characteristics of a collapsed protein or an extended protein. Therefore a protein within the “collapsed protein” field does not necessary mean that it lacks intrinsic disorder under native conditions (15, 31).

For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.

Structural Tendency Plot

The composition of amino acids and the overall chemistry of IDPs are distinctly different from that of ordered proteins. Each amino acid has a tendency to favor a compact or extended secondary and tertiary structures based on the chemistry of the residue. Disorder-promoting residues, those enriched in IDPs, are typically hydrophilic, charged, or small. Order promoting residues, those enriched in structured proteins, tend to be aliphatic, hydrophobic, aromatic, or form tertiary structures. Disorder neutral residues favor neither ordered nor disordered structures (17).

Disorder-promoting residues are P, E, S, Q, K, A, and G; order-promoting residues are M, N, V, H, L, F, Y, I, W, and C; disorder‐neutral residues are D, T, and R (32).

For further theory and details, please refer to idpr’s “Structural Tendency Vignette” file.

Local Charge Calculations

As stated, IDPs are enriched in charged residues. Residues of similar charge tend to repel one another which can prevent protein packing and promote an unstructured protein configuration under native conditions (15). There are many pKa data sets, we utilize the IPC pKa data set for calculations (33). Beyond the use of IDP predictions, local charge is an important biochemical measurement with many applications. Charges are calculated using a sliding window to help identify regions of extreme charge. The resulting figure is similar to ProtScale from ExPASy (34).

For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.

Local Hydropathy

As stated, hydrophobic residues are disfavored in IDPs (15). The hydrophobic effect is a significant driving force in protein packing as it leads to rigid structures (35). IDPs lack this driving force and have residues that preferentially interact with the solvent (17). This plot uses the Kyte and Doolittle measurement of hydropathy (36), scaled with Arg having a hydropathy of 0 and Ile having a hydropathy of 1. This was done to parallel the work in the Charge-Hydropathy plot which utilizes the same, normalized scale (15, 17). The resulting figure is similar to ProtScale from ExPASy (34). Scaled hydropathy is averaged locally along the protein using a sliding window to identify regions devoid of hydropathic characteristics.

For further theory and details, please refer to idpr’s “Charge and Hydropathy Vignette” file.

IUPred

IUPred2 analyzes an amino acid sequence and returns a score of intrinsic disorder depending on a model of the estimated energy potential for residue interactions (28) This is because structured proteins have the ability to create a network of interactions, while IDPs lack abundant interactions. The reduced number of interactions leads to an IDP’s lack of secondary and tertiary structure (37). Predictions are made on a scale of 0-1, where any residues with a score over 0.5 are predicted to be in a disordered region, and any residues scoring below 0.5 are predicted to be ordered (28, 29, 37).

The IUPred graph shown is the default setting, which is the prediction of long disordered regions. IUPred2A Offers several predictors. An additional prediction of protein-protein interactions is done with the ANCHOR2 method, and another predictor of redox-sensitive disorder prediction is done with IUPred2A Redox (28, 29, 37).

These plots can be generated independently, shown here, or with idprofile() using the iupredType argument.

iupredAnchor(P53_ID) #IUPred2 long + ANCHOR2 prediction of scaffolding

Redox-sensitive regions are shaded with a green background.

iupredRedox(P53_ID) #IUPred2 long with environmental context

For further theory, use, and details, please refer to idpr’s “IUPred Vignette” file.


Visualizing Discrete Values

As mentioned in the “Structural Tendency Plot” section above, there are specific amino acid residues that are enriched in unstructured protein regions and other amino acids that favor ordered protein regions (32). While the total composition is important to know, the location of these are important to visualize especially in proteins with both ordered and disordered domains.

Continuous values, like charge and disorder predictions, are easy to visualize along a protein sequence, while discrete values can be more challenging. ‘idpr’ contains a way to visualize discrete values of a protein sequence with the sequenceMap() function. The workflow shown here generates and visualizes a data frame of structural tendency for each residue in a sequence of interest, exemplifying one application for visualizing discrete values.

p53_tendency_DF <- structuralTendency(P53_HUMAN)
head(p53_tendency_DF) #see the first few rows of the generated data frame
#>   Position AA           Tendency
#> 1        1  M    Order Promoting
#> 2        2  E Disorder Promoting
#> 3        3  E Disorder Promoting
#> 4        4  P Disorder Promoting
#> 5        5  Q Disorder Promoting
#> 6        6  S Disorder Promoting

sequenceMap(sequence = P53_HUMAN,
            property = p53_tendency_DF$Tendency,
            customColors = c("#F0B5B3", "#A2CD5A", "#BF3EFF"))  #generate the map

sequenceMap() does accept continuous values as well. Additionally, custom plots that match the theme of plots generated by idprofile() and other idpr functions can be created using the sequencePlot() function within idpr


Substitution Matrices for Analyzing IDPs

In addition to having different biochemistry, IDPs tend to experience evolution differently from ordered proteins. IDPs and IDRs tend to evolve faster than ordered proteins or folded domains because the restraint is in maintaining charge/disorder rather than a particular structure or function (16, 38). Therefore, IDPs tend to accept different (and more) point mutations when compared to ordered proteins (26).

Currently, the most commonly used amino acid substitution matrices are PAM and BLOSUM (39, 40). These are integrated into many web-based tools like those on EMBOSS and NCBI-BLAST+ (41, 42). These do not allow for custom matrices. BLOSUM and PAM matrices can be loaded via the Biostrings Package and can be used with alignment programs in R (43).

Something important to note when using these matrices is that both PAM and BLOSUM are derived from or favor structured proteins and therefore are not the most appropriate to use when analyzing IDPs (25-27). Trivedi & Nagarajaram (2019) provide comparisons between commonly used substitution matrices and matrices developed for IDPs. They additionally developed a matrix that is better at identifying homologs of IDPs (27).

Three groups of IDP-derived substitution matrices are incorporated into the idpr package for use in alignments done with R. These are the matrices from Trivedi & Nagarajaram (2019) (27), Brown et al. (2009) (26), and Radivojac et al. (2001) (25).

References

  1. Dunker AK, Lawson JD, Brown CJ, Williams RM, Romero P, Oh JS, et al. Intrinsically disordered protein. Journal of Molecular Graphics and Modelling. 2001;19(1):26-59.
  2. Tompa P. Intrinsically unstructured proteins. Trends in biochemical sciences. 2002;27(10):527-33.
  3. Uversky VN. Intrinsically disordered proteins from A to Z. The International Journal of Biochemistry & Cell Biology. 2011;43(8):1090-103.
  4. Wright PE, Dyson HJ. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. Journal of Molecular Biology. 1999;293(2):321-31.
  5. Pancsa R, Tompa P. Structural disorder in eukaryotes. PLoS One. 2012;7(4):e34687-e.
  6. Xue B, Dunker AK, Uversky VN. Orderly order in protein intrinsic disorder distribution: disorder in 3500 proteomes from viruses and the three domains of life. Journal of Biomolecular Structure and Dynamics. 2012;30(2):137-49.
  7. Kulkarni P, Uversky VN. Intrinsically Disordered Proteins: The Dark Horse of the Dark Proteome. PROTEOMICS. 2018;18(21-22):1800061.
  8. Uversky VN, Oldfield CJ, Midic U, Xie H, Xue B, Vucetic S, et al. Unfoldomics of human diseases: linking protein intrinsic disorder with diseases. BMC Genomics. 2009;10(1):S7.
  9. Uversky VN, Oldfield CJ, Dunker AK. Intrinsically disordered proteins in human diseases: introducing the D2 concept. Annu Rev Biophys. 2008;37:215-46.
  10. Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, Dunker AK. Intrinsic Disorder and Functional Proteomics. Biophysical Journal. 2007;92(5):1439-56.
  11. Kurgan L, Radivojac P, Sussman JL, Dunker AK. On the Importance of Computational Biology and Bioinformatics to the Origins and Rapid Progression of the Intrinsically Disordered Proteins Field. Biocomputing 2020: WORLD SCIENTIFIC; 2019. p. 149-58.
  12. Li J, Feng Y, Wang X, Li J, Liu W, Rong L, et al. An Overview of Predictors for Intrinsically Disordered Proteins over 2010-2014. Int J Mol Sci. 2015;16(10):23446-62.
  13. Meng F, Uversky V, Kurgan L. Computational prediction of intrinsic disorder in proteins. Current protocols in protein science. 2017;88(1):2.16. 1-2.. 4.
  14. Liu Y, Wang X, Liu B. A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Briefings in Bioinformatics. 2017;20(1):330-46.
  15. Uversky VN, Gillespie JR, Fink AL. Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins: Structure, Function, and Bioinformatics. 2000;41(3):415-27.
  16. Brown CJ, Takayama S, Campen AM, Vise P, Marshall TW, Oldfield CJ, et al. Evolutionary rate heterogeneity in proteins with long disordered regions. Journal of molecular evolution. 2002;55(1):104.
  17. Uversky VN. Intrinsically Disordered Proteins and Their “Mysterious” (Meta)Physics. Frontiers in Physics. 2019;7(10).
  18. Forcelloni S, Giansanti A. Evolutionary Forces and Codon Bias in Different Flavors of Intrinsic Disorder in the Human Proteome. Journal of Molecular Evolution. 2020;88(2):164-78.
  19. Comm I-I. A one-letter notation for amino acid sequences. Tentative rules. Biochemistry. 1968;7(8):2703-5.
  20. Dayhoff MO. Atlas of protein sequence and structure: National Biomedical Research Foundation.; 1972.
  21. Charif D, Lobry JR. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Structural approaches to sequence evolution: Springer; 2007. p. 207-32.
  22. Osorio D, Rondón-Villarrea P, Torres R. Peptides: a package for data mining of antimicrobial peptides. R Journal. 2015;7(1).
  23. Kurar B, Sneineh SA, Ashhab Y, . ProteinDescriptors: Generates Various Protein Descriptors for Machine Learning Algorithms. . 0.1.0 ed2016.
  24. Armstrong CT, Vincent TL, Green PJ, Woolfson DN. SCORER 2.0: an algorithm for distinguishing parallel dimeric and trimeric coiled-coil sequences. Bioinformatics. 2011;27(14):1908-14.
  25. Radivojac P, Obradovic Z, Brown CJ, Dunker AK. Improving sequence alignments for intrinsically disordered proteins. Biocomputing 2002: World Scientific; 2001. p. 589-600.
  26. Brown CJ, Johnson AK, Daughdrill GW. Comparing Models of Evolution for Ordered and Disordered Proteins. Molecular Biology and Evolution. 2009;27(3):609-21.
  27. Trivedi R, Nagarajaram HA. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Scientific Reports. 2019;9(1):16380.
  28. Mészáros B, Erdős G, Dosztányi Z. IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research. 2018;46(W1):W329-W37.
  29. Erdős G, Dosztányi Z. Analyzing Protein Disorder with IUPred2A. Current Protocols in Bioinformatics. 2020;70(1):e99.
  30. Wickham H. ggplot2: elegant graphics for data analysis: springer; 2016.
  31. Uversky VN. Paradoxes and wonders of intrinsic disorder: Complexity of simplicity. Intrinsically Disordered Proteins. 2016;4(1):e1135015.
  32. Uversky VN. Unusual biophysics of intrinsically disordered proteins. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics. 2013;1834(5):932-51.
  33. Kozlowski LP. IPC – Isoelectric Point Calculator. Biology Direct. 2016;11(1):55.
  34. Gasteiger E, Hoogland C, Gattiker A, Wilkins MR, Appel RD, Bairoch A. Protein identification and analysis tools on the ExPASy server. The proteomics protocols handbook: Springer; 2005. p. 571-607.
  35. Durell SR, Ben-Naim A. Hydrophobic-hydrophilic forces in protein folding. Biopolymers. 2017;107(8):10.1002/bip.23020.
  36. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. Journal of molecular biology. 1982;157(1):105-32.
  37. Dosztányi Z. Prediction of protein disorder based on IUPred. Protein Sci. 2018;27(1):331-40.
  38. Franzosa EA, Xia Y. Structural Determinants of Protein Evolution Are Context-Sensitive at the Residue Level. Molecular Biology and Evolution. 2009;26(10):2387-95.
  39. Dayhoff M, Schwartz R, Orcutt B. 22 a model of evolutionary change in proteins. Atlas of protein sequence and structure. 5: National Biomedical Research Foundation Silver Spring MD; 1978. p. 345-52.
  40. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89(22):10915-9.
  41. Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL. NCBI BLAST: a better web interface. Nucleic Acids Research. 2008;36(suppl_2):W5-W9.
  42. Madeira F, Park YM, Lee J, Buso N, Gur T, Madhusoodanan N, et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic acids research. 2019;47(W1):W636-W41.
  43. Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings. R package version. 2020;2(0).

Additional Information

R Version

R.version.string
#> [1] "R version 4.0.3 (2020-10-10)"

System Information

as.data.frame(Sys.info())
#>                                                  Sys.info()
#> sysname                                               Linux
#> release                                   4.4.0-145-generic
#> version        #171-Ubuntu SMP Tue Mar 26 12:43:40 UTC 2019
#> nodename                                            malbec1
#> machine                                              x86_64
#> login                                             biocbuild
#> user                                              biocbuild
#> effective_user                                    biocbuild
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#> [1] msa_1.22.0          Biostrings_2.58.0   XVector_0.30.0     
#> [4] IRanges_2.24.1      S4Vectors_0.28.1    BiocGenerics_0.36.0
#> [7] idpr_1.0.007       
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.5        pillar_1.4.7      compiler_4.0.3    plyr_1.8.6       
#>  [5] prettyunits_1.1.1 progress_1.2.2    zlibbioc_1.36.0   tools_4.0.3      
#>  [9] digest_0.6.27     jsonlite_1.7.2    nlme_3.1-151      lattice_0.20-41  
#> [13] evaluate_0.14     lifecycle_0.2.0   tibble_3.0.4      gtable_0.3.0     
#> [17] pkgconfig_2.0.3   rlang_0.4.9       curl_4.3          yaml_2.2.1       
#> [21] xfun_0.19         stringr_1.4.0     dplyr_1.0.2       knitr_1.30       
#> [25] hms_0.5.3         generics_0.1.0    vctrs_0.3.6       ade4_1.7-16      
#> [29] grid_4.0.3        tidyselect_1.1.0  glue_1.4.2        R6_2.5.0         
#> [33] rmarkdown_2.6     seqinr_4.2-5      ggplot2_3.3.2     purrr_0.3.4      
#> [37] farver_2.0.3      magrittr_2.0.1    MASS_7.3-53       scales_1.1.1     
#> [41] ellipsis_0.3.1    htmltools_0.5.0   ape_5.4-1         colorspace_2.0-0 
#> [45] labeling_0.4.2    stringi_1.5.3     munsell_0.5.0     crayon_1.3.4
citation()

To cite R in publications use:

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

A BibTeX entry for LaTeX users is

@Manual{, title = {R: A Language and Environment for Statistical Computing}, author = {{R Core Team}}, organization = {R Foundation for Statistical Computing}, address = {Vienna, Austria}, year = {2020}, url = {https://www.R-project.org/}, }

We have invested a lot of time and effort in creating R, please cite it when using it for data analysis. See also ‘citation(“pkgname”)’ for citing R packages.