KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: Engholm-Keller and Waardenberg (2018) and Waardenberg (2018) (described in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network.
KinSwingR is implemented as 3 core functions:
buildPWM()
builds position weight matrices (PWMs) from known kinase-substrate sequencesscoreSequences()
score PWMs build using buildPWM()
against input phosphoproteome dataswing()
integrates PWM scores, direction of phosphopeptide change and significance of phosphopeptide change into a “swing” score.The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance.
Additional functions are also provided:
cleanAnnotation()
function to tidy annotations and extract peptide sequences.viewPWM()
function to view PWM modelsDetailed information for each of these functions can be accessed using the ?
command before the function of interest. E.g. ?buildPWM
We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package.
Begin by loading the KinSwingR library and the two data libraries included in the package.
library(KinSwingR)
data(example_phosphoproteome)
data(phosphositeplus_human)
# View the datasets:
head(example_phosphoproteome)
## annotation peptide fc pval
## 1 A0A096MJ61|NA|89|PRRVRNLSAVLAART NA -0.08377538 0.218815889
## 2 A0A096MJB0|Adcy9|1296|LDKASLGSDDGAQTK NA 0.03707147 0.751069301
## 3 A0A096MJB0|Adcy9|610|PRGQGTASPGSVSDL NA -0.06885408 0.594494965
## 4 A0A096MJB0|Adcy9|613|QGTASPGSVSDLAQT NA -0.29418446 0.002806832
## 5 A0A096MJN4|Sept4|49|ILEPRPQSPDLCDDD NA 0.09097982 0.078667811
## 6 A0A096MJN4|Sept4|81|FCPPAPLSPSSRPRS NA -0.12246661 0.078619010
## kinase substrate
## [1,] "EIF2AK1" "MILLSELSRRRIRSI"
## [2,] "EIF2AK1" "RILLSELSR______"
## [3,] "EIF2AK1" "IEGMILLSELSRRRI"
## [4,] "PRKCD" "MKKKDEGSYDLGKKP"
## [5,] "PRKCD" "FPLRKTASEPNLKVR"
## [6,] "PRKCD" "PLLARSPSTNRKYPP"
## sample 100 data points for demonstration
sample_data <- head(example_phosphoproteome, 1000)
# Random sample for demosntration purposes
set.seed(1234)
sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),]
# for visualising a motif, sample only CAMK2A
CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",]
Where the centered peptide sequences (on the phosphosite of interest) are not provided in the format required for scoreSequences()
(see the argument “input_data”, in ?scoreSequences), these can be required to be extracted from another column of annotated data. NB. “input_data” table format must contain columns for “annotation”, “peptide”, “fold-change” and “p-values”.
In the example dataset provided, example_phosphoproteome
, peptides have not been extracted into a stand-a-lone peptide column. cleanAnnotation()
is provided as a function to extract peptides from annotation columns and place into the peptide column.
In the example dataset, example_phosphoproteome
, the peptide sequence is the 4th component of the annotation, which corresponds to using the argument seq_number = 4
below, and is seperated by |
, which corresponds to the argument annotation_delimiter = "|"
. In this case, the annotated data also contains multi-mapped and multi-site information. For example the following annotation A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA
contains two peptides PAQPGHVSPTPATTS
and SPTPATTSPGEKGEA
that map to different sites from the same reference gene Numbl
, where the peptides are seperated by ;
. The annotated data also includes multi-protein mapped (where a peptide could map to more than one protein - not shown) and contains X
instead of _
to indicate sequences that were outside of the length of the coding sequences. KinSwingR requires that these sequences outside of the coding region are marked with _
as deafult and therefore replace_search = "X"
and replace_with = "_"
can be used as arguments in cleanAnnotation()
to replace these. This allows for full flexibility of the input data here, depending of the software used to generate determine the peptide sequences. NB: characters other than _
can be used, but these need to be declared when calling buildPWM and scoreSequences functions later (see their help files).
Calling cleanAnnotation()
will produce a new table with the unique combinations of peptide sequences extracted from the annotation column into the peptide column:
annotated_data <- cleanAnnotation(input_data = sample_data,
annotation_delimiter = "|",
multi_protein_delimiter = ":",
multi_site_delimiter = ";",
seq_number = 4,
replace = TRUE,
replace_search = "X",
replace_with = "_")
head(annotated_data)
## annotation peptide fc
## 1 A0A096MJ61|NA|89|PRRVRNLSAVLAART PRRVRNLSAVLAART -0.08377538
## 2 A0A096MJB0|Adcy9|1296|LDKASLGSDDGAQTK LDKASLGSDDGAQTK 0.03707147
## 3 A0A096MJB0|Adcy9|610|PRGQGTASPGSVSDL PRGQGTASPGSVSDL -0.06885408
## 4 A0A096MJB0|Adcy9|613|QGTASPGSVSDLAQT QGTASPGSVSDLAQT -0.29418446
## 5 A0A096MJN4|Sept4|49|ILEPRPQSPDLCDDD ILEPRPQSPDLCDDD 0.09097982
## 6 A0A096MJN4|Sept4|81|FCPPAPLSPSSRPRS FCPPAPLSPSSRPRS -0.12246661
## pval
## 1 0.218815889
## 2 0.751069301
## 3 0.594494965
## 4 0.002806832
## 5 0.078667811
## 6 0.078619010
The first step to inferring kinase activity, is to build Position Weight Matrices (PWMs) for kinases. This can be done using buildPWM()
for any table containing centered substrate peptide sequences for a list of kinases. The example data data(phosphositeplus_human)
indicates the required format for building PWM models. Below, for demosntration, we use a subset that was sampled above sample_pwm
To generate the PWMs:
This will build the PWM models, accessible as PWM$pwm
and list the number of substrate sequences used to build each PWM, accesible as PWM$kinase
.
To view the list of kinases and the number of sequences used:
## kinase n
## 2 PLK1 25
## 4 PRKCA 42
## 5 CDK5 18
## 6 PRKCD 14
## 7 MAPK3 29
## 9 CDK2 51
To visualise a motif:
Next, we will use the PWM models generated, pwms
, to identify matches in the annotated_data
table that was cleaned using cleanAnnotation()
above. ``scoreSequencessupports multi-core processing - see the example below for setting the number of workers to 4.
scoreSequencesdraws a random background by default of size
n = 1000. It is recommended to use
set.seed()prior to calling
scoreSequencesif you wish to reproduce your results. To access the help file, which explains all the arguments, type
?scoreSequences``` into the console.
# As an example of control over multi-core processing
# load BiocParallel library
library(BiocParallel)
# finally set/register the number of cores to use
register(SnowParam(workers = 4))
# set seed for reproducible results
set.seed(1234)
scores <- scoreSequences(input_data = annotated_data,
pwm_in = pwms,
n = 100)
The outputs of scores
are transparent and accessible. These are however primarily intermediate tables for obtaining swing scores. scores
is a simple list object that contains peptide scores (scores$peptide_scores)
, p-values for the peptide scores (scores$peptide_p)
and the background peptides used to score significance (scores$background)
for reproducibility (i.e. the background can saved and reused for reproducibility).
In summary, scoreSequences()
scores each input sequence for a match against all PWMs provided using `buildPWM()
and generates p-values for scores. This is effectively one large network of kinase-substrate edges of dimensions kinase, k, by substrate, s.
Having built a kinase-substrate network, swing()
then integrates the kinase-substrate predictions, directionality and significance of phosphopeptide fold change to assess local connectivity (or swing) of kinase-substrate networks. The final score is a normalised score of predicted kinase activity that is weighted for the number of substrates used in the PWM model and number of peptides in the local kinase-substrate network. By default, this will permute the network 1000 times (here we use 10 for example purposes). It is recommended to use set.seed()
prior to calling swing
if you wish to reproduce your results. ``swing``` supports multi-core processing - see the example below for setting the number of workers to 4.
# after loading BiocParallel library, set/register the number of cores to use
register(SnowParam(workers = 4))
# set seed for reproducible results
set.seed(1234)
swing_out <- swing(input_data = annotated_data,
pwm_in = pwms,
pwm_scores = scores,
permutations = 10)
# This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores:
head(swing_out$scores)
## kinase pos neg all pk nk swing_raw n swing
## 12 EGFR 8 4 12 0.6666667 0.3333333 13.64922 14 1.1052625
## 20 PLK1 5 1 6 0.8333333 0.1666667 27.87288 25 2.1248974
## 3 ATM 3 3 6 0.5000000 0.5000000 0.00000 21 0.1268058
## 22 PRKACA 4 2 6 0.6666667 0.3333333 15.50978 64 1.2386377
## 24 PRKCB 7 3 10 0.7000000 0.3000000 15.46053 14 1.2351072
## 2 AKT1 4 2 6 0.6666667 0.3333333 11.52746 22 0.9531622
## p_greater p_less
## 12 0.09090909 1.0000000
## 20 0.09090909 1.0000000
## 3 0.18181818 0.7272727
## 22 0.18181818 0.8181818
## 24 0.18181818 0.9090909
## 2 0.27272727 0.8181818
The outputs of this table indicate the following:
kinase
: The kinasepos
: Number of positively regulated kinase substratesneg
: Number of negatively regulated kinase substratesall
: Total number of regulated kinase substratespk
: Proportion of positively regulated kinase substratesnk
: Proportion of negatively regulated kinase substratesswing_raw
: Raw - weighted scoren
: Number of subtrate sequence in kinase
PWMswing
: Normalised (Z-score transformed) - weighted scorep_greater
: probability of observing a swing score greater thanp_less
: probability of observing a swing score less thanNote that the pos
, neg
and all
include a pseudo-count, that is set in ?swing
, note pseudo_count
.
*** See Engholm-Keller and Waardenberg (2018) and Waardenberg (2018) for methods description ***
*** For a full description of the KinSwing algorith, see Engholm-Keller and Waardenberg (2018) and Waardenberg (2018) ***
In brief:
buildPWM()
generates Position Weight Matrices (PWMs) for kinases based on known substrate sequence (Equation 1), where each kinase, \(K\), is considered as the log-likelihood ratio of the average frequency of amino acid, \(a\), at each position, \(p\), divided by background frequencies, \(B\) (\(C\) is a pseudo count to avoid log zero):
Equation 1: \(PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)\)
scoreSequences()
scores each kinase, \(K\), match to a substrate \(S\), given as \(S_{score}=∑^n_{(i=1)}f(a,p)\) , which corresponds to the sum of the corresponding amino acid, \(a\), of peptide sequence length, \(i\), from position, \(p\), of \(PWM_{(a,p)}\) and \(f(a,p)=PWM_{ap}∈PWM_{(a,p)}\). The probability of observing \(S_score\) for kinase, \(K\), is determined as conditional on a randomly sampled reference distribution of size \(N\) sequences \(P(S_{score}|R,N)\), where \(R\) sequences are determined to have a test statistic less than or equal to \(S_{score}\):
Equation 2: \(R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)\)
swing()
integrates phosphosite data and kinase-substrate scores from scoreSequences()
into a network for scoring kinase activity based on local connectivity, \(swing_k\), (Equation 3). \(swing_k\) is the weighted product of the proportion of positive, \(Pos_k\), and negative, \(Neg_k\), network edges, determined as the product of a logic function (described here: Waardenberg (2018) and Engholm-Keller and Waardenberg (2018)) given a local network of size, \(C_k\), with \(n\) substrates for kinase, \(K\):
Equation 3: \(swing_k=log_2((Pos_k+c)/(Neg_k+c))*log_2(C_k)*log_2(S_n)\)
\(swing_k\), is transformed into a z-score, \(Z(swing_k)\), where, \(μ\), is the mean and, \(σ\), the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions.
KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, \(K\), by random chance?” by computing \(swing_k\) given \(N\) permutations of kinase node labels, \(K\), to substrates, \(S\), of the total network, \(M_{ks}\). Thus, the probability of observing \(swing_k\) is conditional on this permuted reference distribution, of size, \(N\) (Equation 2). This is computed for each tail of the distribution, that is, positive and negative \(swing_k\) scores.
Engholm-Keller, Kasper, and Ashley Jacob Waardenberg. 2018. “In Press.” XX XX (X). XX:XX–XX. https://doi.org/XX.
Waardenberg, Ashley Jacob. 2018. “In Press.” XX XX (X). XX:XX–XX. https://doi.org/XX.