xMLrandomforest | R Documentation |
xMLrandomforest
is supposed to integrate predictor matrix in a
supervised manner via machine learning algorithm random forest. It
requires three inputs: 1) Gold Standard Positive (GSP) targets; 2) Gold
Standard Negative (GSN) targets; 3) a predictor matrix containing genes
in rows and predictors in columns, with their predictive scores inside
it. It returns an object of class 'pTarget'.
xMLrandomforest(df_predictor, GSP, GSN, nfold = 3, mtry = NULL, ntree = 2000, fold.aggregateBy = c("Ztransform", "logistic", "fishers", "orderStatistic"), verbose = TRUE, ...)
df_predictor |
a data frame containing genes (in rows) and predictors (in columns), with their predictive scores inside it. This data frame must has gene symbols as row names |
GSP |
a vector containing Gold Standard Positive (GSP) |
GSN |
a vector containing Gold Standard Negative (GSN) |
nfold |
an integer specifying the number of folds for cross validataion |
mtry |
an integer specifying the number of predictors randomly sampled as candidates at each split. If NULL, it will be tuned by 'randomForest::tuneRF', with starting value as sqrt(p) where p is the number of predictors. The minimum value is 3 |
ntree |
an integer specifying the number of trees to grow. By default, it sets to 2000 |
fold.aggregateBy |
the aggregate method used to aggregate results from k-fold cross validataion. It can be either "orderStatistic" for the method based on the order statistics of p-values, or "fishers" for Fisher's method, "Ztransform" for Z-transform method, "logistic" for the logistic method. Without loss of generality, the Z-transform method does well in problems where evidence against the combined null is spread widely (equal footings) or when the total evidence is weak; Fisher's method does best in problems where the evidence is concentrated in a relatively small fraction of the individual tests or when the evidence is at least moderately strong; the logistic method provides a compromise between these two. Notably, the aggregate methods 'Ztransform' and 'logistic' are preferred here |
verbose |
logical to indicate whether the messages will be displayed in the screen. By default, it sets to TRUE for display |
... |
additional graphic parameters. Please refer to 'randomForest::randomForest' for the complete list. |
an object of class "pTarget", a list with following components:
model
: a list of models, results from per-fold train set
priority
: a data frame of nGene X 6 containing gene
priority information, where nGene is the number of genes in the input
data frame, and the 6 columns are "GS" (either 'GSP', or 'GSN', or
'Putative'), "name" (gene names), "rank" (ranks of the priority
scores), "pvalue" (the cross-fold aggregated p-value of being GSP,
per-fold p-value converted from empirical cumulative distribution of
the probability of being GSP), "fdr" (fdr adjusted from the aggregated
p-value), "priority" (-log10(fdr))
predictor
: a data frame, which is the same as the input
data frame but inserting an additional column 'GS' in the first column
pred2fold
: a list of data frame, results from per-fold
test set
prob2fold
: a data frame of nGene X 2+nfold containing the
probability of being GSP, where nGene is the number of genes in the
input data frame, nfold is the number of folds for cross validataion,
and the first two columns are "GS" (either 'GSP', or 'GSN', or
'Putative'), "name" (gene names), and the rest columns storing the
per-fold probability of being GSP
importance2fold
: a data frame of nPredictor X 4+nfold
containing the predictor importance info per fold, where nPredictor is
the number of predictors, nfold is the number of folds for cross
validataion, and the first 4 columns are "median" (the median of the
importance across folds), "mad" (the median of absolute deviation of
the importance across folds), "min" (the minimum of the importance
across folds), "max" (the maximum of the importance across folds), and
the rest columns storing the per-fold importance
roc2fold
: a data frame of 1+nPredictor X 4+nfold
containing the supervised/predictor ROC info (AUC values), where
nPredictor is the number of predictors, nfold is the number of folds
for cross validataion, and the first 4 columns are "median" (the median
of the AUC values across folds), "mad" (the median of absolute
deviation of the AUC values across folds), "min" (the minimum of the
AUC values across folds), "max" (the maximum of the AUC values across
folds), and the rest columns storing the per-fold AUC values
fmax2fold
: a data frame of 1+nPredictor X 4+nfold
containing the supervised/predictor PR info (F-max values), where
nPredictor is the number of predictors, nfold is the number of folds
for cross validataion, and the first 4 columns are "median" (the median
of the F-max values across folds), "mad" (the median of absolute
deviation of the F-max values across folds), "min" (the minimum of the
F-max values across folds), "max" (the maximum of the F-max values
across folds), and the rest columns storing the per-fold F-max values
importance
: a data frame of nPredictor X 2 containing the
predictor importance info, where nPredictor is the number of
predictors, two columns for two types ("MeanDecreaseAccuracy" and
"MeanDecreaseGini") of predictor importance measures.
"MeanDecreaseAccuracy" sees how worse the model performs without each
predictor (a high decrease in accuracy would be expected for very
informative predictors), while "MeanDecreaseGini" measures how pure the
nodes are at the end of the tree (a high score means the predictor was
important if each predictor is taken out)
performance
: a data frame of 1+nPredictor X 2 containing
the supervised/predictor performance info predictor importance info,
where nPredictor is the number of predictors, two columns are "ROC"
(AUC values) and "Fmax" (F-max values)
call
: the call that produced this result
none
## Not run: # Load the library library(Pi) ## End(Not run) RData.location <- "http://galahad.well.ox.ac.uk/bigdata_dev" ## Not run: pTarget <- xMLrandomforest(df_prediction, GSP, GSN) ## End(Not run)