| Title: | Run Predictions Inside the Database |
| Version: | 1.1.0 |
| Description: | It parses a fitted 'R' model object, and returns a formula in 'Tidy Eval' code that calculates the predictions. It works with several databases back-ends because it leverages 'dplyr' and 'dbplyr' for the final 'SQL' translation of the algorithm. It currently supports lm(), glm(), randomForest(), ranger(), rpart(), earth(), xgb.Booster.complete(), lgb.Booster(), catboost.Model(), cubist(), and ctree() models. |
| License: | MIT + file LICENSE |
| URL: | https://tidypredict.tidymodels.org, https://github.com/tidymodels/tidypredict |
| BugReports: | https://github.com/tidymodels/tidypredict/issues |
| Depends: | R (≥ 3.6) |
| Imports: | cli, dplyr (≥ 0.7), generics, jsonlite, knitr, purrr, rlang (≥ 1.1.1), tibble, tidyr |
| Suggests: | bonsai, covr, Cubist (≥ 0.5.1), DBI, dbplyr, earth (≥ 5.1.2), glmnet, lightgbm, methods, mlbench, modeldata, nycflights13, parsnip, partykit, randomForest, ranger (≥ 0.14.1), rpart (≥ 4.1.0), rmarkdown, RSQLite, survival, testthat (≥ 3.2.0), withr, xgboost, yaml |
| VignetteBuilder: | knitr |
| Config/Needs/website: | tidyverse/tidytemplate |
| Config/testthat/edition: | 3 |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| NeedsCompilation: | no |
| Packaged: | 2026-02-26 22:11:56 UTC; emilhvitfeldt |
| Author: | Emil Hvitfeldt [aut, cre], Edgar Ruiz [aut], Max Kuhn [aut] |
| Maintainer: | Emil Hvitfeldt <emil.hvitfeldt@posit.co> |
| Repository: | CRAN |
| Date/Publication: | 2026-02-27 06:30:14 UTC |
tidypredict: Run Predictions Inside the Database
Description
It parses a fitted 'R' model object, and returns a formula in 'Tidy Eval' code that calculates the predictions. It works with several databases back-ends because it leverages 'dplyr' and 'dbplyr' for the final 'SQL' translation of the algorithm. It currently supports lm(), glm(), randomForest(), ranger(), rpart(), earth(), xgb.Booster.complete(), lgb.Booster(), catboost.Model(), cubist(), and ctree() models.
Author(s)
Maintainer: Emil Hvitfeldt emil.hvitfeldt@posit.co
Authors:
Edgar Ruiz edgar@posit.co
Max Kuhn max@posit.co
See Also
Useful links:
Report bugs at https://github.com/tidymodels/tidypredict/issues
Build case_when expression from nodes with predictions and paths
Description
Shared helper for building tree expressions used by ranger and randomForest classification extractors.
Usage
.build_case_when_tree(nodes)
Arguments
nodes |
A list of lists, each with |
Build linear predictor expression from coefficient names and values
Description
Shared helper for building linear predictor strings from coefficients. Used by orbital package for glmnet models.
Usage
.build_linear_pred(coef_names, coef_values)
Arguments
coef_names |
Character vector of coefficient names (including "(Intercept)") |
coef_values |
Numeric vector of coefficient values |
Build nested case_when expression from tree info
Description
Shared helper for building nested tree expressions. This is the nested
equivalent of .build_case_when_tree().
Usage
.build_nested_case_when_tree(tree_info)
Arguments
tree_info |
A tree info list with nodeID, leftChild, rightChild, splitvarName, terminal, prediction, and node_splits |
Extract processed CatBoost trees
Description
For use in orbital package.
Usage
.extract_catboost_trees(model)
Arguments
model |
A CatBoost model object |
Extract multiclass linear predictors for earth models
Description
For use in orbital package.
Usage
.extract_earth_multiclass(model)
Arguments
model |
An earth model object with multiple classes (glm.list with >1 elements) |
Extract multiclass linear predictors for glmnet models
Description
For use in orbital package.
Usage
.extract_glmnet_multiclass(model, penalty = NULL)
Arguments
model |
A glmnet model object with class "multnet" |
penalty |
The penalty value to use for coefficient extraction |
Extract processed LightGBM trees
Description
For use in orbital package.
Usage
.extract_lgb_trees(model)
Arguments
model |
A LightGBM model object |
Extract classprob trees for partykit models
Description
For use in orbital package.
Usage
.extract_partykit_classprob(model)
Arguments
model |
A partykit model object |
Extract classification probability trees for ranger models
Description
For use in orbital package.
Usage
.extract_ranger_classprob(model)
Arguments
model |
A ranger model object fitted with |
Extract regression trees for ranger models
Description
For use in orbital package.
Usage
.extract_ranger_trees(model)
Arguments
model |
A ranger model object (regression) |
Extract classification vote trees for randomForest models
Description
For use in orbital package.
Usage
.extract_rf_classprob(model)
Arguments
model |
A randomForest model object |
Extract regression trees for randomForest models
Description
For use in orbital package.
Usage
.extract_rf_trees(model)
Arguments
model |
A randomForest model object (regression) |
Extract classprob trees for rpart models
Description
For use in orbital package.
Usage
.extract_rpart_classprob(model)
Arguments
model |
An rpart model object |
Extract processed xgboost trees
Description
For use in orbital package.
Usage
.extract_xgb_trees(model)
Arguments
model |
An xgb.Booster model |
Extract comprehensive tree info for partykit models
Description
Returns tree structure in format needed by nested case_when generator. For use in orbital package.
Usage
.partykit_tree_info_full(model)
Arguments
model |
A partykit model object |
Extract comprehensive tree info for rpart models
Description
Returns tree structure in format needed by nested case_when generator. For use in orbital package.
Usage
.rpart_tree_info_full(model)
Arguments
model |
An rpart model object |
Checks that the formula can be parsed
Description
Uses an S3 method to check that a given formula can be parsed based on its class. It currently scans for contrasts that are not supported and in-line functions. (e.g: lm(wt ~ as.factor(am))). Since this function is meant for function interaction, as opposed to human interaction, a successful check is silent.
Usage
acceptable_formula(model)
Arguments
model |
An R model object |
Examples
model <- lm(mpg ~ wt, mtcars)
acceptable_formula(model)
Prepares parsed model object
Description
Prepares parsed model object
Usage
as_parsed_model(x)
Arguments
x |
A parsed model object |
Build a nested case_when expression for a single node
Description
Build a nested case_when expression for a single node
Usage
build_nested_node(node_id, tree_info)
Arguments
node_id |
The node ID to build (0-indexed) |
tree_info |
Tree info list with nodeID, leftChild, rightChild, splitvarName, terminal, prediction, and node_splits |
Build a split condition expression for nested trees (left branch)
Description
Build a split condition expression for nested trees (left branch)
Usage
build_nested_split_condition(split)
Arguments
split |
A split info list with col, val/vals, is_categorical |
Generate trees
Description
Each tree is generated as a flat tree with each node being a seperate part of the case when. This means that the following tree:
Usage
generate_case_when_trees(parsedmodel, default = TRUE)
Details
+-----+ +----|x > 0|----+ | +-----+ | v v
+——+ +——–+ +–|y < 20|–+ +–|z <= 10 |–+ | +——+ | | +——–+ | v v v v a b c d
will be turned into the following case_when() statement.
case_when( x > 0 & y < 20 ~ "a", x > 0 & y >= 20 ~ "b", x <= 0 & z <= 10 ~ "c", x <= 0 & z > 10 ~ "d" )
instead of a nested case_when()s' like this
case_when(
x > 0 ~ case_when(
y < 20 ~ "a",
y >= 10 ~ "b"
),
x <= 0 ~ case_when(
z <= 10 ~ "c",
z > 10 ~ "d"
)
)
The functions in this file generates these tree.
generate_case_when_tree() generates a single tree with
generate_case_when_trees() being a convinience wrapper for multiple trees.
generate_tree_node() generates the expressions for each a single ndoe in
the tree, where generate_tree_nodes() is a convinience wrapper for
calculating all notes.
Generate nested case_when trees
Description
These functions generate nested case_when() expressions for decision trees,
which are more efficient than flat case_when() for both R/dplyr and SQL
execution.
Usage
generate_nested_case_when_tree(tree_info)
Arguments
tree_info |
A tree info list from |
Details
The following tree:
+-----+ +----|x > 0|----+ | +-----+ | v v
+——+ +——–+ +–|y < 20|–+ +–|z <= 10 |–+ | +——+ | | +——–+ | v v v v a b c d
will be turned into the following nested case_when() statement:
case_when(
x > 0 ~ case_when(
y < 20 ~ "a",
.default = "b"
),
.default = case_when(
z <= 10 ~ "c",
.default = "d"
)
)
NA values in predictor columns are not handled by the generated expression. Users should ensure that predictor columns do not contain NA values before using the generated expression, or the results will be NA for those rows.
Construct a single node of a tree
Description
Construct a single node of a tree
Usage
generate_tree_node(node, calc_mode = "")
Arguments
node |
a list with named elements |
calc_mode |
character, takes values The The This list can contain 0 or more elemements. The elements but each be of the following format:
The It can either be a singular value or a list.
If it is a list it will have the following 4 named elements
@keywords internal |
Knit print method for test predictions results
Description
Knit print method for test predictions results
Usage
## S3 method for class 'tidypredict_test'
knit_print(x, ...)
Converts an R model object into a parsed model
Description
Parses a fitted R model's structure and extracts the components needed to create a dplyr formula for prediction. The parsed model can be serialized (e.g., saved to YAML) and later used to generate predictions without the original model object.
Usage
parse_model(model)
Arguments
model |
An R model object. |
Value
A parsed model object with class parsed_model and a model-specific
subclass (e.g., pm_xgb, pm_tree, pm_regression). The object contains:
-
$general: List with model metadata includingmodel(model type),type(used for S3 dispatch),version(parsed model format version), and model-specific parameters. Model-specific fields containing coefficients, tree structures, etc.
Parsed model versions
The $general$version field indicates the parsed model format:
-
Version 1: Original format. Linear models store coefficients in a data frame. Tree models use flat
case_when()expressions where all leaf conditions are at the same level. -
Version 2: Improved coefficient storage for linear models (lm, earth). Tree models still use flat
case_when(). -
Version 3: Current format. Tree models (rpart, ranger, randomForest, xgboost, lightgbm, catboost, partykit, cubist) use nested
case_when()expressions that mirror the tree structure. This produces more efficient SQL and R code because conditions are evaluated hierarchically rather than checking all leaf paths.
When loading a parsed model saved with an older version, tidypredict automatically uses the appropriate formula builder for backwards compatibility.
Model types
Each parsed model has a type that determines the S3 class used for dispatch:
-
pm_regression: Linear models (lm, glm, earth, glmnet) -
pm_tree: Single trees and forests (rpart, partykit, ranger, randomForest, cubist) -
pm_xgb: XGBoost gradient boosting models -
pm_lgb: LightGBM gradient boosting models -
pm_catboost: CatBoost gradient boosting models
Examples
library(dplyr)
df <- mutate(mtcars, cyl = paste0("cyl", cyl))
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = df)
parse_model(model)
Turn a path object into an expression
Description
Turn a path object into an expression
Usage
path_formula(x)
Arguments
x |
a list. The input of this function is a list with 4 values.
|
Turn a path object into a combined expression
Description
Turn a path object into a combined expression
Usage
path_formulas(path)
Arguments
path |
a list of lists. This list can contain 0 or more elemements. The elements but each be of the following format:
|
print method for test predictions results
Description
print method for test predictions results
Usage
## S3 method for class 'tidypredict_test'
print(x, ...)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- generics
Set categorical feature mappings for CatBoost model
Description
CatBoost stores categorical features as hash values internally. This function establishes the mapping between hash values and category names by examining a data frame with the same factor columns used during training.
Usage
set_catboost_categories(parsed_model, model, data)
Arguments
parsed_model |
A parsed CatBoost model from |
model |
The original CatBoost model object |
data |
A data frame containing factor columns matching the categorical features used in the model. The factor levels must match those from training. |
Details
This function is only needed when using raw CatBoost models (trained with
catboost.train()). When using parsnip/bonsai, categorical features are
handled automatically and this function is not required.
Value
The parsed model with category mappings added
Examples
## Not run:
# For raw CatBoost models with categorical features:
pm <- parse_model(catboost_model)
pm <- set_catboost_categories(pm, catboost_model, training_data)
tidypredict_fit(pm)
# For parsnip/bonsai models, this is not needed:
# tidypredict_fit(parsnip_model_fit) # works automatically
## End(Not run)
Tidy the parsed model results
Description
Tidy the parsed model results
Usage
## S3 method for class 'pm_regression'
tidy(x, ...)
Arguments
x |
A parsed_model object |
... |
Reserved for future use |
Returns a Tidy Eval formula to calculate fitted values
Description
It parses a model or uses an already parsed model to return a Tidy Eval formula that can then be used inside a dplyr command.
Usage
tidypredict_fit(model)
Arguments
model |
An R model or a list with a parsed model. |
Examples
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars)
tidypredict_fit(model)
Returns a Tidy Eval formula to calculate prediction interval.
Description
It parses a model or uses an already parsed model to return a Tidy Eval formula that can then be used inside a dplyr command.
Usage
tidypredict_interval(model, interval = 0.95)
Arguments
model |
An R model or a list with a parsed model |
interval |
The prediction interval, defaults to 0.95 |
Details
The result still has to be added to and subtracted from the fit to obtain the upper and lower bound respectively.
Examples
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars)
tidypredict_interval(model)
Returns a SQL query with formula to calculate fitted values
Description
Returns a SQL query with formula to calculate fitted values
Usage
tidypredict_sql(model, con)
Arguments
model |
An R model or a list with a parsed model |
con |
Database connection object. It is used to select the correct SQL translation syntax. |
Examples
library(dbplyr)
model <- lm(mpg ~ wt + am + cyl, data = mtcars)
tidypredict_sql(model, simulate_dbi())
Returns a SQL query with formula to calculate predicted interval
Description
Returns a SQL query with formula to calculate predicted interval
Usage
tidypredict_sql_interval(model, con, interval = 0.95)
Arguments
model |
An R model or a tibble with a parsed model |
con |
Database connection object. It is used to select the correct SQL translation syntax. |
interval |
The prediction interval, defaults to 0.95 |
Examples
library(dbplyr)
model <- lm(mpg ~ wt + am + cyl, data = mtcars)
tidypredict_sql_interval(model, simulate_dbi())
Tests base predict function against tidypredict
Description
Compares the results of predict() and tidypredict_to_column() functions.
Usage
tidypredict_test(
model,
df = model$model,
threshold = 1e-12,
include_intervals = FALSE,
max_rows = NULL,
xg_df = NULL
)
Arguments
model |
An R model or a list with a parsed model. It currently supports lm(), glm() and randomForest() models. |
df |
A data frame that contains all of the needed fields to run the prediction. It defaults to the "model" data frame object inside the model object. |
threshold |
The number that a given result difference, between predict() and tidypredict_to_column() should not exceed. For continuous predictions, the default value is 0.000000000001 (1e-12), and for categorical predictions, the default value is 0. |
include_intervals |
Switch to indicate if the prediction intervals should be included in the test. It defaults to FALSE. |
max_rows |
The number of rows in the object passed in the df argument. Highly recommended for large data sets. |
xg_df |
A xgb.DMatrix object, required only for XGBoost models. It defaults to NULL recommended for large data sets. |
Examples
model <- lm(mpg ~ wt + cyl * disp, offset = am, data = mtcars)
tidypredict_test(model)
Adds the prediction columns to a piped command set.
Description
Adds a new column with the results from tidypredict_fit() to a piped command set. If add_interval is set to TRUE, it will add two additional columns- one for the lower and another for the upper prediction interval bounds.
Usage
tidypredict_to_column(
df,
model,
add_interval = FALSE,
interval = 0.95,
vars = c("fit", "upper", "lower")
)
Arguments
df |
A data.frame or tibble |
model |
An R model or a parsed model inside a data frame |
add_interval |
Switch that indicates if the prediction interval columns should be added. Defaults to FALSE |
interval |
The prediction interval, defaults to 0.95. Ignored if add_interval is set to FALSE |
vars |
The name of the variables that this function will produce. Defaults to "fit", "upper", and "lower". |