Title: | Privacy-Preserving Synthetic Data for 'LLM' Workflows |
Version: | 0.2.2 |
Description: | Generate privacy-preserving synthetic datasets that mirror structure, types, factor levels, and missingness; export bundles for 'LLM' workflows (data plus 'JSON' schema and guidance); and build fake data directly from 'SQL' database tables without reading real rows. Methods are related to approaches in Nowok, Raab and Dibben (2016) <doi:10.32614/RJ-2016-019> and the foundation-model overview by Bommasani et al. (2021) <doi:10.48550/arXiv.2108.07258>. |
License: | MIT + file LICENSE |
URL: | https://zobaer09.github.io/FakeDataR/, https://github.com/zobaer09/FakeDataR |
BugReports: | https://github.com/zobaer09/FakeDataR/issues |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | dplyr, jsonlite, zip |
Suggests: | readr, testthat (≥ 3.0.0), knitr, rmarkdown, DBI, RSQLite, tibble, nycflights13, palmerpenguins, gapminder, arrow, withr |
VignetteBuilder: | knitr, rmarkdown |
Config/testthat/edition: | 3 |
Language: | en-US |
NeedsCompilation: | no |
Packaged: | 2025-09-30 03:48:13 UTC; Zobaer Ahmed |
Author: | Zobaer Ahmed [aut, cre] |
Maintainer: | Zobaer Ahmed <zunnun09@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-10-06 08:10:19 UTC |
Detect sensitive columns by name
Description
Uses a broad, configurable regex library to match likely PII columns.
You can extend it with extra_patterns
(they get ORed in) or replace
everything with a single override_regex
.
Usage
detect_sensitive_columns(x_names, extra_patterns = NULL, override_regex = NULL)
Arguments
x_names |
Character vector of column names to check. |
extra_patterns |
Character vector of additional regexes to OR in. Examples: c("MRN", "NHS", "Aadhaar", "passport") |
override_regex |
Optional single regex string that fully replaces the
defaults (case-insensitive). When supplied, |
Value
Character vector of names from x_names
that matched.
Examples
detect_sensitive_columns(c("id","email","home_phone","zip","notes"))
detect_sensitive_columns(names(mtcars), extra_patterns = c("^vin$", "passport"))
Save a fake dataset to disk
Description
Save a data.frame to CSV, RDS, or Parquet based on the file extension.
Usage
export_fake(x, path)
Arguments
x |
A data.frame (e.g., output of |
path |
File path. Supported extensions: |
Value
(Invisibly) the path written.
Generate Fake Data from Real Dataset Structure
Description
Generate Fake Data from Real Dataset Structure
Usage
generate_fake_data(
data,
n = 30,
category_mode = c("preserve", "generic", "custom"),
numeric_mode = c("range", "distribution"),
column_mode = c("keep", "generic", "custom"),
custom_levels = NULL,
custom_names = NULL,
seed = NULL,
verbose = FALSE,
sensitive = NULL,
sensitive_detect = TRUE,
sensitive_strategy = c("fake", "drop"),
normalize = TRUE
)
Arguments
data |
A tabular object; will be coerced via |
n |
Rows to generate (default 30). |
category_mode |
One of "preserve","generic","custom".
|
numeric_mode |
One of "range","distribution".
|
column_mode |
One of "keep","generic","custom".
|
custom_levels |
optional named list of allowed levels per column (for |
custom_names |
optional named character vector old->new (for
|
seed |
Optional RNG seed. |
verbose |
Logical; print progress. |
sensitive |
Optional character vector of original column names to treat as sensitive. |
sensitive_detect |
Logical; auto-detect common sensitive columns by name. |
sensitive_strategy |
One of "fake","drop". Only applied if any sensitive columns exist. |
normalize |
Logical; lightly normalize inputs (trim, %→numeric, short date-times→POSIXct). |
Value
A data.frame of n rows with attributes:
-
name_map
(named chr: original -> output) -
column_mode
(chr) -
sensitive_columns
(chr; original names) -
dropped_columns
(chr; original names that were dropped)
Generate fake data from a DB schema data.frame
Description
Generate fake data from a DB schema data.frame
Usage
generate_fake_from_schema(sch_df, n = 30, seed = NULL)
Arguments
sch_df |
A data.frame returned by |
n |
Number of rows to generate. |
seed |
Optional integer seed for reproducibility. |
Value
A base data.frame
with n
rows and one column per schema
entry. Column classes follow the schema type
values
(integer
, numeric
, character
, logical
, Date
, POSIXct
);
missingness is injected when nullable
is TRUE
.
Generate a Fake POSIXct Column
Description
Create synthetic timestamps either by mimicking an existing POSIXct vector
(using its range and NA rate) or by sampling uniformly between start
and end
.
Usage
generate_fake_posixct_column(
like = NULL,
n = NULL,
start = NULL,
end = NULL,
tz = "UTC",
na_prop = NULL
)
Arguments
like |
Optional POSIXct vector to mimic. If supplied, |
n |
Number of rows to generate. Required when |
start , end |
Optional POSIXct bounds to sample between when |
tz |
Timezone to use if |
na_prop |
Optional NA proportion to enforce in the output (0–1). If |
Value
A POSIXct vector of length n
.
Generate fake data with privacy controls
Description
Generates a synthetic copy of data
, then optionally detects/handles
sensitive columns by name. Detection uses the ORIGINAL column names and
maps to output via attr(fake, "name_map")
if present.
Usage
generate_fake_with_privacy(
data,
n = 30,
level = c("low", "medium", "high"),
seed = NULL,
sensitive = NULL,
sensitive_detect = TRUE,
sensitive_strategy = c("fake", "drop"),
normalize = TRUE,
sensitive_patterns = NULL,
sensitive_regex = NULL
)
Arguments
data |
A data.frame (or coercible) to mirror. |
n |
Rows to generate (default same as input if NULL). |
level |
One of "low","medium","high". |
seed |
Optional RNG seed. |
sensitive |
Character vector of original column names to treat as sensitive. |
sensitive_detect |
Logical; auto-detect common sensitive columns by name. |
sensitive_strategy |
One of "fake" or "drop". |
normalize |
Logical; lightly normalize inputs. |
sensitive_patterns |
Optional named list of patterns to treat as sensitive (e.g., list(id = "...", email = "...", phone = "...")). Overrides defaults. |
sensitive_regex |
Optional fully-combined regex (single string) to detect sensitive columns by name. If supplied, it is used instead of defaults. |
Details
Generate fake data with privacy controls
Value
data.frame with attributes: sensitive_columns, dropped_columns, name_map
Create a copy-paste prompt for LLMs
Description
Create a copy-paste prompt for LLMs
Usage
generate_llm_prompt(
fake_path,
schema_path = NULL,
notes = NULL,
write_file = TRUE,
path = dirname(fake_path),
filename = "README_FOR_LLM.txt"
)
Arguments
fake_path |
Path to the fake data file (CSV/RDS/Parquet). |
schema_path |
Optional path to the JSON schema. |
notes |
Optional extra notes to append for the analyst/LLM. |
write_file |
Write a README txt next to the files? Default TRUE. |
path |
Output directory for the README if write_file = TRUE. |
filename |
README file name. Default "README_FOR_LLM.txt". |
Value
The prompt string (invisibly returns the file path if written).
Create a fake-data bundle for LLM workflows
Description
Generates fake data, writes files (CSV/RDS/Parquet), writes a scrubbed JSON schema, and optionally writes a README prompt and a single ZIP file containing everything.
Usage
llm_bundle(
data,
n = 30,
level = c("medium", "low", "high"),
formats = c("csv", "rds"),
path = tempdir(),
filename = "fake_bundle",
seed = NULL,
write_prompt = TRUE,
zip = FALSE,
prompt_filename = "README_FOR_LLM.txt",
zip_filename = NULL,
sensitive = NULL,
sensitive_detect = TRUE,
sensitive_strategy = c("fake", "drop"),
normalize = FALSE
)
Arguments
data |
A data.frame (or coercible) to mirror. |
n |
Number of rows in the fake dataset (default 30). |
level |
Privacy level: "low", "medium", or "high". Controls stricter defaults. |
formats |
Which data files to write: any of "csv","rds","parquet". |
path |
Folder to write outputs. Default: |
filename |
Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. |
seed |
Optional RNG seed for reproducibility. |
write_prompt |
Write a README_FOR_LLM.txt next to the data? Default TRUE. |
zip |
Create a single zip archive containing data + schema + README? Default FALSE. |
prompt_filename |
Name for the README file. Default "README_FOR_LLM.txt". |
zip_filename |
Optional custom name for the ZIP file (no path).
If |
sensitive |
Character vector of column names to treat as sensitive (optional). |
sensitive_detect |
Logical, auto-detect common sensitive columns (id/email/phone). Default TRUE. |
sensitive_strategy |
"fake" (replace with realistic fakes) or "drop". Default "fake". |
normalize |
Logical; if TRUE, attempt light auto-normalization before faking. |
Details
Tips
Avoid using angle brackets in examples; prefer plain tokens like NAME
or FILE_NAME
. If you truly want bracket glyphs, use Unicode ⟨name⟩ ⟩name⟨.
Value
List with paths: $data_paths (named), $schema_path, $readme_path (optional), $zip_path (optional), and $fake (data.frame).
Build an LLM bundle directly from a database table
Description
Reads just the schema from table
on conn
, synthesizes n
fake rows,
writes a schema JSON, fake dataset(s), and a README prompt, and optionally
zips them into a single archive.
Usage
llm_bundle_from_db(
conn,
table,
n = 30,
level = c("medium", "low", "high"),
formats = c("csv", "rds"),
path = tempdir(),
filename = "fake_from_db",
seed = NULL,
write_prompt = TRUE,
zip = FALSE,
zip_filename = NULL,
sensitive_strategy = c("fake", "drop")
)
Arguments
conn |
A DBI connection. |
table |
Character scalar: table name to read. |
n |
Number of rows in the fake dataset (default 30). |
level |
Privacy level: "low", "medium", or "high". Controls stricter defaults. |
formats |
Which data files to write: any of "csv","rds","parquet". |
path |
Folder to write outputs. Default: |
filename |
Base file name (no extension). Example: "demo_bundle". This becomes files like "demo_bundle.csv", "demo_bundle.rds", etc. |
seed |
Optional RNG seed for reproducibility. |
write_prompt |
Write a README_FOR_LLM.txt next to the data? Default TRUE. |
zip |
Create a single zip archive containing data + schema + README? Default FALSE. |
zip_filename |
Optional custom name for the ZIP file (no path).
If |
sensitive_strategy |
"fake" (replace with realistic fakes) or "drop". Default "fake". |
Value
Invisibly, a list with useful paths:
-
schema_path
– schema JSON -
files
– vector of written fake-data files -
zip_path
– zip archive path (ifzip = TRUE
)
Examples
if (requireNamespace("DBI", quietly = TRUE) &&
requireNamespace("RSQLite", quietly = TRUE)) {
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbWriteTable(con, "cars", head(cars, 20), overwrite = TRUE)
out <- llm_bundle_from_db(
con, "cars",
n = 100, level = "medium",
formats = c("csv","rds"),
path = tempdir(), filename = "db_bundle",
seed = 1, write_prompt = TRUE, zip = TRUE
)
}
Prepare Input Data: Coerce to data.frame and (optionally) normalize values
Description
Converts common tabular objects to a base data.frame
, and if normalize = TRUE
it applies light, conservative value normalization:
Converts common date/time strings to POSIXct (best-effort across several formats)
Converts percent-like character columns (e.g. "85%") to numeric (85)
Maps a configurable set of "NA-like" strings to
NA
, while keeping common survey responses like "not applicable" or "prefer not to answer" as real levelsNormalizes yes/no character columns to an ordered factor
c("no","yes")
Usage
prepare_input_data(
data,
normalize = TRUE,
na_strings = c("", "NA", "N/A", "na", "No data", "no data"),
keep_as_levels = c("not applicable", "prefer not to answer", "unsure"),
percent_detect_threshold = 0.6,
datetime_formats = c("%m/%d/%Y %H:%M:%S", "%m/%d/%Y %H:%M",
"%Y-%m-%d %H:%M:%S", "%Y-%m-%d %H:%M", "%Y-%m-%dT%H:%M:%S",
"%Y-%m-%dT%H:%M", "%m/%d/%Y", "%Y-%m-%d")
)
Arguments
data |
An object coercible to |
normalize |
Logical, run value normalization step (default |
na_strings |
Character vector that should become |
keep_as_levels |
Character vector that should be kept as values (not |
percent_detect_threshold |
Proportion of non-missing values that must contain |
datetime_formats |
Candidate formats tried (in order) when parsing date-times strings.
The best-fitting format (most successful parses) is used. Defaults cover
|
Value
A base data.frame
.
Extract a table schema from a DB connection
Description
Returns a data frame describing the columns of a database table.
Usage
schema_from_db(conn, table, level = c("medium", "low", "high"))
Arguments
conn |
A DBI connection. |
table |
Character scalar: table name to introspect. |
level |
Privacy preset to annotate in schema metadata: one of "low", "medium", "high". Default "medium". |
Value
A data.frame with column metadata (e.g., name, type).
Examples
if (requireNamespace("DBI", quietly = TRUE) &&
requireNamespace("RSQLite", quietly = TRUE)) {
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
on.exit(DBI::dbDisconnect(con), add = TRUE)
DBI::dbWriteTable(con, "mtcars", mtcars[1:3, ])
sc <- schema_from_db(con, "mtcars")
head(sc)
}
Validate a fake dataset against the original
Description
Compares classes, NA/blank proportions, and simple numeric ranges.
Usage
validate_fake(original, fake, tol = 0.15)
Arguments
original |
data.frame |
fake |
data.frame (same columns) |
tol |
numeric tolerance for proportion differences (default 0.15) |
Value
data.frame summary by column
Zip a set of files for easy sharing
Description
Zip a set of files for easy sharing
Usage
zip_llm_bundle(files, zipfile)
Arguments
files |
Character vector of file paths. |
zipfile |
Path to the zip file to create. |
Value
The path to the created zip file.