| Title: | Cell Key Perturbation |
| Version: | 3.0.0 |
| Description: | Provides functions to generate frequency tables and apply cell key perturbation to protect against statistical disclosure in tabular outputs. The implemented methods are described in "Cell Key Perturbation User Guide" https://github.com/ONSdigital/cell-key-perturbation-R/blob/main/documentation/SML_UserDoc_CKP_R.md. Developed at the UK Office for National Statistics. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | bigrquery, DBI, knitr, rmarkdown, testthat (≥ 3.0.0) |
| Config/testthat/edition: | 3 |
| Imports: | data.table |
| Depends: | R (≥ 3.5.0) |
| LazyData: | true |
| VignetteBuilder: | knitr |
| URL: | https://github.com/ONSdigital/cell-key-perturbation-R |
| BugReports: | https://github.com/ONSdigital/cell-key-perturbation-R/issues |
| NeedsCompilation: | no |
| Packaged: | 2026-05-21 10:57:32 UTC; aydina |
| Author: | Iain Dove [aut, cph], Ahmet Aydin [aut, cre] |
| Maintainer: | Ahmet Aydin <SDC.Queries@ons.gov.uk> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-28 11:40:06 UTC |
Check perturbed table for missingness in tabulation variables
Description
Check perturbed table for missingness in tabulation variables
Usage
check_for_na(DT, cols)
Arguments
DT |
– |
cols |
– |
Value
Warning message if any tabulation variable contain missing values
Create a frequency table with cell key perturbation applied
Description
create_perturbed_table() creates a frequency table which has had
cell key perturbation applied to the counts.
A p-table file needs to be supplied which determines which cells are
perturbed.
The data needs to contain a 'record key' variable which along with the
ptable allows the process to be repeatable and consistent.
Usage
create_perturbed_table(
data,
ptable,
geog,
tab_vars,
record_key,
use_existing_ons_id = TRUE,
threshold = 10
)
Arguments
data |
A The data should contain one row per statistical unit (person, household, business or other) and one column per variable (age, sex, health status) |
ptable |
A |
geog |
A |
tab_vars |
A |
record_key |
A |
use_existing_ons_id |
A |
threshold |
An |
Value
Returns a data.table giving a frequency table which has had
cell key perturbation applied according to the ptable supplied.
Examples
if (requireNamespace("data.table", quietly = TRUE)) {
data.table::setDTthreads(1)
}
geog <- "var1"
tab_vars <- c("var5","var8")
record_key <- "record_key"
perturbed_table <- create_perturbed_table(micro,
ptable_10_5,
geog,
tab_vars,
record_key)
# Alternatively
perturbed_table <- create_perturbed_table(data = micro,
ptable = ptable_10_5,
geog = c(),
tab_vars = c("var1","var5","var8"),
record_key = "record_key",
threshold = 10)
Create a perturbed frequency table in BigQuery and return it as a data frame
Description
This function runs the perturbation method fully in BigQuery (via SQL) and only downloads the result, which allows handling large datasets efficiently.
Usage
create_perturbed_table_bigquery(
con,
data,
ptable,
geog,
tab_vars,
record_key,
use_existing_ons_id = TRUE,
threshold = 10,
return_query = FALSE
)
Arguments
con |
– |
data |
– |
ptable |
– |
geog |
– |
tab_vars |
– |
record_key |
– |
use_existing_ons_id |
– |
threshold |
– |
return_query |
– |
Details
Function workflow:
Generate BigQuery SQL query to run perturbation on BigQuery
If return_query = TRUE, return the query text and exit: otherwise, execute the rest
Validate inputs using BigQuery
Run perturbation using BigQuery
Convert perturbed table to data.table and sort
The query build by this function does the following when executed:
Computes counts and cell keys for each unique combination of geographic and tabulation variables.
Includes zero-count cells by generating the full cartesian product of variable combinations.
Calculates pcv by ensuring the rows of ptable 501-750 are reused for cell values above 750.
Applies perturbation values from a perturbation table based on cell keys and pseudo cell values (pcv).
Suppresses cells below a specified threshold by setting their perturbed count to NULL.
Value
When
return_query = FALSE: adata.tablecontaining the perturbed frequency table, sorted bygeogandtab_vars.When
return_query = TRUE: a character string containing the query.
Examples
# --- Return query text without executing it ---
query <- create_perturbed_table_bigquery(
con = NULL,
data = "my-gcp-project.survey.microdata",
ptable = "my-gcp-project.sdc.ptable",
geog = c("Region"),
tab_vars = c("AgeGroup", "HealthStatus", "Occupation"),
record_key = "Record_Key",
threshold = 10,
return_query = TRUE
)
cat(query)
Generate ptable (10-5 rule)
Description
generate_ptable_10_5_rule() generates a sample p-table based on 10-5 rule,
which means a suppression threshold of 10 and rounding to the nearest 5.
Usage
generate_ptable_10_5_rule(max_pcv = 750, ckey_range = 255)
Arguments
max_pcv |
Max value for pcv. Default is 750. |
ckey_range |
The max range for cell keys. Default is 255. |
Value
A data.table assigning a pvalue to each ckey and pcv combination
Examples
if (requireNamespace("data.table", quietly = TRUE)) {
data.table::setDTthreads(1)
}
ptable <- generate_ptable_10_5_rule()
Generate and attach random record keys to microdata
Description
generate_random_key() attaches randomly generated record keys to microdata
tables for testing purposes.
Usage
generate_random_rkey(data, rkey_range = 255, seed = NULL)
Arguments
data |
A data.table or data.frame containing the microdata |
rkey_range |
The max range for record keys. Default is 255. |
seed |
A seed for the random number generator |
Value
A data.table with a new integer column record_key
Examples
library(data.table)
data <- data.table(id = 1:1000)
data <- generate_random_rkey(data, rkey_range = 255, seed = 2005)
Generate Record Key from ONS ID
Description
This function creates a new record key column by taking the modulo 4096 of
the ons_id column. It converts ons_id to numeric, preserving NA for
non-numeric values, and assigns the result as an integer.
Usage
generate_record_key_from_ons_id(data, record_key_col)
Arguments
data |
A |
record_key_col |
A character string specifying the name of the new record key column to create. |
Details
The function checks that
datais adata.table.Non-numeric values in
ons_idare converted toNA.The record key is computed as
ons_id %% 4096and stored as integer.
Value
A data.table with the new record key column added.
Generate sample microdata
Description
generate_test_data() creates a sample microdata containing randomly
generated microdata columns and record keys for testing purposes.
Note: You can set a seed for random value generator to obtain same output in different runs. However, the sample microdata included in the package will be different than this one, as it was generated from the corresponding python package for consistency in test output.
Usage
generate_test_data(size = 1000, rkey_range = 255, seed = NULL)
Arguments
size |
Number of rows in the sample microdata. Default is 1000. |
rkey_range |
The max range for record keys. Default is 255. |
seed |
A seed for the random number generator |
Value
A data.table containing randomly generated microdata and record keys
Examples
if (requireNamespace("data.table", quietly = TRUE)) {
data.table::setDTthreads(1)
}
data <- generate_test_data(size = 1000)
data <- generate_test_data(size = 1000, rkey_range = 255, seed = 111)
Example data (micro)
Description
A data set containing randomly generated data to showcase the cell key perturbation method.
Usage
data(micro)
Format
A data.table containing 1000 observations of 11 variables
Details
record_key. record key value (0-255)
var1. example variable 1 (1-5)
var2. example variable 2 (1,2)
var3. example variable 3 (1-4)
var4. example variable 4 (1-4)
var5. example variable 5 (1-10)
var6. example variable 6 (1-5)
var7. example variable 7 (1-5)
var8. example variable 8 (A-D)
var9. example variable 9 (A-H)
var10. example variable 10 (1-49)
Perturbation table
Description
A data set containing the rules to apply cell key perturbation with a threshold of 10, and rounding to base 5. In other words, counts less than 10 will be removed, and all others will be rounded to the nearest 5.
Usage
data(ptable_10_5)
Format
A data.table containing 192000 observations of 3 variables
Details
pcv. perturbation cell value (1-750)
ckey. cell key value (0-255)
pvalue. perturbation value to be applied
Validate Inputs Before Perturbation
Description
Validates inputs for a perturbation process.
Validate type of input data & ptable
Validate other input arguments
Check that at least one variable specified for geog or tab_vars
Check geog and tab_vars are either character vectors or NULL
Check specified record_key is character vector or NULL
Check threshold is an integer and non-negative
Validate microdata and ptable contain required columns
Check data contain the specified geog, tab_vars & record_key
Check ptable contains required columns
Validate the range of record keys and cell keys
Validate data has sufficient records with record keys to apply perturbation
Usage
validate_inputs(data, ptable, geog, tab_vars, record_key, threshold)
Arguments
data |
A The data should contain one row per statistical unit (person, household, business or other) and one column per variable (age, sex, health status) |
ptable |
A |
geog |
A |
tab_vars |
A |
record_key |
A |
threshold |
An |
Value
Invisibly returns TRUE on success. Throws stop or Warning messages if any validation fails.
Examples
if (requireNamespace("data.table", quietly = TRUE)) {
data.table::setDTthreads(1)
}
validate_inputs(data = micro,
ptable = ptable_10_5,
record_key = "record_key",
geog = c("var1"),
tab_vars = c("var5","var8"),
threshold = 10)
Validate Inputs Before Perturbation using BigQuery
Description
Validates BigQuery inputs for a perturbation process.
Validate input arguments
Check that at least one variable specified for geog or tab_vars
Check geog and tab_vars are either character vectors or NULL
Check specified record_key is character vector or NULL
Check threshold is an integer and non-negative
Validate microdata and ptable contain required columns
Check data contain the specified geog, tab_vars & record_key
Check ptable contains required columns
Validate the range of record keys and cell keys
Validate data has sufficient records with record keys to apply perturbation
Usage
validate_inputs_bigquery(
con,
data,
ptable,
geog,
tab_vars,
record_key,
use_existing_ons_id,
threshold
)
Arguments
con |
– |
data |
– |
ptable |
– |
geog |
– |
tab_vars |
– |
record_key |
– |
use_existing_ons_id |
– |
threshold |
– |
Value
Invisibly returns TRUE on success. Throws stop or Warning messages if any validation fails.