Requirements

This method requires microdata and a perturbation table (ptable) file.
The microdata and the ptable both need to be supplied as pandas.DataFrame/data.table or BigQuery tables.
The microdata must include a record key variable for cell key perturbation to be applied, unless ons_id will be used to create record key.
You must use the provided ptable that corresponds to your microdata, e.g. ptable_census21 for census 2021.

Microdata and Record Keys

Microdata must be row-level, i.e. one row per statistical unit such as person or household. Microdata must contain one column per variable, which are expected to be categorical (they can be numeric but categorical is more suitable for frequency tables).

Record keys should already be attached to the microdata as a column of integers in the range 0-255 or 0-4095, except certain ONS datasets with ons_id. The name of the record key column could change in different microdata tables. For example, record key columns in census data tables are named as resident_record_key, household_record_key, or family_record_key depending on the table type.

Certain ONS datasets contain ons_id column and use it as the basis for record keys to keep the perturbation consistent. If ons_id is available as a column in microdata, then record keys will be derived from ons_id by default. (This can be switched off by setting use_existing_ons_id = False)

The range of record keys should match the range of cell keys in the ptable. A warning message will be generated if those ranges do not match.

Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.

The record keys need to be unchanged, changing the record keys would create inconsistent results and provide much less protection. You should use record keys attached to your microdata if provided instead of creating new ones to obtain consistent perturbation across different runs.

Perturbation Table (P-table)

The perturbation table contains the parameters which determine which cells are perturbed by how much and which are not (most cells are perturbed by +0). The ptable contains each possible combination of cell key (ckey) and cell value (pcv), and the perturbation value (pvalue) for each combination.

A sample ptable that applies the ‘10-5 rule’ is provided with the package and works with record keys in the range 0-255. This ptable will remove all cells below the threshold of 10, and round all others to the nearest 5. This provides more protection and will ensure safe outputs.

Other ptables may be available depending on the microdata used, for example census 2021 data will require the ptable_census21 to be used and is based on cell keys in the range 0-255.

You must use the specific ptable provided with the microdata you are working with to ensure sufficient and consistent protection, e.g. ptable_census21 for census 2021.

User Instructions

Installing the SML method

This method requires R version 3.5 or higher and uses the data.table package.

You can install the released version of cellkeyperturbation from CRAN:

install.packages("cellkeyperturbation")

In your code you can load the cell key perturbation package using:

library(cellkeyperturbation)

Using the SML method

You can call the main functions for cell key perturbation with the following parameters:

# for data.table
create_perturbed_table(data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

# for BigQuery
create_perturbed_table_bigquery(con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

Parameters specific for BigQuery version:

con - (DBIConnection) - An active BigQuery connection created with DBI::dbConnect()
data - (Microdata) - a character for the full name of micro-level data in BigQuery in “<PROJECT>.<DATASET>.<TABLE>” format.
ptable - (Perturbation table) - a character for the full name of ptable in BigQuery in “<PROJECT>.<DATASET>.<TABLE>” format.

Parameters specific for data.table version:

data - (Microdata) - a data.table containing the micro-level data to be tabulated and perturbed.
ptable - (Perturbation table) - a data.table containing the ptable file which determines when perturbation is applied.

Common parameters for both versions:

geog - (Geography) - a character vector giving the column name in data that contains the desired geography level you wish to tabulate at, e.g. c("Local_Authority", "Ward"). This can be the empty vector, geog = c(), if no geography level is required.
tab_vars - (Variables to tabulate) - a character vector giving the column names in data of the variables to be tabulated e.g. c("Age","Health","Occupation"). This can also be the empty vector, tab_vars = c(). However, at least one of tab_vars or geog must be populated. If both are left blank an error message will be returned.
record_key - a character containing the column name in data giving the record keys required for perturbation. If ons_id is available as a column in data and use_existing_ons_id = TRUE, set record_key = NULL, as record keys will be generated from ons_id.
use_existing_ons_id - TRUE or FALSE, with a default of TRUE. If ons_id is available as a column in data, then record keys will be derived from ons_id by default.
threshold - the value below which a count is suppressed (default 10).

How to Use the Method in BigQuery

Create a BigQuery connection:

# Import libraries
library(DBI)
library(bigrquery)

# Assign the project_id
project_id <- system("gcloud config get project", intern = TRUE)

# Create the connection using DBI
con <- DBI::dbConnect(
  bigrquery::bigquery(),
  project = project_id,
  bigint = "integer64"
)

Import cellkeyperturbation package and call the main function with parameters:

library(cellkeyperturbation)

perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = "Record_Key",
  threshold  = 10
)

# or
perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = NULL,
  use_existing_ons_id = True
  threshold  = 10
)

The returned perturbed_table is a data.table. You need to drop disclosive columns before exporting the output from the secure data environment. Please refer to the Interpreting the Output and Saving the Output sections below for more details.

perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]

Worked Example with Synthetic Data (`data.table`)

This is an example showing how to create a perturbed table from synthetic test data provided in the package (micro and ptable_10_5). You can access and view these data tables after loading the package.

library(cellkeyperturbation)
View(micro)
View(ptable_10_5)

You can also generate different sample data or generate random record keys for testing purposes for your own test data with the following code:

data = generate_test_data(size = 1000, rkey_range = 255, seed = 123)
ptable = generate_ptable_10_5_rule(ckey_range = 255)

library(data.table)
data <- fread("input_microdata.csv")
data = generate_random_rkey(data, rkey_range = 255, seed = 123)

micro: A sample data.table containing randomly generated microdata and record keys.

Example rows of a microdata table are shown below:

record_key	var1	var5	var8
84	2	9	D
108	1	9	C
212	1	1	D
212	2	2	A
86	2	4	A

ptable_10_5: A sample perturbation table (data.table) that defines the cell key perturbation rules. This specific table applies the ’10 to 5 rule’, which means a suppression threshold of 10 and rounding to the nearest 5. In other words, this ptable will remove all cells under 10, and round all others to the nearest 5.

Example rows of a ptable are shown below:

pcv	ckey	pvalue
1	0	-1
1	1	-1
1	2	-1
…	…	…
750	255	0

Use the following code to generate the perturbed table using the sample microdata and perturbation table provided:

perturbed_table <- create_perturbed_table(
  data       = micro,
  ptable     = ptable_10_5,
  geog       = c("var1"),
  tab_vars   = c("var5","var8"),
  record_key = "record_key",
  threshold  = 10
)

Interpreting the Output

The output from the code is a data.table containing a frequency table with the counts having been affected by perturbation, as specified in the ptable.

For most ptables, the most obvious effect will be that all counts lower than the threshold of 10 will have been removed. Suppressing counts below the threshold is a condition that need to be met when exporting data from IDS (Integrated Data Service) and many other secure environments such as SRS (Secure Research Service).

The perturbation code will treat categories for missing data in the same way as it treats other categories. If you would like to exclude missing data from your outputs, you will need to remove the missing data categories either before or after applying the perturbation.

The table will be in the following format:

var1	var5	var8	pre_sdc_count	ckey	pcv	pvalue	count
1	1	A	10	173	10	0	10
1	1	B	10	88	10	0	10
1	1	C	7	180	7	-7	nan
1	1	D	14	66	14	1	15
1	2	A	11	190	11	-1	10
…	…	…	…	…	…	…	…

The table contains the variables used to summarise the data (in this example var1, var5 & var8), and five other columns:

ckey is the sum of record keys for each combination of variables.
pcv is the perturbation cell value, the pre-perturbation count modulo 750.
pre_sdc_count is the pre-perturbation count.
pvalue is the perturbation applied to the original count, most commonly it will be 0. This is obtained from the ptable using a join on ckey and pcv.
count is the post-perturbation count, the values to be output. It will be set to NA if the value is suppressed for being below the threshold.

The columns you are most likely interested in are the variables, which are the categories you’ve summarised by, plus the count column.

WARNING! - The ckey, pcv, pre_sdc_count and pvalue columns should be dropped before the contingency table is published. Otherwise, the perturbation can be unpicked and the output will be disclosive.

Saving the Output

Before the table is ready to be published the disclosive columns must be dropped. These cannot be output as they would allow for the perturbation to be unpicked. This code assumes that you have not changed the default column names; please update it if you have.

perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]

To save this dataframe as a csv file, you can use data.table fast write function:

fwrite(perturbed_table, "perturbed_table.csv")

Appendix - Help Pages

The package includes further help pages like Introduction to Cell Key Perturbation vignette and documentation for each function. You can access these pages by selecting the cellkeyperturbation package name in the packages tab of RStudio or using:

help(package=cellkeyperturbation)

Methodology

The user is required to supply microdata and to specify which columns in the data they want to tabulate by. They must also supply a ptable which will determine which cells get perturbed and by how much.

The microdata needs to contain a column for record key. Record keys are random, uniformly distributed integers within the chosen range. Previously, record keys between 0-255 have been used (as for census-2021). The method has been extended to also handle record keys in the range 0-4095 for the purpose of processing administrative data.

It is expected that users will tabulate 1-4 variables for a particular geography level e.g. tabulate age by sex at local authority level.

The create_perturbed_table() function counts how many rows in the data contain each combination of categories e.g. how many respondents are of each age category in each local authority area. The sum of the record keys for each record in each cell is also calculated. Modulo 256 or 4096 of the sum is taken so this cell key is within range. The table now has perturbation cell values (pcv) and cell keys (ckey).

The ptable is merged with the data, matching on pcv and ckey. The merge provides a pvalue for each cell. The post perturbation count (count) is the pre-perturbation count (pre_sdc_count), plus the perturbation value (pvalue). After this step, the counts have had the required perturbation applied. The output is the frequency table with the post-perturbation count (count) column. The result is that counts have been deliberately changed based on the ptable, for the purpose of disclosure protection.

To limit the size of the ptable, only 750 rows are used, and rows 501-750 are used repeatedly for larger cell values. E.g. instead of containing 100,001 rows, when the cell value is 100,001 the 501st row is used. Rows 501-750 will be used for cell values of 501-750, as well as 751-1000, 1001-1250, 1251-1500 and so on. To achieve this effect an alternative cell value column (pcv) is calculated which will be between 0-750. For cell values 0-750 the pcv will be the same as the cell value. For cell values above 750, the values are transformed by -1, modulo 250, +501. This achieves the looping effect so that cell values 751, 1001, 1251 and so on will have a pcv of 501.

After cell key perturbation is applied, a threshold is applied so that any counts below the threshold will be suppressed (set to missing). The user can specify the value for the threshold, but if they do not, the default value of 10 will be applied. Setting the threshold to zero would mean no suppression is applied.

As well as specifying the level of perturbation, the ptable can also be used to apply rounding, and a threshold for small counts. The example ptable supplied with this method, ptable_10_5, applies the 10_5 rule (supressing values less than 10 and rounding others to the nearest 5) for record keys in the range 0-255.

Cell Key Perturbation in R

Summary

BigQuery

Terminology

Requirements

Microdata and Record Keys

Perturbation Table (P-table)

User Instructions

Installing the SML method

Using the SML method

How to Use the Method in BigQuery

Worked Example with Synthetic Data (`data.table`)

Interpreting the Output

Saving the Output

Appendix - Help Pages

Methodology

Additional Information

License

var1	var5	var8	pre_sdc_count	ckey	pcv	pvalue	count
1	1	A	10	173	10	0	10
1	1	B	10	88	10	0	10
1	1	C	7	180	7	-7	nan
1	1	D	14	66	14	1	15
1	2	A	11	190	11	-1	10
…	…	…	…	…	…	…	…

var1	var5	var8	pre_sdc_count	ckey	pcv	pvalue	count
1	1	A	10	173	10	0	10
1	1	B	10	88	10	0	10
1	1	C	7	180	7	-7	nan
1	1	D	14	66	14	1	15
1	2	A	11	190	11	-1	10
…	…	…	…	…	…	…	…

Cell Key Perturbation in R

Summary

BigQuery

Terminology

Requirements

Microdata and Record Keys

Perturbation Table (P-table)

User Instructions

Installing the SML method

Using the SML method

How to Use the Method in BigQuery

Worked Example with Synthetic Data (data.table)

Interpreting the Output

Saving the Output

Appendix - Help Pages

Methodology

Additional Information

License

Worked Example with Synthetic Data (`data.table`)

var1	var5	var8	pre_sdc_count	ckey	pcv	pvalue	count
1	1	A	10	173	10	0	10
1	1	B	10	88	10	0	10
1	1	C	7	180	7	-7	nan
1	1	D	14	66	14	1	15
1	2	A	11	190	11	-1	10
…	…	…	…	…	…	…	…