Cell Key Perturbation in R

library(cellkeyperturbation)

Summary

This method creates a frequency table which has had cell key perturbation applied to the counts to protect against disclosure.

Cell key perturbation adds small amounts of noise to frequency tables. Noise is added to change the counts that appear in the frequency table by small amounts, for example a 14 is changed to a 15. This noise introduces uncertainty in the counts and makes it harder to identify individuals, especially when taking the ‘difference’ between two similar tables. It protects against the risk of disclosure by differencing since it cannot be determined whether a difference between two similar tables represents a real person, or is caused by the perturbation.

Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.

It is expected that users will tabulate 1 to 4 variables for a particular geography level - for example, tabulate age by sex at local authority level. 

Full details of the methodology and statistical process flow are given in the Methodology section.

Cell Key Perturbation method is available in Python and R, each with integrated BigQuery functionality.

BigQuery

The BigQuery version allows users to perform perturbation without reading raw data into local memory. The package creates the frequency table and runs perturbation with an SQL query. Then, it converts the final perturbed table into a pandas.DataFrame/data.table as an output.

This will allow users to run the method on large datasets without breaking the memory limits.

Terminology

Requirements

Microdata and Record Keys

Microdata must be row-level, i.e. one row per statistical unit such as person or household. Microdata must contain one column per variable, which are expected to be categorical (they can be numeric but categorical is more suitable for frequency tables).

Record keys should already be attached to the microdata as a column of integers in the range 0-255 or 0-4095, except certain ONS datasets with ons_id. The name of the record key column could change in different microdata tables. For example, record key columns in census data tables are named as resident_record_key, household_record_key, or family_record_key depending on the table type.

Certain ONS datasets contain ons_id column and use it as the basis for record keys to keep the perturbation consistent. If ons_id is available as a column in microdata, then record keys will be derived from ons_id by default. (This can be switched off by setting use_existing_ons_id = False)

The range of record keys should match the range of cell keys in the ptable. A warning message will be generated if those ranges do not match.

Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.

The record keys need to be unchanged, changing the record keys would create inconsistent results and provide much less protection. You should use record keys attached to your microdata if provided instead of creating new ones to obtain consistent perturbation across different runs.

Perturbation Table (P-table)

The perturbation table contains the parameters which determine which cells are perturbed by how much and which are not (most cells are perturbed by +0). The ptable contains each possible combination of cell key (ckey) and cell value (pcv), and the perturbation value (pvalue) for each combination.

A sample ptable that applies the ‘10-5 rule’ is provided with the package and works with record keys in the range 0-255. This ptable will remove all cells below the threshold of 10, and round all others to the nearest 5. This provides more protection and will ensure safe outputs.

Other ptables may be available depending on the microdata used, for example census 2021 data will require the ptable_census21 to be used and is based on cell keys in the range 0-255.

You must use the specific ptable provided with the microdata you are working with to ensure sufficient and consistent protection, e.g. ptable_census21 for census 2021.

User Instructions

Installing the SML method

This method requires R version 3.5 or higher and uses the data.table package.

You can install the released version of cellkeyperturbation from CRAN:

install.packages("cellkeyperturbation")

In your code you can load the cell key perturbation package using:

library(cellkeyperturbation)

Using the SML method

You can call the main functions for cell key perturbation with the following parameters:

# for data.table
create_perturbed_table(data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

# for BigQuery
create_perturbed_table_bigquery(con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

Parameters specific for BigQuery version:

Parameters specific for data.table version:

Common parameters for both versions:

How to Use the Method in BigQuery

  1. Create a BigQuery connection:
# Import libraries
library(DBI)
library(bigrquery)

# Assign the project_id
project_id <- system("gcloud config get project", intern = TRUE)

# Create the connection using DBI
con <- DBI::dbConnect(
  bigrquery::bigquery(),
  project = project_id,
  bigint = "integer64"
)
  1. Import cellkeyperturbation package and call the main function with parameters:
library(cellkeyperturbation)

perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = "Record_Key",
  threshold  = 10
)

# or
perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = NULL,
  use_existing_ons_id = True
  threshold  = 10
)
  1. The returned perturbed_table is a data.table. You need to drop disclosive columns before exporting the output from the secure data environment. Please refer to the Interpreting the Output and Saving the Output sections below for more details.
perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]

Worked Example with Synthetic Data (data.table)

This is an example showing how to create a perturbed table from synthetic test data provided in the package (micro and ptable_10_5). You can access and view these data tables after loading the package.

library(cellkeyperturbation)
View(micro)
View(ptable_10_5)

You can also generate different sample data or generate random record keys for testing purposes for your own test data with the following code:

data = generate_test_data(size = 1000, rkey_range = 255, seed = 123)
ptable = generate_ptable_10_5_rule(ckey_range = 255)

library(data.table)
data <- fread("input_microdata.csv")
data = generate_random_rkey(data, rkey_range = 255, seed = 123)

Example rows of a microdata table are shown below:

record_key var1 var5 var8
84 2 9 D
108 1 9 C
212 1 1 D
212 2 2 A
86 2 4 A

Example rows of a ptable are shown below:

pcv ckey pvalue
1 0 -1
1 1 -1
1 2 -1
750 255 0

Use the following code to generate the perturbed table using the sample microdata and perturbation table provided:

perturbed_table <- create_perturbed_table(
  data       = micro,
  ptable     = ptable_10_5,
  geog       = c("var1"),
  tab_vars   = c("var5","var8"),
  record_key = "record_key",
  threshold  = 10
)

Interpreting the Output

The output from the code is a data.table containing a frequency table with the counts having been affected by perturbation, as specified in the ptable.

For most ptables, the most obvious effect will be that all counts lower than the threshold of 10 will have been removed. Suppressing counts below the threshold is a condition that need to be met when exporting data from IDS (Integrated Data Service) and many other secure environments such as SRS (Secure Research Service).

The perturbation code will treat categories for missing data in the same way as it treats other categories. If you would like to exclude missing data from your outputs, you will need to remove the missing data categories either before or after applying the perturbation.

The table will be in the following format:

var1 var5 var8 pre_sdc_count ckey pcv pvalue count
1 1 A 10 173 10 0 10
1 1 B 10 88 10 0 10
1 1 C 7 180 7 -7 nan
1 1 D 14 66 14 1 15
1 2 A 11 190 11 -1 10

The table contains the variables used to summarise the data (in this example var1, var5 & var8), and five other columns:

The columns you are most likely interested in are the variables, which are the categories you’ve summarised by, plus the count column.

WARNING! - The ckey, pcv, pre_sdc_count and pvalue columns should be dropped before the contingency table is published. Otherwise, the perturbation can be unpicked and the output will be disclosive.

Saving the Output

Before the table is ready to be published the disclosive columns must be dropped. These cannot be output as they would allow for the perturbation to be unpicked. This code assumes that you have not changed the default column names; please update it if you have.

perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]

To save this dataframe as a csv file, you can use data.table fast write function:

fwrite(perturbed_table, "perturbed_table.csv")

Appendix - Help Pages

The package includes further help pages like Introduction to Cell Key Perturbation vignette and documentation for each function. You can access these pages by selecting the cellkeyperturbation package name in the packages tab of RStudio or using:

help(package=cellkeyperturbation)

Methodology

The user is required to supply microdata and to specify which columns in the data they want to tabulate by. They must also supply a ptable which will determine which cells get perturbed and by how much.

The microdata needs to contain a column for record key. Record keys are random, uniformly distributed integers within the chosen range. Previously, record keys between 0-255 have been used (as for census-2021). The method has been extended to also handle record keys in the range 0-4095 for the purpose of processing administrative data.

It is expected that users will tabulate 1-4 variables for a particular geography level e.g. tabulate age by sex at local authority level.

The create_perturbed_table() function counts how many rows in the data contain each combination of categories e.g. how many respondents are of each age category in each local authority area. The sum of the record keys for each record in each cell is also calculated. Modulo 256 or 4096 of the sum is taken so this cell key is within range. The table now has perturbation cell values (pcv) and cell keys (ckey).

The ptable is merged with the data, matching on pcv and ckey. The merge provides a pvalue for each cell. The post perturbation count (count) is the pre-perturbation count (pre_sdc_count), plus the perturbation value (pvalue). After this step, the counts have had the required perturbation applied. The output is the frequency table with the post-perturbation count (count) column. The result is that counts have been deliberately changed based on the ptable, for the purpose of disclosure protection.

To limit the size of the ptable, only 750 rows are used, and rows 501-750 are used repeatedly for larger cell values. E.g. instead of containing 100,001 rows, when the cell value is 100,001 the 501st row is used. Rows 501-750 will be used for cell values of 501-750, as well as 751-1000, 1001-1250, 1251-1500 and so on. To achieve this effect an alternative cell value column (pcv) is calculated which will be between 0-750. For cell values 0-750 the pcv will be the same as the cell value. For cell values above 750, the values are transformed by -1, modulo 250, +501. This achieves the looping effect so that cell values 751, 1001, 1251 and so on will have a pcv of 501.

After cell key perturbation is applied, a threshold is applied so that any counts below the threshold will be suppressed (set to missing). The user can specify the value for the threshold, but if they do not, the default value of 10 will be applied. Setting the threshold to zero would mean no suppression is applied.

As well as specifying the level of perturbation, the ptable can also be used to apply rounding, and a threshold for small counts. The example ptable supplied with this method, ptable_10_5, applies the 10_5 rule (supressing values less than 10 and rounding others to the nearest 5) for record keys in the range 0-255.

Additional Information

The ONS Statistical Methods Library (statisticalmethodslibrary.ons.gov.uk) contains:

License

Unless stated otherwise, the SML codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is available under the terms of the Open Government 3.0 license.