This method creates a frequency table which has had cell key perturbation applied to the counts to protect against disclosure.
Cell key perturbation adds small amounts of noise to frequency tables. Noise is added to change the counts that appear in the frequency table by small amounts, for example a 14 is changed to a 15. This noise introduces uncertainty in the counts and makes it harder to identify individuals, especially when taking the ‘difference’ between two similar tables. It protects against the risk of disclosure by differencing since it cannot be determined whether a difference between two similar tables represents a real person, or is caused by the perturbation.
Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.
It is expected that users will tabulate 1 to 4 variables for a particular geography level - for example, tabulate age by sex at local authority level.
Full details of the methodology and statistical process flow are given in the Methodology section.
Cell Key Perturbation method is available in Python and R, each with integrated BigQuery functionality.
The BigQuery version allows users to perform perturbation without
reading raw data into local memory. The package creates the frequency
table and runs perturbation with an SQL query. Then, it converts the
final perturbed table into a
pandas.DataFrame/data.table as an output.
This will allow users to run the method on large datasets without breaking the memory limits.
pandas.DataFrame/data.table or BigQuery
tables.ons_id will be used to
create record key.ptable_census21 for census 2021.Microdata must be row-level, i.e. one row per statistical unit such as person or household. Microdata must contain one column per variable, which are expected to be categorical (they can be numeric but categorical is more suitable for frequency tables).
Record keys should already be attached to the
microdata as a column of integers in the range 0-255 or
0-4095, except certain ONS datasets with ons_id. The name
of the record key column could change in different
microdata tables. For example, record
key columns in census data tables are named as
resident_record_key, household_record_key, or
family_record_key depending on the table type.
Certain ONS datasets contain ons_id column and use it as
the basis for record keys to keep the perturbation consistent. If
ons_id is available as a column in
microdata, then record keys will be
derived from ons_id by default. (This can be switched off
by setting use_existing_ons_id = False)
The range of record keys should match the range of cell keys in the ptable. A warning message will be generated if those ranges do not match.
Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way.
The record keys need to be unchanged, changing the record keys would create inconsistent results and provide much less protection. You should use record keys attached to your microdata if provided instead of creating new ones to obtain consistent perturbation across different runs.
The perturbation table contains the parameters which
determine which cells are perturbed by how much and which are not (most
cells are perturbed by +0). The ptable contains each
possible combination of cell key (ckey)
and cell value (pcv), and the
perturbation value (pvalue) for each
combination.
A sample ptable that applies the ‘10-5 rule’ is provided with the package and works with record keys in the range 0-255. This ptable will remove all cells below the threshold of 10, and round all others to the nearest 5. This provides more protection and will ensure safe outputs.
Other ptables may be available depending on the
microdata used, for example census 2021 data will
require the ptable_census21 to be used and is based on cell
keys in the range 0-255.
You must use the specific ptable provided with the microdata
you are working with to ensure sufficient and consistent protection,
e.g. ptable_census21 for census 2021.
This method requires R version 3.5 or higher and uses the
data.table package.
You can install the released version of
cellkeyperturbation from CRAN:
In your code you can load the cell key perturbation package using:
You can call the main functions for cell key perturbation with the following parameters:
# for data.table
create_perturbed_table(data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)
# for BigQuery
create_perturbed_table_bigquery(con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)Parameters specific for BigQuery version:
con - (DBIConnection) - An active
BigQuery connection created with DBI::dbConnect()data - (Microdata) - a
character for the full name of micro-level
data in BigQuery in
“<PROJECT>.<DATASET>.<TABLE>” format.ptable - (Perturbation table) - a
character for the full name of ptable in
BigQuery in “<PROJECT>.<DATASET>.<TABLE>” format.Parameters specific for data.table version:
data - (Microdata) - a
data.table containing the micro-level data to
be tabulated and perturbed.ptable - (Perturbation table) - a
data.table containing the ptable file which
determines when perturbation is applied.Common parameters for both versions:
geog - (Geography) - a character
vector giving the column name in data that contains the
desired geography level you wish to tabulate at,
e.g. c("Local_Authority", "Ward"). This can be the empty
vector, geog = c(), if no geography level is required.tab_vars - (Variables to tabulate) - a
character vector giving the column names in data of the
variables to be tabulated
e.g. c("Age","Health","Occupation"). This can also be the
empty vector, tab_vars = c(). However, at least one of
tab_vars or geog must be populated. If both
are left blank an error message will be returned.record_key - a character containing
the column name in data giving the record keys required for
perturbation. If ons_id is available as a column in
data and use_existing_ons_id = TRUE, set
record_key = NULL, as record keys will be generated from
ons_id.use_existing_ons_id -
TRUE or FALSE, with a default of
TRUE. If ons_id is available as a column in
data, then record keys will be derived from
ons_id by default.threshold - the value below which a
count is suppressed (default 10).# Import libraries
library(DBI)
library(bigrquery)
# Assign the project_id
project_id <- system("gcloud config get project", intern = TRUE)
# Create the connection using DBI
con <- DBI::dbConnect(
bigrquery::bigquery(),
project = project_id,
bigint = "integer64"
)cellkeyperturbation package and call the main
function with parameters:library(cellkeyperturbation)
perturbed_table <- create_perturbed_table_bigquery(
con = con,
data = "<PROJECT>.<DATASET>.<microdata>",
ptable = "<PROJECT>.<DATASET>.<ptable>",
geog = c("Region"),
tab_vars = c("AgeGroup", "HealthStatus", "Occupation"),
record_key = "Record_Key",
threshold = 10
)
# or
perturbed_table <- create_perturbed_table_bigquery(
con = con,
data = "<PROJECT>.<DATASET>.<microdata>",
ptable = "<PROJECT>.<DATASET>.<ptable>",
geog = c("Region"),
tab_vars = c("AgeGroup", "HealthStatus", "Occupation"),
record_key = NULL,
use_existing_ons_id = True
threshold = 10
)perturbed_table is a
data.table. You need to drop disclosive columns before
exporting the output from the secure data environment. Please refer to
the Interpreting the Output and
Saving the Output sections below for
more details.data.table)This is an example showing how to create a perturbed table from
synthetic test data provided in the package (micro and
ptable_10_5). You can access and view these data tables
after loading the package.
You can also generate different sample data or generate random record keys for testing purposes for your own test data with the following code:
data = generate_test_data(size = 1000, rkey_range = 255, seed = 123)
ptable = generate_ptable_10_5_rule(ckey_range = 255)
library(data.table)
data <- fread("input_microdata.csv")
data = generate_random_rkey(data, rkey_range = 255, seed = 123)micro: A sample
data.table containing randomly generated microdata and
record keys.Example rows of a microdata table are shown below:
| record_key | var1 | var5 | var8 |
|---|---|---|---|
| 84 | 2 | 9 | D |
| 108 | 1 | 9 | C |
| 212 | 1 | 1 | D |
| 212 | 2 | 2 | A |
| 86 | 2 | 4 | A |
ptable_10_5: A sample perturbation
table (data.table) that defines the cell key perturbation
rules. This specific table applies the ’10 to 5 rule’, which means a
suppression threshold of 10 and rounding to the nearest
5. In other words, this ptable will remove all cells under
10, and round all others to the nearest 5.Example rows of a ptable are shown below:
| pcv | ckey | pvalue |
|---|---|---|
| 1 | 0 | -1 |
| 1 | 1 | -1 |
| 1 | 2 | -1 |
| … | … | … |
| 750 | 255 | 0 |
Use the following code to generate the perturbed table using the sample microdata and perturbation table provided:
The output from the code is a data.table containing a
frequency table with the counts having been affected by perturbation, as
specified in the ptable.
For most ptables, the most obvious effect will be that all counts lower than the threshold of 10 will have been removed. Suppressing counts below the threshold is a condition that need to be met when exporting data from IDS (Integrated Data Service) and many other secure environments such as SRS (Secure Research Service).
The perturbation code will treat categories for missing data in the same way as it treats other categories. If you would like to exclude missing data from your outputs, you will need to remove the missing data categories either before or after applying the perturbation.
The table will be in the following format:
| var1 | var5 | var8 | pre_sdc_count | ckey | pcv | pvalue | count |
|---|---|---|---|---|---|---|---|
| 1 | 1 | A | 10 | 173 | 10 | 0 | 10 |
| 1 | 1 | B | 10 | 88 | 10 | 0 | 10 |
| 1 | 1 | C | 7 | 180 | 7 | -7 | nan |
| 1 | 1 | D | 14 | 66 | 14 | 1 | 15 |
| 1 | 2 | A | 11 | 190 | 11 | -1 | 10 |
| … | … | … | … | … | … | … | … |
The table contains the variables used to summarise the data (in this
example var1, var5 & var8),
and five other columns:
ckey is the sum of record keys for each combination of
variables.pcv is the perturbation cell value, the
pre-perturbation count modulo 750.pre_sdc_count is the pre-perturbation count.pvalue is the perturbation applied to the original
count, most commonly it will be 0. This is obtained from the ptable
using a join on ckey and pcv.count is the post-perturbation count, the values to be
output. It will be set to NA if the value is suppressed for
being below the threshold.The columns you are most likely interested in are the variables,
which are the categories you’ve summarised by, plus the
count column.
WARNING! - The ckey, pcv,
pre_sdc_count and pvalue columns should be
dropped before the contingency table is published. Otherwise, the
perturbation can be unpicked and the output will be
disclosive.
Before the table is ready to be published the disclosive columns must be dropped. These cannot be output as they would allow for the perturbation to be unpicked. This code assumes that you have not changed the default column names; please update it if you have.
To save this dataframe as a csv file, you can use
data.table fast write function:
The package includes further help pages like Introduction to Cell Key
Perturbation vignette and documentation for each function. You can
access these pages by selecting the cellkeyperturbation
package name in the packages tab of RStudio or using:
The user is required to supply microdata and to specify which columns in the data they want to tabulate by. They must also supply a ptable which will determine which cells get perturbed and by how much.
The microdata needs to contain a column for record key. Record keys are random, uniformly distributed integers within the chosen range. Previously, record keys between 0-255 have been used (as for census-2021). The method has been extended to also handle record keys in the range 0-4095 for the purpose of processing administrative data.
It is expected that users will tabulate 1-4 variables for a particular geography level e.g. tabulate age by sex at local authority level.
The create_perturbed_table() function counts how many
rows in the data contain each combination of categories e.g. how many
respondents are of each age category in each local authority area. The
sum of the record keys for each record in each cell is
also calculated. Modulo 256 or 4096 of the sum is taken so this
cell key is within range. The table now has
perturbation cell values (pcv) and
cell keys (ckey).
The ptable is merged with the data, matching on
pcv and ckey. The merge provides a
pvalue for each cell. The post perturbation
count (count) is the pre-perturbation
count (pre_sdc_count), plus the
perturbation value (pvalue). After this
step, the counts have had the required perturbation applied. The output
is the frequency table with the post-perturbation count
(count) column. The result is that counts have been
deliberately changed based on the ptable, for the
purpose of disclosure protection.
To limit the size of the ptable, only 750 rows are
used, and rows 501-750 are used repeatedly for larger cell values. E.g.
instead of containing 100,001 rows, when the cell value is 100,001 the
501st row is used. Rows 501-750 will be used for cell values of 501-750,
as well as 751-1000, 1001-1250, 1251-1500 and so on. To achieve this
effect an alternative cell value column (pcv) is calculated
which will be between 0-750. For cell values 0-750 the pcv
will be the same as the cell value. For cell values above 750, the
values are transformed by -1, modulo 250, +501. This achieves the
looping effect so that cell values 751, 1001, 1251 and so on will have a
pcv of 501.
After cell key perturbation is applied, a threshold is applied so that any counts below the threshold will be suppressed (set to missing). The user can specify the value for the threshold, but if they do not, the default value of 10 will be applied. Setting the threshold to zero would mean no suppression is applied.
As well as specifying the level of perturbation, the
ptable can also be used to apply rounding, and a
threshold for small counts. The example
ptable supplied with this method,
ptable_10_5, applies the 10_5 rule (supressing values less
than 10 and rounding others to the nearest 5) for record
keys in the range 0-255.
The ONS Statistical Methods Library (statisticalmethodslibrary.ons.gov.uk) contains:
Unless stated otherwise, the SML codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is available under the terms of the Open Government 3.0 license.