---
title: "Cell Key Perturbation in R"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Cell Key Perturbation in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(cellkeyperturbation)
```

## Summary

This method creates a frequency table which has had cell key perturbation 
applied to the counts to protect against disclosure. 

Cell key perturbation adds small amounts of noise to frequency tables. 
Noise is added to change the counts that appear 
in the frequency table by small amounts, for example a 14 is changed to a 15. 
This noise introduces uncertainty in the counts and makes it harder to identify
individuals, especially when taking the 'difference' between two similar
tables. It protects against the risk of disclosure by differencing since it 
cannot be determined whether a difference between two similar tables represents 
a real person, or is caused by the perturbation.

Cell Key Perturbation is consistent and repeatable, so the same cells are 
always perturbed in the same way.

It is expected that users will tabulate 1 to 4 variables for a particular 
geography level - for example, tabulate age by sex at local authority level. 

Full details of the methodology and statistical process flow are given in the [Methodology](#methodology) section.

Cell Key Perturbation method is available in [Python](https://github.com/ONSdigital/cell-key-perturbation) and 
[R](https://github.com/ONSdigital/cell-key-perturbation-R), each with integrated BigQuery functionality.

### BigQuery

The BigQuery version allows users to perform perturbation without reading raw data into local memory. 
The package creates the frequency table and runs perturbation with an SQL query. 
Then, it converts the final perturbed table into a `pandas.DataFrame`/`data.table` as an output. 

This will allow users to run the method on large datasets without breaking the memory limits. 

### Terminology

- ***Microdata*** - Data at the level of individual respondents
- ***Record key*** - A random number assigned to each record 
- ***Cell value*** - The number of records or frequency for a cell
- ***Cell key*** - The sum of record keys for a given cell
- ***pvalue*** - Perturbation value. The value of noise added to cells, e.g. +1, -1
- ***pcv*** - Perturbation cell value. This is an amended cell value needed to merge on the ptable
- ***ptable*** - Perturbation table. The look-up file containing the pvalues, this determines which cells get perturbed and by how much.

## Requirements 

- This method requires microdata and a perturbation table (ptable) file. 
- The microdata and the ptable both need to be supplied as `pandas.DataFrame`/`data.table` or BigQuery tables.
- The microdata must include a record key variable for cell key perturbation to be applied, unless `ons_id` will be used to create record key.
- You must use the provided ptable that corresponds to your microdata, e.g. `ptable_census21` for census 2021.

### Microdata and Record Keys

**Microdata** must be row-level, i.e. one row per statistical unit such as person or household. **Microdata** must contain one column per variable, which are expected to be categorical (they can be numeric but categorical is more suitable for frequency tables). 

**Record keys** should already be attached to the **microdata** as a column of integers in the range 0-255 or 0-4095, except certain ONS datasets with `ons_id`. 
The name of the **record key** column could change in different **microdata** tables. 
For example, **record key** columns in census data tables are named as `resident_record_key`, `household_record_key`, or `family_record_key` depending on the table type.

Certain ONS datasets contain `ons_id` column and use it as the basis for record keys to keep the perturbation consistent. 
If `ons_id` is available as a column in **microdata**, then **record keys** will be derived from `ons_id` by default. 
(This can be switched off by setting `use_existing_ons_id = False`)

The range of **record keys** should match the range of **cell keys** in the **ptable**. A warning message will be generated if those ranges do not match.

Cell Key Perturbation is consistent and repeatable, so the same cells are always perturbed in the same way. 

**The record keys need to be unchanged, changing the record keys would create inconsistent results and provide much less protection. You should use record keys attached to your microdata if provided instead of creating new ones to obtain consistent perturbation across different runs.**

### Perturbation Table (P-table)

The **perturbation table** contains the parameters which determine which cells are perturbed by how much and which are not (most cells are perturbed by +0). The **ptable** contains each possible combination of **cell key** (`ckey`) and **cell value** (`pcv`), and the **perturbation value** (`pvalue`) for each combination. 

A sample **ptable** that applies the '10-5 rule' is provided with the package and works with **record keys** in the range 0-255. This **ptable** will remove all cells below the threshold of 10, and round all others to the nearest 5. This provides more protection and will ensure safe outputs.

Other **ptables** may be available depending on the **microdata** used, for example census 2021 data will require the `ptable_census21` to be used and is based on cell keys in the range 0-255.

**You must use the specific ptable provided with the microdata you are working with to ensure sufficient and consistent protection, e.g. `ptable_census21` for census 2021.**


# User Instructions

## Installing the SML method

This method requires R version 3.5 or higher and uses the `data.table` package.

You can install the released version of `cellkeyperturbation` from CRAN:

```r
install.packages("cellkeyperturbation")
```

In your code you can load the cell key perturbation package using:

```r
library(cellkeyperturbation)
```

## Using the SML method

You can call the main functions for cell key perturbation with the following parameters:
```R
# for data.table
create_perturbed_table(data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)

# for BigQuery
create_perturbed_table_bigquery(con, data, ptable, geog, tab_vars, record_key, use_existing_ons_id, threshold)
```

Parameters specific for BigQuery version:

- **`con`** - (DBIConnection) - An active BigQuery connection created with `DBI::dbConnect()`
- **`data`** - (Microdata) - a `character` for the full name of micro-level `data` in BigQuery in "\<PROJECT>.\<DATASET>.\<TABLE>" format.
- **`ptable`** - (Perturbation table) - a `character` for the full name of `ptable` in BigQuery in "\<PROJECT>.\<DATASET>.\<TABLE>" format.

Parameters specific for data.table version:

- **`data`** - (Microdata) - a `data.table` containing the micro-level `data` to be tabulated and perturbed.
- **`ptable`** - (Perturbation table) - a `data.table` containing the `ptable` file which determines when 
perturbation is applied.

Common parameters for both versions:

- **`geog`** - (Geography) - a character vector giving the column name in `data` that contains the desired geography level you wish to tabulate at, e.g. `c("Local_Authority", "Ward")`. This can be the empty vector, `geog = c()`, if no geography level is required.
- **`tab_vars`** - (Variables to tabulate) - a character vector giving the column names in `data` of the variables to be tabulated e.g. `c("Age","Health","Occupation")`. This can also be the empty vector, `tab_vars = c()`. However, at least one of `tab_vars` or `geog` must be populated. If both are left blank an error message will be returned.
- **`record_key`** - a character containing the column name in `data` giving the record keys required for perturbation. If `ons_id` is available as a column in `data` and `use_existing_ons_id = TRUE`, set `record_key = NULL`, as record keys will be generated from `ons_id`.
- **`use_existing_ons_id`** - `TRUE` or `FALSE`, with a default of `TRUE`. If `ons_id` is available as a column in `data`, then record keys will be derived from `ons_id` by default.
- **`threshold`** - the value below which a count is suppressed (default 10).


## How to Use the Method in BigQuery

1. Create a BigQuery connection:
```R
# Import libraries
library(DBI)
library(bigrquery)

# Assign the project_id
project_id <- system("gcloud config get project", intern = TRUE)

# Create the connection using DBI
con <- DBI::dbConnect(
  bigrquery::bigquery(),
  project = project_id,
  bigint = "integer64"
)
```

2. Import `cellkeyperturbation` package and call the main function with parameters:
```R
library(cellkeyperturbation)

perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = "Record_Key",
  threshold  = 10
)

# or
perturbed_table <- create_perturbed_table_bigquery(
  con        = con,
  data       = "<PROJECT>.<DATASET>.<microdata>",
  ptable     = "<PROJECT>.<DATASET>.<ptable>",
  geog       = c("Region"),
  tab_vars   = c("AgeGroup", "HealthStatus", "Occupation"),
  record_key = NULL,
  use_existing_ons_id = True
  threshold  = 10
)
```

3. The returned `perturbed_table` is a `data.table`. You need to drop disclosive columns before exporting the output from the secure data environment. Please refer to the [Interpreting the Output](#interpreting-the-output) and [Saving the Output](#saving-the-output) sections below for more details.
```R
perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]
```


## Worked Example with Synthetic Data (`data.table`)

This is an example showing how to create a perturbed table from synthetic test data provided in the package (`micro` and `ptable_10_5`). You can access and view these data tables after loading the package.

```R
library(cellkeyperturbation)
View(micro)
View(ptable_10_5)
```

You can also generate different sample data or generate random record keys for testing purposes for your own test data with the following code:
```R
data = generate_test_data(size = 1000, rkey_range = 255, seed = 123)
ptable = generate_ptable_10_5_rule(ckey_range = 255)

library(data.table)
data <- fread("input_microdata.csv")
data = generate_random_rkey(data, rkey_range = 255, seed = 123)
```

- **`micro`**: A sample `data.table` containing randomly generated microdata and record keys.

Example rows of a microdata table are shown below:

 | record_key  | var1  | var5  | var8  | 
 |       :---  | :---- | :---- | :---- | 
 |      84     | 2     | 9     | D     | 
 |     108     | 1     | 9     | C     | 
 |     212     | 1     | 1     | D     | 
 |     212     | 2     | 2     | A     | 
 |      86     | 2     | 4     | A     | 

- **`ptable_10_5`**: A sample perturbation table (`data.table`) that defines the cell key perturbation rules. This specific table applies the ’10 to 5 rule’, which means a suppression threshold of *10* and rounding to the nearest *5*. In other words, this ptable will remove all cells under *10*, and round all others to the nearest *5*.

Example rows of a ptable are shown below:  

 | pcv  | ckey  | pvalue |
 |:---  | :---- | :----  |
 |   1  |    0  |    -1  | 
 |   1  |    1  |    -1  | 
 |   1  |    2  |    -1  | 
 | ...  |  ...  |   ...  |    
 | 750  |  255  |     0  |

Use the following code to generate the perturbed table using the sample microdata and perturbation table provided:

```R
perturbed_table <- create_perturbed_table(
  data       = micro,
  ptable     = ptable_10_5,
  geog       = c("var1"),
  tab_vars   = c("var5","var8"),
  record_key = "record_key",
  threshold  = 10
)
```


## Interpreting the Output

The output from the code is a `data.table` containing a frequency table with 
the counts having been affected by perturbation, as specified in the ptable. 

For most ptables, the most obvious effect will be that all counts lower than the 
threshold of 10 will have been removed. Suppressing counts below the threshold is a 
condition that need to be met when exporting data from IDS (Integrated Data Service)
and many other secure environments such as SRS (Secure Research Service).

The perturbation code will treat categories for missing data in the same way as it treats other categories. 
If you would like to exclude missing data from your outputs, you will need to remove the missing data categories either before or after applying the perturbation.

The table will be in the following format:

  | var1  | var5  | var8  | pre_sdc_count | ckey  | pcv   | pvalue | count  |
  |  :--- | :---- | :---- |       :----   | :---- | :---- | :----  | :----  | 
  |  1    |   1   |   A   |         10    | 173   |  10   |    0   |   10   | 
  |  1    |   1   |   B   |         10    |  88   |  10   |    0   |   10   | 
  |  1    |   1   |   C   |          7    | 180   |   7   |   -7   |  nan   | 
  |  1    |   1   |   D   |         14    |  66   |  14   |    1   |   15   | 
  |  1    |   2   |   A   |         11    | 190   |  11   |   -1   |   10   | 
  | ...   | ...   | ...   |        ...    | ...   | ...   |  ...   |  ...   | 
  

The table contains the variables used to summarise the 
data (in this example `var1`, `var5` & `var8`), and five other columns:

- `ckey` is the sum of record keys for each combination of variables.
- `pcv` is the perturbation cell value, the pre-perturbation count modulo 750.
- `pre_sdc_count` is the pre-perturbation count.
- `pvalue` is the perturbation applied to the original count, most commonly 
it will be 0. This is obtained from the ptable using a join on `ckey` and `pcv`.
- `count` is the post-perturbation count, the values to be output. It will be
set to `NA` if the value is suppressed for being below the threshold.

The columns you are most likely interested in are the variables, which 
are the categories you've summarised by, plus the `count` column.

**WARNING! - The `ckey`, `pcv`, `pre_sdc_count` and `pvalue` columns should be dropped before the 
contingency table is published. Otherwise, the perturbation can be unpicked and the output will be disclosive.**


## Saving the Output

Before the table is ready to be published the disclosive columns must be dropped. 
These cannot be output as they would allow for the perturbation to be unpicked. 
This code assumes that you have not changed the default column names; please update it if you have.

```R
perturbed_table[, c("pre_sdc_count", "ckey", "pcv", "pvalue") := NULL]
```

To save this dataframe as a csv file, you can use `data.table` fast write function:
```R
fwrite(perturbed_table, "perturbed_table.csv")
```


## Appendix - Help Pages

The package includes further help pages like Introduction to Cell Key Perturbation vignette 
and documentation for each function. You can access these pages by selecting the 
`cellkeyperturbation` package name in the packages tab of RStudio or using: 

```r
help(package=cellkeyperturbation)
```


# Methodology

The user is required to supply **microdata** and to specify which columns in the
data they want to tabulate by. They must also supply a **ptable** which will 
determine which cells get perturbed and by how much.

The **microdata** needs to contain a column for **record key**. **Record keys** are 
random, uniformly distributed integers within the chosen range. Previously, 
**record keys** between 0-255 have been used (as for census-2021). The method has 
been extended to also handle **record keys** in the range 0-4095 for the purpose of 
processing administrative data. 

It is expected that users will tabulate 1-4 variables for a particular geography 
level e.g. tabulate age by sex at local authority level. 

The `create_perturbed_table()` function counts how many rows in the data
contain each combination of categories e.g. how many respondents are of
each age category in each local authority area. The sum of the **record
keys** for each record in each cell is also calculated. Modulo 256 or 4096
of the sum is taken so this **cell key** is within range. The table now has 
**perturbation cell values** (`pcv`) and **cell keys** (`ckey`).

The **ptable** is merged with the data, matching on `pcv` and `ckey`. The merge 
provides a `pvalue` for each cell. The **post perturbation count** (`count`) 
is the **pre-perturbation count** (`pre_sdc_count`), plus the **perturbation value**
(`pvalue`). After this step, the counts have had the required perturbation 
applied. The output is the frequency table with the **post-perturbation count** (`count`) 
column. The result is that counts have been deliberately changed based on the 
**ptable**, for the purpose of disclosure protection.

To limit the size of the **ptable**, only 750 rows are used, and rows
501-750 are used repeatedly for larger cell values. E.g. instead of
containing 100,001 rows, when the cell value is 100,001 the 501st row
is used. Rows 501-750 will be used for cell values of 501-750, as well
as 751-1000, 1001-1250, 1251-1500 and so on. To achieve this effect an
alternative cell value column (`pcv`) is calculated which will be between 0-750.
For cell values 0-750 the `pcv` will be the same as the cell value. For
cell values above 750, the values are transformed by -1, modulo 250,
+501. This achieves the looping effect so that cell values 751, 1001,
1251 and so on will have a `pcv` of 501.

After cell key perturbation is applied, a **threshold** is applied so that any 
counts below the **threshold** will be suppressed (set to missing). The user can 
specify the value for the **threshold**, but if they do not, the default value of 
10 will be applied. Setting the **threshold** to zero would mean no suppression is 
applied.

As well as specifying the level of perturbation, the **ptable** can also be used 
to apply rounding, and a **threshold** for small counts. The example **ptable** 
supplied with this method, `ptable_10_5`, applies the 10_5 rule (supressing 
values less than 10 and rounding others to the nearest 5) for **record keys** 
in the range 0-255.


# Additional Information

The ONS Statistical Methods Library (statisticalmethodslibrary.ons.gov.uk) contains:

-	Further information about the methods including a link to the GitHub 
repository which contains detailed API information as part of the method code.
-	Information about other methods available through the library.


## License

Unless stated otherwise, the SML codebase is released under the MIT License. 
This covers both the codebase and any sample code in the documentation.
The documentation is available under the terms of the Open Government 3.0 
license. 
