Genomic Privacy

Purpose

Probe masking is important to prevent privacy data leakage. The goal of data sanitization is to modifiy IDAT files in place, so they can be released to public domain without privacy leak. This will be achieved by deIdentification. The following function requires the R package seSAMe.

Let’s take DNA methylation data from the HM450 platform for example.

De-identify by Masking

This first method of deIdentification masks SNP probe intensity mean by zero. As a consequence, the allele frequency will be 0.5.

## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
## 0.48831587 0.44527168 0.85344189 0.90835267 0.04676394 0.45331875
## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
##        0.5        0.5        0.5        0.5        0.5        0.5

Note that before deIdentify, the rs values will all be different. After deIdentify, the rs values will all be masked at an intensity of 0.5.

De-identify by Scrambling

This second method of deIdentification will scramble the intensities using a secret key to help formalize a random number. Therefore, randomize needs to be set to TRUE.

## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
## 0.48831587 0.44527168 0.85344189 0.90835267 0.04676394 0.45331875
## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
## 0.06952204 0.05275335 0.91886629 0.89602848 0.48673376 0.86940527

Note that the rs values are scrambled after deIdentify.

Re-identify

To restore order of the deIdentified intensities, one can re-identify IDATs. The reIdentify function can thus restore the scrambled SNP intensities.

## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
## 0.48831587 0.44527168 0.85344189 0.90835267 0.04676394 0.45331875
## rs10033147  rs1019916  rs1040870 rs10457834 rs10774834 rs10796216 
## 0.48831587 0.44527168 0.85344189 0.90835267 0.04676394 0.45331875

Note that reIdentify restored the values. Subsequently, they are the same as betas1.

Extract Genotypes

SeSAMe can output explicit and Infinium-I-derived SNP to VCF. This information can be used to identify sample swaps.

## Retrieving annotation from  https://zhouserver.research.chop.edu/InfiniumAnnotation/current/EPIC/EPIC.hg19.snp_overlap_b151.rds ... Done.
## Retrieving annotation from  https://zhouserver.research.chop.edu/InfiniumAnnotation/current/EPIC/EPIC.hg19.typeI_overlap_b151.rds ... Done.

One can output to actual VCF file with a header by formatVCF(sdf, vcf=path_to_vcf).

The FileSet

Preprocessing IDATs to FileSets

When a large number of samples are being analyzed, it is desirable to have random access to specific CpG methylation without loading all the data. SeSAMe provides such interface through the fileSet object which is in essence an indexed file-based numeric matrix.

The one function to generate a fileSet is through the openSesameToFile function. In this case, there is no concrete output from the function. The consequence is the generation of a file at the given path. One can operate on the fileSet by referencing the path to the file.

The following openSesameToFile call does three things - generates a file called mybetas. - generates an index file called mybetas_idx.rds - returns a fileSet object which serves as an interface to the two files.

## Allocating space for 2 HM27 samples at mybetas.
## Mapping 2 HM27 samples to mybetas.
## Successfully processed 2 IDATs (0 failed).

Introduction to fileSet

When printed to console, the number of samples and the number of probes are shown.

## File Set for 27578 probes and 2 samples.

One can obtain the samples and probes information with the $ operator.

## [1] "4207113116_A" "4207113116_B"
## [1] "cg00000292" "cg00002426" "cg00003994" "cg00005847" "cg00006414"
## [6] "cg00007981"

Query fileSet

One can query the specific CpG by probe name(s) and sample name(s). Note that every query to fset is a disk read. Therefore it can be slower than in-memory processing. Here we only retrieve the beta values for the two probes cg00006414 and cg00007981 in the sample 4207113116_B.

##            4207113116_B
## cg00006414    0.1387231
## cg00007981    0.0262714

Read Existing fileSet

In the previous example, we preprocessed IDATs directly to fileSet. We can also read a pre-existing fileSet using the file path using readFileSet function.

##            4207113116_A
## cg00000292    0.9052752

Write fileSet by Allocation and Filling

fileSet size is always fixed. One cannot dynamically expand or shrink a fileSet. We can write a fileSet by filling the space one sample by one sample. This is achieved by first allocating the space given the number of samples and the probe IDs (optional if platform is one if HM27, HM450 or EPIC).

## Allocating space for 2 HM450 samples at mybetas2.

Then one can fill in the beta values by mapFileSet. Here I am illustrating using a randomly generated beta values.

## File Set for 486427 probes and 2 samples.

The mapped value should be equal to the generated beta value. Let’s spot-check.

##            sample2
## cg00000108    TRUE

Session Info

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] e1071_1.7-9                 tidyr_1.2.0                
##  [3] dplyr_1.0.7                 knitr_1.37                 
##  [5] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [7] MatrixGenerics_1.6.0        matrixStats_0.61.0         
##  [9] scales_1.1.1                DNAcopy_1.68.0             
## [11] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [13] IRanges_2.28.0              S4Vectors_0.32.3           
## [15] wheatmap_0.1.0              ggplot2_3.3.5              
## [17] sesame_1.12.9               sesameData_1.12.0          
## [19] rmarkdown_2.11              ExperimentHub_2.2.1        
## [21] AnnotationHub_3.2.1         BiocFileCache_2.2.1        
## [23] dbplyr_2.1.1                BiocGenerics_0.40.0        
## 
## loaded via a namespace (and not attached):
##  [1] fgsea_1.20.0                  colorspace_2.0-2             
##  [3] ellipsis_0.3.2                class_7.3-20                 
##  [5] XVector_0.34.0                base64_2.0                   
##  [7] proxy_0.4-26                  farver_2.1.0                 
##  [9] ggrepel_0.9.1                 bit64_4.0.5                  
## [11] interactiveDisplayBase_1.32.0 AnnotationDbi_1.56.2         
## [13] fansi_1.0.2                   splines_4.1.2                
## [15] cachem_1.0.6                  jsonlite_1.7.3               
## [17] png_0.1-7                     shiny_1.7.1                  
## [19] BiocManager_1.30.16           compiler_4.1.2               
## [21] httr_1.4.2                    assertthat_0.2.1             
## [23] Matrix_1.4-0                  fastmap_1.1.0                
## [25] cli_3.1.1                     later_1.3.0                  
## [27] htmltools_0.5.2               tools_4.1.2                  
## [29] gtable_0.3.0                  glue_1.6.1                   
## [31] GenomeInfoDbData_1.2.7        reshape2_1.4.4               
## [33] rappdirs_0.3.3                fastmatch_1.1-3              
## [35] Rcpp_1.0.8                    jquerylib_0.1.4              
## [37] vctrs_0.3.8                   Biostrings_2.62.0            
## [39] nlme_3.1-155                  preprocessCore_1.56.0        
## [41] xfun_0.29                     stringr_1.4.0                
## [43] mime_0.12                     lifecycle_1.0.1              
## [45] zlibbioc_1.40.0               MASS_7.3-55                  
## [47] BiocStyle_2.22.0              promises_1.2.0.1             
## [49] parallel_4.1.2                RColorBrewer_1.1-2           
## [51] yaml_2.2.2                    curl_4.3.2                   
## [53] memoise_2.0.1                 gridExtra_2.3                
## [55] sass_0.4.0                    stringi_1.7.6                
## [57] RSQLite_2.2.9                 BiocVersion_3.14.0           
## [59] highr_0.9                     randomForest_4.7-1           
## [61] filelock_1.0.2                BiocParallel_1.28.3          
## [63] rlang_1.0.1                   pkgconfig_2.0.3              
## [65] bitops_1.0-7                  evaluate_0.14                
## [67] lattice_0.20-45               purrr_0.3.4                  
## [69] labeling_0.4.2                bit_4.0.4                    
## [71] tidyselect_1.1.1              plyr_1.8.6                   
## [73] magrittr_2.0.2                R6_2.5.1                     
## [75] generics_0.1.2                DelayedArray_0.20.0          
## [77] DBI_1.1.2                     mgcv_1.8-38                  
## [79] pillar_1.7.0                  withr_2.4.3                  
## [81] KEGGREST_1.34.0               RCurl_1.98-1.5               
## [83] tibble_3.1.6                  crayon_1.4.2                 
## [85] KernSmooth_2.23-20            utf8_1.2.2                   
## [87] grid_4.1.2                    data.table_1.14.2            
## [89] blob_1.2.2                    digest_0.6.29                
## [91] xtable_1.8-4                  httpuv_1.6.5                 
## [93] illuminaio_0.36.0             openssl_1.4.6                
## [95] munsell_0.5.0                 bslib_0.3.1                  
## [97] askpass_1.1