Contents

1 Introduction

rexposome is an R package designed for the analysis of exposome data. The exposome can be defined as the measure of all the exposures of an individual in a lifetime and how those exposures relate to health. Hence, the aim or rexposome is to offer a set of functions to incorporate exposome data to R framework. Also to provide a series of tools to analyse exposome data using standard methods from Biocondcutor.

1.1 Installation

rexposome is currently in development and not available from CRAN nor Bioconductor. Anyway, the package can be installed by using devtools R package and taking the source from Bioinformatic Research Group in Epidemiology’s GitHub repository.

This can be done by opening an R session and typing the following code:

devtools::install_github("isglobal-brge/rexposome")

User must take into account that this sentence do not install the packages dependencies.

1.2 Pipeline

The following pictures illustrates the rexposome’s pipeline:

Pipeline for exposome analysis

Pipeline for exposome analysis

The first step is to load exposome data on R. rexposome provides to functions for this aim: one to load three TXT files and another to use three data.frames. Then the quantification of missing data and values under limit of detection (LOD) is done, and it helps to consider imputation processes. The exposome characterization is useful to understand the nature of the exposome and the relations between exposures. The clustering processes on individual exposure data is done to find exposure-signatures which association with health outcomes can be tested in the next step. From both exposures and exposure-signatures levels, the association with health outcomes is tested using Exposome-Wide Association Studies (ExWAS).

1.3 Data Format

1.3.1 Three table format

rexposome defines the exposome data as a three different data-sets:

  1. Description Data
  2. Exposure Data
  3. Phenotype Data

The description data is a file describing the exposome. This means that has a row for each exposure and, at last, defined the families of exposures. Usually, this file incorporates a description of the exposures, the matrix where it was obtained and the units of measurement among others.

The following is an example of a description data file:

exposure  family  matrix         description
bde100    PBDEs   colostrum       BDE 100 - log10
bde138    PBDEs   colostrum       BDE 138 - log10
bde209    PBDEs   colostrum       BDE 209 - log10
PFOA      PFAS    cord blood      PFOA - log10
PFNA      PFAS    cord blood      PFNA - log10
PFOA      PFAS    maternal serum  PFOA - log10
PFNA      PFAS    maternal serum  PFNA - log10
hg        Metals  cord blood      hg - log 10
Co        Metals  urine           Co (creatinine) - log10
Zn        Metals  urine           Zn (creatinine) - log10
Pb        Metals  urine           Pb (creatinine) - log10
THM       Water   ---             Average total THM uptake - log10
CHCL3     Water   ---             Average Chloroform uptake - log10
BROM      Water   ---             Average Brominated THM uptake - log10
NO2       Air     ---             NO2 levels whole pregnancy- log10
Ben       Air     ---             Benzene levels whole pregnancy- log10

The exposures data file is the one containing the measures of each exposures for all the individuals included in the analysis. It is a matrix-like file having a row per individual and a column per exposures. It must includes a column with the subject’s identifier.

The following is an example of a exposures data file:

id    bde100  bde138  bde209  PFOA    ...
sub01  2.4665  0.7702  1.6866  2.0075 ...
sub02  0.7799  1.4147  1.2907  1.0153 ...  
sub03 -1.6583 -0.9851 -0.8902 -0.0806 ... 
sub04 -1.0812 -0.6639 -0.2988 -0.4268 ... 
sub05 -0.2842 -0.1518 -1.5291 -0.7365 ... 
...   ...     ...     ...     ...

The last of the data-sets is the phenotype data files. This file contains the covariates to be included in the analysis as well as the health outcomes of interest. It contains a row per individual included in the analysis and a column for each covariate and outcome. Moreover, it must include a column with the individual’s identifier.

The following is an example of a phenotype data file:

id    asthma   BMI      sex  age  ...
sub01 control  23.2539  boy  4    ...
sub02 asthma   24.4498  girl 5    ...
sub03 asthma   15.2356  boy  4    ...
sub04 control  25.1387  girl 4    ...
sub05 control  22.0477  boy  5    ...
...   ...      ...      ...  ...

To properly coordinate the exposome data, the information included in the three data-sets must follow some rules:

  • description data files has a column identifying the exposures
  • exposures data file has a column for each exposures defined in description data file
  • exposures data file has a column identifying the individuals
  • phenotype data files has a column identifying the same individuals included in the exposures data file

This rules are easy seen in the following figure:

Links Between Files

Links Between Files

In summary: All the exposures, rows, in the description data file are columns in the exposures data file (plus the column for identifying subjects). All the subjects in the exposures data files are, also, in the phenotype data file.

1.3.2 Single table format

To ease the life of researchers that have their datasets as one big table (exposures and phenotypes combined in a single table), we offer the option of using it as a input to rexposome. Please look into the documentation of the loadExposome_plain() function for further details on how to load this type of data. A few remarks on that methodology:

  • The exposures can be grouped into families by passing a list argument
  • Internally this function converts an exposures/phenotypes table into the three individual tables needed by rexposome
  • There is no option of adding description fields to the exposures

An example of single table is the following:

id    bde100  bde138  bde209    asthma   BMI      ...
sub01  2.4665  0.7702  1.6866   control  23.2539  ...
sub02  0.7799  1.4147  1.2907   asthma   24.4498  ...  
sub03 -1.6583 -0.9851 -0.8902   asthma   15.2356  ... 
sub04 -1.0812 -0.6639 -0.2988   control  25.1387  ... 
sub05 -0.2842 -0.1518 -1.5291   control  22.0477  ...
...   ...     ...      ...      ...      ...

And a visual representation of this single tables is the following:

2 Analysis

rexposome R package is loaded using the standard library command:

library(rexposome)

rexposome provides two functions to load the exposome data: readExposome and loadexposome. The function readExposome will load the exposome data from txt files and loadExposome will do the same from standard R data.frames. Both functions will create an ExposomeSet object. The ExposomeSet is a standard S4 class that will encapsulate the exposome data.

2.1 Loading Exposome Data

2.1.1 From TXT files

The function readExposome will create an ExposomeSet from the three txt files. The following lines are used to locate these three files, that were included in the package for demonstration purposes.

path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")

These files follows the rules described in Data Format section. They are csv files, meaning each values is split from the others by a comma (,). Function readExposome allows to load most any type of files containing exposome data:

args(readExposome)
## function (exposures, description, phenotype, sep = ",", na.strings = c("NA", 
##     "-", "?", " ", ""), exposures.samCol = "sample", description.expCol = "exposure", 
##     description.famCol = "family", phenotype.samCol = "sample", 
##     exposures.asFactor = 5, warnings = TRUE) 
## NULL

readExposome expects, by default, csv files. Changing the content of the argument sep will allow to load other files types. The missing values are set using the argument na.strings. This means that the character assigned to this argument will be interpreted as a missing value. By default, those characters are "NA", "-", "?", " " and "". Then, the columns with the exposures’ names and the individual’s names need to be indicated. Arguments exposures.samCol and phenotype.samCol indicates the column with the individuals’ names at exposures file and phenotypes file. The arguments description.expCol and description.famCol indicates the column containing the exposures’ names and the exposures’ family in the description file.

exp <- readExposome(exposures = exposures, description = description, phenotype = phenotype,
    exposures.samCol = "idnum", description.expCol = "Exposure", description.famCol = "Family",
    phenotype.samCol = "idnum")

The result is an object of class ExposomeSet, that can show all the information of the loaded exposome:

exp
## Object of class 'ExposomeSet' (storageMode: environment)
##  . exposures description:
##     . categorical:  4 
##     . continuous:  84 
##  . exposures transformation:
##     . categorical: 0 
##     . transformed: 0 
##     . standardized: 0 
##     . imputed: 0 
##  . assayData: 88 exposures 109 individuals
##     . element names: exp 
##     . exposures: AbsPM25, ..., Zn 
##     . individuals: id001, ..., id108 
##  . phenoData: 109 individuals 9 phenotypes
##     . individuals: id001, ..., id108 
##     . phenotypes: whistling_chest, ..., cbmi 
##  . featureData: 88 exposures 7 explanations
##     . exposures: AbsPM25, ..., Zn 
##     . descriptions: Family, ..., .imp 
## experimentData: use 'experimentData(object)'
## Annotation:

Under the section exposures description the number of continuous (84) and categorical (4) exposures are shown. The assayData, phenoData and featureData shows the content of the files we loaded with readExposome.

2.1.2 From data.frame

The function loadExposome allows to create an ExposomeSet through three data.frames: one as description data, one as exposures data and one as phenotypes data. The arguments are similar to the ones from readExposome:

args(loadExposome)
## function (exposures, description, phenotype, description.famCol = "family", 
##     exposures.asFactor = 5, warnings = TRUE) 
## NULL

In order to illustrate how to use loadExposome, we are loading the previous csv files as data.frames:

dd <- read.csv(description, header = TRUE)
ee <- read.csv(exposures, header = TRUE)
pp <- read.csv(phenotype, header = TRUE)

Then we rearrange the data.frames to fulfil with the requirements of the exposome data. The data.frame corresponding to description data needs to have the exposure’s names as rownames.

rownames(dd) <- dd[, 2]
dd <- dd[, -2]

The data.frame corresponding to exposures data needs to have the individual’s identifiers as rownames:

rownames(ee) <- ee[, 1]
ee <- ee[, -1]

The data.frame corresponding to phenotypes data needs to have the individual’s identifiers as a rownames, as the previous data.frame:

rownames(pp) <- pp[, 1]
pp <- pp[, -1]

Then, the ExposomeSet is creating by giving the three data.frames to loadExposome:

exp <- loadExposome(exposures = ee, description = dd, phenotype = pp, description.famCol = "Family")

2.1.3 Accessing to Exposome Data

The class ExposomeSet has several accessors to get the data stored in it. There are four basic methods that returns the names of the individuals (sampleNames), the name of the exposures (exposureNames), the name of the families of exposures (familyNames) and the name of the phenotypes (phenotypeNames).

head(sampleNames(exp))
## [1] "id001" "id002" "id003" "id004" "id005" "id006"
head(exposureNames(exp))
## [1] "AbsPM25" "As"      "BDE100"  "BDE138"  "BDE153"  "BDE154"
familyNames(exp)
##  [1] "Air Pollutants"    "Metals"            "PBDEs"            
##  [4] "Bisphenol A"       "Water Pollutants"  "Built Environment"
##  [7] "Cotinine"          "Organochlorines"   "Home Environment" 
## [10] "Phthalates"        "Noise"             "PFOAs"            
## [13] "Temperature"
phenotypeNames(exp)
## [1] "whistling_chest" "flu"             "rhinitis"        "wheezing"       
## [5] "birthdate"       "sex"             "age"             "cbmi"           
## [9] "blood_pre"

fData will return the description of the exposures (including internal information to manage them).

head(fData(exp), n = 3)
##                 Family                                          Name .fct .trn
## AbsPM25 Air Pollutants Measurement of the blackness of PM2.5 filters          
## As              Metals                                        Asenic          
## BDE100           PBDEs            Polybrominated diphenyl ether -100          
##         .std .imp   .type
## AbsPM25           numeric
## As                numeric
## BDE100            numeric

pData will return the phenotypes information.

head(pData(exp), n = 3)
##       whistling_chest flu rhinitis wheezing  birthdate  sex age cbmi blood_pre
## id001           never  no       no       no 2004-12-29 male 4.2 16.3       120
## id002           never  no       no       no 2005-01-05 male 4.2 16.4       121
## id003        7-12 epi  no       no      yes 2005-01-05 male 4.2 19.0       120

Finally, the method expos allows to obtain the matrix of exposures as a data.frame:

expos(exp)[1:10, c("Cotinine", "PM10V", "PM25", "X5cxMEPP")]
##          Cotinine       PM10V     PM25 X5cxMEPP
## id001  0.03125173  0.10373078 1.176255       NA
## id002  1.59401990 -0.47768393 1.155122       NA
## id003  1.46251090          NA 1.215834 1.859045
## id004  0.89059991          NA 1.171610       NA
## id005          NA          NA 1.145765       NA
## id006  0.34818304          NA 1.145382       NA
## id007  1.53591130          NA 1.174642       NA
## id008  2.26864700          NA 1.165078 1.291871
## id009  1.24842660          NA 1.171406 1.650948
## id010 -0.36758339  0.01593277 1.179240 2.112357

2.2 Exposome Pre-process

2.2.1 Missing Data in Exposures and Phenotypes

The number of missing data on each exposure and on each phenotype can be found by using the function tableMissings. This function returns a vector with the amount of missing data in each exposure or phenotype. The argument set indicates if the number of missing values is counted on exposures of phenotypes. The argument output indicates if it is shown as counts (output="n") or as percentage (output="p").

The current exposome data has no missing in the exposures nor in the phenotypes:

tableMissings(exp, set = "exposures", output = "n")
##         Dens         Temp         Conn      AbsPM25           NO          NO2 
##            0            0            1            2            2            2 
##          NOx         PM10       PM10Cu       PM10Fe        PM10K       PM10Ni 
##            2            2            2            2            2            2 
##        PM10S       PM10SI       PM10Zn         PM25       PM25CU       PM25FE 
##            2            2            2            2            2            2 
##        PM25K       PM25Ni        PM25S       PM25Sl       PM25Zn     PMcoarse 
##            2            2            2            2            2            2 
##      Benzene        PM25V          ETS G_pesticides          Gas         BTHM 
##            3            3            5            5            5            6 
##        CHCl3 H_pesticides      Noise_d      Noise_n          THM     Cotinine 
##            6            6            6            6            6            7 
##          DDE          DDT          HCB       PCB118       PCB138       PCB153 
##           13           13           13           13           13           13 
##       PCB180         bHCH          BPA           As           Cs           Mo 
##           13           13           21           24           24           24 
##           Ni           Tl           Zn           Hg           Cd           Sb 
##           24           24           24           27           28           30 
##        Green           Cu        PM10V           Se         MBzP        MEHHP 
##           31           40           41           45           46           46 
##         MEHP        MEOHP          MEP         MiBP         MnBP     X5cxMEPP 
##           46           46           46           46           46           46 
##           Co        PFHxS         PFNA         PFOA         PFOS    X7OHMMeOP 
##           47           48           48           48           48           49 
##           Pb     X2cxMMHP       BDE100       BDE138       BDE153       BDE154 
##           59           64           76           76           76           76 
##        BDE17       BDE183       BDE190       BDE209        BDE28        BDE47 
##           76           76           76           76           76           76 
##        BDE66        BDE71        BDE85        BDE99 
##           76           76           76           76
tableMissings(exp, set = "phenotypes", output = "n")
## whistling_chest             flu        rhinitis        wheezing             sex 
##               0               0               0               0               0 
##             age            cbmi       blood_pre       birthdate 
##               0               0               2               3

Alternatively to tableMissings, the function plotMissings draw a bar plot with the percentage of missing data in each exposure of phenotype.

plotMissings(exp, set = "exposures")

2.2.2 Exposures Normality

Most of the test done in exposome analysis requires that the exposures must follow a normal distribution. The function normalityTest performs a test on each exposure for normality behaviour. The result is a data.frame with the exposures’ names, a flag TRUE/FALSE for normality and the p-value obtained from the Shapiro-Wilk Normality Test (if the p-value is under the threshold, then the exposure is not normal).

nm <- normalityTest(exp)
table(nm$normality)
## 
## FALSE  TRUE 
##    55    29

So, the exposures that do not follow a normal distribution are:

nm$exposure[!nm$normality]
##  [1] "DDT"      "PM10SI"   "PM25K"    "PM25Sl"   "PCB118"   "Tl"      
##  [7] "PM10V"    "PM25Zn"   "PM25FE"   "PM10K"    "BDE17"    "PM25"    
## [13] "PMcoarse" "PM10"     "BPA"      "Green"    "NO2"      "Cs"      
## [19] "PFNA"     "PCB153"   "PM25CU"   "MEOHP"    "Cu"       "HCB"     
## [25] "MEHHP"    "DDE"      "BDE190"   "bHCH"     "PM10Zn"   "MnBP"    
## [31] "NO"       "NOx"      "PM10S"    "MEHP"     "PCB138"   "Zn"      
## [37] "X2cxMMHP" "PCB180"   "PFOA"     "PFHxS"    "Cotinine" "PM25S"   
## [43] "Co"       "Conn"     "PM25Ni"   "PM10Ni"   "Cd"       "Dens"    
## [49] "Se"       "X5cxMEPP" "BDE183"   "BDE28"    "Sb"       "BDE138"  
## [55] "PM25V"

Some of these exposures are categorical so they must not follow a normal distribution. This is the case, for example, of G_pesticides. If we plot the histogram of the values of the exposures it will make clear:

library(ggplot2)
plotHistogram(exp, select = "G_pesticides") + ggtitle("Garden Pesticides")

Some others exposures are continuous variables that do not overpass the normality test. A visual inspection is required in this case.

plotHistogram(exp, select = "BDE209") + ggtitle("BDE209 - Histogram")

If the exposures were following an anon normal distribution, the method plotHistogram has an argument show.trans that set to TRUE draws the histogram of the exposure plus three typical transformations:

plotHistogram(exp, select = "BDE209", show.trans = TRUE)