An introductory guide to analysing exposome data with R package rexposome
. The areas covered in this document are: loading exposome data from files and matrices, exploration the exposome data including missing data quantification and individual clustering, and testing association between exposures and health outcomes.
rexposome 1.12.4
rexposome
is an R package designed for the analysis of exposome data. The exposome can be defined as the measure of all the exposures of an individual in a lifetime and how those exposures relate to health. Hence, the aim or rexposome
is to offer a set of functions to incorporate exposome data to R framework. Also to provide a series of tools to analyse exposome data using standard methods from Biocondcutor.
rexposome
is currently in development and not available from CRAN nor Bioconductor. Anyway, the package can be installed by using devtools
R package and taking the source from Bioinformatic Research Group in Epidemiology’s GitHub repository.
This can be done by opening an R session and typing the following code:
devtools::install_github("isglobal-brge/rexposome")
User must take into account that this sentence do not install the packages dependencies.
The following pictures illustrates the rexposome
’s pipeline:
The first step is to load exposome data on R. rexposome
provides to functions for this aim: one to load three TXT
files and another to use three data.frames
. Then the quantification of missing data and values under limit of detection (LOD) is done, and it helps to consider imputation processes. The exposome characterization is useful to understand the nature of the exposome and the relations between exposures. The clustering processes on individual exposure data is done to find exposure-signatures which association with health outcomes can be tested in the next step. From both exposures and exposure-signatures levels, the association with health outcomes is tested using Exposome-Wide Association Studies (ExWAS).
rexposome
defines the exposome data as a three different data-sets:
The description data is a file describing the exposome. This means that has a row for each exposure and, at last, defined the families of exposures. Usually, this file incorporates a description of the exposures, the matrix where it was obtained and the units of measurement among others.
The following is an example of a description data file:
exposure family matrix description
bde100 PBDEs colostrum BDE 100 - log10
bde138 PBDEs colostrum BDE 138 - log10
bde209 PBDEs colostrum BDE 209 - log10
PFOA PFAS cord blood PFOA - log10
PFNA PFAS cord blood PFNA - log10
PFOA PFAS maternal serum PFOA - log10
PFNA PFAS maternal serum PFNA - log10
hg Metals cord blood hg - log 10
Co Metals urine Co (creatinine) - log10
Zn Metals urine Zn (creatinine) - log10
Pb Metals urine Pb (creatinine) - log10
THM Water --- Average total THM uptake - log10
CHCL3 Water --- Average Chloroform uptake - log10
BROM Water --- Average Brominated THM uptake - log10
NO2 Air --- NO2 levels whole pregnancy- log10
Ben Air --- Benzene levels whole pregnancy- log10
The exposures data file is the one containing the measures of each exposures for all the individuals included in the analysis. It is a matrix-like file having a row per individual and a column per exposures. It must includes a column with the subject’s identifier.
The following is an example of a exposures data file:
id bde100 bde138 bde209 PFOA ...
sub01 2.4665 0.7702 1.6866 2.0075 ...
sub02 0.7799 1.4147 1.2907 1.0153 ...
sub03 -1.6583 -0.9851 -0.8902 -0.0806 ...
sub04 -1.0812 -0.6639 -0.2988 -0.4268 ...
sub05 -0.2842 -0.1518 -1.5291 -0.7365 ...
... ... ... ... ...
The last of the data-sets is the phenotype data files. This file contains the covariates to be included in the analysis as well as the health outcomes of interest. It contains a row per individual included in the analysis and a column for each covariate and outcome. Moreover, it must include a column with the individual’s identifier.
The following is an example of a phenotype data file:
id asthma BMI sex age ...
sub01 control 23.2539 boy 4 ...
sub02 asthma 24.4498 girl 5 ...
sub03 asthma 15.2356 boy 4 ...
sub04 control 25.1387 girl 4 ...
sub05 control 22.0477 boy 5 ...
... ... ... ... ...
To properly coordinate the exposome data, the information included in the three data-sets must follow some rules:
This rules are easy seen in the following figure:
In summary: All the exposures, rows, in the description data file are columns in the exposures data file (plus the column for identifying subjects). All the subjects in the exposures data files are, also, in the phenotype data file.
To ease the life of researchers that have their datasets as one big table (exposures and phenotypes combined in a single table), we offer the option of using it as a input to rexposome
. Please look into the documentation of the loadExposome_plain()
function for further details on how to load this type of data. A few remarks on that methodology:
rexposome
An example of single table is the following:
id bde100 bde138 bde209 asthma BMI ...
sub01 2.4665 0.7702 1.6866 control 23.2539 ...
sub02 0.7799 1.4147 1.2907 asthma 24.4498 ...
sub03 -1.6583 -0.9851 -0.8902 asthma 15.2356 ...
sub04 -1.0812 -0.6639 -0.2988 control 25.1387 ...
sub05 -0.2842 -0.1518 -1.5291 control 22.0477 ...
... ... ... ... ... ...
And a visual representation of this single tables is the following:
rexposome
R package is loaded using the standard library
command:
library(rexposome)
rexposome
provides two functions to load the exposome data: readExposome
and loadexposome
. The function readExposome
will load the exposome data from txt files and loadExposome
will do the same from standard R data.frame
s. Both functions will create an ExposomeSet
object. The ExposomeSet
is a standard S4 class that will encapsulate the exposome data.
TXT
filesThe function readExposome
will create an ExposomeSet
from the three txt
files. The following lines are used to locate these three files, that were included in the package for demonstration purposes.
path <- file.path(path.package("rexposome"), "extdata")
description <- file.path(path, "description.csv")
phenotype <- file.path(path, "phenotypes.csv")
exposures <- file.path(path, "exposures.csv")
These files follows the rules described in Data Format section. They are csv
files, meaning each values is split from the others by a comma (,
). Function readExposome
allows to load most any type of files containing exposome data:
args(readExposome)
## function (exposures, description, phenotype, sep = ",", na.strings = c("NA",
## "-", "?", " ", ""), exposures.samCol = "sample", description.expCol = "exposure",
## description.famCol = "family", phenotype.samCol = "sample",
## exposures.asFactor = 5, warnings = TRUE)
## NULL
readExposome
expects, by default, csv
files. Changing the content of the argument sep
will allow to load other files types. The missing values are set using the argument na.strings
. This means that the character assigned to this argument will be interpreted as a missing value. By default, those characters are "NA"
, "-"
, "?"
, " "
and ""
. Then, the columns with the exposures’ names and the individual’s names need to be indicated. Arguments exposures.samCol
and phenotype.samCol
indicates the column with the individuals’ names at exposures file and phenotypes file. The arguments description.expCol
and description.famCol
indicates the column containing the exposures’ names and the exposures’ family in the description file.
exp <- readExposome(exposures = exposures, description = description, phenotype = phenotype,
exposures.samCol = "idnum", description.expCol = "Exposure", description.famCol = "Family",
phenotype.samCol = "idnum")
The result is an object of class ExposomeSet
, that can show all the information of the loaded exposome:
exp
## Object of class 'ExposomeSet' (storageMode: environment)
## . exposures description:
## . categorical: 4
## . continuous: 84
## . exposures transformation:
## . categorical: 0
## . transformed: 0
## . standardized: 0
## . imputed: 0
## . assayData: 88 exposures 109 individuals
## . element names: exp
## . exposures: AbsPM25, ..., Zn
## . individuals: id001, ..., id108
## . phenoData: 109 individuals 9 phenotypes
## . individuals: id001, ..., id108
## . phenotypes: whistling_chest, ..., cbmi
## . featureData: 88 exposures 7 explanations
## . exposures: AbsPM25, ..., Zn
## . descriptions: Family, ..., .imp
## experimentData: use 'experimentData(object)'
## Annotation:
Under the section exposures description the number of continuous (84) and categorical (4) exposures are shown. The assayData, phenoData and featureData shows the content of the files we loaded with readExposome
.
data.frame
The function loadExposome
allows to create an ExposomeSet
through three data.frames
: one as description data, one as exposures data and one as phenotypes data. The arguments are similar to the ones from readExposome
:
args(loadExposome)
## function (exposures, description, phenotype, description.famCol = "family",
## exposures.asFactor = 5, warnings = TRUE)
## NULL
In order to illustrate how to use loadExposome
, we are loading the previous csv
files as data.frames
:
dd <- read.csv(description, header = TRUE)
ee <- read.csv(exposures, header = TRUE)
pp <- read.csv(phenotype, header = TRUE)
Then we rearrange the data.frames
to fulfil with the requirements of the exposome data. The data.frame
corresponding to description data needs to have the exposure’s names as rownames.
rownames(dd) <- dd[, 2]
dd <- dd[, -2]
The data.frame
corresponding to exposures data needs to have the individual’s identifiers as rownames:
rownames(ee) <- ee[, 1]
ee <- ee[, -1]
The data.frame
corresponding to phenotypes data needs to have the individual’s identifiers as a rownames, as the previous data.frame
:
rownames(pp) <- pp[, 1]
pp <- pp[, -1]
Then, the ExposomeSet
is creating by giving the three data.frames
to loadExposome
:
exp <- loadExposome(exposures = ee, description = dd, phenotype = pp, description.famCol = "Family")
The class ExposomeSet
has several accessors to get the data stored in it. There are four basic methods that returns the names of the individuals (sampleNames
), the name of the exposures (exposureNames
), the name of the families of exposures (familyNames
) and the name of the phenotypes (phenotypeNames
).
head(sampleNames(exp))
## [1] "id001" "id002" "id003" "id004" "id005" "id006"
head(exposureNames(exp))
## [1] "AbsPM25" "As" "BDE100" "BDE138" "BDE153" "BDE154"
familyNames(exp)
## [1] "Air Pollutants" "Metals" "PBDEs"
## [4] "Bisphenol A" "Water Pollutants" "Built Environment"
## [7] "Cotinine" "Organochlorines" "Home Environment"
## [10] "Phthalates" "Noise" "PFOAs"
## [13] "Temperature"
phenotypeNames(exp)
## [1] "whistling_chest" "flu" "rhinitis" "wheezing"
## [5] "birthdate" "sex" "age" "cbmi"
## [9] "blood_pre"
fData
will return the description of the exposures (including internal information to manage them).
head(fData(exp), n = 3)
## Family Name .fct .trn
## AbsPM25 Air Pollutants Measurement of the blackness of PM2.5 filters
## As Metals Asenic
## BDE100 PBDEs Polybrominated diphenyl ether -100
## .std .imp .type
## AbsPM25 numeric
## As numeric
## BDE100 numeric
pData
will return the phenotypes information.
head(pData(exp), n = 3)
## whistling_chest flu rhinitis wheezing birthdate sex age cbmi blood_pre
## id001 never no no no 2004-12-29 male 4.2 16.3 120
## id002 never no no no 2005-01-05 male 4.2 16.4 121
## id003 7-12 epi no no yes 2005-01-05 male 4.2 19.0 120
Finally, the method expos
allows to obtain the matrix of exposures as a data.frame
:
expos(exp)[1:10, c("Cotinine", "PM10V", "PM25", "X5cxMEPP")]
## Cotinine PM10V PM25 X5cxMEPP
## id001 0.03125173 0.10373078 1.176255 NA
## id002 1.59401990 -0.47768393 1.155122 NA
## id003 1.46251090 NA 1.215834 1.859045
## id004 0.89059991 NA 1.171610 NA
## id005 NA NA 1.145765 NA
## id006 0.34818304 NA 1.145382 NA
## id007 1.53591130 NA 1.174642 NA
## id008 2.26864700 NA 1.165078 1.291871
## id009 1.24842660 NA 1.171406 1.650948
## id010 -0.36758339 0.01593277 1.179240 2.112357
The number of missing data on each exposure and on each phenotype can be found by using the function tableMissings
. This function returns a vector with the amount of missing data in each exposure or phenotype. The argument set
indicates if the number of missing values is counted on exposures of phenotypes. The argument output
indicates if it is shown as counts (output="n"
) or as percentage (output="p"
).
The current exposome data has no missing in the exposures nor in the phenotypes:
tableMissings(exp, set = "exposures", output = "n")
## Dens Temp Conn AbsPM25 NO NO2
## 0 0 1 2 2 2
## NOx PM10 PM10Cu PM10Fe PM10K PM10Ni
## 2 2 2 2 2 2
## PM10S PM10SI PM10Zn PM25 PM25CU PM25FE
## 2 2 2 2 2 2
## PM25K PM25Ni PM25S PM25Sl PM25Zn PMcoarse
## 2 2 2 2 2 2
## Benzene PM25V ETS G_pesticides Gas BTHM
## 3 3 5 5 5 6
## CHCl3 H_pesticides Noise_d Noise_n THM Cotinine
## 6 6 6 6 6 7
## DDE DDT HCB PCB118 PCB138 PCB153
## 13 13 13 13 13 13
## PCB180 bHCH BPA As Cs Mo
## 13 13 21 24 24 24
## Ni Tl Zn Hg Cd Sb
## 24 24 24 27 28 30
## Green Cu PM10V Se MBzP MEHHP
## 31 40 41 45 46 46
## MEHP MEOHP MEP MiBP MnBP X5cxMEPP
## 46 46 46 46 46 46
## Co PFHxS PFNA PFOA PFOS X7OHMMeOP
## 47 48 48 48 48 49
## Pb X2cxMMHP BDE100 BDE138 BDE153 BDE154
## 59 64 76 76 76 76
## BDE17 BDE183 BDE190 BDE209 BDE28 BDE47
## 76 76 76 76 76 76
## BDE66 BDE71 BDE85 BDE99
## 76 76 76 76
tableMissings(exp, set = "phenotypes", output = "n")
## whistling_chest flu rhinitis wheezing sex
## 0 0 0 0 0
## age cbmi blood_pre birthdate
## 0 0 2 3
Alternatively to tableMissings
, the function plotMissings
draw a bar plot with the percentage of missing data in each exposure of phenotype.
plotMissings(exp, set = "exposures")
Most of the test done in exposome analysis requires that the exposures must follow a normal distribution. The function normalityTest
performs a test on each exposure for normality behaviour. The result is a data.frame
with the exposures’ names, a flag TRUE
/FALSE
for normality and the p-value obtained from the Shapiro-Wilk Normality Test (if the p-value is under the threshold, then the exposure is not normal).
nm <- normalityTest(exp)
table(nm$normality)
##
## FALSE TRUE
## 55 29
So, the exposures that do not follow a normal distribution are:
nm$exposure[!nm$normality]
## [1] "DDT" "PM10SI" "PM25K" "PM25Sl" "PCB118" "Tl"
## [7] "PM10V" "PM25Zn" "PM25FE" "PM10K" "BDE17" "PM25"
## [13] "PMcoarse" "PM10" "BPA" "Green" "NO2" "Cs"
## [19] "PFNA" "PCB153" "PM25CU" "MEOHP" "Cu" "HCB"
## [25] "MEHHP" "DDE" "BDE190" "bHCH" "PM10Zn" "MnBP"
## [31] "NO" "NOx" "PM10S" "MEHP" "PCB138" "Zn"
## [37] "X2cxMMHP" "PCB180" "PFOA" "Cotinine" "PM25S" "Co"
## [43] "Conn" "PM25Ni" "PFHxS" "PM10Ni" "Cd" "Dens"
## [49] "Se" "X5cxMEPP" "BDE183" "BDE28" "Sb" "BDE138"
## [55] "PM25V"
Some of these exposures are categorical so they must not follow a normal distribution. This is the case, for example, of G_pesticides
. If we plot the histogram of the values of the exposures it will make clear:
library(ggplot2)
plotHistogram(exp, select = "G_pesticides") + ggtitle("Garden Pesticides")
Some others exposures are continuous variables that do not overpass the normality test. A visual inspection is required in this case.
plotHistogram(exp, select = "BDE209") + ggtitle("BDE209 - Histogram")