CoreGx 1.2.0
The current implementation for the @sensitivity
slot in a PharmacoSet
has some
limitations.
Firstly, it does not natively support dose-response experiments with multiple drugs and/or cancer cell lines. As a result we have not been able to include this data into a PharmacoSet thus far.
Secondly, drug combination data has the potential to scale to high dimensionality. As a result we need an object that is highly performant to ensure computations on such data can be completed in a timely manner.
The current use case is supporting drug and cell-line combinations in
PharmacoGx
, but we wanted to create something flexible enough to fit
other use cases. As such, the current class makes no mention of drugs or cell-lines,
nor anything specifically related to Bioinformatics or Computation Biology. Rather, we tried to design a general purpose data structure which could support
high dimensional data for any use case.
Our design takes the best aspects
of the SummarizedExperiment
and MultiAssayExperiment
class and implements
them using the data.table
package, which provides an R API to a rich set of
tools for high performance data processing implemented in C.
We have borrowed directly from the SummarizedExperiment
class
for the rowData
, colData
, metadata
and assays
slot names.
We also implemented the SummarizedExperiment
accessor generics for the LongTable
.
There are, however, some important differences which make this object more flexible when dealing with high dimensional data.
Unlike a SummarizedExperiment
, there are three distinct
classes of columns in rowData
and colData
.
The first is the rowKey
or colKey
, these are implemented internally to keep mappings between each assay and the associated samples or drugs; these will not be returned by the accessors by default. The second is the rowIDs
and colIDs
, these hold all of the
information necessary to uniquely identify a row or column and are used to
generate the rowKey
and colKey
. Finally, there are the rowMeta
and colMeta
columns, which store any additional data about samples or drugs not required
to uniquely identify a row in either table.
Within the assays the rowKey
and colKey
are combined to form a primary key
for each assay row. This is required because each assay is stored in ‘long’
format, instead of wide format as in the assay matrices within a
SummarizedExperiment
. Thanks to the fast implementation of binary search
within the data.table
package, assay tables can scale up to tens or even
hundreds of millions of rows while still being relatively performant.
Also worth noting is the cardinality between rowData
and colData
for a given
assay within the assays list. As indicated by the lower connection between these
tables and an assay, for each row or column key there may be zero or more rows in
the assay table. Conversely for each row in the assay there may be zero or one key
in colData
or rowData
. When combined, the rowKey
and colKey
for a given
row in an assay become a composite key which maps that to
he current implementation of the buildLongTable
function is able to assemble
a LongTable
object from two sources. The first is a single large table with
all assays, row and column data contained within it. This is the structure of the Merck drug combination data that has been used to test the data structure thus far.
filePath <- '../data/merckLongTable.csv'
merckDT <- fread(filePath, na.strings=c('NULL'))
colnames(merckDT)
## [1] "cell_line" "combination_name" "BatchID" "drugA_name"
## [5] "drugA Conc (uM)" "drugB_name" "drugB Conc (uM)" "viability1"
## [9] "viability2" "viability3" "viability4" "mu/muMax"
## [13] "X/X0"
knitr::kable(head(merckDT)[, 1:5])
cell_line | combination_name | BatchID | drugA_name | drugA Conc (uM) |
---|---|---|---|---|
A2058 | 5-FU & ABT-888 | 1 | 5-FU | 0.35 |
A2058 | 5-FU & BEZ-235 | 1 | 5-FU | 0.35 |
A2058 | 5-FU & Bortezomib | 1 | 5-FU | 0.35 |
A2058 | 5-FU & Dasatinib | 1 | 5-FU | 0.35 |
A2058 | 5-FU & L778123 | 1 | 5-FU | 0.35 |
A2058 | 5-FU & geldanamycin | 1 | 5-FU | 0.35 |
knitr::kable(head(merckDT)[, 5:ncol(merckDT)])
drugA Conc (uM) | drugB_name | drugB Conc (uM) | viability1 | viability2 | viability3 | viability4 | mu/muMax | X/X0 |
---|---|---|---|---|---|---|---|---|
0.35 | ABT-888 | 0.35000 | 0.971 | 1.090 | 0.949 | 0.996 | 0.992 | 0.988 |
0.35 | BEZ-235 | 0.00450 | 0.921 | 0.947 | 0.915 | 0.956 | 0.965 | 0.953 |
0.35 | Bortezomib | 0.00045 | 0.983 | 0.962 | 0.950 | 0.954 | 0.978 | 0.970 |
0.35 | Dasatinib | 0.02400 | 0.798 | 0.778 | 0.946 | 0.312 | 0.879 | 0.846 |
0.35 | L778123 | 0.32500 | 1.117 | 1.020 | 0.920 | 0.927 | 0.986 | 0.981 |
0.35 | geldanamycin | 0.02230 | 1.023 | 1.018 | 0.912 | 0.897 | 0.982 | 0.975 |
We can see that all the data related to the treatment response experiment is contained within this table.
To build a LongTable
object from this file:
rowDataCols <- list(
c(cell_line1="cell_line", BatchID="BatchID"))
colDataCols <- list(
c(drug1='drugA_name', drug2='drugB_name',
drug1dose='drugA Conc (uM)', drug2dose='drugB Conc (uM)'),
c(comboName='combination_name'))
assayCols <- list(viability=paste0('viability', seq_len(4)),
viability_summary=c('mu/muMax', 'X/X0'))
longTable <- buildLongTable(from=filePath, rowDataCols,
colDataCols, assayCols)
## < LongTable >
## dim: 8 583
## assays(2): viability viability_summary
## rownames(8): A2058:1 A2780:1 A2780:2 ... A427:1 CAOV3:1 CAOV3:2
## rowData(2): cell_line1 BatchID
## colnames(583): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... geldanamycin:SN-38:0.0223:0.000115 geldanamycin:Sorafenib:0.0223:2.5 geldanamycin:Topotecan:0.0223:0.0223
## colData(5): drug1 drug2 drug1dose drug2dose comboName
## metadata(0): none
This function will also work if directly passed a data.table
or data.frame
object:
## [1] "All equal? TRUE"
The second option for building a LongTable
is to pass it a list of different
assays with a shared set of row and column identifiers. We haven’t had a chance to testing this functionality with real data yet, but do have a toy example.
assayList <- assays(longTable, withDimnames=TRUE, metadata=TRUE, key=FALSE)
assayList$new_viability <- assayList$viability # Add a fake additional assay
assayCols$new_viability <- assayCols$viability # Add column names for fake assay
longTable2 <- buildLongTable(from=assayList, lapply(rowDataCols, names), lapply(colDataCols, names), assayCols)
## < LongTable >
## dim: 8 583
## assays(3): viability viability_summary new_viability
## rownames(8): A2058:1 A2780:1 A2780:2 ... A427:1 CAOV3:1 CAOV3:2
## rowData(2): cell_line1 BatchID
## colnames(583): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... geldanamycin:SN-38:0.0223:0.000115 geldanamycin:Sorafenib:0.0223:2.5 geldanamycin:Topotecan:0.0223:0.0223
## colData(5): drug1 drug2 drug1dose drug2dose comboName
## metadata(0): none
We can see that a new assay has been added to the LongTable
object when passed
a list of assay tables containing the required row and column IDs. Additionally, any row or
column IDs not already in rowData or colData will be appended to these slots automatically!
As mentioned previously, a LongTable
has both list and table like behaviours.
For table like operations, a given LongTable
can be thought of as a rowKey
by colKey
rectangular object.
To support data.frame
like sub-setting for this
object, the constructor makes pseudo row and column names, which are the ID columns
for each row of rowData
or colData
pasted together with a ‘:’.
head(rownames(longTable))
## [1] "A2058:1" "A2780:1" "A2780:2" "A375:1" "A375:2" "A427:1"
We see that the rownames for the Merck LongTable
are the cell line name
pasted to the batch id.
head(colnames(longTable))
## [1] "5-FU:ABT-888:0.35:0.35" "5-FU:AZD1775:0.35:0.0325"
## [3] "5-FU:BEZ-235:0.35:0.0045" "5-FU:Bortezomib:0.35:0.00045"
## [5] "5-FU:Dasatinib:0.35:0.024" "5-FU:Dinaciclib:0.35:0.000925"
For the column names, a similar pattern is followed by combining the colID columns in the form ‘drug1:drug2:drug1dose:drug2dose’.
data.frame
SubsettingWe can subset a LongTable
using the same row and column name syntax as
with a data.frame
or matrix
.
row <- rownames(longTable)[1]
columns <- colnames(longTable)[1:2]
longTable[row, columns]
## < LongTable >
## dim: 1 2
## assays(2): viability viability_summary
## rownames(1): A2058:1
## rowData(2): cell_line1 BatchID
## colnames(2): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325
## colData(5): drug1 drug2 drug1dose drug2dose comboName
## metadata(0): none
However, unlike a data.frame
or matrix
this subsetting also accepts partial
row and column names as well as regex queries.
head(rowData(longTable), 3)
## cell_line1 BatchID
## 1: A2058 1
## 2: A2780 1
## 3: A2780 2
head(colData(longTable), 3)
## drug1 drug2 drug1dose drug2dose comboName
## 1: 5-FU ABT-888 0.35 0.3500 5-FU & ABT-888
## 2: 5-FU AZD1775 0.35 0.0325 5-FU & AZD1775
## 3: 5-FU BEZ-235 0.35 0.0045 5-FU & BEZ-235
For example, if we want to get all instance where ‘5-FU’ is the drug:
longTable[, '5-FU']
## < LongTable >
## dim: 5 22
## assays(2): viability viability_summary
## rownames(5): A2058:1 A2780:1 A375:1 A427:1 CAOV3:1
## rowData(2): cell_line1 BatchID
## colnames(22): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... 5-FU:geldanamycin:0.35:0.0223 MK-4541:5-FU:0.045:0.35 MRK-003:5-FU:0.35:0.35
## colData(5): drug1 drug2 drug1dose drug2dose comboName
## metadata(0): none
This has matched all colnames where 5-FU was in either drug1 or drug2. If we only want to match drug1, we have several options:
all.equal(longTable[, '5-FU:*:*:*'], longTable[, '^5-FU'])
## [1] TRUE
data.table
SubsettingIn addition to regex queries, a LongTable
object supports arbitrarily complex
subset queries using the data.table
API. To access this API, you will need to use the .
function, which allows you
to pass raw R expressions to be evaluated inside the i
and j
arguments
for dataTable[i, j]
.
For example if I want to subset to rows where the cell line is VCAP and columns where drug1 is Temozolomide and drug2 is either Lapatinib or Bortezomib:
longTable[.(cell_line1 == 'CAOV3'), # row query
.(drug1 == 'Temozolomide' & drug2 %in% c('Lapatinib', 'Bortezomib'))] # column query
## < LongTable >
## dim: 1 2
## assays(2): viability viability_summary
## rownames(1): CAOV3:1
## rowData(2): cell_line1 BatchID
## colnames(2): Temozolomide:Bortezomib:2.75:0.00045 Temozolomide:Lapatinib:2.75:0.055
## colData(5): drug1 drug2 drug1dose drug2dose comboName
## metadata(0): none
We can also invert matches or subset on other columns in rowData
or colData
:
subLongTable <-
longTable[.(BatchID != 2),
.(drug1 == 'Temozolomide' & drug2 != 'Lapatinib')]
To show that it works as expected:
print(paste0('BatchID: ', paste0(unique(rowData(subLongTable)$BatchID), collapse=', ')))
## [1] "BatchID: 1"
print(paste0('drug2: ', paste0(unique(colData(subLongTable)$drug2), collapse=', ')))
## [1] "drug2: ABT-888, AZD1775, BEZ-235, Bortezomib, Dasatinib, Dinaciclib, Erlotinib, MK-2206, MK-4827, MK-5108, MK-8669, MK-8776, Oxaliplatin, PD325901, SN-38, Sorafenib, Topotecan, geldanamycin"
head(rowData(longTable), 3)
## cell_line1 BatchID
## 1: A2058 1
## 2: A2780 1
## 3: A2780 2
head(rowData(longTable, key=TRUE), 3)
## cell_line1 BatchID rowKey
## 1: A2058 1 1
## 2: A2780 1 2
## 3: A2780 2 3
head(colData(longTable), 3)
## drug1 drug2 drug1dose drug2dose comboName
## 1: 5-FU ABT-888 0.35 0.3500 5-FU & ABT-888
## 2: 5-FU AZD1775 0.35 0.0325 5-FU & AZD1775
## 3: 5-FU BEZ-235 0.35 0.0045 5-FU & BEZ-235
head(colData(longTable, key=TRUE), 3)
## drug1 drug2 drug1dose drug2dose comboName colKey
## 1: 5-FU ABT-888 0.35 0.3500 5-FU & ABT-888 1
## 2: 5-FU AZD1775 0.35 0.0325 5-FU & AZD1775 2
## 3: 5-FU BEZ-235 0.35 0.0045 5-FU & BEZ-235 3
assays <- assays(longTable)
assays[[1]]
## viability1 viability2 viability3 viability4 rowKey colKey
## 1: 0.971 1.090 0.949 0.996 1 1
## 2: 0.893 1.106 0.907 1.029 1 2
## 3: 0.921 0.947 0.915 0.956 1 3
## 4: 0.983 0.962 0.950 0.954 1 4
## 5: 0.798 0.778 0.946 0.312 1 5
## ---
## 2911: 0.824 0.817 0.988 0.835 8 499
## 2912: 0.926 0.871 1.069 0.995 8 560
## 2913: 0.815 0.845 0.753 0.677 8 561
## 2914: 0.670 0.779 0.647 0.822 8 568
## 2915: 1.028 1.020 1.021 1.032 8 574
assays[[2]]
## mu/muMax X/X0 rowKey colKey
## 1: 0.992 0.988 1 1
## 2: 0.984 0.977 1 2
## 3: 0.965 0.953 1 3
## 4: 0.978 0.970 1 4
## 5: 0.879 0.846 1 5
## ---
## 2911: 0.757 0.714 8 499
## 2912: 0.947 0.930 8 560
## 2913: 0.683 0.645 8 561
## 2914: 0.581 0.559 8 568
## 2915: 1.032 1.045 8 574
assays <- assays(longTable, withDimnames=TRUE)
colnames(assays[[1]])
## [1] "drug1" "drug2" "drug1dose" "drug2dose" "comboName"
## [6] "cell_line1" "BatchID" "viability1" "viability2" "viability3"
## [11] "viability4"
assays <- assays(longTable, withDimnames=TRUE, metadata=TRUE)
colnames(assays[[2]])
## [1] "drug1" "drug2" "drug1dose" "drug2dose" "comboName"
## [6] "cell_line1" "BatchID" "mu/muMax" "X/X0"
assayNames(longTable)
## [1] "viability" "viability_summary"
Using these names we can access specific assays within a LongTable
.
colnames(assay(longTable, 'viability'))
## [1] "viability1" "viability2" "viability3" "viability4" "rowKey"
## [6] "colKey"
assay(longTable, 'viability')
## viability1 viability2 viability3 viability4 rowKey colKey
## 1: 0.971 1.090 0.949 0.996 1 1
## 2: 0.893 1.106 0.907 1.029 1 2
## 3: 0.921 0.947 0.915 0.956 1 3
## 4: 0.983 0.962 0.950 0.954 1 4
## 5: 0.798 0.778 0.946 0.312 1 5
## ---
## 2911: 0.824 0.817 0.988 0.835 8 499
## 2912: 0.926 0.871 1.069 0.995 8 560
## 2913: 0.815 0.845 0.753 0.677 8 561
## 2914: 0.670 0.779 0.647 0.822 8 568
## 2915: 1.028 1.020 1.021 1.032 8 574
colnames(assay(longTable, 'viability', withDimnames=TRUE))
## [1] "drug1" "drug2" "drug1dose" "drug2dose" "comboName"
## [6] "cell_line1" "BatchID" "viability1" "viability2" "viability3"
## [11] "viability4"
assay(longTable, 'viability', withDimnames=TRUE)
## drug1 drug2 drug1dose drug2dose comboName
## 1: 5-FU ABT-888 0.3500 0.350000 5-FU & ABT-888
## 2: 5-FU AZD1775 0.3500 0.032500 5-FU & AZD1775
## 3: 5-FU BEZ-235 0.3500 0.004500 5-FU & BEZ-235
## 4: 5-FU Bortezomib 0.3500 0.000450 5-FU & Bortezomib
## 5: 5-FU Dasatinib 0.3500 0.024000 5-FU & Dasatinib
## ---
## 2911: Temozolomide Erlotinib 2.7500 0.055000 Temozolomide & Erlotinib
## 2912: Zolinza Dinaciclib 0.0925 0.000925 Zolinza & Dinaciclib
## 2913: Zolinza Erlotinib 0.0925 0.055000 Zolinza & Erlotinib
## 2914: Zolinza MK-8776 0.0925 0.092500 Zolinza & MK-8776
## 2915: Zolinza Temozolomide 0.0925 2.750000 Zolinza & Temozolomide
## cell_line1 BatchID viability1 viability2 viability3 viability4
## 1: A2058 1 0.971 1.090 0.949 0.996
## 2: A2058 1 0.893 1.106 0.907 1.029
## 3: A2058 1 0.921 0.947 0.915 0.956
## 4: A2058 1 0.983 0.962 0.950 0.954
## 5: A2058 1 0.798 0.778 0.946 0.312
## ---
## 2911: CAOV3 2 0.824 0.817 0.988 0.835
## 2912: CAOV3 2 0.926 0.871 1.069 0.995
## 2913: CAOV3 2 0.815 0.845 0.753 0.677
## 2914: CAOV3 2 0.670 0.779 0.647 0.822
## 2915: CAOV3 2 1.028 1.020 1.021 1.032