date: “last edit: 9/21/2022”
Load packages into R session. It will automatically load the package of dplyr
and dbplyr
.
library(SQLDataFrame)
library(DBI)
dbfile <- system.file("extdata/test.db", package = "SQLDataFrame")
conn <- DBI::dbConnect(dbDriver("SQLite"), dbname = dbfile)
obj <- SQLDataFrame(
conn = conn, dbtable = "state", dbkey = "state")
obj
#> SQLDataFrame with 50 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Alabama | East South Central South 3615 medium
#> Alaska | Pacific West 365 small
#> Arizona | Mountain West 2280 medium
#> Arkansas | West South Central South 2110 medium
#> California | Pacific West 21198 large
#> ... . ... ... ... ...
#> Virginia | South Atlantic South 4981 medium
#> Washington | Pacific West 3559 medium
#> West Virginia | South Atlantic South 1799 medium
#> Wisconsin | East North Central North Central 4589 medium
#> Wyoming | Mountain West 376 small
To make the SQLDataFrame
object as light and compact as possible, there are only 5 slots contained in the object: tblData
, dbkey
, dbnrows
, dbconcatKey
, indexes
. Metadata information could be returned through these 5 slots using slot accessors or other utility functions.
slotNames(obj)
#> [1] "dbkey" "dbnrows" "tblData" "indexes"
#> [5] "dbconcatKey" "elementType" "elementMetadata" "metadata"
dbtable(obj)
#> [1] "state"
dbkey(obj)
#> [1] "state"
tblData
slotThe tblData
slot saves the dbplyr::tbl_dbi
version of the database table, which is a light-weight representation of the database table in R. Of note is that this lazy tbl only contains unique rows. It could also be sorted by the dbkey(obj)
if the SQLDataFrame
object was generated from union
or rbind
. So when the saveSQLDataFrame()
function was called, a database table will be written into a physical disk space and have the unique records.
Accessor function is made avaible for this slot:
tblData(obj)
#> # Source: table<state> [?? x 5]
#> # Database: sqlite 3.39.4 [/tmp/RtmpRM3VSz/Rinst26bffd2872677e/SQLDataFrame/extdata/test.db]
#> division region state population size
#> <chr> <chr> <chr> <dbl> <chr>
#> 1 East South Central South Alabama 3615 medium
#> 2 Pacific West Alaska 365 small
#> 3 Mountain West Arizona 2280 medium
#> 4 West South Central South Arkansas 2110 medium
#> 5 Pacific West California 21198 large
#> 6 Mountain West Colorado 2541 medium
#> 7 New England Northeast Connecticut 3100 medium
#> 8 South Atlantic South Delaware 579 small
#> 9 South Atlantic South Florida 8277 large
#> 10 South Atlantic South Georgia 4931 medium
#> # … with more rows
dbnrows
and dbconcatKey
The dbnrows
slot saves the number of rows corresponding to the tblData
, and dbconcatKey
saves the realized (concatenated if multiple) key columns corresponding to the tblData
. Accessor functions are also available for these 2 slots:
dbnrows(obj)
#> [1] 50
dbconcatKey(obj)
#> [1] "Alabama" "Alaska" "Arizona" "Arkansas"
#> [5] "California" "Colorado" "Connecticut" "Delaware"
#> [9] "Florida" "Georgia" "Hawaii" "Idaho"
#> [13] "Illinois" "Indiana" "Iowa" "Kansas"
#> [17] "Kentucky" "Louisiana" "Maine" "Maryland"
#> [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
#> [25] "Missouri" "Montana" "Nebraska" "Nevada"
#> [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
#> [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
#> [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
#> [41] "South Dakota" "Tennessee" "Texas" "Utah"
#> [45] "Vermont" "Virginia" "Washington" "West Virginia"
#> [49] "Wisconsin" "Wyoming"
indexes
slotThe indexes
slots is an unnamed list saving the row and column indexes respectively corresponding to the tblData
slot, so that the SQLDataFrame
could possibly have duplicate rows or only a subset of data records from the tblData
, while the tblData
slot doesn’t need to be changed. To be consistent, the slots of dbnrows
and dbconcatKey
will also remain unchanged.
obj@indexes
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL
obj_sub <- obj[sample(5, 3, replace = TRUE), 2:3]
obj_sub
#> SQLDataFrame with 3 rows and 2 columns
#> state | region population
#> <character> | <character> <numeric>
#> California | West 21198
#> California | West 21198
#> Arizona | West 2280
obj_sub@indexes
#> [[1]]
#> [1] 5 5 3
#>
#> [[2]]
#> [1] 2 3
identical(tblData(obj), tblData(obj_sub))
#> [1] TRUE
With a filter
or select
function (which is similar to [i, ]
subsetting), only the indexes
slot will be updated for the row or column index pointing to the tblData
.
obj_filter <- obj %>% filter(division == "South Atlantic" & size == "medium")
obj_filter@indexes
#> [[1]]
#> [1] 10 20 40 46 48
#>
#> [[2]]
#> NULL
identical(tblData(obj), tblData(obj_filter))
#> [1] TRUE
obj_select <- obj %>% select(division, size)
obj_select@indexes
#> [[1]]
#> NULL
#>
#> [[2]]
#> [1] 1 4
identical(tblData(obj), tblData(obj_select))
#> [1] TRUE
The ROWNAMES,SQLDataFrame
method was defined to return the (concatenated if multiple) key column(s) value, so that the row subsetting with character vector works for the SQLDataFrame
objects.
rnms <- ROWNAMES(obj)
obj[sample(rnms, 3), ]
#> SQLDataFrame with 3 rows and 4 columns
#> state | division region population size
#> <character> | <character> <character> <numeric> <character>
#> Tennessee | East South Central South 4173 medium
#> Maryland | South Atlantic South 4122 medium
#> New York | Middle Atlantic Northeast 18076 large
For SQLDataFrame
object with composite keys:
obj1 <- SQLDataFrame(conn = conn, dbtable = "state",
dbkey = c("region", "population"))
ROWNAMES(obj1[1:10,])
#> [1] "South:3615.0" "West:365.0" "West:2280.0" "South:2110.0"
#> [5] "West:21198.0" "West:2541.0" "Northeast:3100.0" "South:579.0"
#> [9] "South:8277.0" "South:4931.0"
obj1[c("South:3615.0", "West:365.0"), ]
#> SQLDataFrame with 2 rows and 3 columns
#> region population | division state size
#> <character> <numeric> | <character> <character> <character>
#> South 3615 | East South Central Alabama medium
#> West 365 | Pacific Alaska small
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] DBI_1.1.3 SQLDataFrame_1.12.0 S4Vectors_0.36.0
#> [4] BiocGenerics_0.44.0 dbplyr_2.2.1 dplyr_1.0.10
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.9 pillar_1.8.1 bslib_0.4.0 compiler_4.2.1
#> [5] jquerylib_0.1.4 tools_4.2.1 bit_4.0.4 digest_0.6.30
#> [9] memoise_2.0.1 jsonlite_1.8.3 evaluate_0.17 RSQLite_2.2.18
#> [13] lifecycle_1.0.3 tibble_3.1.8 pkgconfig_2.0.3 rlang_1.0.6
#> [17] cli_3.4.1 yaml_2.3.6 xfun_0.34 fastmap_1.1.0
#> [21] withr_2.5.0 stringr_1.4.1 knitr_1.40 generics_0.1.3
#> [25] vctrs_0.5.0 sass_0.4.2 bit64_4.0.5 tidyselect_1.2.0
#> [29] glue_1.6.2 R6_2.5.1 fansi_1.0.3 rmarkdown_2.17
#> [33] purrr_0.3.5 blob_1.2.3 magrittr_2.0.3 htmltools_0.5.3
#> [37] assertthat_0.2.1 utf8_1.2.2 stringi_1.7.8 lazyeval_0.2.2
#> [41] cachem_1.0.6