Prepared references let bigKNN cache metric-specific
information about a fixed reference matrix and reuse it across later
exact searches. They are the right tool when the reference data stays
put but queries arrive in batches over time.
This article walks through that pattern end to end:
big.matrix
objectsPrepared references are most useful when:
They do not change the search result. The advantage is that repeated searches can reuse cached row-wise quantities instead of recomputing them every time.
For this vignette we will use a file-backed big.matrix,
because persisted prepared caches are easiest to demonstrate when the
reference can be reattached through files on disk.
scratch_dir <- file.path(tempdir(), "bigknn-prepared-search")
dir.create(scratch_dir, recursive = TRUE, showWarnings = FALSE)
reference_points <- data.frame(
id = paste0("r", 1:8),
x1 = c(1, 1, 2, 2, 3, 3, 4, 4),
x2 = c(1, 2, 1, 2, 2, 3, 3, 4),
x3 = c(0.5, 0.5, 1.0, 1.0, 1.5, 1.5, 2.0, 2.5)
)
reference <- filebacked.big.matrix(
nrow = nrow(reference_points),
ncol = 3,
type = "double",
backingfile = "reference.bin",
descriptorfile = "reference.desc",
backingpath = scratch_dir
)
reference[,] <- as.matrix(reference_points[c("x1", "x2", "x3")])
query_batch_a <- matrix(
c(1.1, 1.2, 0.5,
2.7, 2.2, 1.4),
ncol = 3,
byrow = TRUE
)
query_batch_b <- matrix(
c(3.6, 3.1, 1.9,
1.5, 1.8, 0.8),
ncol = 3,
byrow = TRUE
)
query_ids_a <- c("a1", "a2")
query_ids_b <- c("b1", "b2")
reference_points
#> id x1 x2 x3
#> 1 r1 1 1 0.5
#> 2 r2 1 2 0.5
#> 3 r3 2 1 1.0
#> 4 r4 2 2 1.0
#> 5 r5 3 2 1.5
#> 6 r6 3 3 1.5
#> 7 r7 4 3 2.0
#> 8 r8 4 4 2.5All rows are non-zero, which matters because cosine distance requires non-zero reference and query vectors.
knn_prepare_bigmatrix()prepared <- knn_prepare_bigmatrix(reference, metric = "cosine")
prepared
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUEInternally, a prepared object stores:
metricrow_cacheThe print method keeps that summary compact:
summary(prepared)
#> $metric
#> [1] "cosine"
#>
#> $block_size
#> [1] 1024
#>
#> $n_ref
#> [1] 8
#>
#> $n_col
#> [1] 3
#>
#> $validated
#> [1] TRUE
#>
#> $cache_path
#> NULL
length(prepared$row_cache)
#> [1] 8
head(prepared$row_cache, 4)
#> [1] 1.500000 2.291288 2.449490 3.000000For cosine distance, row_cache contains row-wise
quantities that are reused during later searches. In normal workflows
you rarely need to manipulate it directly; it is included here so you
can see that a prepared object is more than just a wrapper around the
original big.matrix.
knn_search_prepared()batch_a_result <- knn_search_prepared(
prepared,
query = query_batch_a,
k = 2,
exclude_self = FALSE
)
batch_b_result <- knn_search_prepared(
prepared,
query = query_batch_b,
k = 2,
exclude_self = FALSE
)
batch_a_result
#> <bigknn_knn_result>
#> metric: cosine
#> k: 2
#> queries: 2
#> references: 8
#> backend: bruteforce
knn_table(batch_a_result, query_ids = query_ids_a, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 a1 1 r6 0.00172560
#> 2 a1 2 r1 0.00172560
#> 3 a2 1 r7 0.00069773
#> 4 a2 2 r5 0.00399290
knn_table(batch_b_result, query_ids = query_ids_b, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 b1 1 r7 0.0019579
#> 2 b1 2 r8 0.0029916
#> 3 b2 1 r6 0.0037227
#> 4 b2 2 r1 0.0037227The result contract is the same as knn_bigmatrix(). The
difference is that the reference preparation step has already been done,
so you can reuse the same prepared object across many query
batches.
To make that explicit, we can compare a prepared search with the one-shot API:
direct_batch_a <- knn_bigmatrix(
reference,
query = query_batch_a,
k = 2,
metric = "cosine",
exclude_self = FALSE
)
identical(batch_a_result$index, direct_batch_a$index)
#> [1] TRUE
all.equal(batch_a_result$distance, direct_batch_a$distance)
#> [1] TRUEPrepared search is therefore an ergonomics and performance feature, not a different search algorithm.
knn_search_stream_prepared()If you want the prepared search to write directly into destination
big.matrix objects, use
knn_search_stream_prepared(). This is helpful when the
query set is larger or when you want to keep results in shared-memory or
file-backed structures instead of dense R matrices.
index_store <- big.matrix(nrow(query_batch_b), 2, type = "integer")
distance_store <- big.matrix(nrow(query_batch_b), 2, type = "double")
streamed_batch_b <- knn_search_stream_prepared(
prepared,
query = query_batch_b,
xpIndex = index_store,
xpDistance = distance_store,
k = 2,
exclude_self = FALSE
)
bigmemory::as.matrix(streamed_batch_b$index)
#> [,1] [,2]
#> [1,] 7 8
#> [2,] 6 1
round(bigmemory::as.matrix(streamed_batch_b$distance), 6)
#> [,1] [,2]
#> [1,] 0.001958 0.002992
#> [2,] 0.003723 0.003723
all.equal(bigmemory::as.matrix(streamed_batch_b$distance), batch_b_result$distance)
#> [1] TRUEThe neighbour indices and distances are the same as the in-memory prepared search; the only difference is where the results land.
cache_pathPrepared references can be serialized with cache_path,
which is useful when a project repeatedly opens the same file-backed
reference over many sessions.
cache_path <- file.path(scratch_dir, "prepared-cosine-cache.rds")
prepared_cached <- knn_prepare_bigmatrix(
reference,
metric = "cosine",
cache_path = cache_path
)
prepared_cached
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUE
#> cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpomqI5n/bigknn-prepared-search/prepared-cosine-cache.rds
file.exists(cache_path)
#> [1] TRUEPersisted prepared references are especially helpful for long-running projects and reproducible pipelines.
knn_load_prepared()loaded <- knn_load_prepared(cache_path)
loaded
#> <bigknn_prepared>
#> metric: cosine
#> block_size: 1024
#> shape: 8 x 3
#> validated: TRUE
#> cache_path: /private/var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T/RtmpomqI5n/bigknn-prepared-search/prepared-cosine-cache.rdsknn_load_prepared() restores the cached metadata and
reattaches the underlying big.matrix through its stored
descriptor. That means the prepared cache is tied to the original
reference backing files: if those files move or disappear, the cache can
no longer be reattached.
knn_validate_prepared()Validation is usually worth calling after loading a cache from disk, or any time you want to confirm that the descriptor, cached dimensions, and row cache still match the underlying reference.
Once the cache has been loaded and validated, it behaves like any other prepared reference:
big.matrix descriptor, so the reference files need to
remain accessible.validate = TRUE when preparing cosine references so
incompatible rows are caught early.knn_bigmatrix()
calls to knn_prepare_bigmatrix() plus
knn_search_prepared().Prepared references are a small API feature with a big practical payoff: you do the setup work once, and then exact search against the same reference becomes easier to repeat, easier to stream, and easier to persist.