Large exact jobs are not only about arithmetic. They are also about
controlling how much work is done per block and where the results are
stored. bigKNN addresses those two questions with:
big.matrix objects instead of returning everything as dense
R objectsThis vignette uses a moderately sized toy example so the code stays readable, but the same patterns scale to much larger references and query batches.
For a small exploratory search, the simplest thing is often best:
knn_bigmatrix()index and distance
matrices backThat becomes less attractive as the job grows. The exact computation may still fit, but the output itself can become large:
k-NN output scales like n_query x kExecution plans and streaming let you control those two pressure points directly.
The reference matrix below has 160 rows and 4 columns. It is large enough to show real changes in derived block size, but still small enough that we can inspect outputs directly in the vignette.
i <- seq_len(160)
reference_matrix <- cbind(
x1 = i,
x2 = (i %% 7) + 1,
x3 = (i %% 11) + 0.5,
x4 = (i %% 13) + 2
)
reference <- as.big.matrix(reference_matrix)
dense_query <- rbind(
reference_matrix[5, ] + c(0.2, 0.0, 0.1, 0.0),
reference_matrix[50, ] + c(-0.3, 0.2, 0.0, 0.1),
reference_matrix[120, ] + c(0.4, -0.1, 0.2, 0.0),
reference_matrix[151, ] + c(0.1, 0.2, -0.2, 0.3)
)
query_ids <- paste0("q", seq_len(nrow(dense_query)))
dim(reference_matrix)
#> [1] 160 4
dense_query
#> x1 x2 x3 x4
#> [1,] 5.2 6.0 5.6 7.0
#> [2,] 49.7 2.2 6.5 13.1
#> [3,] 120.4 1.9 10.7 5.0
#> [4,] 151.1 5.2 8.3 10.3knn_plan_bigmatrix()knn_plan_bigmatrix() converts a target memory budget
into a search plan. The plan stores the metric, the requested thread
count, whether progress reporting should be enabled, and most
importantly the derived block_size.
plan <- knn_plan_bigmatrix(
reference,
metric = "euclidean",
memory_budget = "64KB",
num_threads = 2L,
progress = FALSE
)
plan
#> <bigknn_plan>
#> metric: euclidean
#> memory_budget: 64KB
#> block_size: 72
#> num_threads: 2
#> shape: 160 x 4The printed object is the key summary you usually need:
metric: the exact distance used by plan-aware
callsmemory_budget: the requested budget in a normalized
formblock_size: the number of rows processed per blocknum_threads: the thread request forwarded to common
BLAS/OpenMP variablesshape: the reference dimensions the plan was built
forBlock size is derived rather than guessed. With the same reference matrix, a larger budget yields a larger working block.
plan_small <- knn_plan_bigmatrix(
reference,
metric = "euclidean",
memory_budget = "4KB",
num_threads = 2L,
progress = FALSE
)
plan_large <- knn_plan_bigmatrix(
reference,
metric = "euclidean",
memory_budget = "1MB",
num_threads = 2L,
progress = FALSE
)
data.frame(
memory_budget = c(plan_small$memory_budget, plan$memory_budget, plan_large$memory_budget),
block_size = c(plan_small$block_size, plan$block_size, plan_large$block_size),
row.names = NULL
)
#> memory_budget block_size
#> 1 4KB 13
#> 2 64KB 72
#> 3 1MB 160This is useful even though the search remains exact. The plan does not change which neighbours are returned. It changes how much data is processed at once, which makes resource use more predictable across runs and machines.
In practice:
Plans plug into the same search API you would use without them.
planned_knn <- knn_bigmatrix(
reference,
query = dense_query,
k = 3,
plan = plan,
exclude_self = FALSE
)
planned_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 4
#> references: 160
#> backend: bruteforce
knn_table(planned_knn, query_ids = query_ids)
#> query rank neighbor distance
#> 1 q1 1 5 0.22361
#> 2 q1 2 6 1.85740
#> 3 q1 3 4 2.15640
#> 4 q2 1 50 0.37417
#> 5 q2 2 49 2.03470
#> 6 q2 3 51 2.03470
#> 7 q3 1 120 0.45826
#> 8 q3 2 119 2.28250
#> 9 q3 3 118 6.37260
#> 10 q4 1 151 0.42426
#> 11 q4 2 152 1.83850
#> 12 q4 3 150 2.23160The result contract is unchanged. You still get dense
index and distance matrices. The difference is
that bigKNN uses the plan’s block size and thread policy
while computing them.
k-NN output with
knn_stream_bigmatrix()When n_query x k is large enough that you do not want
the full result held in ordinary R matrices, switch to
knn_stream_bigmatrix().
The destination requirements are straightforward:
xpIndex must be a writable big.matrix with
n_query rows and k columnsxpDistance, if supplied, must have the same
dimensionsxpIndex should use integer or double storagexpDistance should use double storageindex_store <- big.matrix(nrow(dense_query), 3, type = "integer")
distance_store <- big.matrix(nrow(dense_query), 3, type = "double")
streamed_knn <- knn_stream_bigmatrix(
reference,
query = dense_query,
xpIndex = index_store,
xpDistance = distance_store,
k = 3,
plan = plan,
exclude_self = FALSE
)
bigmemory::as.matrix(streamed_knn$index)
#> [,1] [,2] [,3]
#> [1,] 5 6 4
#> [2,] 50 49 51
#> [3,] 120 119 118
#> [4,] 151 152 150
round(bigmemory::as.matrix(streamed_knn$distance), 4)
#> [,1] [,2] [,3]
#> [1,] 0.2236 1.8574 2.1564
#> [2,] 0.3742 2.0347 2.0347
#> [3,] 0.4583 2.2825 6.3726
#> [4,] 0.4243 1.8385 2.2316Those streamed outputs match the in-memory result exactly:
radius_stream_bigmatrix()Radius search needs one extra step before streaming: you have to know how much space to allocate for the flattened matches. The simplest pattern is:
count_within_radius_bigmatrix() to get per-query
countssum(counts) rows for the flattened
index and distancelength(counts) + 1 rows for the offset
vectorradius_stream_bigmatrix()radius_counts <- count_within_radius_bigmatrix(
reference,
query = dense_query,
radius = 2.2,
plan = plan,
exclude_self = FALSE
)
radius_counts
#> [1] 3 3 1 2
total_matches <- sum(radius_counts)
total_matches
#> [1] 9radius_index_store <- big.matrix(total_matches, 1, type = "integer")
radius_distance_store <- big.matrix(total_matches, 1, type = "double")
radius_offset_store <- big.matrix(length(radius_counts) + 1L, 1, type = "double")
streamed_radius <- radius_stream_bigmatrix(
reference,
query = dense_query,
xpIndex = radius_index_store,
xpDistance = radius_distance_store,
xpOffset = radius_offset_store,
radius = 2.2,
plan = plan,
exclude_self = FALSE
)
streamed_radius
#> <bigknn_radius_result>
#> metric: euclidean
#> radius: 2.2
#> queries: 4
#> references: 160
#> matches: 9
streamed_radius$n_match
#> [1] 3 3 1 2The offset vector is what makes flattened radius output usable. It tells you where each query’s matches begin and end.
radius_offset <- as.vector(bigmemory::as.matrix(streamed_radius$offset))
radius_index <- as.vector(bigmemory::as.matrix(streamed_radius$index))
radius_distance <- as.vector(bigmemory::as.matrix(streamed_radius$distance))
radius_offset
#> [1] 1 4 7 8 10
radius_slice_table(radius_index, radius_distance, radius_offset, query_ids, 1)
#> query neighbor distance
#> 1 q1 5 0.22361
#> 2 q1 6 1.85740
#> 3 q1 4 2.15640
radius_slice_table(radius_index, radius_distance, radius_offset, query_ids, 2)
#> query neighbor distance
#> 1 q2 50 0.37417
#> 2 q2 49 2.03470
#> 3 q2 51 2.03470If you prefer, you can skip the streamed route and use
radius_bigmatrix() to have bigKNN allocate the
flattened vectors for you. Streaming becomes useful when those flattened
vectors are large or when you want explicit control over their
storage.
The public API also accepts sparse query matrices. At the moment they are densified on the R side before exact computation, so sparse input is primarily an interface convenience rather than a sparse-native backend.
sparse_query <- Matrix::Matrix(dense_query, sparse = TRUE)
sparse_knn <- knn_bigmatrix(
reference,
query = sparse_query,
k = 3,
plan = plan,
exclude_self = FALSE
)
identical(sparse_knn$index, planned_knn$index)
#> [1] TRUE
all.equal(sparse_knn$distance, planned_knn$distance)
#> [1] TRUEThat makes sparse queries especially handy when your upstream
workflow already produces Matrix objects, and you want to
use the same exact search call without manually converting first.
A simple rule of thumb works well:
knn_bigmatrix() when n_query x k is
small enough that ordinary R matrices are convenientknn_stream_bigmatrix() when
n_query x k is large and you want control over where those
matrices liveradius_bigmatrix() when flattened radius output is
still manageable in memoryradius_stream_bigmatrix() when the total number of
matches is large or highly variable across query rowsknn_plan_bigmatrix() whenever you want repeatable
memory and threading policy instead of ad hoc defaultsPlans and streaming do not make the search approximate. They make the exact workflow more operationally predictable, which is often the difference between an algorithm that works in principle and one that is pleasant to run at scale.