Execution Plans and Streaming Workflows

Large exact jobs are not only about arithmetic. They are also about controlling how much work is done per block and where the results are stored. bigKNN addresses those two questions with:

This vignette uses a moderately sized toy example so the code stays readable, but the same patterns scale to much larger references and query batches.

Why plans and streaming matter for large matrices

For a small exploratory search, the simplest thing is often best:

That becomes less attractive as the job grows. The exact computation may still fit, but the output itself can become large:

Execution plans and streaming let you control those two pressure points directly.

Build a Repeatable Example

The reference matrix below has 160 rows and 4 columns. It is large enough to show real changes in derived block size, but still small enough that we can inspect outputs directly in the vignette.

i <- seq_len(160)

reference_matrix <- cbind(
  x1 = i,
  x2 = (i %% 7) + 1,
  x3 = (i %% 11) + 0.5,
  x4 = (i %% 13) + 2
)

reference <- as.big.matrix(reference_matrix)

dense_query <- rbind(
  reference_matrix[5, ] + c(0.2, 0.0, 0.1, 0.0),
  reference_matrix[50, ] + c(-0.3, 0.2, 0.0, 0.1),
  reference_matrix[120, ] + c(0.4, -0.1, 0.2, 0.0),
  reference_matrix[151, ] + c(0.1, 0.2, -0.2, 0.3)
)

query_ids <- paste0("q", seq_len(nrow(dense_query)))

dim(reference_matrix)
#> [1] 160   4
dense_query
#>         x1  x2   x3   x4
#> [1,]   5.2 6.0  5.6  7.0
#> [2,]  49.7 2.2  6.5 13.1
#> [3,] 120.4 1.9 10.7  5.0
#> [4,] 151.1 5.2  8.3 10.3

Building a plan with knn_plan_bigmatrix()

knn_plan_bigmatrix() converts a target memory budget into a search plan. The plan stores the metric, the requested thread count, whether progress reporting should be enabled, and most importantly the derived block_size.

plan <- knn_plan_bigmatrix(
  reference,
  metric = "euclidean",
  memory_budget = "64KB",
  num_threads = 2L,
  progress = FALSE
)

plan
#> <bigknn_plan>
#>   metric: euclidean
#>   memory_budget: 64KB
#>   block_size: 72
#>   num_threads: 2
#>   shape: 160 x 4

The printed object is the key summary you usually need:

How memory budget maps to block size

Block size is derived rather than guessed. With the same reference matrix, a larger budget yields a larger working block.

plan_small <- knn_plan_bigmatrix(
  reference,
  metric = "euclidean",
  memory_budget = "4KB",
  num_threads = 2L,
  progress = FALSE
)

plan_large <- knn_plan_bigmatrix(
  reference,
  metric = "euclidean",
  memory_budget = "1MB",
  num_threads = 2L,
  progress = FALSE
)

data.frame(
  memory_budget = c(plan_small$memory_budget, plan$memory_budget, plan_large$memory_budget),
  block_size = c(plan_small$block_size, plan$block_size, plan_large$block_size),
  row.names = NULL
)
#>   memory_budget block_size
#> 1           4KB         13
#> 2          64KB         72
#> 3           1MB        160

This is useful even though the search remains exact. The plan does not change which neighbours are returned. It changes how much data is processed at once, which makes resource use more predictable across runs and machines.

In practice:

Streaming k-NN output with knn_stream_bigmatrix()

When n_query x k is large enough that you do not want the full result held in ordinary R matrices, switch to knn_stream_bigmatrix().

The destination requirements are straightforward:

index_store <- big.matrix(nrow(dense_query), 3, type = "integer")
distance_store <- big.matrix(nrow(dense_query), 3, type = "double")

streamed_knn <- knn_stream_bigmatrix(
  reference,
  query = dense_query,
  xpIndex = index_store,
  xpDistance = distance_store,
  k = 3,
  plan = plan,
  exclude_self = FALSE
)

bigmemory::as.matrix(streamed_knn$index)
#>      [,1] [,2] [,3]
#> [1,]    5    6    4
#> [2,]   50   49   51
#> [3,]  120  119  118
#> [4,]  151  152  150
round(bigmemory::as.matrix(streamed_knn$distance), 4)
#>        [,1]   [,2]   [,3]
#> [1,] 0.2236 1.8574 2.1564
#> [2,] 0.3742 2.0347 2.0347
#> [3,] 0.4583 2.2825 6.3726
#> [4,] 0.4243 1.8385 2.2316

Those streamed outputs match the in-memory result exactly:

identical(bigmemory::as.matrix(streamed_knn$index), planned_knn$index)
#> [1] TRUE
all.equal(bigmemory::as.matrix(streamed_knn$distance), planned_knn$distance)
#> [1] TRUE

Streaming radius output with radius_stream_bigmatrix()

Radius search needs one extra step before streaming: you have to know how much space to allocate for the flattened matches. The simplest pattern is:

  1. call count_within_radius_bigmatrix() to get per-query counts
  2. allocate sum(counts) rows for the flattened index and distance
  3. allocate length(counts) + 1 rows for the offset vector
  4. run radius_stream_bigmatrix()
radius_counts <- count_within_radius_bigmatrix(
  reference,
  query = dense_query,
  radius = 2.2,
  plan = plan,
  exclude_self = FALSE
)

radius_counts
#> [1] 3 3 1 2
total_matches <- sum(radius_counts)
total_matches
#> [1] 9
radius_index_store <- big.matrix(total_matches, 1, type = "integer")
radius_distance_store <- big.matrix(total_matches, 1, type = "double")
radius_offset_store <- big.matrix(length(radius_counts) + 1L, 1, type = "double")

streamed_radius <- radius_stream_bigmatrix(
  reference,
  query = dense_query,
  xpIndex = radius_index_store,
  xpDistance = radius_distance_store,
  xpOffset = radius_offset_store,
  radius = 2.2,
  plan = plan,
  exclude_self = FALSE
)

streamed_radius
#> <bigknn_radius_result>
#>   metric: euclidean
#>   radius: 2.2
#>   queries: 4
#>   references: 160
#>   matches: 9
streamed_radius$n_match
#> [1] 3 3 1 2

The offset vector is what makes flattened radius output usable. It tells you where each query’s matches begin and end.

radius_offset <- as.vector(bigmemory::as.matrix(streamed_radius$offset))
radius_index <- as.vector(bigmemory::as.matrix(streamed_radius$index))
radius_distance <- as.vector(bigmemory::as.matrix(streamed_radius$distance))

radius_offset
#> [1]  1  4  7  8 10
radius_slice_table(radius_index, radius_distance, radius_offset, query_ids, 1)
#>   query neighbor distance
#> 1    q1        5  0.22361
#> 2    q1        6  1.85740
#> 3    q1        4  2.15640
radius_slice_table(radius_index, radius_distance, radius_offset, query_ids, 2)
#>   query neighbor distance
#> 1    q2       50  0.37417
#> 2    q2       49  2.03470
#> 3    q2       51  2.03470

If you prefer, you can skip the streamed route and use radius_bigmatrix() to have bigKNN allocate the flattened vectors for you. Streaming becomes useful when those flattened vectors are large or when you want explicit control over their storage.

Dense versus sparse query inputs

The public API also accepts sparse query matrices. At the moment they are densified on the R side before exact computation, so sparse input is primarily an interface convenience rather than a sparse-native backend.

sparse_query <- Matrix::Matrix(dense_query, sparse = TRUE)

sparse_knn <- knn_bigmatrix(
  reference,
  query = sparse_query,
  k = 3,
  plan = plan,
  exclude_self = FALSE
)

identical(sparse_knn$index, planned_knn$index)
#> [1] TRUE
all.equal(sparse_knn$distance, planned_knn$distance)
#> [1] TRUE

That makes sparse queries especially handy when your upstream workflow already produces Matrix objects, and you want to use the same exact search call without manually converting first.

Practical guidance for choosing output modes

A simple rule of thumb works well:

Plans and streaming do not make the search approximate. They make the exact workflow more operationally predictable, which is often the difference between an algorithm that works in principle and one that is pleasant to run at scale.