bigKNN provides exact nearest-neighbour and
radius-search routines that work directly on
bigmemory::big.matrix references. This quickstart walks
through a small end-to-end example: create a reference matrix, run exact
k-nearest neighbour and radius queries, and interpret the
objects that come back.
The reference data for bigKNN lives in a
bigmemory::big.matrix. For this quickstart we will use six
points in two dimensions and keep separate labels so that the returned
row indices are easy to read.
reference_points <- data.frame(
id = paste0("p", 1:6),
x1 = c(1, 2, 1, 2, 3, 4),
x2 = c(1, 1, 2, 2, 2, 3)
)
query_points <- data.frame(
id = c("q1", "q2"),
x1 = c(1.2, 2.8),
x2 = c(1.1, 2.2)
)
reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
query_matrix <- as.matrix(query_points[c("x1", "x2")])
reference_points
#> id x1 x2
#> 1 p1 1 1
#> 2 p2 2 1
#> 3 p3 1 2
#> 4 p4 2 2
#> 5 p5 3 2
#> 6 p6 4 3
query_points
#> id x1 x2
#> 1 q1 1.2 1.1
#> 2 q2 2.8 2.2In the examples below, reference is the
big.matrix searched by bigKNN, while
query_matrix is an ordinary dense R matrix. The same APIs
also accept queries stored in another big.matrix, and the
v3 search paths accept sparse query matrices as well.
k-Nearest-Neighbour SearchWith query = NULL, knn_bigmatrix() performs
a self-search. By default exclude_self = TRUE in that case,
so each row does not return itself as its own nearest neighbour.
self_knn <- knn_bigmatrix(reference, k = 2)
self_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 2
#> queries: 6
#> references: 6
#> backend: bruteforceThe raw result stores two n_query x k matrices:
index: 1-based row indices into the reference
matrixdistance: the corresponding exact distancesself_knn$index
#> [,1] [,2]
#> [1,] 2 3
#> [2,] 1 4
#> [3,] 1 4
#> [4,] 2 3
#> [5,] 4 2
#> [6,] 5 4
round(self_knn$distance, 3)
#> [,1] [,2]
#> [1,] 1.000 1.000
#> [2,] 1.000 1.000
#> [3,] 1.000 1.000
#> [4,] 1.000 1.000
#> [5,] 1.000 1.414
#> [6,] 1.414 2.236A small helper table makes the same result easier to read:
knn_table(self_knn, query_ids = reference_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 p1 1 p2 1.000
#> 2 p1 2 p3 1.000
#> 3 p2 1 p1 1.000
#> 4 p2 2 p4 1.000
#> 5 p3 1 p1 1.000
#> 6 p3 2 p4 1.000
#> 7 p4 1 p2 1.000
#> 8 p4 2 p3 1.000
#> 9 p5 1 p4 1.000
#> 10 p5 2 p2 1.414
#> 11 p6 1 p5 1.414
#> 12 p6 2 p4 2.236To search new observations against the same reference, pass them
through the query argument. Here we ask for the three
nearest reference rows for two new points.
query_knn <- knn_bigmatrix(
reference,
query = query_matrix,
k = 3,
exclude_self = FALSE
)
query_knn
#> <bigknn_knn_result>
#> metric: euclidean
#> k: 3
#> queries: 2
#> references: 6
#> backend: bruteforce
knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id)
#> query rank neighbor distance
#> 1 q1 1 p1 0.2236
#> 2 q1 2 p2 0.8062
#> 3 q1 3 p3 0.9220
#> 4 q2 1 p5 0.2828
#> 5 q2 2 p4 0.8246
#> 6 q2 3 p6 1.4420The returned indices are still row numbers in reference.
That makes the object compact and easy to use in later workflows, while
a lookup table like reference_points can translate those
indices into human-readable labels when needed.
Radius search returns every neighbour within a fixed distance
threshold instead of a fixed k. This is useful when the
local neighbourhood size should vary by query.
radius_result <- radius_bigmatrix(
reference,
query = query_matrix,
radius = 1.15,
exclude_self = FALSE
)
radius_result
#> <bigknn_radius_result>
#> metric: euclidean
#> radius: 1.15
#> queries: 2
#> references: 6
#> matches: 5
radius_result$n_match
#> [1] 3 2
radius_result$offset
#> [1] 1 4 6radius_bigmatrix() uses a flattened output format:
index and distance hold all matches
back-to-backn_match gives the number of matches per queryoffset tells you where each query’s slice starts and
endsFor query i, the matching rows live in
index[offset[i]:(offset[i + 1] - 1)], with the same slice
in distance.
radius_slice(radius_result, 1, reference_points$id)
#> neighbor distance
#> 1 p1 0.2236
#> 2 p2 0.8062
#> 3 p3 0.9220
radius_slice(radius_result, 2, reference_points$id)
#> neighbor distance
#> 1 p5 0.2828
#> 2 p4 0.8246If you only need the counts,
count_within_radius_bigmatrix() avoids returning the
flattened neighbour vectors.
bigKNN currently supports three exact metrics:
"euclidean" for ordinary Euclidean distance"sqeuclidean" for squared Euclidean distance"cosine" for cosine distance, defined as
1 - cosine similarityThe squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location.
metric_summary <- do.call(rbind, lapply(
c("euclidean", "sqeuclidean", "cosine"),
function(metric) {
result <- knn_bigmatrix(
reference,
query = query_matrix,
k = 1,
metric = metric,
exclude_self = FALSE
)
data.frame(
metric = metric,
query = query_points$id,
nearest = reference_points$id[result$index[, 1]],
distance = signif(result$distance[, 1], 4),
row.names = NULL
)
}
))
metric_summary
#> metric query nearest distance
#> 1 euclidean q1 p1 0.2236000
#> 2 euclidean q2 p5 0.2828000
#> 3 sqeuclidean q1 p1 0.0500000
#> 4 sqeuclidean q2 p5 0.0800000
#> 5 cosine q1 p1 0.0009438
#> 6 cosine q2 p6 0.0002524In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data.
This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are:
knn_prepare_bigmatrix() and
knn_search_prepared() for repeated exact queries against
the same referenceknn_plan_bigmatrix(),
knn_stream_bigmatrix(), and
radius_stream_bigmatrix() for memory-aware and streamed
workflowsknn_graph_bigmatrix(),
mutual_knn_graph_bigmatrix(),
snn_graph_bigmatrix(), and
radius_graph_bigmatrix() for exact graph constructionrecall_against_exact() and
rerank_candidates_bigmatrix() when bigKNN is
being used as the exact ground-truth engine for approximate search