Quick Start with bigKNN

bigKNN provides exact nearest-neighbour and radius-search routines that work directly on bigmemory::big.matrix references. This quickstart walks through a small end-to-end example: create a reference matrix, run exact k-nearest neighbour and radius queries, and interpret the objects that come back.

Create a Small Reference Matrix

The reference data for bigKNN lives in a bigmemory::big.matrix. For this quickstart we will use six points in two dimensions and keep separate labels so that the returned row indices are easy to read.

reference_points <- data.frame(
  id = paste0("p", 1:6),
  x1 = c(1, 2, 1, 2, 3, 4),
  x2 = c(1, 1, 2, 2, 2, 3)
)

query_points <- data.frame(
  id = c("q1", "q2"),
  x1 = c(1.2, 2.8),
  x2 = c(1.1, 2.2)
)

reference <- as.big.matrix(as.matrix(reference_points[c("x1", "x2")]))
query_matrix <- as.matrix(query_points[c("x1", "x2")])

reference_points
#>   id x1 x2
#> 1 p1  1  1
#> 2 p2  2  1
#> 3 p3  1  2
#> 4 p4  2  2
#> 5 p5  3  2
#> 6 p6  4  3
query_points
#>   id  x1  x2
#> 1 q1 1.2 1.1
#> 2 q2 2.8 2.2

In the examples below, reference is the big.matrix searched by bigKNN, while query_matrix is an ordinary dense R matrix. The same APIs also accept queries stored in another big.matrix, and the v3 search paths accept sparse query matrices as well.

Searching New Query Points

To search new observations against the same reference, pass them through the query argument. Here we ask for the three nearest reference rows for two new points.

query_knn <- knn_bigmatrix(
  reference,
  query = query_matrix,
  k = 3,
  exclude_self = FALSE
)

query_knn
#> <bigknn_knn_result>
#>   metric: euclidean
#>   k: 3
#>   queries: 2
#>   references: 6
#>   backend: bruteforce
knn_table(query_knn, query_ids = query_points$id, ref_ids = reference_points$id)
#>   query rank neighbor distance
#> 1    q1    1       p1   0.2236
#> 2    q1    2       p2   0.8062
#> 3    q1    3       p3   0.9220
#> 4    q2    1       p5   0.2828
#> 5    q2    2       p4   0.8246
#> 6    q2    3       p6   1.4420

The returned indices are still row numbers in reference. That makes the object compact and easy to use in later workflows, while a lookup table like reference_points can translate those indices into human-readable labels when needed.

Choosing a Metric

bigKNN currently supports three exact metrics:

The squared Euclidean metric preserves the same neighbour ordering as ordinary Euclidean distance, but reports squared values. Cosine distance can choose a different neighbour because it prefers similar direction rather than similar absolute location.

metric_summary <- do.call(rbind, lapply(
  c("euclidean", "sqeuclidean", "cosine"),
  function(metric) {
    result <- knn_bigmatrix(
      reference,
      query = query_matrix,
      k = 1,
      metric = metric,
      exclude_self = FALSE
    )

    data.frame(
      metric = metric,
      query = query_points$id,
      nearest = reference_points$id[result$index[, 1]],
      distance = signif(result$distance[, 1], 4),
      row.names = NULL
    )
  }
))

metric_summary
#>        metric query nearest  distance
#> 1   euclidean    q1      p1 0.2236000
#> 2   euclidean    q2      p5 0.2828000
#> 3 sqeuclidean    q1      p1 0.0500000
#> 4 sqeuclidean    q2      p5 0.0800000
#> 5      cosine    q1      p1 0.0009438
#> 6      cosine    q2      p6 0.0002524

In this example, Euclidean and squared Euclidean agree on the nearest row for each query, while cosine distance can favour a different point because its direction is more similar. Cosine distance requires non-zero rows in both the reference and the query data.

Where to Go Next

This vignette focused on the smallest useful workflow. For larger or repeated jobs, the next places to look are: