One of the main goals of bigANNOY is to work comfortably with bigmemory
data that already lives on disk. Instead of forcing a large reference matrix
through dense in-memory copies, the package can build and query Annoy indexes
directly from file-backed big.matrix objects and their descriptors.
This vignette focuses on the most common disk-oriented workflows:
big.matrix query layoutslibrary(bigANNOY)
library(bigmemory)
For reproducibility, we will create all backing files inside a temporary directory. In real work this would usually be a project directory or a shared data location.
workspace_dir <- tempfile("bigannoy-filebacked-")
dir.create(workspace_dir, recursive = TRUE, showWarnings = FALSE)
make_filebacked_matrix <- function(values, type, backingpath, name) {
bm <- filebacked.big.matrix(
nrow = nrow(values),
ncol = ncol(values),
type = type,
backingfile = sprintf("%s.bin", name),
descriptorfile = sprintf("%s.desc", name),
backingpath = backingpath
)
bm[,] <- values
bm
}
We will create a reference dataset and store it in a file-backed
big.matrix. The corresponding descriptor file is what lets later R sessions
reattach to the same on-disk data.
ref_dense <- matrix(
c(
0.0, 0.0,
5.0, 0.0,
0.0, 5.0,
5.0, 5.0,
9.0, 9.0
),
ncol = 2,
byrow = TRUE
)
ref_fb <- make_filebacked_matrix(
values = ref_dense,
type = "double",
backingpath = workspace_dir,
name = "ref"
)
ref_desc <- describe(ref_fb)
ref_desc_path <- file.path(workspace_dir, "ref.desc")
file.exists(ref_desc_path)
#> [1] TRUE
dim(ref_fb)
#> [1] 5 2
At this point we have:
ref.binref.descbig.matrix object currently attached in this R sessionThe simplest persisted workflow is to build directly from the descriptor file
path instead of from the live big.matrix object. That mirrors how later
sessions typically work.
index_path <- file.path(workspace_dir, "ref.ann")
index <- annoy_build_bigmatrix(
x = ref_desc_path,
path = index_path,
n_trees = 25L,
metric = "euclidean",
seed = 99L,
load_mode = "lazy"
)
index
#> <bigannoy_index>
#> path: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBEyDSE/bigannoy-filebacked-b1f658a0aee9/ref.ann
#> metadata: /var/folders/h9/npmqbtmx4wlblg4wks47yj5c0000gn/T//RtmpBEyDSE/bigannoy-filebacked-b1f658a0aee9/ref.ann.meta
#> index_id: annoy-20260327203933-c2eab887babb
#> metric: euclidean
#> trees: 25
#> items: 5
#> dimension: 2
#> build_seed: 99
#> build_threads: -1
#> build_backend: cpp
#> load_mode: lazy
#> loaded: FALSE
#> file_size: 2496
#> file_md5: c2eab887babb44656824ba64e545d6c9
#> prefault: FALSE
This pattern is useful because the build call no longer depends on a particular in-memory object being alive. As long as the descriptor can be reattached, the reference matrix can be used.
For x, query, xpIndex, and xpDistance, bigANNOY accepts several
bigmemory-oriented forms:
big.matrixbig.matrixbig.matrix.descriptor objectFor queries only, a dense numeric matrix is also accepted.
That flexibility matters most in persisted workflows where one part of the pipeline writes descriptors and another part reattaches them later.
Now we will create a file-backed query matrix and search the persisted Annoy index against it.
query_dense <- matrix(
c(
0.2, 0.1,
4.7, 5.1
),
ncol = 2,
byrow = TRUE
)
query_fb <- make_filebacked_matrix(
values = query_dense,
type = "double",
backingpath = workspace_dir,
name = "query"
)
query_result_big <- annoy_search_bigmatrix(
index,
query = query_fb,
k = 2L,
search_k = 100L
)
query_result_big$index
| 1 | 2 |
| 4 | 3 |
round(query_result_big$distance, 3)
| 0.224 | 4.801 |
| 0.316 | 4.701 |
The query matrix itself is file-backed, but the search call looks the same as
it would for an in-memory big.matrix.
The same persisted query data can be supplied through its descriptor object or through the descriptor file path. This is often the most convenient way to reattach query data across sessions.
query_desc <- describe(query_fb)
query_desc_path <- file.path(workspace_dir, "query.desc")
query_result_desc <- annoy_search_bigmatrix(
index,
query = query_desc,
k = 2L,
search_k = 100L
)
query_result_path <- annoy_search_bigmatrix(
index,
query = query_desc_path,
k = 2L,
search_k = 100L
)
query_result_desc$index
| 1 | 2 |
| 4 | 3 |
query_result_path$index
| 1 | 2 |
| 4 | 3 |
These should match the result obtained from the live big.matrix query.
identical(query_result_big$index, query_result_desc$index)
#> [1] TRUE
identical(query_result_big$index, query_result_path$index)
#> [1] TRUE
all.equal(query_result_big$distance, query_result_desc$distance)
#> [1] TRUE
Large search results can be expensive to keep in ordinary R memory. To avoid
that, bigANNOY can stream neighbour ids and distances directly into
destination big.matrix objects.
For file-backed workflows, this means you can keep both the inputs and the outputs on disk.
index_store <- filebacked.big.matrix(
nrow = nrow(query_dense),
ncol = 2L,
type = "integer",
backingfile = "nn_index.bin",
descriptorfile = "nn_index.desc",
backingpath = workspace_dir
)
distance_store <- filebacked.big.matrix(
nrow = nrow(query_dense),
ncol = 2L,
type = "double",
backingfile = "nn_distance.bin",
descriptorfile = "nn_distance.desc",
backingpath = workspace_dir
)
streamed_result <- annoy_search_bigmatrix(
index,
query = query_desc,
k = 2L,
xpIndex = describe(index_store),
xpDistance = file.path(workspace_dir, "nn_distance.desc")
)
bigmemory::as.matrix(index_store)
| 1 | 2 |
| 4 | 3 |
round(bigmemory::as.matrix(distance_store), 3)
| 0.224 | 4.801 |
| 0.316 | 4.701 |
The important practical details are:
xpIndex must be integer-compatiblexpDistance must be double-compatiblen_query x kxpDistance can only be supplied when xpIndex is also suppliedBecause the result matrices are file-backed, they can be reattached later in
the same way as any other bigmemory artifact.
index_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_index.desc"))
distance_store_again <- attach.big.matrix(file.path(workspace_dir, "nn_distance.desc"))
bigmemory::as.matrix(index_store_again)
| 1 | 2 |
| 4 | 3 |
round(bigmemory::as.matrix(distance_store_again), 3)
| 0.224 | 4.801 |
| 0.316 | 4.701 |
That is useful in longer pipelines where one step performs ANN search and a later step consumes the neighbour graph or distance matrix.
bigANNOY also supports separated-column big.matrix layouts. These are not
necessarily file-backed, but they are common in bigmemory workflows and are
worth knowing about because they use a different memory layout from the usual
contiguous matrix case.
query_sep <- big.matrix(
nrow = nrow(query_dense),
ncol = ncol(query_dense),
type = "double",
separated = TRUE
)
query_sep[,] <- query_dense
sep_result <- annoy_search_bigmatrix(
index,
query = describe(query_sep),
k = 2L,
search_k = 100L
)
sep_result$index
| 1 | 2 |
| 4 | 3 |
round(sep_result$distance, 3)
| 0.224 | 4.801 |
| 0.316 | 4.701 |
For the same query values, the separated-column result should match the ordinary file-backed query result.
identical(sep_result$index, query_result_big$index)
#> [1] TRUE
all.equal(sep_result$distance, query_result_big$distance)
#> [1] TRUE
Taken together, the main file-backed pattern looks like this:
big.matrixbig.matrix, a descriptor object, or a
descriptor pathThis is often the most practical way to use bigANNOY in large-data settings,
because every major artifact in the workflow can be reopened later.
.ann file with its .meta sidecar file.n_query x k is too large to hold comfortably in
ordinary R matrices.This vignette covered the main bigmemory persistence features in bigANNOY:
The natural next vignette after this one is Benchmarking Recall and Latency, which shows how to evaluate these workflows against runtime and quality targets.