BiocNeighbors 1.20.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6649 4017 4114 5141 6982 1228 6385 5476 2688 8651
## [2,] 4669 2077 6355 5221 5073 2967 9759 1216 1100 8403
## [3,] 7453 6422 4984 6750 1442 8988 9552 9845 7498 3808
## [4,] 6020 6294 9880 5764 808 4437 8533 6043 6684 1331
## [5,] 363 3047 9722 3764 7093 1567 94 1356 405 5498
## [6,] 3998 8665 9038 8324 6908 2691 4197 4894 5401 9556
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9037870 0.9762648 0.9771307 0.9913849 1.0184532 1.0326996 1.0598290
## [2,] 0.7933953 0.8281871 0.8318267 0.9254500 0.9305344 0.9350958 0.9621301
## [3,] 0.6717967 0.7544437 0.7819871 0.8434197 0.9598715 0.9668364 0.9942203
## [4,] 0.9162563 0.9225003 0.9883295 0.9969557 1.0114713 1.0227057 1.0397508
## [5,] 0.8880574 0.9687862 0.9971341 1.0626482 1.0642688 1.0848354 1.1284107
## [6,] 0.9755152 0.9953957 0.9956528 0.9985347 1.0009269 1.0092028 1.0114200
## [,8] [,9] [,10]
## [1,] 1.0633745 1.0716157 1.086399
## [2,] 0.9885423 1.0132899 1.018920
## [3,] 0.9953395 0.9955587 1.008898
## [4,] 1.0436453 1.0525502 1.062184
## [5,] 1.1309519 1.1392366 1.145807
## [6,] 1.0221270 1.0239488 1.028367
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2794 2478 4275 2876 1889
## [2,] 8963 2798 3706 9973 1802
## [3,] 7822 4507 9366 3892 5604
## [4,] 1853 5272 7094 1342 987
## [5,] 2979 138 9436 3127 3006
## [6,] 7932 3919 4137 6037 5184
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9685761 0.9757761 1.0446761 1.0492578 1.0558292
## [2,] 1.0626312 1.1003242 1.1101084 1.1129403 1.1442555
## [3,] 0.9041288 0.9332999 0.9391312 0.9569269 0.9664629
## [4,] 0.9663554 1.0250485 1.0419847 1.0441931 1.0741812
## [5,] 0.8676831 0.9360979 0.9800010 1.0438293 1.0461012
## [6,] 0.7470924 0.9051411 0.9382781 0.9399055 0.9547424
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpEECBo7/filee49e7813c5fe.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.2 Patched (2023-11-01 r85457)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.7.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.2 knitr_1.45 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.8 S4Vectors_0.40.2 htmltools_0.5.7
## [7] stats4_4.3.2 sass_0.4.8 rmarkdown_2.25
## [10] grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.37 BiocManager_1.30.22 compiler_4.3.2
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.36.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.3.2 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.3.2 BiocGenerics_0.48.1 cachem_1.0.8