BiocNeighbors 1.23.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7593 6474 8397 3207 1904 2510 5229 2913 7054 4627
## [2,] 5244 3840 8548 4586 8519 3564 2747 496 2816 4245
## [3,] 9921 189 6637 3177 6602 1814 7195 8035 1303 8097
## [4,] 1573 9377 3675 78 317 875 5220 8738 5745 6350
## [5,] 5172 2399 7582 2244 9744 4079 3963 5428 322 3087
## [6,] 8540 4682 6324 4985 6035 5411 1567 9676 2769 4768
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.7815234 0.9229839 0.9608401 1.0236636 1.0295267 1.0314099 1.0324094
## [2,] 0.9560331 0.9654326 1.0704964 1.0970221 1.1042212 1.1182202 1.1303931
## [3,] 0.9326977 0.9487914 0.9667958 1.0254927 1.0453124 1.0474528 1.0617453
## [4,] 0.8205906 0.8387248 0.8467078 0.8521984 0.8801771 0.8913614 0.9009713
## [5,] 0.8481647 0.9324974 0.9722405 0.9999254 1.0165464 1.0184181 1.0298846
## [6,] 0.9661646 0.9843574 1.0380453 1.0476668 1.0616553 1.0891678 1.0939600
## [,8] [,9] [,10]
## [1,] 1.0362610 1.0392092 1.0534132
## [2,] 1.1315485 1.1321986 1.1404413
## [3,] 1.0680672 1.0714984 1.0838408
## [4,] 0.9177495 0.9205099 0.9303882
## [5,] 1.0314233 1.0455111 1.0505949
## [6,] 1.1107712 1.1188346 1.1334125
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1232 1538 2247 4646 4440
## [2,] 1076 4427 8641 8738 9
## [3,] 9934 4632 7066 2233 8622
## [4,] 8710 6205 5247 1905 6338
## [5,] 4772 9051 3041 7179 3405
## [6,] 20 7224 5498 2242 6878
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9995661 1.012411 1.0685118 1.112615 1.148740
## [2,] 0.8331856 0.954421 0.9913397 1.031744 1.032894
## [3,] 0.9938676 1.014864 1.0242976 1.025031 1.037188
## [4,] 1.0032258 1.066891 1.0769411 1.076952 1.078563
## [5,] 1.0622971 1.088637 1.1027135 1.119287 1.162734
## [6,] 1.0666075 1.073998 1.0790703 1.090754 1.116814
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpVmLZb5/file8f585f677e26.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 Patched (2024-04-24 r86482)
## Platform: x86_64-apple-darwin20
## Running under: macOS Monterey 12.7.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.23.0 knitr_1.46 BiocStyle_2.33.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.43.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.39.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.51.0 cachem_1.0.8