BiocNeighbors 1.22.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7185 6235 9282 2873 5047 8585 6871 4713 2789 3316
## [2,] 6741 9689 7636 6359 2993 4731 1581 4386 9981 1681
## [3,] 5412 2374 3733 302 2129 6824 7520 9180 5357 3684
## [4,] 6571 8002 3109 2092 2286 363 9654 8747 3922 6185
## [5,] 3456 8448 7858 5895 7870 860 507 2462 118 7448
## [6,] 6534 4985 200 7427 9027 4778 7579 2558 6726 4197
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8076415 0.9000395 0.9005997 0.9801784 1.0043234 1.0091287 1.0101857
## [2,] 0.8706856 0.9875263 1.0073107 1.0092914 1.0869757 1.0896205 1.1030468
## [3,] 0.8577539 0.9109335 0.9523353 0.9560369 0.9569755 0.9608607 0.9939891
## [4,] 0.7341259 0.8933441 0.9612451 1.0078721 1.0298184 1.0537697 1.0744098
## [5,] 0.8997407 0.9145612 0.9612712 0.9838849 0.9842784 0.9926714 0.9929001
## [6,] 0.9207253 0.9832393 0.9843387 1.0050380 1.0198334 1.0308520 1.0423095
## [,8] [,9] [,10]
## [1,] 1.037611 1.050631 1.052028
## [2,] 1.109873 1.113673 1.127582
## [3,] 1.007553 1.008265 1.010376
## [4,] 1.075423 1.080180 1.097522
## [5,] 1.022584 1.027323 1.029023
## [6,] 1.045436 1.051560 1.058640
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2029 9727 7104 4337 7649
## [2,] 5758 3584 8983 3863 884
## [3,] 2435 925 5892 1385 3492
## [4,] 6615 6693 9567 9469 134
## [5,] 5201 8773 3078 3615 790
## [6,] 1152 501 9812 9760 8282
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9775436 0.9846010 0.9996586 1.0210204 1.0483401
## [2,] 1.0020682 1.0271850 1.0344262 1.0970036 1.1021838
## [3,] 0.8052995 0.9297472 0.9565524 1.0217369 1.0438033
## [4,] 0.9118630 0.9746927 1.0185562 1.0784415 1.0974541
## [5,] 0.9531181 0.9647527 0.9669973 0.9699942 0.9768789
## [6,] 0.8380350 0.8832462 0.9639988 0.9690495 0.9718001
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/var/folders/r0/l4fjk6cj5xj0j3brt4bplpl40000gt/T//RtmpkDuFzV/file60522d7e9fe.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 alpha (2024-03-27 r86216)
## Platform: aarch64-apple-darwin20
## Running under: macOS Ventura 13.6.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.22.0 knitr_1.45 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.42.0 htmltools_0.5.8
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.38 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-19 Rcpp_1.0.12 BiocParallel_1.38.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.6.2 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.50.0 cachem_1.0.8