BiocNeighbors 1.18.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 663 7349 4350 7224 5544 3678 4212 24 4400 2453
## [2,] 2117 3024 4922 5535 7912 6501 8210 3835 2360 634
## [3,] 772 1015 9907 9272 4674 3953 490 4006 8344 5789
## [4,] 2709 4650 4054 3516 5648 867 3322 3050 8509 594
## [5,] 8634 265 8855 4397 6139 3679 4488 6547 2902 778
## [6,] 3106 1461 3860 8674 4577 9105 3887 9278 5201 9248
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9590609 1.0070078 1.0104954 1.0170457 1.0438687 1.0531012 1.0635365
## [2,] 1.0302229 1.0478259 1.0501170 1.0517058 1.0660744 1.0665904 1.0697542
## [3,] 0.9085701 0.9839107 1.0030575 1.0502628 1.0586632 1.0618154 1.0711921
## [4,] 0.9112773 0.9600726 0.9864589 0.9972513 1.0092118 1.0143824 1.0194058
## [5,] 0.6485839 0.7989683 0.8251042 0.8303503 0.8903715 0.8962286 0.9170837
## [6,] 0.8852639 1.0074463 1.0421774 1.0638520 1.0668644 1.0736978 1.0789726
## [,8] [,9] [,10]
## [1,] 1.0664017 1.0748776 1.1017776
## [2,] 1.1138246 1.1192961 1.1352495
## [3,] 1.0767018 1.0892571 1.0930864
## [4,] 1.0355130 1.0624614 1.0752249
## [5,] 0.9426028 0.9502914 0.9519216
## [6,] 1.0795145 1.0895756 1.0967888
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5806 9400 6202 4482 1439
## [2,] 428 8022 7726 4620 6289
## [3,] 8158 8827 5910 5328 7781
## [4,] 4528 6644 8346 8984 6702
## [5,] 2170 7822 1017 3761 9397
## [6,] 386 3682 7062 8086 8988
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9136798 1.0422038 1.0435807 1.0464723 1.1319344
## [2,] 0.8103591 0.9882891 1.0350436 1.0809746 1.1400504
## [3,] 0.7821624 0.9137707 0.9324439 0.9398239 0.9643954
## [4,] 0.9799287 1.0153875 1.0770674 1.0964017 1.1124548
## [5,] 0.8264561 0.9451679 0.9608215 0.9924807 1.0001674
## [6,] 1.0949714 1.1124891 1.1208382 1.1230546 1.1243676
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpIUmBdb/file49ad59649f4e.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84266)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.18.0 knitr_1.42 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.0 xfun_0.38
## [4] jsonlite_1.8.4 S4Vectors_0.38.1 htmltools_0.5.5
## [7] stats4_4.3.0 sass_0.4.5 rmarkdown_2.21
## [10] grid_4.3.0 evaluate_0.20 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.33
## [16] BiocManager_1.30.20 compiler_4.3.0 codetools_0.2-19
## [19] Rcpp_1.0.10 BiocParallel_1.34.1 lattice_0.21-8
## [22] digest_0.6.31 R6_2.5.1 parallel_4.3.0
## [25] bslib_0.4.2 Matrix_1.5-4 tools_4.3.0
## [28] BiocGenerics_0.46.0 cachem_1.0.7