BiocNeighbors 1.20.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9465 3315 8194 2881 8939 7348 7583 1397 8346 6897
## [2,] 1393 7994 3189 1124 4135 6473 6568 8237 9629 5428
## [3,] 7691 4756 6473 884 7635 1758 9698 7810 8371 7750
## [4,] 6718 7114 1455 2213 7550 9227 6256 4465 5587 7997
## [5,] 918 7817 3660 1444 4852 8193 6821 3053 1915 3700
## [6,] 1349 1804 2394 2173 3973 526 6609 7423 4823 5221
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0051320 1.0099893 1.0120785 1.0415605 1.0469691 1.0470042 1.0716137
## [2,] 0.9907297 0.9929423 1.0283728 1.0313265 1.0769904 1.0847454 1.0893449
## [3,] 1.0016223 1.0292242 1.0752902 1.0813148 1.1228063 1.1346149 1.1486953
## [4,] 0.8453010 0.8457760 0.9024099 0.9904853 0.9946299 0.9978787 1.0070746
## [5,] 1.0259787 1.0434988 1.0596395 1.0709848 1.0898215 1.0908622 1.0965651
## [6,] 0.8108830 0.8645692 0.9020698 0.9153494 0.9194252 0.9339090 0.9384009
## [,8] [,9] [,10]
## [1,] 1.072527 1.0848043 1.092572
## [2,] 1.092362 1.1038890 1.112921
## [3,] 1.152456 1.1618799 1.167553
## [4,] 1.008721 1.0220501 1.042942
## [5,] 1.096973 1.1115184 1.123486
## [6,] 0.940537 0.9729785 1.009081
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1154 1038 5125 9046 5987
## [2,] 6605 72 5681 2743 9627
## [3,] 9729 5339 8027 1836 6403
## [4,] 5138 9195 2623 8989 3337
## [5,] 5013 4981 6072 2958 1369
## [6,] 4561 7241 2159 4946 1424
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9291151 0.9306268 0.9635136 0.9637606 0.9806175
## [2,] 0.7492169 0.8793808 0.9994067 1.0043513 1.0054315
## [3,] 1.0779456 1.0814588 1.1091346 1.1144617 1.1496998
## [4,] 0.9416516 1.0309392 1.0521165 1.0890845 1.1049868
## [5,] 0.9274538 0.9664112 1.0005889 1.0277957 1.0444899
## [6,] 1.0424936 1.0440456 1.0574404 1.0612711 1.0681978
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpu1abJr/file5aa15091cf36.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.2 knitr_1.45 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.1 xfun_0.41
## [4] jsonlite_1.8.7 S4Vectors_0.40.2 htmltools_0.5.7
## [7] stats4_4.3.2 sass_0.4.7 rmarkdown_2.25
## [10] grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.36
## [16] BiocManager_1.30.22 compiler_4.3.2 codetools_0.2-19
## [19] Rcpp_1.0.11 BiocParallel_1.36.0 lattice_0.22-5
## [22] digest_0.6.33 R6_2.5.1 parallel_4.3.2
## [25] bslib_0.5.1 Matrix_1.6-1.1 tools_4.3.2
## [28] BiocGenerics_0.48.1 cachem_1.0.8