BiocNeighbors 1.20.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7191 1462 4096 3121 2907 1516 1841 2792 7749 8251
## [2,] 438 5791 327 6616 2840 7244 4938 7626 1891 9606
## [3,] 6566 185 4028 5499 7924 4469 7770 1485 124 1483
## [4,] 7521 3625 1520 6983 3315 6498 4392 1281 3569 860
## [5,] 4648 3584 5404 8916 999 5006 3654 7227 9936 5280
## [6,] 4637 8330 1488 7770 7864 8986 185 5194 4892 8134
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8708924 0.9201971 0.9778174 1.0183202 1.0234785 1.0301259 1.0595636
## [2,] 0.8885668 0.8919418 0.9135027 0.9238924 0.9469747 0.9561982 0.9860298
## [3,] 0.7205864 1.0103204 1.0300944 1.0570358 1.0636051 1.0699366 1.1028218
## [4,] 1.0429268 1.0746384 1.0890863 1.1089108 1.1372968 1.1413801 1.1541716
## [5,] 0.9937217 1.0027014 1.0347068 1.0525789 1.1410241 1.1661737 1.1766642
## [6,] 0.8931364 0.9494966 0.9624577 0.9634194 0.9882307 0.9974183 0.9984540
## [,8] [,9] [,10]
## [1,] 1.0640060 1.0647048 1.069014
## [2,] 0.9873471 0.9949265 1.008371
## [3,] 1.1131444 1.1422718 1.145583
## [4,] 1.1713729 1.1752270 1.182554
## [5,] 1.1803154 1.1863912 1.210942
## [6,] 1.0074879 1.0219098 1.028828
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8431 2200 2114 4208 9732
## [2,] 9491 4517 6476 1727 6297
## [3,] 4697 166 3022 3391 1193
## [4,] 7114 206 4478 8399 1891
## [5,] 4439 3167 1112 7095 1774
## [6,] 9235 6646 1281 3133 3277
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0445930 1.0707749 1.0805953 1.1079738 1.1312418
## [2,] 0.9895988 0.9994723 1.0012251 1.0509707 1.0870900
## [3,] 0.8710847 0.9248948 0.9709811 0.9844105 0.9930291
## [4,] 0.7749339 0.8814268 0.8843094 0.9072486 0.9114492
## [5,] 0.8673612 0.9486401 0.9687132 0.9850274 0.9986081
## [6,] 0.8231861 0.8742477 0.8788669 0.9018787 0.9058378
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "F:\\biocbuild\\bbs-3.18-bioc\\tmpdir\\RtmpMHvReU\\file446040bf7356.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2022 x64 (build 20348)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.20.2 knitr_1.45 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.2 xfun_0.41
## [4] jsonlite_1.8.8 S4Vectors_0.40.2 htmltools_0.5.7
## [7] stats4_4.3.2 sass_0.4.8 rmarkdown_2.25
## [10] grid_4.3.2 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.37 BiocManager_1.30.22 compiler_4.3.2
## [19] codetools_0.2-19 Rcpp_1.0.11 BiocParallel_1.36.0
## [22] lattice_0.22-5 digest_0.6.33 R6_2.5.1
## [25] parallel_4.3.2 bslib_0.6.1 Matrix_1.6-4
## [28] tools_4.3.2 BiocGenerics_0.48.1 cachem_1.0.8