BiocNeighbors 1.16.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2052 5309 6282 2557 2168 8263 9205 7528 8789 3740
## [2,] 5736 9615 3647 1358 9782 1680 4137 5847 9136 6982
## [3,] 6651 5430 4595 8265 7810 8425 6423 330 2549 570
## [4,] 3696 3778 3165 1608 1901 3711 8224 577 8295 1629
## [5,] 4287 548 5115 3675 7887 8367 6388 4526 2628 6315
## [6,] 7735 7322 8125 6150 3509 5365 4657 3303 9131 6261
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8627000 0.9854210 1.001156 1.0017806 1.0360787 1.0406851 1.0484282
## [2,] 1.0128206 1.0780077 1.078687 1.1209522 1.1439979 1.1514652 1.1527506
## [3,] 0.7901211 0.8264780 0.863410 0.8977429 0.9352114 0.9389573 0.9490262
## [4,] 0.8244431 0.9287865 1.008486 1.0160102 1.0175763 1.0424658 1.0495710
## [5,] 0.9664762 0.9750541 0.992348 1.0133016 1.0293305 1.0478083 1.0694299
## [6,] 0.9233503 1.0199189 1.053357 1.0571043 1.0590004 1.0592493 1.0667665
## [,8] [,9] [,10]
## [1,] 1.0540121 1.0542150 1.0659641
## [2,] 1.1559690 1.1592537 1.1608156
## [3,] 0.9617434 0.9673256 0.9715796
## [4,] 1.1036350 1.1150218 1.1169617
## [5,] 1.0697476 1.0756302 1.0765107
## [6,] 1.0774070 1.0920420 1.0928895
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5275 9661 1609 9234 4824
## [2,] 643 4035 6056 2402 7981
## [3,] 2980 8271 3716 3533 1564
## [4,] 600 1649 3742 4262 4470
## [5,] 2453 3329 8097 736 1528
## [6,] 1261 1086 9393 2514 1234
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8067096 0.9223074 1.0235654 1.025655 1.069929
## [2,] 0.9043778 1.0257226 1.0618618 1.087329 1.103766
## [3,] 0.9051384 1.0243276 1.0300729 1.038151 1.059343
## [4,] 0.8496410 0.9191397 0.9824249 1.003406 1.033886
## [5,] 0.9433628 0.9715790 1.0286065 1.082799 1.109918
## [6,] 0.8933529 0.9920318 1.0104198 1.017549 1.040444
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpgdJPve/file842a243d0995.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.16.0 knitr_1.40 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] BiocParallel_1.32.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.6 fastmap_1.1.0 stringr_1.4.1
## [10] tools_4.2.1 parallel_4.2.1 grid_4.2.1
## [13] xfun_0.34 cli_3.4.1 jquerylib_0.1.4
## [16] htmltools_0.5.3 yaml_2.3.6 digest_0.6.30
## [19] bookdown_0.29 Matrix_1.5-1 BiocManager_1.30.19
## [22] S4Vectors_0.36.0 sass_0.4.2 codetools_0.2-18
## [25] cachem_1.0.6 evaluate_0.17 rmarkdown_2.17
## [28] stringi_1.7.8 compiler_4.2.1 bslib_0.4.0
## [31] stats4_4.2.1 jsonlite_1.8.3