BiocNeighbors 1.18.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9662 3037 299 4397 726 5922 1325 7163 672 1337
## [2,] 1488 9290 7952 3676 4691 9474 3924 7058 4831 3837
## [3,] 844 4022 3508 6331 1039 7570 9114 7037 4939 3209
## [4,] 4885 5480 2471 3059 4728 9445 7564 9800 1160 4288
## [5,] 2498 123 7127 3073 4679 2648 1293 8873 4212 6237
## [6,] 4128 3969 1070 9264 5318 7792 2124 465 9227 262
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0018635 1.0734630 1.1102037 1.1157204 1.1295308 1.1385425 1.1708193
## [2,] 0.8835352 0.9777583 1.0060213 1.0081369 1.0122799 1.0259368 1.0492104
## [3,] 0.8123723 0.8647476 0.9139982 0.9534140 0.9876873 0.9924435 0.9940631
## [4,] 0.9842315 0.9888017 1.0033739 1.0249933 1.0479329 1.0561618 1.0610654
## [5,] 0.8673671 0.9623174 0.9827742 0.9977949 1.0064403 1.0284088 1.0384618
## [6,] 0.7851849 0.8384334 0.9159828 0.9715101 0.9834490 1.0078323 1.0196633
## [,8] [,9] [,10]
## [1,] 1.1721100 1.1844047 1.185606
## [2,] 1.0585682 1.0604984 1.065688
## [3,] 0.9975992 0.9996915 1.002198
## [4,] 1.0628065 1.0649843 1.067426
## [5,] 1.0565158 1.0652058 1.067776
## [6,] 1.0235919 1.0277270 1.033206
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3650 8442 28 3378 3349
## [2,] 328 7740 6927 2418 3180
## [3,] 4141 5745 2902 1041 5609
## [4,] 139 427 2301 422 4046
## [5,] 1557 7397 6540 5908 3096
## [6,] 8041 6676 871 7056 2647
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9686017 1.2081473 1.2145671 1.2264786 1.2347609
## [2,] 0.9217723 0.9357488 0.9521836 0.9589323 0.9655992
## [3,] 0.9635603 0.9915902 1.0060973 1.0188990 1.0341231
## [4,] 0.7947047 0.8804924 0.9119638 0.9358485 0.9468299
## [5,] 0.9376368 0.9625379 0.9931554 1.0463696 1.0515987
## [6,] 0.9765031 1.0734611 1.0976894 1.0983497 1.1001519
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpvaDkk3/file30a4cf51edb153.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.18.0 knitr_1.42 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 rlang_1.1.0 xfun_0.39
## [4] jsonlite_1.8.4 S4Vectors_0.38.0 htmltools_0.5.5
## [7] stats4_4.3.0 sass_0.4.5 rmarkdown_2.21
## [10] grid_4.3.0 evaluate_0.20 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.7 bookdown_0.33
## [16] BiocManager_1.30.20 compiler_4.3.0 codetools_0.2-19
## [19] Rcpp_1.0.10 BiocParallel_1.34.0 lattice_0.21-8
## [22] digest_0.6.31 R6_2.5.1 parallel_4.3.0
## [25] bslib_0.4.2 Matrix_1.5-4 tools_4.3.0
## [28] BiocGenerics_0.46.0 cachem_1.0.7