BiocNeighbors 1.22.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3506 7656 9060 294 7982 5763 5474 2185 2560 5397
## [2,] 5151 7920 2937 7891 1893 9072 2090 7528 3900 8499
## [3,] 2397 7417 7970 2647 9880 8618 2825 3404 424 6403
## [4,] 6203 7936 530 7071 4525 899 4804 8264 5204 4180
## [5,] 6131 1074 952 6845 9337 2972 5813 6267 5682 477
## [6,] 2170 22 9687 3589 6577 582 237 1970 7244 9155
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8567343 0.9148225 0.9407513 0.9881785 1.039331 1.044142 1.045700
## [2,] 0.6668538 0.8987589 0.9708484 1.0178564 1.025155 1.025636 1.026060
## [3,] 0.9775957 0.9990369 1.0128471 1.0149100 1.021895 1.027442 1.027629
## [4,] 0.8519940 0.9045876 0.9454841 0.9798104 1.019344 1.113069 1.130685
## [5,] 0.9539289 0.9631653 1.0059111 1.0206730 1.069508 1.106110 1.116073
## [6,] 0.9079698 0.9231571 0.9897676 0.9988787 1.027678 1.049186 1.051881
## [,8] [,9] [,10]
## [1,] 1.051286 1.056190 1.064790
## [2,] 1.044135 1.048658 1.049732
## [3,] 1.058460 1.064249 1.094251
## [4,] 1.135794 1.140484 1.170250
## [5,] 1.128396 1.144618 1.145611
## [6,] 1.080945 1.083736 1.100816
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2753 8516 7796 6679 6884
## [2,] 3289 6218 3669 309 8970
## [3,] 3778 9850 9023 4961 1192
## [4,] 432 160 6157 8795 5380
## [5,] 4486 7019 1370 332 1204
## [6,] 2369 4021 1589 8072 9372
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7266792 1.1280293 1.1945150 1.2018101 1.2285351
## [2,] 0.8395190 0.8967423 0.9254928 0.9273860 0.9366759
## [3,] 0.9343264 1.0064628 1.0398891 1.0494432 1.0512451
## [4,] 0.7891868 0.9628958 1.0200688 1.0563028 1.0563600
## [5,] 0.8804094 0.8907453 0.9851564 0.9880109 1.0105910
## [6,] 0.7642388 0.8639182 0.8784883 0.8930487 0.8942956
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpMDZF9O/file24d8b43ba45760.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.22.0 knitr_1.46 BiocStyle_2.32.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.2 rlang_1.1.3 xfun_0.43
## [4] jsonlite_1.8.8 S4Vectors_0.42.0 htmltools_0.5.8.1
## [7] stats4_4.4.0 sass_0.4.9 rmarkdown_2.26
## [10] grid_4.4.0 evaluate_0.23 jquerylib_0.1.4
## [13] fastmap_1.1.1 yaml_2.3.8 lifecycle_1.0.4
## [16] bookdown_0.39 BiocManager_1.30.22 compiler_4.4.0
## [19] codetools_0.2-20 Rcpp_1.0.12 BiocParallel_1.38.0
## [22] lattice_0.22-6 digest_0.6.35 R6_2.5.1
## [25] parallel_4.4.0 bslib_0.7.0 Matrix_1.7-0
## [28] tools_4.4.0 BiocGenerics_0.50.0 cachem_1.0.8