BiocNeighbors 1.16.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1488 4100 6960 3210 7315 986 5158 214 8138 6064
## [2,] 1046 3489 630 2825 8518 1012 2964 7116 9727 8869
## [3,] 7682 7331 7542 3778 3894 3637 2806 1012 4397 1759
## [4,] 5805 7844 349 8098 1432 9033 3493 1503 4335 3922
## [5,] 5236 3597 5057 8497 5781 2617 9107 2569 6834 4684
## [6,] 7292 4071 9683 3849 1157 7921 8580 3710 9333 1058
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8980877 0.9016330 1.0025955 1.0135369 1.0913147 1.0928073 1.1074014
## [2,] 0.8442817 0.8833933 0.8835032 0.9719879 0.9725556 0.9903216 0.9953329
## [3,] 0.9005111 0.9195552 0.9760839 1.0521972 1.0542729 1.0573517 1.0579487
## [4,] 0.8925229 0.8941774 0.9950394 0.9991521 1.0039755 1.0101743 1.0197034
## [5,] 0.8919423 0.9140932 0.9193544 0.9543656 0.9553107 0.9813408 1.0118549
## [6,] 0.8464547 0.8740495 0.9008550 0.9276317 0.9474833 0.9721993 0.9756349
## [,8] [,9] [,10]
## [1,] 1.1101927 1.122265 1.1256909
## [2,] 1.0039561 1.033293 1.0359432
## [3,] 1.0626491 1.067146 1.0895292
## [4,] 1.0385362 1.059329 1.0664909
## [5,] 1.0281721 1.064905 1.0736519
## [6,] 0.9775385 0.978605 0.9906248
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8241 1642 4623 659 7700
## [2,] 999 8224 3701 7500 5814
## [3,] 1600 479 9195 5187 4813
## [4,] 8531 1315 1509 6472 545
## [5,] 5732 2575 834 6667 347
## [6,] 9366 8968 3256 4082 9904
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9060953 0.9152490 0.9513214 0.9561808 0.9611506
## [2,] 0.8480575 0.9162175 0.9295247 0.9333763 0.9646035
## [3,] 0.7936448 0.9068733 0.9118121 1.0047711 1.0825424
## [4,] 0.8237086 0.9147242 0.9257399 0.9792080 0.9990725
## [5,] 0.7983628 0.8051614 0.8466321 0.9108397 0.9505035
## [6,] 0.8724666 1.0984951 1.1566696 1.1602634 1.1629437
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpiWwPKb/file3dd2ed8b9db.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.16.0 knitr_1.39 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] BiocParallel_1.32.1 lattice_0.20-45 R6_2.5.1
## [7] rlang_1.0.4 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.2.1 parallel_4.2.1 grid_4.2.1
## [13] xfun_0.31 cli_3.3.0 jquerylib_0.1.4
## [16] htmltools_0.5.2 yaml_2.3.5 digest_0.6.29
## [19] bookdown_0.27 Matrix_1.4-1 BiocManager_1.30.18
## [22] S4Vectors_0.36.0 sass_0.4.1 codetools_0.2-18
## [25] evaluate_0.15 rmarkdown_2.14 stringi_1.7.8
## [28] compiler_4.2.1 bslib_0.3.1 stats4_4.2.1
## [31] jsonlite_1.8.0
On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access.↩︎