BiocNeighbors 1.11.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2421 2328 5329 420 8656 7727 4890 7183 4422 4277
## [2,] 8800 8363 1683 7828 552 3926 4670 3685 246 98
## [3,] 1995 3906 7727 275 1542 6125 7216 8304 9113 209
## [4,] 3146 7408 3157 3527 7108 4794 5268 3014 9382 209
## [5,] 2762 6045 1136 2049 3263 4119 4993 8606 5661 9066
## [6,] 5660 4269 4279 916 2026 357 3388 2331 9229 5943
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.0336167 1.1038778 1.1093621 1.1119086 1.1199956 1.1291362 1.1390663
## [2,] 0.8655013 0.8672772 0.9290387 0.9539437 0.9665293 0.9754884 0.9787839
## [3,] 0.9729924 1.0096172 1.0132091 1.0492468 1.0726197 1.0957625 1.0987935
## [4,] 0.7949862 0.8238237 0.9506414 0.9551117 0.9711037 0.9711465 0.9770277
## [5,] 0.8529103 0.9102007 0.9914404 1.0265529 1.0274782 1.0346222 1.0636659
## [6,] 0.8690536 0.9264944 1.0805883 1.0962285 1.1123945 1.1131320 1.1174803
## [,8] [,9] [,10]
## [1,] 1.1453700 1.1658609 1.179105
## [2,] 0.9876196 0.9969602 0.999236
## [3,] 1.1113429 1.1156112 1.118943
## [4,] 0.9844464 0.9858928 1.006664
## [5,] 1.0653194 1.0715467 1.073080
## [6,] 1.1883646 1.1917070 1.192599
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4895 8845 945 7610 4891
## [2,] 1041 9858 1560 6384 950
## [3,] 2756 4210 2705 6426 4445
## [4,] 6475 8717 8203 4218 2482
## [5,] 9947 3315 4160 1603 5208
## [6,] 6429 9096 9692 9549 3705
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8805130 0.9012313 0.9987613 1.0197636 1.0281456
## [2,] 0.9470706 0.9524254 1.0082129 1.0824685 1.0852799
## [3,] 1.0098195 1.0194396 1.0274774 1.0345684 1.0506550
## [4,] 0.9313117 1.0629472 1.1044185 1.1116905 1.1127110
## [5,] 0.6997288 0.7358833 0.8190524 0.8623688 0.8881356
## [6,] 0.8428691 0.8765065 0.8988195 0.9274227 0.9319549
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpqr7fd8/file1728332a386c7.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 RC (2021-05-16 r80304)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.11.0 knitr_1.33 BiocStyle_2.21.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.39.0
## [4] BiocParallel_1.27.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.31.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2