BiocNeighbors 1.8.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5975 4142 1823 9924 9784 150 451 7019 4092 998
## [2,] 5896 6885 3702 3655 6833 5695 1508 7567 2706 9389
## [3,] 8039 1997 2919 7738 101 5533 2383 9187 503 6802
## [4,] 7120 4984 3105 5911 358 533 6167 6221 6570 189
## [5,] 4324 1158 1941 3647 5559 8618 5578 9428 4256 4311
## [6,] 5181 3376 25 7025 5653 2971 936 6527 3307 9124
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8846774 0.9170939 0.9533285 1.0004638 1.0651345 1.069717 1.086231
## [2,] 1.0185841 1.0702676 1.1127722 1.1167427 1.1196262 1.135688 1.151511
## [3,] 0.9210580 0.9415088 0.9448202 0.9621202 0.9726055 1.004326 1.028555
## [4,] 0.8473746 0.8966746 1.0348847 1.0388007 1.0396987 1.050842 1.059177
## [5,] 0.9646571 0.9706110 1.0138236 1.0335057 1.0704336 1.070782 1.114955
## [6,] 0.9049893 0.9336991 0.9983528 1.0245627 1.0318879 1.043392 1.050258
## [,8] [,9] [,10]
## [1,] 1.097389 1.134364 1.143752
## [2,] 1.166504 1.187105 1.192744
## [3,] 1.034014 1.034294 1.043469
## [4,] 1.066719 1.073556 1.077692
## [5,] 1.132372 1.141067 1.142774
## [6,] 1.053191 1.058890 1.071351
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1998 167 6773 8777 1199
## [2,] 4513 7074 304 4432 3717
## [3,] 9148 6861 9310 5435 2735
## [4,] 5410 4578 3419 144 2139
## [5,] 1799 9602 3889 1267 9263
## [6,] 1923 1200 5519 2025 2484
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0476940 1.0562769 1.0647473 1.0818518 1.107778
## [2,] 0.9535993 1.0420475 1.1164645 1.1409461 1.150593
## [3,] 0.9018676 0.9595568 0.9639738 0.9797339 1.055790
## [4,] 0.9961546 1.0349640 1.0856773 1.0950519 1.102567
## [5,] 1.0412649 1.0593468 1.0812370 1.1037289 1.132968
## [6,] 1.0490067 1.0828927 1.1120396 1.1184928 1.131061
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpzsMl9q/file26ca5705e1da.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.8.2 knitr_1.30 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 bookdown_0.21 lattice_0.20-41
## [4] digest_0.6.27 grid_4.0.3 stats4_4.0.3
## [7] magrittr_2.0.1 evaluate_0.14 rlang_0.4.9
## [10] stringi_1.5.3 S4Vectors_0.28.0 Matrix_1.2-18
## [13] rmarkdown_2.5 BiocParallel_1.24.1 tools_4.0.3
## [16] stringr_1.4.0 parallel_4.0.3 xfun_0.19
## [19] yaml_2.2.1 compiler_4.0.3 BiocGenerics_0.36.0
## [22] BiocManager_1.30.10 htmltools_0.5.0