BiocNeighbors 1.4.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 3799 707 9765 8776 7164 1241 5042 9003 3079 9487
## [2,] 3639 1775 6540 8377 4170 4105 4239 3780 623 9149
## [3,] 3193 5702 6131 3556 2278 1467 5905 7972 751 6794
## [4,] 8563 4240 4950 3937 3915 1173 637 7253 5242 8764
## [5,] 6958 9775 2348 1723 6309 6189 8332 2306 7131 9346
## [6,] 3535 7392 9955 341 1448 6327 1496 1221 5703 7
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8796675 0.9298825 0.9407840 0.9484969 0.9696676 0.9733458 1.009521
## [2,] 0.9334475 0.9498138 0.9767259 1.1378182 1.1410992 1.1478661 1.162463
## [3,] 0.8371768 0.9287036 0.9650908 1.0288686 1.0400424 1.0506155 1.061247
## [4,] 0.8504387 0.9287976 0.9349621 0.9534878 0.9718963 0.9721311 0.981039
## [5,] 0.9797012 1.0404111 1.0508379 1.0708305 1.0896506 1.1226237 1.126803
## [6,] 0.9662001 1.0002621 1.0080508 1.0088716 1.0454358 1.0702065 1.075105
## [,8] [,9] [,10]
## [1,] 1.0214065 1.029604 1.033413
## [2,] 1.1681520 1.170679 1.173934
## [3,] 1.0814985 1.088075 1.089289
## [4,] 0.9860948 0.996186 1.016510
## [5,] 1.1364032 1.139101 1.141199
## [6,] 1.0786674 1.092711 1.096879
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2248 8687 7970 6662 6771
## [2,] 1764 2156 9678 7172 9997
## [3,] 6334 7578 5116 6939 2555
## [4,] 2546 6219 2553 295 6261
## [5,] 1235 779 2902 7586 9467
## [6,] 8396 3550 200 7478 4880
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9522341 0.9728507 0.9741805 0.9752116 0.9843085
## [2,] 0.9348707 0.9561337 1.0239024 1.0491139 1.0654186
## [3,] 0.9712541 0.9960141 1.0080669 1.0858932 1.1007937
## [4,] 0.9419668 1.0071369 1.0388074 1.0424126 1.0685666
## [5,] 0.8917747 0.9413266 1.0215584 1.0394553 1.0646684
## [6,] 0.8049647 0.8999624 0.9166077 0.9314021 0.9382524
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmp0lflDQ/file7ca41f3074fe.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.4.0 knitr_1.25 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 bookdown_0.14 lattice_0.20-38
## [4] digest_0.6.22 grid_3.6.1 stats4_3.6.1
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.1
## [10] stringi_1.4.3 S4Vectors_0.24.0 Matrix_1.2-17
## [13] rmarkdown_1.16 BiocParallel_1.20.0 tools_3.6.1
## [16] stringr_1.4.0 parallel_3.6.1 xfun_0.10
## [19] yaml_2.2.0 compiler_3.6.1 BiocGenerics_0.32.0
## [22] BiocManager_1.30.9 htmltools_0.4.0