BiocNeighbors 1.4.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6048 9885 353 2971 8945 6637 4376 4840 7113 4829
## [2,] 4566 5533 4546 6554 3587 6773 781 6464 1773 7421
## [3,] 9301 4308 5739 571 5126 4764 9939 1072 8372 5958
## [4,] 9181 5582 2810 7584 8324 7026 6776 7403 489 2008
## [5,] 5176 4487 5412 9882 9251 3958 1164 8756 8020 7579
## [6,] 5306 271 9964 8753 7782 9840 8334 8626 638 6371
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8753423 1.0324337 1.0401479 1.0433857 1.0487950 1.0555128 1.072550
## [2,] 0.7456681 0.9882122 1.0138134 1.0305669 1.0346680 1.0355890 1.043177
## [3,] 1.0490947 1.0645473 1.0732491 1.0743933 1.0768465 1.1141785 1.150830
## [4,] 0.8340625 0.8681348 0.8935820 0.8972260 0.9046458 0.9814676 1.020612
## [5,] 0.8079657 0.8583915 0.8963977 0.9582708 0.9645010 0.9685468 1.029258
## [6,] 1.0281247 1.0578119 1.0763279 1.0804365 1.0964895 1.0981416 1.099325
## [,8] [,9] [,10]
## [1,] 1.077113 1.088926 1.090039
## [2,] 1.065462 1.070953 1.089345
## [3,] 1.155441 1.171077 1.176814
## [4,] 1.025402 1.026696 1.027654
## [5,] 1.030429 1.030792 1.042128
## [6,] 1.105819 1.107315 1.128199
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2588 9668 3292 428 2745
## [2,] 9731 8783 9113 217 5817
## [3,] 4088 2983 1205 9533 9397
## [4,] 2249 4930 4691 3497 789
## [5,] 6888 4084 7679 5406 6175
## [6,] 4421 8209 9718 5691 9212
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8495727 0.8568265 0.9252666 0.9386955 0.9692541
## [2,] 0.9696178 1.0059217 1.0248015 1.0293496 1.0342718
## [3,] 0.7496009 0.8444536 0.9318359 0.9610884 0.9643283
## [4,] 0.8002510 0.9452562 0.9696701 1.0036349 1.0458778
## [5,] 0.9468699 0.9619154 0.9694815 0.9710934 0.9857361
## [6,] 0.9074824 0.9646798 1.0710313 1.0750747 1.1089975
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpk2FJHs/filee6283700ca92.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.4.2 knitr_1.28 BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.17 lattice_0.20-40
## [4] digest_0.6.25 grid_3.6.2 stats4_3.6.2
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.4
## [10] stringi_1.4.6 S4Vectors_0.24.3 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.20.1 tools_3.6.2
## [16] stringr_1.4.0 parallel_3.6.2 xfun_0.12
## [19] yaml_2.2.1 compiler_3.6.2 BiocGenerics_0.32.0
## [22] BiocManager_1.30.10 htmltools_0.4.0