BiocNeighbors 1.4.1
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6517 5034 6083 6280 3946 1321 2573 3903 8792 1391
## [2,] 15 3647 4237 4479 8097 4620 839 8823 2664 9162
## [3,] 8729 962 7956 5344 10000 1102 4762 8057 2112 2868
## [4,] 5049 4376 6067 3980 9598 3815 3879 3650 1812 6682
## [5,] 2691 8359 416 388 2793 9033 3814 233 7444 2077
## [6,] 9726 6706 1705 1265 6953 9735 1765 3452 6240 4581
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8605417 0.9011844 0.9598055 0.9906568 1.0115165 1.0232428 1.0299177
## [2,] 0.8203998 0.8889937 0.9243639 0.9317232 0.9691805 0.9804098 1.0009714
## [3,] 0.7097339 0.8187694 0.8552614 0.9166747 0.9915828 0.9933022 0.9963664
## [4,] 0.8513043 0.9897790 0.9964214 1.0439152 1.0475429 1.0496386 1.0623977
## [5,] 0.9504657 0.9642309 0.9957683 0.9980729 1.0243515 1.0412277 1.0679247
## [6,] 0.9641368 0.9917504 0.9967526 1.0047925 1.0260675 1.0399781 1.0428491
## [,8] [,9] [,10]
## [1,] 1.035740 1.042568 1.069985
## [2,] 1.010033 1.013638 1.021102
## [3,] 1.009564 1.026669 1.034802
## [4,] 1.068869 1.070686 1.088625
## [5,] 1.069522 1.075323 1.076468
## [6,] 1.056430 1.060908 1.063288
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3748 1319 2076 7046 8557
## [2,] 4620 750 6533 5689 1725
## [3,] 4609 8092 3628 5591 3505
## [4,] 4938 1787 7226 4689 5716
## [5,] 5349 2805 1739 3241 6318
## [6,] 3703 1693 8180 4667 3696
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7973912 0.8206508 0.8604583 0.8850521 0.9127023
## [2,] 0.9411416 0.9566611 0.9824429 0.9873756 1.0442573
## [3,] 0.8824219 0.9730743 1.1062536 1.1107501 1.1252599
## [4,] 0.7889428 0.8185468 0.9280216 0.9744157 0.9912663
## [5,] 1.0451206 1.0672330 1.0854192 1.1135842 1.1168857
## [6,] 0.8135959 0.9004993 0.9239533 0.9281310 0.9418005
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmp1dmTIx/filedc223a75b28.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.4.1 knitr_1.25 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 bookdown_0.14 lattice_0.20-38
## [4] digest_0.6.22 grid_3.6.1 stats4_3.6.1
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.1
## [10] stringi_1.4.3 S4Vectors_0.24.0 Matrix_1.2-17
## [13] rmarkdown_1.16 BiocParallel_1.20.0 tools_3.6.1
## [16] stringr_1.4.0 parallel_3.6.1 xfun_0.10
## [19] yaml_2.2.0 compiler_3.6.1 BiocGenerics_0.32.0
## [22] BiocManager_1.30.9 htmltools_0.4.0