BiocNeighbors 1.11.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7587 5930 685 1429 9488 1430 2849 1528 5649 7248
## [2,] 4558 7243 4556 276 665 1881 6514 476 3055 3175
## [3,] 227 2985 386 1935 2110 4928 8247 1115 7831 4823
## [4,] 1144 8304 9278 3903 5968 7503 6781 2088 3321 7281
## [5,] 7613 8066 6474 2445 1979 6696 1298 5726 8959 9675
## [6,] 1017 1164 7118 617 3339 1039 4085 1078 2039 9628
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8299260 0.9151373 0.9236432 0.9262968 0.9476347 0.9749637 1.0106630
## [2,] 0.8828778 0.8972326 0.8974372 0.9064905 0.9451266 0.9452245 0.9545125
## [3,] 0.9951708 1.0747873 1.0878440 1.1131284 1.1210618 1.1348094 1.1358058
## [4,] 0.8396846 0.8567523 0.9629714 1.0144162 1.0147822 1.0157418 1.0186867
## [5,] 0.8551103 0.9662976 0.9822791 0.9958056 1.0296835 1.0762092 1.0847216
## [6,] 0.8324599 0.8752241 0.9100977 0.9402945 0.9446812 0.9519247 0.9567233
## [,8] [,9] [,10]
## [1,] 1.0151135 1.0318009 1.0377684
## [2,] 0.9666264 0.9762674 0.9926321
## [3,] 1.1446716 1.1498840 1.1549600
## [4,] 1.0310694 1.0472790 1.0554855
## [5,] 1.1214069 1.1265185 1.1303506
## [6,] 0.9777825 0.9783190 1.0098292
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 371 8435 6367 6788 6273
## [2,] 2689 7100 7084 4642 3184
## [3,] 196 3823 5149 2611 1814
## [4,] 7264 9471 5269 191 5853
## [5,] 7772 181 7882 4108 3052
## [6,] 3320 3100 9980 4704 382
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8750058 0.8844784 0.9592286 0.9758261 1.0101227
## [2,] 0.9639787 0.9783370 0.9798650 0.9832655 0.9994668
## [3,] 0.9974281 1.1182429 1.1473372 1.1493706 1.1566035
## [4,] 0.9676893 1.0117347 1.0508809 1.0568327 1.0929832
## [5,] 0.7819328 0.9336281 0.9580468 1.0115699 1.0165854
## [6,] 0.6443267 0.8989388 0.9033276 0.9315614 0.9652831
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpNLW6NZ/file9e415578d2c8d.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 beta (2021-05-03 r80259)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.11.0 knitr_1.33 BiocStyle_2.21.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.39.0
## [4] BiocParallel_1.27.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.31.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2