BiocNeighbors 1.6.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 8812 5180 4883 1506 8666 5510 6842 8685 149 3181
## [2,] 2562 7380 589 1891 3704 9129 5011 6193 1199 6164
## [3,] 4823 8986 3297 9889 3008 2677 4492 1892 7720 5476
## [4,] 6291 964 8595 364 9728 4878 1829 3752 5644 9400
## [5,] 7526 9746 1407 2504 7044 3358 5845 7017 6339 189
## [6,] 3184 7707 1589 7921 7340 500 9188 7774 3972 3173
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.8768666 0.8888551 0.9227604 0.9561289 0.9579893 0.9615831 0.9795825
## [2,] 0.9847859 1.0386254 1.0623527 1.0730896 1.0869210 1.0911182 1.0948753
## [3,] 1.0189785 1.0751929 1.0790136 1.0828307 1.1209067 1.1281173 1.1373672
## [4,] 0.9766369 0.9774971 0.9884105 1.0181866 1.0184928 1.0620124 1.0708467
## [5,] 0.8348617 0.9138915 0.9650772 0.9767015 0.9847749 0.9981797 1.0188067
## [6,] 0.7504042 0.8388906 0.9507595 0.9864888 1.0168992 1.0173223 1.0325357
## [,8] [,9] [,10]
## [1,] 0.9894164 1.011822 1.025235
## [2,] 1.1003511 1.101923 1.122902
## [3,] 1.1424726 1.150046 1.157533
## [4,] 1.0747107 1.074921 1.091504
## [5,] 1.0306041 1.048890 1.059096
## [6,] 1.0354717 1.048204 1.052870
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8481 6168 2309 3262 8705
## [2,] 7233 2576 9122 9566 727
## [3,] 8648 9538 7687 6143 1775
## [4,] 4233 2967 4364 3682 4646
## [5,] 75 4766 4000 9447 1811
## [6,] 4378 6664 9181 8924 4000
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9407953 0.9868281 0.9962595 1.0039060 1.0261729
## [2,] 0.8328750 0.9287263 0.9302588 0.9559189 0.9755356
## [3,] 1.0644226 1.0754666 1.1123490 1.1208977 1.1230382
## [4,] 1.0067418 1.0109611 1.0242090 1.0313118 1.0379146
## [5,] 0.8990247 0.9934020 0.9957891 1.0538554 1.0577562
## [6,] 0.9122109 0.9538555 0.9600627 0.9918503 1.0299059
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/Rtmpw5nYAA/file690e1ac0bc41.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.6.0 knitr_1.28 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 bookdown_0.18 lattice_0.20-41
## [4] digest_0.6.25 grid_4.0.0 stats4_4.0.0
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.5
## [10] stringi_1.4.6 S4Vectors_0.26.0 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.22.0 tools_4.0.0
## [16] stringr_1.4.0 parallel_4.0.0 xfun_0.13
## [19] yaml_2.2.1 compiler_4.0.0 BiocGenerics_0.34.0
## [22] BiocManager_1.30.10 htmltools_0.4.0