BiocNeighbors 1.11.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 5796 3655 2204 6896 6859 1061 2434 8725 8936 1683
## [2,] 6700 9413 4673 3875 3676 4696 6695 3737 4271 6786
## [3,] 5330 3477 3599 3593 1861 7147 7012 397 8981 418
## [4,] 9276 7180 972 5425 8305 7834 2571 7249 1024 3904
## [5,] 5955 6425 4511 9799 2959 5248 1182 5839 16 676
## [6,] 9528 3973 8353 4668 8451 6445 698 4227 6529 2973
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9693387 1.0451729 1.0512965 1.0533886 1.0644118 1.0943296 1.1098098
## [2,] 1.0216324 1.0768965 1.0953947 1.1275071 1.1336306 1.1414694 1.1437887
## [3,] 1.0078336 1.0252762 1.0308529 1.0728158 1.0785601 1.1061784 1.1146740
## [4,] 0.9039552 0.9212173 0.9265746 0.9397625 0.9415513 0.9431368 0.9756497
## [5,] 0.8823097 1.0141798 1.1352360 1.2069786 1.2085961 1.2092415 1.2314954
## [6,] 0.8681335 0.8928884 0.9188645 0.9968033 1.0448685 1.0526092 1.0531902
## [,8] [,9] [,10]
## [1,] 1.1172045 1.1181917 1.1217240
## [2,] 1.1499655 1.1669014 1.1708187
## [3,] 1.1338359 1.1385583 1.1503313
## [4,] 0.9803168 0.9859107 0.9921471
## [5,] 1.2716721 1.2749056 1.2804711
## [6,] 1.0730172 1.0758401 1.0762510
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8768 1175 8490 4666 7426
## [2,] 8451 882 5424 7831 6
## [3,] 3673 9487 7647 8866 3218
## [4,] 7095 4962 5165 9873 9791
## [5,] 7222 3465 2412 8249 2967
## [6,] 9829 3607 6097 8243 6771
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8125736 0.8508520 0.9123031 0.9305382 0.9497266
## [2,] 0.9453325 0.9635710 1.0339614 1.0465548 1.1328304
## [3,] 0.7222402 0.8206985 0.8820139 0.8918611 0.9317644
## [4,] 0.7508256 0.9995013 1.0059122 1.0091230 1.0260259
## [5,] 0.9141772 0.9601210 0.9748088 0.9748784 0.9952974
## [6,] 0.9300561 1.0575197 1.1367359 1.1402951 1.1483943
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "D:\\biocbuild\\bbs-3.14-bioc\\tmpdir\\RtmpiIrcPy\\file35087f733267.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.11.0 knitr_1.33 BiocStyle_2.21.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 magrittr_2.0.1 BiocGenerics_0.39.0
## [4] BiocParallel_1.27.0 lattice_0.20-44 R6_2.5.0
## [7] rlang_0.4.11 stringr_1.4.0 tools_4.1.0
## [10] parallel_4.1.0 grid_4.1.0 xfun_0.23
## [13] jquerylib_0.1.4 htmltools_0.5.1.1 yaml_2.2.1
## [16] digest_0.6.27 bookdown_0.22 Matrix_1.3-3
## [19] BiocManager_1.30.15 S4Vectors_0.31.0 sass_0.4.0
## [22] evaluate_0.14 rmarkdown_2.8 stringi_1.6.2
## [25] compiler_4.1.0 bslib_0.2.5.1 stats4_4.1.0
## [28] jsonlite_1.7.2