BiocNeighbors 1.16.0
The BiocNeighbors package implements a few algorithms for exact nearest neighbor searching:
Both KMKNN and VP-trees involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic1.
The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
The findKNN()
method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting BNPARAM=KmknnParam()
(which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting BNPARAM=VptreeParam()
.
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1230 3494 488 6272 4171 5110 9532 1514 369 4692
## [2,] 3295 6022 297 4946 776 9755 7314 7588 5932 4773
## [3,] 344 6539 3364 4461 4441 4914 4337 1863 8022 6210
## [4,] 169 8479 3685 4313 7862 2393 4728 2895 6885 3350
## [5,] 6264 1879 9180 6942 42 2045 9798 1196 8591 2441
## [6,] 3661 3006 3882 8238 5030 103 4665 8917 3213 6648
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9387179 0.9421655 0.9541736 0.9617318 0.9701266 1.0041094 1.0251458
## [2,] 0.8429318 0.9237294 0.9607030 0.9805009 0.9817203 0.9853284 0.9873492
## [3,] 1.0168365 1.0263692 1.0547203 1.0825883 1.0929322 1.0951547 1.1003726
## [4,] 0.9357928 0.9886514 0.9888640 0.9942486 1.0176580 1.0285775 1.0428388
## [5,] 1.0401678 1.0816584 1.0901354 1.1032688 1.1145581 1.1262050 1.1393630
## [6,] 0.8501148 0.8919946 0.8962771 0.9191365 0.9252817 0.9434827 0.9487024
## [,8] [,9] [,10]
## [1,] 1.0373496 1.0418113 1.0557300
## [2,] 0.9928638 0.9944717 1.0096288
## [3,] 1.1005845 1.1044661 1.1268476
## [4,] 1.0445013 1.0447898 1.0476792
## [5,] 1.1402687 1.1452201 1.1563972
## [6,] 0.9616424 0.9756474 0.9978414
Each row of the index
matrix corresponds to a point in data
and contains the row indices in data
that are its nearest neighbors.
For example, the 3rd point in data
has the following nearest neighbors:
fout$index[3,]
## [1] 344 6539 3364 4461 4441 4914 4337 1863 8022 6210
… with the following distances to those neighbors:
fout$distance[3,]
## [1] 1.016836 1.026369 1.054720 1.082588 1.092932 1.095155 1.100373 1.100585
## [9] 1.104466 1.126848
Note that the reported neighbors are sorted by distance.
Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
We then use the queryKNN()
function to identify the 5 nearest neighbors in data
for each point in query
.
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 9886 4966 2482 8528 6413
## [2,] 1735 6913 4796 3843 9232
## [3,] 6437 1591 1956 2213 5661
## [4,] 34 6518 2403 6079 4398
## [5,] 3988 2545 7781 3784 1106
## [6,] 2247 2489 481 3756 5408
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9359542 0.9432785 0.9997144 1.0141157 1.0918081
## [2,] 0.7941403 0.8003494 0.8487098 0.9084944 0.9095805
## [3,] 0.8442306 0.8776378 0.8787388 0.9293020 0.9305006
## [4,] 0.9133952 0.9608883 0.9703476 0.9798192 0.9957778
## [5,] 0.9162590 0.9875159 0.9977371 1.0357519 1.0370905
## [6,] 0.9355371 0.9570418 0.9764381 0.9795691 0.9893509
Each row of the index
matrix contains the row indices in data
that are the nearest neighbors of a point in query
.
For example, the 3rd point in query
has the following nearest neighbors in data
:
qout$index[3,]
## [1] 6437 1591 1956 2213 5661
… with the following distances to those neighbors:
qout$distance[3,]
## [1] 0.8442306 0.8776378 0.8787388 0.9293020 0.9305006
Again, the reported neighbors are sorted by distance.
Users can perform the search for a subset of query points using the subset=
argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.
findKNN(data, k=5, subset=3:5)
## $index
## [,1] [,2] [,3] [,4] [,5]
## [1,] 344 6539 3364 4461 4441
## [2,] 169 8479 3685 4313 7862
## [3,] 6264 1879 9180 6942 42
##
## $distance
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0168365 1.0263692 1.054720 1.0825883 1.092932
## [2,] 0.9357928 0.9886514 0.988864 0.9942486 1.017658
## [3,] 1.0401678 1.0816584 1.090135 1.1032688 1.114558
If only the indices are of interest, users can set get.distance=FALSE
to avoid returning the matrix of distances.
This will save some time and memory.
names(findKNN(data, k=2, get.distance=FALSE))
## [1] "index"
It is also simple to speed up functions by parallelizing the calculations with the BiocParallel framework.
library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
For multiple queries to a constant data
, the pre-clustering can be performed in a separate step with buildIndex()
.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2.
pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
The default setting is to search on the Euclidean distance.
Alternatively, we can use the Manhattan distance by setting distance="Manhattan"
in the BiocNeighborParam
object.
out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))
Advanced users may also be interested in the raw.index=
argument, which returns indices directly to the precomputed object rather than to data
.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.0
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.32.1 BiocNeighbors_1.16.0 knitr_1.39
## [4] BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 magrittr_2.0.3 BiocGenerics_0.44.0
## [4] lattice_0.20-45 R6_2.5.1 rlang_1.0.4
## [7] fastmap_1.1.0 stringr_1.4.0 tools_4.2.1
## [10] parallel_4.2.1 grid_4.2.1 xfun_0.31
## [13] cli_3.3.0 jquerylib_0.1.4 htmltools_0.5.2
## [16] yaml_2.3.5 digest_0.6.29 bookdown_0.27
## [19] Matrix_1.4-1 BiocManager_1.30.18 S4Vectors_0.36.0
## [22] sass_0.4.1 codetools_0.2-18 evaluate_0.15
## [25] rmarkdown_2.14 stringi_1.7.8 compiler_4.2.1
## [28] bslib_0.3.1 stats4_4.2.1 jsonlite_1.8.0