BiocNeighbors 1.2.0

The *BiocNeighbors* package implements a few algorithms for exact nearest neighbor searching:

- The k-means for k-nearest neighbors (KMKNN) algorithm (Wang 2012) uses k-means clustering to create an index. Within each cluster, the distance of each of that cluster’s points to the cluster center are computed and used to sort all points. Given a query point, the distance to each cluster center is determined and the triangle inequality is applied to determine which points in each cluster warrant a full distance calculation.
- The vantage point (VP) tree algorithm (Yianilos 1993) involves constructing a tree where each node is located at a data point and is associated with a subset of neighboring points. Each node progressively partitions points into two subsets that are either closer or further to the node than a given threshold. Given a query point, the triangle inequality is applied at each node in the tree to determine if the child nodes warrant searching.

Both methods involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic1 Except in the presence of ties, see `?findKNN`

for details..

The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.

```
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
```

The `findKNN()`

method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting `BNPARAM=KmknnParam()`

(which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting `BNPARAM=VptreeParam()`

.

```
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6709 9627 8683 8907 4811 8728 9935 5093 1868 121
## [2,] 7593 9203 6652 961 2856 8711 4412 8907 4666 2811
## [3,] 8361 2534 7096 1443 8355 5135 730 9233 5057 3333
## [4,] 6116 9935 8994 4613 7533 7671 4938 587 9627 6673
## [5,] 4363 7309 3126 6247 4800 2470 1907 285 7956 3990
## [6,] 8871 5867 7618 1557 1016 7067 473 187 271 1517
```

`head(fout$distance)`

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9134994 0.9279878 0.9321199 0.9432648 0.9759880 0.9883781 0.9904984
## [2,] 0.8612897 0.9499456 0.9703708 0.9782614 1.0085772 1.0099281 1.0102099
## [3,] 1.0378627 1.0478542 1.0709121 1.0858699 1.0869769 1.0923892 1.0965950
## [4,] 0.9568614 1.0717803 1.0742441 1.0990840 1.1128575 1.1139495 1.1360398
## [5,] 0.8285719 0.8375260 0.8930268 0.9053241 0.9169200 0.9226835 0.9358931
## [6,] 0.7955337 0.9129179 0.9275191 0.9435631 0.9749024 0.9931279 0.9980455
## [,8] [,9] [,10]
## [1,] 0.9966502 1.0032015 1.0052059
## [2,] 1.0129669 1.0177056 1.0362768
## [3,] 1.1051209 1.1088504 1.1158236
## [4,] 1.1429454 1.1429834 1.1538995
## [5,] 0.9490294 0.9546187 0.9621188
## [6,] 0.9994188 1.0027133 1.0101063
```

Each row of the `index`

matrix corresponds to a point in `data`

and contains the row indices in `data`

that are its nearest neighbors.
For example, the 3rd point in `data`

has the following nearest neighbors:

`fout$index[3,]`

`## [1] 8361 2534 7096 1443 8355 5135 730 9233 5057 3333`

… with the following distances to those neighbors:

`fout$distance[3,]`

```
## [1] 1.037863 1.047854 1.070912 1.085870 1.086977 1.092389 1.096595 1.105121
## [9] 1.108850 1.115824
```

Note that the reported neighbors are sorted by distance.

Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:

```
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
```

We then use the `queryKNN()`

function to identify the 5 nearest neighbors in `data`

for each point in `query`

.

```
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4271 6336 1896 66 5255
## [2,] 577 2631 4942 8018 9216
## [3,] 1282 7391 6915 9507 6113
## [4,] 8115 5553 2903 1729 1585
## [5,] 9979 5919 7712 1312 6032
## [6,] 7124 1550 5867 4266 1743
```

`head(qout$distance)`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8255378 0.9534278 0.9662206 0.9895064 0.9973368
## [2,] 0.9082208 0.9134062 0.9218634 0.9518809 0.9672650
## [3,] 0.9416898 0.9466910 0.9773505 0.9956400 0.9989399
## [4,] 1.0114986 1.0219494 1.0297100 1.0405589 1.0559577
## [5,] 0.8845057 0.8972909 0.8973070 0.9025898 0.9055773
## [6,] 1.0793351 1.0811115 1.1228046 1.1345023 1.1360885
```

Each row of the `index`

matrix contains the row indices in `data`

that are the nearest neighbors of a point in `query`

.
For example, the 3rd point in `query`

has the following nearest neighbors in `data`

:

`qout$index[3,]`

`## [1] 1282 7391 6915 9507 6113`

… with the following distances to those neighbors:

`qout$distance[3,]`

`## [1] 0.9416898 0.9466910 0.9773505 0.9956400 0.9989399`

Again, the reported neighbors are sorted by distance.

Users can perform the search for a subset of query points using the `subset=`

argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.

`findKNN(data, k=5, subset=3:5)`

```
## $index
## [,1] [,2] [,3] [,4] [,5]
## [1,] 8361 2534 7096 1443 8355
## [2,] 6116 9935 8994 4613 7533
## [3,] 4363 7309 3126 6247 4800
##
## $distance
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0378627 1.047854 1.0709121 1.0858699 1.086977
## [2,] 0.9568614 1.071780 1.0742441 1.0990840 1.112857
## [3,] 0.8285719 0.837526 0.8930268 0.9053241 0.916920
```

If only the indices are of interest, users can set `get.distance=FALSE`

to avoid returning the matrix of distances.
This will save some time and memory.

`names(findKNN(data, k=2, get.distance=FALSE))`

`## [1] "index"`

It is also simple to speed up functions by parallelizing the calculations with the *BiocParallel* framework.

```
library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
```

For multiple queries to a constant `data`

, the pre-clustering can be performed in a separate step with `buildIndex()`

.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2 The algorithm type is automatically determined when `BNINDEX`

is specified, so there is no need to also specify `BNPARAM`

in the later functions..

```
pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
```

Advanced users may also be interested in the `raw.index=`

argument, which returns indices directly to the precomputed object rather than to `data`

.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.18.0 BiocNeighbors_1.2.0 knitr_1.22
## [4] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 bookdown_0.9 digest_0.6.18
## [4] stats4_3.6.0 magrittr_1.5 evaluate_0.13
## [7] stringi_1.4.3 S4Vectors_0.22.0 rmarkdown_1.12
## [10] tools_3.6.0 stringr_1.4.0 parallel_3.6.0
## [13] xfun_0.6 yaml_2.2.0 compiler_3.6.0
## [16] BiocGenerics_0.30.0 BiocManager_1.30.4 htmltools_0.3.6
```

Wang, X. 2012. “A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.” *Proc Int Jt Conf Neural Netw* 43 (6):2351–8.

Yianilos, P. N. 1993. “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces.” In *SODA*, 93:311–21. 194.