BiocNeighbors 1.4.2

The *BiocNeighbors* package implements a few algorithms for exact nearest neighbor searching:

- The k-means for k-nearest neighbors (KMKNN) algorithm (Wang 2012) uses k-means clustering to create an index. Within each cluster, the distance of each of that cluster’s points to the cluster center are computed and used to sort all points. Given a query point, the distance to each cluster center is determined and the triangle inequality is applied to determine which points in each cluster warrant a full distance calculation.
- The vantage point (VP) tree algorithm (Yianilos 1993) involves constructing a tree where each node is located at a data point and is associated with a subset of neighboring points. Each node progressively partitions points into two subsets that are either closer or further to the node than a given threshold. Given a query point, the triangle inequality is applied at each node in the tree to determine if the child nodes warrant searching.

Both methods involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic1 Except in the presence of ties, see `?findKNN`

for details..

The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.

```
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
```

The `findKNN()`

method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns.
We indicate that we want to use the KMKNN algorithm by setting `BNPARAM=KmknnParam()`

(which is also the default, so this is not strictly necessary here).
We could use a VP tree instead by setting `BNPARAM=VptreeParam()`

.

```
fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
```

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 7264 7089 3730 6925 9917 4061 8288 9758 3671 8560
## [2,] 3778 8629 4347 2562 1139 6380 2350 7012 5873 3619
## [3,] 1416 5978 2811 8398 8676 8782 7806 3172 1173 8261
## [4,] 1727 9839 455 264 1541 9373 4863 847 4999 4607
## [5,] 9620 5660 6691 7 5325 1974 5933 4445 2097 9634
## [6,] 1019 2272 2262 474 3935 3887 3075 1652 9177 8812
```

`head(fout$distance)`

```
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9484602 0.9708714 0.9784185 0.9964415 1.0065982 1.0068451 1.0077159
## [2,] 0.8495990 0.8577315 0.9097091 0.9244535 0.9306995 0.9640296 0.9911031
## [3,] 0.7450924 0.8531180 0.8688548 0.8920542 0.9042700 0.9671633 0.9711515
## [4,] 0.9130970 1.0636489 1.1358683 1.1912135 1.2303144 1.2483183 1.2519752
## [5,] 0.6625614 0.8574672 0.9252329 0.9362569 0.9610968 0.9625113 0.9891207
## [6,] 0.8970165 0.9215320 1.0082941 1.0657953 1.0705579 1.0721446 1.0896080
## [,8] [,9] [,10]
## [1,] 1.0156997 1.0300333 1.0393344
## [2,] 0.9950853 0.9998471 1.0247501
## [3,] 0.9752463 0.9802696 0.9889851
## [4,] 1.2561708 1.2582664 1.2590226
## [5,] 0.9974890 1.0234002 1.0249384
## [6,] 1.0968779 1.0970466 1.1116739
```

Each row of the `index`

matrix corresponds to a point in `data`

and contains the row indices in `data`

that are its nearest neighbors.
For example, the 3rd point in `data`

has the following nearest neighbors:

`fout$index[3,]`

`## [1] 1416 5978 2811 8398 8676 8782 7806 3172 1173 8261`

… with the following distances to those neighbors:

`fout$distance[3,]`

```
## [1] 0.7450924 0.8531180 0.8688548 0.8920542 0.9042700 0.9671633 0.9711515
## [8] 0.9752463 0.9802696 0.9889851
```

Note that the reported neighbors are sorted by distance.

Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:

```
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
```

We then use the `queryKNN()`

function to identify the 5 nearest neighbors in `data`

for each point in `query`

.

```
qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
```

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3854 3859 8296 657 5535
## [2,] 8870 4109 7197 1144 1582
## [3,] 6488 1362 5774 7573 5826
## [4,] 5410 1363 9927 5043 8746
## [5,] 8401 3729 3703 8537 5494
## [6,] 4668 8457 1802 7084 6467
```

`head(qout$distance)`

```
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8978305 1.0143729 1.0159165 1.0613681 1.061982
## [2,] 0.8807007 1.0118383 1.0348805 1.0440376 1.082465
## [3,] 0.9602804 1.0178952 1.0402025 1.0437240 1.050159
## [4,] 0.8427553 0.8878139 0.8914375 0.8998593 1.034443
## [5,] 0.9688056 1.0379531 1.0843502 1.1034637 1.117487
## [6,] 0.9426396 0.9894729 0.9914380 1.0008863 1.004094
```

Each row of the `index`

matrix contains the row indices in `data`

that are the nearest neighbors of a point in `query`

.
For example, the 3rd point in `query`

has the following nearest neighbors in `data`

:

`qout$index[3,]`

`## [1] 6488 1362 5774 7573 5826`

… with the following distances to those neighbors:

`qout$distance[3,]`

`## [1] 0.9602804 1.0178952 1.0402025 1.0437240 1.0501585`

Again, the reported neighbors are sorted by distance.

Users can perform the search for a subset of query points using the `subset=`

argument.
This yields the same result as but is more efficient than performing the search for all points and subsetting the output.

`findKNN(data, k=5, subset=3:5)`

```
## $index
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1416 5978 2811 8398 8676
## [2,] 1727 9839 455 264 1541
## [3,] 9620 5660 6691 7 5325
##
## $distance
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.7450924 0.8531180 0.8688548 0.8920542 0.9042700
## [2,] 0.9130970 1.0636489 1.1358683 1.1912135 1.2303144
## [3,] 0.6625614 0.8574672 0.9252329 0.9362569 0.9610968
```

If only the indices are of interest, users can set `get.distance=FALSE`

to avoid returning the matrix of distances.
This will save some time and memory.

`names(findKNN(data, k=2, get.distance=FALSE))`

`## [1] "index"`

It is also simple to speed up functions by parallelizing the calculations with the *BiocParallel* framework.

```
library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))
```

For multiple queries to a constant `data`

, the pre-clustering can be performed in a separate step with `buildIndex()`

.
The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2 The algorithm type is automatically determined when `BNINDEX`

is specified, so there is no need to also specify `BNPARAM`

in the later functions..

```
pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
```

The default setting is to search on the Euclidean distance.
Alternatively, we can use the Manhattan distance by setting `distance="Manhattan"`

in the `BiocNeighborParam`

object.

`out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))`

Advanced users may also be interested in the `raw.index=`

argument, which returns indices directly to the precomputed object rather than to `data`

.
This may be useful inside package functions where it may be more convenient to work on a common precomputed object.

`sessionInfo()`

```
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocParallel_1.20.1 BiocNeighbors_1.4.2 knitr_1.28
## [4] BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.17 lattice_0.20-40
## [4] digest_0.6.25 grid_3.6.2 stats4_3.6.2
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.4
## [10] stringi_1.4.6 S4Vectors_0.24.3 Matrix_1.2-18
## [13] rmarkdown_2.1 tools_3.6.2 stringr_1.4.0
## [16] parallel_3.6.2 xfun_0.12 yaml_2.2.1
## [19] compiler_3.6.2 BiocGenerics_0.32.0 BiocManager_1.30.10
## [22] htmltools_0.4.0
```

Wang, X. 2012. “A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.” *Proc Int Jt Conf Neural Netw* 43 (6):2351–8.

Yianilos, P. N. 1993. “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces.” In *SODA*, 93:311–21. 194.