1 Identifying all neighbors within range

Another application of the KMKNN or VP tree algorithms is to identify all neighboring points within a certain distance1 The default here is Euclidean, but again, we can set distance="Manhattan" in the BNPARAM object if so desired. of the current point. We first mock up some data:

nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)

We apply the findNeighbors() function to data:

fout <- findNeighbors(data, threshold=1)
head(fout$index)
## [[1]]
## [1] 8632 8992 1663    1 5703 5681
## 
## [[2]]
## [1]    2 5267
## 
## [[3]]
##  [1] 9162    3  157 1244 8491 9217 8554  197 4172 3456 6484 3366 9988 6165 3691
## [16] 9229 2050 6164 7163 9617 2886 1771 1492 8855 1454 4993  746 3637  222 6969
## [31]  539 2125 8583 8628 1312 9144 6573 1313 1345 6465
## 
## [[4]]
## [1] 8893 6089    4 8686 4092
## 
## [[5]]
## [1] 3259 4114    5
## 
## [[6]]
## [1] 1832 2998    6 7597 9123
head(fout$distance)
## [[1]]
## [1] 0.9679688 0.9955637 0.9529184 0.0000000 0.9455772 0.9465536
## 
## [[2]]
## [1] 0.0000000 0.8882458
## 
## [[3]]
##  [1] 0.9638099 0.0000000 0.9993844 0.9768464 0.9927234 0.9765588 0.9522297
##  [8] 0.9915558 0.9611522 0.8413193 0.9994964 0.8834591 0.9948502 0.9624836
## [15] 0.8056670 0.9544794 0.9659481 0.9337453 0.9639685 0.8808415 0.9567267
## [22] 0.9856127 0.9956212 0.9606468 0.8722711 0.9886010 0.7948116 0.7879434
## [29] 0.9946219 0.9710092 0.9572142 0.8698447 0.9555942 0.9465146 0.8392669
## [36] 0.9354918 0.9912927 0.9700482 0.9973354 0.8249090
## 
## [[4]]
## [1] 0.8933662 0.9847649 0.0000000 0.9405773 0.9140731
## 
## [[5]]
## [1] 0.9427589 0.9928982 0.0000000
## 
## [[6]]
## [1] 0.9551407 0.9371954 0.0000000 0.9861542 0.9415436

Each entry of the index list corresponds to a point in data and contains the row indices in data that are within threshold. For example, the 3rd point in data has the following neighbors:

fout$index[[3]]
##  [1] 9162    3  157 1244 8491 9217 8554  197 4172 3456 6484 3366 9988 6165 3691
## [16] 9229 2050 6164 7163 9617 2886 1771 1492 8855 1454 4993  746 3637  222 6969
## [31]  539 2125 8583 8628 1312 9144 6573 1313 1345 6465

… with the following distances to those neighbors:

fout$distance[[3]]
##  [1] 0.9638099 0.0000000 0.9993844 0.9768464 0.9927234 0.9765588 0.9522297
##  [8] 0.9915558 0.9611522 0.8413193 0.9994964 0.8834591 0.9948502 0.9624836
## [15] 0.8056670 0.9544794 0.9659481 0.9337453 0.9639685 0.8808415 0.9567267
## [22] 0.9856127 0.9956212 0.9606468 0.8722711 0.9886010 0.7948116 0.7879434
## [29] 0.9946219 0.9710092 0.9572142 0.8698447 0.9555942 0.9465146 0.8392669
## [36] 0.9354918 0.9912927 0.9700482 0.9973354 0.8249090

Note that, for this function, the reported neighbors are not sorted by distance. The order of the output is completely arbitrary and will vary depending on the random seed. However, the identity of the neighbors is fully deterministic.

2 Querying another data set for neighbors

The queryNeighbors() function is also provided for identifying all points within a certain distance of a query point. Given a query data set:

nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)

… we apply the queryNeighbors() function:

qout <- queryNeighbors(data, query, threshold=1)
length(qout$index)
## [1] 1000

… where each entry of qout$index corresponds to a row of query and contains its neighbors in data. Again, the order of the output is arbitrary but the identity of the neighbors is deterministic.

3 Further options

Most of the options described for findKNN() are also applicable here. For example:

  • subset to identify neighbors for a subset of points.
  • get.distance to avoid retrieving distances when unnecessary.
  • BPPARAM to parallelize the calculations across multiple workers.
  • raw.index to return the raw indices from a precomputed index.

Note that the argument for a precomputed index is precomputed:

pre <- buildIndex(data, BNPARAM=KmknnParam())
fout.pre <- findNeighbors(BNINDEX=pre, threshold=1)
qout.pre <- queryNeighbors(BNINDEX=pre, query=query, threshold=1)

Users are referred to the documentation of each function for specific details.

4 Session information

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.20.1 BiocNeighbors_1.4.2 knitr_1.28         
## [4] BiocStyle_2.14.4   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3          bookdown_0.17       lattice_0.20-40    
##  [4] digest_0.6.25       grid_3.6.2          stats4_3.6.2       
##  [7] magrittr_1.5        evaluate_0.14       rlang_0.4.4        
## [10] stringi_1.4.6       S4Vectors_0.24.3    Matrix_1.2-18      
## [13] rmarkdown_2.1       tools_3.6.2         stringr_1.4.0      
## [16] parallel_3.6.2      xfun_0.12           yaml_2.2.1         
## [19] compiler_3.6.2      BiocGenerics_0.32.0 BiocManager_1.30.10
## [22] htmltools_0.4.0