[Changes from version 1 and migration: https://github.com/omid-arhami/topolow/blob/main/NEWS.md]
topolow
is an R package that implements a novel,
physics-inspired algorithm for Euclidean embedding of
potentially non-metric, sparse, and noisy dissimilarity data. The
algorithm converts dissimilarity matrices into valid Euclidean
coordinate systems, making the data compatible with standard statistical
and machine learning tools like PCA, clustering, and regression.
The Problem: Many datasets contain dissimilarity measurements that violate metric axioms (symmetry, triangle inequality) or are highly sparse with missing values. Standard methods like Multidimensional Scaling (MDS) struggle with such data, leading to poor embeddings or complete failure.
The Solution: Topolow uses a physics-inspired approach that models objects as particles connected by springs (for known dissimilarities) and repulsive forces (for missing pairs). This gradient-free optimization is robust to local optima and handles non-metric data naturally.
(Download version 2 from GitHub to use the wizard function Euclidify.)
install.packages("topolow")
# Install devtools if needed
if (!require("devtools")) install.packages("devtools")
# Install topolow
::install_github("omid-arhami/topolow") devtools
Let’s embed 4 points with known coordinates to validate the algorithm:
library(topolow)
# Known coordinates:
# S1 at (0,0), S2 at (3,0), S3 at (4,4), V1 at (2,2), V2 at (0,4)
# We'll provide some distances and let Topolow infer the missing ones
<- data.frame(
coordinates point = c("S1", "S2", "S3", "V1", "V2"),
x = c(0, 3, 4, 2, 0),
y = c(0, 0, 4, 2, 4)
)# Create a matrix with just the x,y coordinates for distance calculation
<- as.matrix(coordinates[, c("x", "y")])
coord_matrix rownames(coord_matrix) <- coordinates$point
# Calculate the distance matrix using Euclidean distance
<- as.matrix(dist(coord_matrix))
dist_mat # Remove a known distance to see if topolow predicts it accurately
"V1", "V2"] <- NA
dist_mat["V2", "V1"] <- NA
dist_mat[
# Run Topolow (manual parameters for quick demo)
<- euclidean_embedding(
result dissimilarity_matrix = dist_mat,
ndim = 2,
mapping_max_iter = 1000,
k0 = 5,
cooling_rate = 0.03,
c_repulsion = 0.7,
verbose = TRUE
)
# Check results
print("Original matrix:")
print(dist_mat)
print("Estimated distances:")
print(result$est_distances)
# The missing distance V1-V2 should be approximately 2.83
For real applications, use Euclidify()
which
automatically optimizes all parameters:
# Using automatic parameter optimization
<- Euclidify(
result_auto dissimilarity_matrix = dist_mat,
ndim_range = c(2, 4),
output_dir = tempdir(), # Required for optimization files
n_initial_samples = 50, # Reduced for quick demo
n_adaptive_samples = 200,
folds = 4,
verbose = "standard"
)
# Extract optimized distances
<- result_auto$est_distances
est_dist print(est_dist)
# View optimal parameters found
print(result_auto$optimal_params)
For immunologists studying viral evolution and vaccine effectiveness, Topolow can generate insightful visualizations. A key feature for this application is the calculation of antigenic velocity vectors, which show the rate and direction of antigenic drift for each virus against its recent predecessors. These vectors can reveal evolutionary trends and highlight potential vaccine-escape variants by revealing fast movements.
# Example: H3N2 Influenza A antigenic evolution with temporal mapping and velocity vectors
# Create a more comprehensive antigenic dataset with temporal information
<- data.frame(
antigen_data virus = c("A/H3N2/HK/1968", "A/H3N2/EN/1972", "A/H3N2/VI/1975", "A/H3N2/TX/1977",
"A/H3N2/BK/1979", "A/H3N2/SI/1987", "A/H3N2/BE/1989", "A/H3N2/BE/1992",
"A/H3N2/WU/1995", "A/H3N2/SY/1997", "A/H3N2/FU/2002", "A/H3N2/WI/2005"),
serum = rep(c("anti-HK68/1968", "anti-EN72/1972", "anti-VI75/1975", "anti-TX77/1977", "anti-BK79/1979",
"anti-SI87/1987"), each = 12),
titer = c(2560, 1280, 640, 320, 160, 80, 40, "<40", "<40", "<40", "<40", "<40",
640, 2560, 1280, 640, 320, 160, 80, 40, "<40", "<40", "<40", "<40",
320, 640, 2560, 1280, 640, 320, 160, 80, 40, "<40", "<40", "<40",
160, 320, 640, 2560, 1280, 640, 320, 160, 80, 40, "<40", "<40",
80, 160, 320, 640, 2560, 1280, 640, 320, 160, 80, 40, "<40",
"<40", 80, 160, 320, 640, 2560, 1280, 640, 320, 160, 80, 40),
year = rep(c(1968, 1972, 1975, 1977, 1979, 1987, 1989, 1992, 1995, 1997, 2002, 2005), 6)
)
# Convert titers to dissimilarity matrix
<- process_antigenic_data(
results data = antigen_data,
antigen_col = "virus",
serum_col = "serum",
value_col = "titer",
is_similarity = TRUE, # Titers are similarities
scale_factor = 10 # Base dilution factor of HI assay
)<- results$matrix
antigenic_matrix
# Create antigenic map with temporal information
<- Euclidify(
antigenic_map dissimilarity_matrix = antigenic_matrix,
ndim_range = c(3, 6),
folds = 5,
output_dir = tempdir(),
verbose = "standard"
)
# Prepare data for temporal visualization
<- data.frame(
positions_df V1 = antigenic_map$positions[, 1],
V2 = antigenic_map$positions[, 2],
name = rownames(antigenic_map$positions),
year = as.numeric(sub(".*/([0-9]+).*", "\\1", rownames(antigenic_map$positions))),
antigen = grepl("^V/", rownames(antigenic_map$positions)),
antiserum = grepl("^S/", rownames(antigenic_map$positions))
)
# Configure visualization aesthetics
<- new_aesthetic_config(
aesthetic_config point_size = 3.0,
point_alpha = 0.8,
point_shapes = c(antigen = 16, antiserum = 5), # Circle for antigens, diamond for sera
gradient_colors = list(low = "blue", high = "red"),
show_labels = TRUE,
show_title = TRUE,
title_size = 14,
axis_text_size = 11,
show_legend = TRUE,
legend_position = "right",
arrow_alpha = 0.7
)
<- new_layout_config(
layout_config width = 10,
height = 8,
save_plot = FALSE,
arrow_plot_threshold = 0.15, # Show arrows for significant movements
show_grid = TRUE,
grid_type = "major"
)
<- new_annotation_config(
annotation_config notable_points = c("V/A/H3N2/HK/1968", "V/A/H3N2/FU/2002", "V/A/H3N2/WI/2005")
)
# Create temporal antigenic map with velocity vectors
<- plot_temporal_mapping(
temporal_plot df_coords = positions_df,
ndim = 3,
draw_arrows = TRUE, # Enable velocity vectors
annotate_arrows = TRUE, # Label arrows with strain names
sigma_t = 2.0, # Temporal bandwidth (years)
sigma_x = 1.5, # Spatial bandwidth (antigenic units)
aesthetic_config = aesthetic_config,
layout_config = layout_config,
annotation_config = annotation_config,
output_dir = tempdir()
)
print(temporal_plot)
# Create clustered view by antigenic epoch
$cluster <- cut(positions_df$year,
positions_dfbreaks = c(1965, 1975, 1985, 1995, 2010),
labels = c("Pre-1975", "1975-1985", "1985-1995", "Post-1995"))
# Visualize antigenic clusters with evolutionary relationships
<- plot_cluster_mapping(
cluster_plot df_coords = positions_df,
ndim = 3,
draw_arrows = TRUE,
show_one_arrow_per_cluster = TRUE, # One representative arrow per cluster
aesthetic_config = aesthetic_config,
layout_config = layout_config,
annotation_config = annotation_config
)
print(cluster_plot)
# Generate 3D visualization for enhanced perspective
if (requireNamespace("rgl", quietly = TRUE)) {
# Interactive 3D antigenic map
<- plot_3d_mapping(
plot_3d
positions_df,ndim = 3,
aesthetic_config = aesthetic_config,
layout_config = layout_config
)
cat("3D antigenic map created. Use mouse to rotate and zoom.\n")
}
# Example: Customer behavior dissimilarity
<- data.frame(
customer_data customer = rep(paste0("Cust", 1:5), each = 5),
product = rep(paste0("Prod", 1:5), 5),
dissimilarity = c(0, 2.1, 3.5, 1.8, 4.2,
2.1, 0, 1.9, 3.1, 2.8,
3.5, 1.9, 0, 2.4, 3.7,
1.8, 3.1, 2.4, 0, 2.9,
4.2, 2.8, 3.7, 2.9, 0)
)
# Convert to matrix format
<- list_to_matrix(
dissim_matrix data = customer_data,
object_col = "customer",
reference_col = "product",
value_col = "dissimilarity",
is_similarity = FALSE
)
# Embed in Euclidean space
<- Euclidify(
customer_map dissimilarity_matrix = dissim_matrix,
output_dir = tempdir(),
ndim_range = c(2, 4),
verbose = "standard"
)
plot(customer_map$positions, main = "Customer Behavior Map")
text(customer_map$positions[,1] + jitter(rep(0, nrow(customer_map$positions)), amount = 0.2),
$positions[,2] + jitter(rep(0, nrow(customer_map$positions)), amount = 0.2),
customer_maplabels = rownames(customer_map$positions), pos = 3, cex = 0.5)
# Example: Large symmetric sparse matrix with realistic structure
set.seed(12345) # For reproducibility
# Generate a large, realistic sparse dissimilarity matrix
<- 50
n_objects <- paste0("Object_", sprintf("%02d", 1:n_objects))
object_names
# Create base coordinates in 3D space with clustered structure
<- matrix(c(
cluster_centers c(0, 0, 0), # Cluster 1
c(5, 0, 0), # Cluster 2
c(0, 5, 0), # Cluster 3
c(5, 5, 0), # Cluster 4
c(2.5, 2.5, 3) # Cluster 5
ncol = 3, byrow = TRUE)
),
# Assign objects to clusters
<- sample(1:5, n_objects, replace = TRUE,
cluster_assignments prob = c(0.25, 0.25, 0.20, 0.20, 0.10))
# Generate coordinates with cluster structure + noise
<- matrix(0, n_objects, 3)
true_coordinates for(i in 1:n_objects) {
<- cluster_assignments[i]
cluster_id <- cluster_centers[cluster_id, ] + rnorm(3, 0, 0.8)
true_coordinates[i, ]
}
rownames(true_coordinates) <- object_names
# Calculate complete Euclidean distance matrix
<- as.matrix(dist(true_coordinates))
complete_distances
# Add realistic measurement noise (5% coefficient of variation)
<- complete_distances * (1 + rnorm(n_objects^2, 0, 0.05))
noisy_distances <- pmax(noisy_distances, 0.1) # Minimum distance threshold
noisy_distances
# Make symmetric and zero diagonal
lower.tri(noisy_distances)] <- t(noisy_distances)[lower.tri(noisy_distances)]
noisy_distances[diag(noisy_distances) <- 0
# Introduce structured sparsity (85% missing data)
# Objects in the same cluster are more likely to have measurements
<- n_objects * (n_objects - 1) / 2
total_pairs <- round(total_pairs * 0.85) # 85% sparsity
target_missing_pairs
# Generate upper triangular indices for sampling
<- which(upper.tri(noisy_distances), arr.ind = TRUE)
upper_tri_indices
# Create sampling weights: higher probability for within-cluster pairs
<- numeric(nrow(upper_tri_indices))
sampling_weights for(k in 1:nrow(upper_tri_indices)) {
<- upper_tri_indices[k, 1]
i <- upper_tri_indices[k, 2]
j if(cluster_assignments[i] == cluster_assignments[j]) {
<- 0.3 # Lower chance of being missing (within cluster)
sampling_weights[k] else {
} <- 1.0 # Higher chance of being missing (between clusters)
sampling_weights[k]
}
}
# Sample pairs to remove
<- sample(
missing_pair_indices nrow(upper_tri_indices),
target_missing_pairs, prob = sampling_weights
)
# Create sparse matrix
<- noisy_distances
sparse_matrix <- NA
sparse_matrix[upper_tri_indices[missing_pair_indices, ]] c(2,1)]] <- NA
sparse_matrix[upper_tri_indices[missing_pair_indices,
rownames(sparse_matrix) <- colnames(sparse_matrix) <- object_names
# Calculate actual sparsity
<- sum(is.na(sparse_matrix)) / (n_objects * (n_objects-1)) * 100
actual_sparsity
cat("=== SPARSE DATA EXAMPLE ===\n")
cat("Generated sparse dissimilarity matrix:\n")
cat("- Matrix size:", n_objects, "x", n_objects, "objects\n")
cat("- Actual sparsity:", round(actual_sparsity, 1), "% missing data\n")
cat("- Clustering structure: 5 clusters with", table(cluster_assignments), "objects each\n")
cat("- Noise level: 5% coefficient of variation\n")
cat("- Within-cluster connectivity preserved for realism\n")
cat("=== Check the connectivity of the data graph ===\n")
# Network structure analysis to make sure there are no separate islands in the data
<- analyze_network_structure(sparse_matrix)
network_analysis <- plot_network_structure(network_analysis)
network_plot print(network_plot)
# Demonstrate Topolow's superior sparse data handling
cat("\n=== EMBEDDING SPARSE DATA ===\n")
# Topolow embedding with automatic optimization
<- Euclidify(
sparse_result dissimilarity_matrix = sparse_matrix,
ndim_range = c(2, 8),
output_dir = tempdir(),
n_initial_samples = 50,
n_adaptive_samples = 150,
folds = 20,
verbose = "standard"
)print(sparse_result$optimal_params)
# Evaluate embedding quality by visualizing the results
# and coloring by original cluster for validation
<- rainbow(5)[cluster_assignments]
plot_colors
plot(sparse_result$positions[, 1:2],
col = plot_colors,
pch = 19,
cex = 1.2,
main = paste("Topolow Embedding of Sparse Data\n",
round(actual_sparsity, 1), "% Missing Values"),
xlab = "Dimension 1",
ylab = "Dimension 2")
legend("topleft",
legend = paste("Cluster", 1:5),
col = rainbow(5),
pch = 19,
cex = 0.6)
# Add text labels for some points
text(sparse_result$positions[1:10, 1:2],
labels = object_names[1:10],
pos = 3,
cex = 0.6)
Topolow employs a novel physical model where:
Key Distinction from MDS: While MDS methods impute missing values and calculate eigenvalues or gradient vectors, Topolow works directly with the structure in the data and uses physics-inspired forces for robust optimization.
Topolow accepts data in multiple formats:
# Direct dissimilarity matrix
<- matrix(c(0, 1.2, 2.1, 1.2, 0, 1.8, 2.1, 1.8, 0), nrow=3) dist_matrix
# Convert long format to matrix
<- data.frame(
long_data object = c("A", "B", "C"),
reference = c("X", "Y", "Z"),
value = c(1.2, 1.8, 2.1)
)<- list_to_matrix(long_data, "object", "reference", "value") matrix_data
# Data with detection limits
<- matrix(c(0, ">64", "<40", ">64", 0, "20", "<40", "20", 0), nrow=3) threshold_matrix
Based on empirical evaluations in the Bioinformatics paper (Arhami and Rohani, 2025 https://doi.org/10.1093/bioinformatics/btaf372):
Key parameters of euclidean_embedding()
for manual
optimization:
Recommendation: Use Euclidify()
for
automatic parameter optimization if you are not willing to invest time
on manual tuning and investigation.
Method | Non-metric Compatible | Missing Data | Sparse Data | Gradient-free | Stability |
---|---|---|---|---|---|
Topolow | ✅ | ✅ | ✅ | ✅ | ✅ |
Classical MDS | ❌ | ❌ (requires imputation) | ❌ | ✅ | ✅ |
Iterative MDS | ❌ | ❌ (requires imputation) | ❌ | ❌ | ❌ |
t-SNE | ❌ | ❌ | ❌ | ❌ | ❌ |
UMAP | ❌ | ❌ | ❌ | ❌ | ⚠️ |
topolow can be used on a single HPC system and leverage the larger
number of cores by increasing max_cores
parameters.
Distributed processing using SLURM is supported in versions prior to
1.0.0.
For 3D visualization capabilities, install the rgl
package:
install.packages("rgl")
Note for macOS users: The rgl
package requires XQuartz.
Download from https://www.xquartz.org/, install, and restart your
computer.
Full documentation available at:
# View documentation for specific functions
?Euclidify
?euclidean_embedding
?initial_parameter_optimization
# Package overview
help(package = "topolow")
If you use this package, please cite either the original introduction of topolow (with applications in the context of viral evolution) :
Omid Arhami, Pejman Rohani, Topolow: a mapping algorithm for antigenic cross-reactivity and binding affinity assays, Bioinformatics, Volume 41, Issue 7, July 2025, btaf372, https://doi.org/10.1093/bioinformatics/btaf372
@article{10.1093/bioinformatics/btaf372,
author = {Arhami, Omid and Rohani, Pejman},
title = {Topolow: a mapping algorithm for antigenic cross-reactivity and binding affinity assays},
journal = {Bioinformatics},
volume = {41},
number = {7},
pages = {btaf372},
year = {2025},
month = {06},
abstract = {Understanding antigenic evolution through cross-reactivity assays is crucial for tracking rapidly evolving pathogens requiring regular vaccine updates. However, existing cartography methods, commonly based on multidimensional scaling (MDS), face significant challenges with sparse and complex data, producing incomplete and inconsistent maps. There is an urgent need for robust computational methods that can accurately map antigenic relationships from incomplete experimental data while maintaining biological relevance, especially given that more than 95\% of possible measurements could be missing in large-scale studies.We present Topolow, an algorithm that transforms cross-reactivity and binding affinity measurements into accurate positions in a phenotype space. Using a physics-inspired model, Topolow achieved comparable prediction accuracy to MDS for H3N2 influenza and 56\% and 41\% improved accuracy for dengue and HIV, while maintaining complete positioning of all antigens. The method effectively reduces experimental noise and bias, determines optimal dimensionality through likelihood-based estimation, avoiding distortions due to insufficient dimensions, and demonstrates orders of magnitude better stability across multiple runs. We also introduce antigenic velocity vectors, which measure the rate of antigenic advancement of each isolate per unit of time against its temporal and evolutionary related background, revealing the underlying antigenic relationships and cluster transitions.Topolow is implemented in R and freely available at https://doi.org/10.5281/zenodo.15620983 and https://github.com/omid-arhami/topolow.},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btaf372},
url = {https://doi.org/10.1093/bioinformatics/btaf372},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/41/7/btaf372/63582086/btaf372.pdf},
}
Or, the pre-print explaining mathematical properties of the algorithm, Euclidean embedding, and evaluations:
Omid Arhami, Pejman Rohani, Topolow: Force-Directed Euclidean Embedding of Dissimilarity Data with Robustness Against Non-Metricity and Sparsity, arXiv:2508.01733, https://doi.org/10.48550/arXiv.2508.01733
@misc{arhami2025topolowforcedirectedeuclideanembedding,
title={Topolow: Force-Directed Euclidean Embedding of Dissimilarity Data with Robustness Against Non-Metricity and Sparsity},
author={Omid Arhami and Pejman Rohani},
year={2025},
eprint={2508.01733},
archivePrefix={arXiv},
primaryClass={cs.CG},
doi={10.48550/arXiv.2508.01733},
url={https://arxiv.org/abs/2508.01733},
}
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is distributed under BSD-3-Clause license YEAR: 2025 COPYRIGHT HOLDER: Omid Arhami