| Title: | Construct Mapper Graphs for Topological and Exploratory Data Analysis |
|---|---|
| Description: | Topological data analysis (TDA) is a method of data analysis that uses techniques from topology to analyze high-dimensional data. Here we implement Mapper, an algorithm from this area developed by Singh, Mémoli and Carlsson (2007) which generalizes the concept of a Reeb graph <https://en.wikipedia.org/wiki/Reeb_graph>. |
| Authors: | George Clare Kennedy [aut, cre] |
| Maintainer: | George Clare Kennedy <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 2.4.0 |
| Built: | 2026-05-27 07:55:47 UTC |
| Source: | https://github.com/uiowa-applied-topology/mapper |
Run Mapper using a one-dimensional filter, a cover of the codomain of intervals, and a clusterer.
create_1D_mapper_object( data, dists, lens, cover, clusterer = global_hierarchical_clusterer("single", dists) )create_1D_mapper_object( data, dists, lens, cover, clusterer = global_hierarchical_clusterer("single", dists) )
data |
A data frame. |
dists |
A distance matrix associated to the data frame. Can be a |
lens |
The result of a function applied to the rows of |
cover |
An |
clusterer |
A function which accepts a list of distance matrices as input, and returns the results of clustering done on each distance matrix; that is, it should return a list of named vectors, whose name are the names of data points and whose values are cluster assignments (integers). If this value is omitted, then trivial clustering will be done. |
A list of two data frames, nodes and edges, which contain information about the Mapper graph constructed from the given parameters.
The node data frame consists of:
id: vertex ID
cluster_size: number of data points in cluster
medoid: the name of the medoid of the vertex
mean_dist_to_medoid: mean distance to medoid of cluster
max_dist_to_medoid: max distance to medoid of cluster
cluster_width: maximum pairwise distance within cluster
wcss: sum of squares of distances to cluster medoid
data: names of data points in cluster
patch: level set ID
The edge data frame contains consists of:
source: vertex ID of edge source
target: vertex ID of edge target
weight: Jaccard index of edge; this is the size of the intersection between the vertices divided by the union
overlap_data: names of data points in overlap
overlap_size: number of data points overlap
# Create noisy circle data data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Project to horizontal axis as lens projx = data$x # Create a one-dimensional cover num_bins = 5 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) # Build Mapper object create_1D_mapper_object(data, dist(data), projx, cover)# Create noisy circle data data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Project to horizontal axis as lens projx = data$x # Create a one-dimensional cover num_bins = 5 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) # Build Mapper object create_1D_mapper_object(data, dist(data), projx, cover)
Run Mapper using the identity function as a lens and an -net cover, greedily generated using a distance matrix.
create_ball_mapper_object(data, dists, eps)create_ball_mapper_object(data, dists, eps)
data |
A data frame. |
dists |
A distance matrix for the data frame. Can be a |
eps |
A positive real number for the desired ball radius. |
A list of two data frames, nodes and edges, which contain information about the Mapper graph constructed from the given parameters.
The node data frame consists of:
id: vertex ID
cluster_size: number of data points in cluster
medoid: the name of the medoid of the vertex
mean_dist_to_medoid: mean distance to medoid of cluster
max_dist_to_medoid: max distance to medoid of cluster
cluster_width: maximum pairwise distance within cluster
wcss: sum of squares of distances to cluster medoid
data: names of data points in cluster
The edge data frame contains consists of:
source: vertex ID of edge source
target: vertex ID of edge target
weight: Jaccard index of edge; this is the size of the intersection between the vertices divided by the union
overlap_data: names of data points in overlap
overlap_size: number of data points overlap
# Create noisy cirle data set data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Set ball radius eps = .25 # Create Mapper object create_ball_mapper_object(data, dist(data), eps)# Create noisy cirle data set data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Set ball radius eps = .25 # Create Mapper object create_ball_mapper_object(data, dist(data), eps)
Make a greedy epsilon net of a data set.
create_balls(data, dists, eps)create_balls(data, dists, eps)
data |
A data frame. |
dists |
A distance matrix for the data frame. |
eps |
A positive real number. |
A list of vectors of data point names, one list element per ball. The output is such that every data point is contained in a ball of radius , and no ball center is contained in more than one ball. The centers themselves are data points.
# Create a data set from 5000 points sampled from a parametric curve, plus some noise num_points = 5000 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1), z = sapply(1:num_points, function(x) 10 * sin(x) ^ 2 * cos(x)) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) # Ball it up balls = create_balls(data = P.data, dists = P.dist, eps = .25)# Create a data set from 5000 points sampled from a parametric curve, plus some noise num_points = 5000 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1), z = sapply(1:num_points, function(x) 10 * sin(x) ^ 2 * cos(x)) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) # Ball it up balls = create_balls(data = P.data, dists = P.dist, eps = .25)
Run Ball Mapper, but non-trivially cluster within the balls. You can use two different distance matrices to for the balling and clustering.
create_clusterball_mapper_object( data, dist1, dist2, eps, clusterer = local_hierarchical_clusterer("single") )create_clusterball_mapper_object( data, dist1, dist2, eps, clusterer = local_hierarchical_clusterer("single") )
data |
A data frame. |
dist1 |
A distance matrix for the data frame; this will be used to ball the data. It can be a |
dist2 |
Another distance matrix for the data frame; this will be used to cluster the data after balling. It can be a |
eps |
A positive real number for the desired ball radius. |
clusterer |
A function which accepts a list of distance matrices as input, and returns the results of clustering done on each distance matrix; that is, it should return a list of named vectors, whose name are the names of data points and whose values are cluster assignments (integers). If this value is omitted, then single-linkage clustering will be done (and cutting heights will be decided for you). |
A list of two data frames, nodes and edges, which contain information about the Mapper graph constructed from the given parameters.
The node data frame consists of:
id: vertex ID
cluster_size: number of data points in cluster
medoid: the name of the medoid of the vertex
mean_dist_to_medoid: mean distance to medoid of cluster
max_dist_to_medoid: max distance to medoid of cluster
cluster_width: maximum pairwise distance within cluster
wcss: sum of squares of distances to cluster medoid
data: names of data points in cluster
patch: level set ID
The edge data frame contains consists of:
source: vertex ID of edge source
target: vertex ID of edge target
weight: Jaccard index of edge; this is the size of the intersection between the vertices divided by the union
overlap_data: names of data points in overlap
overlap_size: number of data points overlap
# Create noisy circle data set data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) data.dists = dist(data) # Set ball radius eps = 1 # Do single-linkage clustering in the balls to produce Mapper graph create_clusterball_mapper_object(data, data.dists, data.dists, eps)# Create noisy circle data set data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) data.dists = dist(data) # Set ball radius eps = 1 # Do single-linkage clustering in the balls to produce Mapper graph create_clusterball_mapper_object(data, data.dists, data.dists, eps)
Run the Mapper algorithm on a data frame.
create_mapper_object(data, dists, lens, cover_element_tests, clusterer = NULL)create_mapper_object(data, dists, lens, cover_element_tests, clusterer = NULL)
data |
A data frame. |
dists |
A distance matrix for the data frame. Can be a |
lens |
The result of a function applied to the data frame. There should be one value per observation in the original data frame, and, if the values are not named, they should be in the same order as their inputs in the original data frame. |
cover_element_tests |
A list of membership test functions for a set of cover elements. In other words, each element of |
clusterer |
A function which accepts a list of distance matrices as input, and returns the results of clustering done on each distance matrix; that is, it should return a list of named vectors, whose names are the names of data points and whose values are cluster assignments (integers). If this value is omitted, then trivial clustering will be done. |
A list of two data frames, one with node data and one with edge data. The node data includes:
id: vertex ID
cluster_size: number of data points in cluster
medoid: the name of the medoid of the vertex
mean_dist_to_medoid: mean distance to medoid of cluster
max_dist_to_medoid: max distance to medoid of cluster
cluster_width: maximum pairwise distance within cluster
wcss: sum of squares of distances to cluster medoid
data: names of data points in cluster
patch: level set ID
The edge data includes:
source: vertex ID of edge source
target: vertex ID of edge target
jaccard: Jaccard index of edge; intersection divided by union
overlap_data: names of data points in overlap
overlap_size: number of data points overlap
# Create noisy data around a circle data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Apply lens function to data projx = data$x # Build a width-balanced cover with 10 intervals and 25 percent overlap num_bins = 10 percent_overlap = 25 xcover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) # Write a function which can check if a data point lives in an interval of our lens function check_in_interval <- function(endpoints) { return(function(x) (endpoints[1] - x <= 0) & (endpoints[2] - x >= 0)) } # Each of the "cover" elements will really be a function that checks if a data point lives in it xcovercheck = apply(xcover, 1, check_in_interval) # Build the mapper object xmapper = create_mapper_object( data = data, dists = dist(data), lens = projx, cover_element_tests = xcovercheck )# Create noisy data around a circle data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) # Apply lens function to data projx = data$x # Build a width-balanced cover with 10 intervals and 25 percent overlap num_bins = 10 percent_overlap = 25 xcover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) # Write a function which can check if a data point lives in an interval of our lens function check_in_interval <- function(endpoints) { return(function(x) (endpoints[1] - x <= 0) & (endpoints[2] - x >= 0)) } # Each of the "cover" elements will really be a function that checks if a data point lives in it xcovercheck = apply(xcover, 1, check_in_interval) # Build the mapper object xmapper = create_mapper_object( data = data, dists = dist(data), lens = projx, cover_element_tests = xcovercheck )
Generate a cover of an interval with uniformly sized and spaced subintervals.
create_width_balanced_cover(min_val, max_val, num_bins, percent_overlap)create_width_balanced_cover(min_val, max_val, num_bins, percent_overlap)
min_val |
The left endpoint |
max_val |
The right endpoint |
num_bins |
The number of cover intervals with which to cover the interval. A positive integer. |
percent_overlap |
How much overlap desired between the cover intervals
(the percent of the intersection of each interval with its immediate
neighbor relative to its length, e.g., |
A 2D numeric array.
left_ends - The left endpoints of the cover intervals.
right_ends - The right endpoints of the cover intervals.
# Cover `[0, 100]` in 10 patches with 15% overlap create_width_balanced_cover(min_val=0, max_val=100, num_bins=10, percent_overlap=15) # Cover `[-11.5, 10.33]` in 100 patches with 2% overlap create_width_balanced_cover(-11.5, 10.33, 100, 2)# Cover `[0, 100]` in 10 patches with 15% overlap create_width_balanced_cover(min_val=0, max_val=100, num_bins=10, percent_overlap=15) # Cover `[-11.5, 10.33]` in 100 patches with 2% overlap create_width_balanced_cover(-11.5, 10.33, 100, 2)
Compute eccentricity of data points.
eccentricity_filter(dists)eccentricity_filter(dists)
dists |
A distance matrix associated to a data frame. Can be a |
A vector of centrality measures, calculated per data point as the sum of its distances to every other data point, divided by the number of points.
# Generate some noisy data along a 2D curve num_points = 100 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) # Apply eccentricity filter eccentricity = eccentricity_filter(P.dist)# Generate some noisy data along a 2D curve num_points = 100 P.data = data.frame( x = sapply(1:num_points, function(x) sin(x) * 10) + rnorm(num_points, 0, 0.1), y = sapply(1:num_points, function(x) cos(x) ^ 2 * sin(x) * 10) + rnorm(num_points, 0, 0.1) ) P.dist = dist(P.data) # Apply eccentricity filter eccentricity = eccentricity_filter(P.dist)
Create a dude to perform hierarchical clustering in a global context using the hclust package.
global_hierarchical_clusterer(method, dists, cut_height = -1)global_hierarchical_clusterer(method, dists, cut_height = -1)
method |
A string to pass to hclust to tell it what kind of clustering to do. |
dists |
The global distance matrix on which to run clustering to determine a global cutting height. |
cut_height |
The cutting height at which you want all dendrograms to be cut. If this is not specified then the clusterer will use a cut height 5% above the merge point preceding the tallest branch in the global dendrogram. |
This clusterer cuts all dendrograms it is given at a uniform cutting height, defaulting to a heuristic if necessary.
A function that inputs a list of distance matrices and returns a list containing one vector per matrix, whose element names are data point names and whose values are cluster labels (relative to each matrix).
data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) projx = data$x names(projx) = row.names(data) dists = dist(data) num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dists, projx, cover, global_hierarchical_clusterer("mcquitty", dists))data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) projx = data$x names(projx) = row.names(data) dists = dist(data) num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dists, projx, cover, global_hierarchical_clusterer("mcquitty", dists))
Create a dude to perform hierarchical clustering in a local context using the hclust package.
local_hierarchical_clusterer(method)local_hierarchical_clusterer(method)
method |
A string to pass to hclust to tell it what kind of clustering to do. |
This clusterer determines cutting heights for dendrograms by cutting them individually, 5% above the merge point with the longest unbroken gap until the next merge point.
A function that inputs a list of distance matrices and returns a list containing one vector per matrix, whose element names are data point names and whose values are cluster labels (within each patch).
data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) projx = data$x names(projx) = row.names(data) dists = dist(data) num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dists, projx, cover, local_hierarchical_clusterer("mcquitty"))data = data.frame(x = sapply(1:1000, function(x) cos(x)) + runif(1000, 0, .25), y = sapply(1:1000, function(x) sin(x)) + runif(1000, 0, .25)) projx = data$x names(projx) = row.names(data) dists = dist(data) num_bins = 10 percent_overlap = 25 cover = create_width_balanced_cover(min(projx), max(projx), num_bins, percent_overlap) create_1D_mapper_object(data, dists, projx, cover, local_hierarchical_clusterer("mcquitty"))