# Introduction to Topological Data Analysis¶

## TODO:¶

[x] Make Goals

[x] Make Outline

[x] Make abstract

[ ] Make Deck text

[ ] Make Deck drawings

[ ] Make deck pdf

## Goals¶

Understand the general “zen” of TDA

Understand concrete example(s)

Understand a few applications

Validation (time permitting)

## Abstract¶

Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: “Data has shape, shape has meaning.” After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies.

## Outline:¶

Part I: Zen of TDA

Data -> Space

Space -> low dimensional embedding

Space -> structured features ((persistent) homology)

Qualitative/computable features, e.g. number of connected components

Part II: Examples

DBSCAN

Mapper

Example of Mapper on a point cloud from a circle

circle has two points connected by two qualitatively distinct paths

Part III: Applications

Breast Cancer

Malaria

## Slides¶

Loose draft of the contents of my slide deck.

### Slide 1: Title Slide¶

Title: Introduction to TDA: “Data has shape, shape has meaning”

Author: Jeremy Mann, PhD

### Slide 2: What is TDA?¶

TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms

- Homotopy theory = (De)Construction and analysis of “spaces”
“connectivity”: when two things can be connected via a continuous path

“locality”: computable via divide-and-conquer

Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, ….

Examples of analysis: number of connected components.

### Slide 3: Step I¶

We want something of the form Data –> Insight

Idea: break this process into two steps:

Data –> Space

Space –> Insight

Space –> Structured data

Space –> Low dimensional embedding

### Slide 3: Step I¶

Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, …) “naturally” construct a shape whose properties reflect something interesting about the input.

Family tree, connectivity ~ kinship

Gauge theory, connectivity ~ physical indistinguishability

Point: All of these notions of connectivity (and more!) are specified once you construct a space!

### Slide 5: DBSCAN¶

DSCAN is a classical clustering algorithm

Primary hyperparameter is a scale parameter \(\epsilon > 0\) determining when two points are close/similar

Number of clusters is not explicitly specified ahead of time.

Breaks into two steps

### Slide 6: DBSCAN¶

DBSCAN constructs a space (graph) from a point cloud \(\{x_i\}\) and a

Points: there is a point for every example

Edges: there is an edge between \(x_i\) and \(x_j\) whenever they are less than \(\epsilon\) units away from each other.

Two data points are declared to be in the same cluster if they can be connected by a path in this graph.

number of clusters = number of connected components of this space.

### Slide 6: DCSCAN¶

Explicit example

### Slide 7: Mapper¶

Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a “lens” function :

particularly interesting feature(s)

highest component generated by PCA

loss function of a learning algorithm

points ~ subpopulations

edges ~ intersections of subpopulations

color ~ (average/median) value of original function

Mapper requires additional choice: (overlapping) cover of range of function

### Slide 8: Covering the range¶

Essentially hyperparameter means

For every element of the cover, we have a subset of the data

- For every intersectin in the data, we have a subset of the data.
These subsets sit inside of two larger subsets

### Slide 8: Mapper Constructs a Space¶

#. Mapper constructs a space (graph) - First step: stratify data by the “lens” function.

KEY POINT: strata should be overlapping!

Perform DBSCAN

Say when two points are not being connected

### Slide 9: Mapper Constructs a Space¶

Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function

The colors should vary continuously as you move along the edges of the graph

### Slide 9¶

Mapper breast cancer application

Interpretation of picture

### Slide 10¶

Mapper malaria example

### Drawings:¶

point cloud approximation of \(S^1 \coprod *\) clustered by DBSCAN - draw the length scale \(\epsilon\) on the side

picture of an overlapping cover.

3. picture of point cloud broken into independent parts and clustered accordingly 4.