# Introduction to Topological Data Analysis¶

## TODO:¶

[x] Make Goals

[x] Make Outline

[x] Make abstract

[ ] Make Deck text

[ ] Make Deck drawings

[ ] Make deck pdf

## Goals¶

1. Understand the general “zen” of TDA

2. Understand concrete example(s)

3. Understand a few applications

4. Validation (time permitting)

## Abstract¶

Topological data analysis (TDA) is an ecosystem of (unsupervised) learning algorithms based on the ansatz: “Data has shape, shape has meaning.” After a high level overview, we present the primary workhorse of TDA (Mapper) and two applications to healthcare. Time permitting, I will outline a couple validation strategies.

## Outline:¶

1. Part I: Zen of TDA

1. Data -> Space

2. Space -> low dimensional embedding

3. Space -> structured features ((persistent) homology)

1. Qualitative/computable features, e.g. number of connected components

2. Part II: Examples

1. DBSCAN

2. Mapper

3. Example of Mapper on a point cloud from a circle

1. circle has two points connected by two qualitatively distinct paths

3. Part III: Applications

1. Breast Cancer

2. Malaria

## Slides¶

Loose draft of the contents of my slide deck.

### Slide 1: Title Slide¶

Title: Introduction to TDA: “Data has shape, shape has meaning”

Author: Jeremy Mann, PhD

### Slide 2: What is TDA?¶

• TDA is the application of homotopy theory to data science, primarily through an ecosystem of unsupervised learning algorithms

• Homotopy theory = (De)Construction and analysis of “spaces”
• “connectivity”: when two things can be connected via a continuous path

• “locality”: computable via divide-and-conquer

• Examples of spaces: sets, graphs, circles, torii, spheres, solutions to f(x)=0, invertible matrices, ….

• Examples of analysis: number of connected components.

### Slide 3: Step I¶

• We want something of the form Data –> Insight

• Idea: break this process into two steps:

1. Data –> Space

2. Space –> Insight

• Space –> Structured data

• Space –> Low dimensional embedding

### Slide 3: Step I¶

• Idea: given input of some kind (a physical theory, a protein, a point cloud of structured data, an image, …) “naturally” construct a shape whose properties reflect something interesting about the input.

• Family tree, connectivity ~ kinship

• Gauge theory, connectivity ~ physical indistinguishability

• Point: All of these notions of connectivity (and more!) are specified once you construct a space!

### Slide 5: DBSCAN¶

• DSCAN is a classical clustering algorithm

• Primary hyperparameter is a scale parameter $$\epsilon > 0$$ determining when two points are close/similar

• Number of clusters is not explicitly specified ahead of time.

• Breaks into two steps

### Slide 6: DBSCAN¶

1. DBSCAN constructs a space (graph) from a point cloud $$\{x_i\}$$ and a

• Points: there is a point for every example

• Edges: there is an edge between $$x_i$$ and $$x_j$$ whenever they are less than $$\epsilon$$ units away from each other.

2. Two data points are declared to be in the same cluster if they can be connected by a path in this graph.

• number of clusters = number of connected components of this space.

Explicit example

### Slide 7: Mapper¶

• Mapper modifies this construction in by integrating two pieces: - Allows for/leverages intersections of subpopulations - Integrates the data of a “lens” function :

• particularly interesting feature(s)

• highest component generated by PCA

• loss function of a learning algorithm

• points ~ subpopulations

• edges ~ intersections of subpopulations

• color ~ (average/median) value of original function

• Mapper requires additional choice: (overlapping) cover of range of function

### Slide 8: Covering the range¶

• Essentially hyperparameter means

• For every element of the cover, we have a subset of the data

• For every intersectin in the data, we have a subset of the data.
• These subsets sit inside of two larger subsets

### Slide 8: Mapper Constructs a Space¶

#. Mapper constructs a space (graph) - First step: stratify data by the “lens” function.

• KEY POINT: strata should be overlapping!

• Perform DBSCAN

• Say when two points are not being connected

### Slide 9: Mapper Constructs a Space¶

Slide : Embed the Graph - There are many ways to embed graphs in low dimensions - These + mapper gives low dimensional visualizations of the data - Can remember the original lens function

• The colors should vary continuously as you move along the edges of the graph

### Slide 9¶

• Mapper breast cancer application

• Interpretation of picture

### Slide 10¶

• Mapper malaria example

### Drawings:¶

1. point cloud approximation of $$S^1 \coprod *$$ clustered by DBSCAN - draw the length scale $$\epsilon$$ on the side

2. picture of an overlapping cover.

3. picture of point cloud broken into independent parts and clustered accordingly 4.