---
date: '2025-08-28'
description: and hollistic overview
id: '1'
modified: 2026-06-05 15:08:32 GMT-04:00
seealso:
  - '[[thoughts/tsfm/lecture-1-exercise/index|exercise]]'
  - '[[thoughts/Manifold hypothesis|Manifold hypothesis]]'
socials:
  link: https://tsfm.ca/lecture-one
tags:
  - ml
  - tsfm
  - ml/DL
title: lecture one
created: '2025-08-28'
published: '2025-08-28'
pageLayout: default
slug: thoughts/tsfm/1
permalink: https://aarnphm.xyz/thoughts/tsfm/1.md
generator:
  quartz: v4.6.0
  hostedProvider: Cloudflare
  baseUrl: aarnphm.xyz
full: https://aarnphm.xyz/llms-full.txt
---
## Manifold Hypothesis

Markov (order-1, bigrams) versus uniform distributions

> As dimensionality grows, fractions of meaningful strings shrinks to zero (curse of dimensionality)

Structured data is more compressible; random data is near max entropy. (Entropy readout in text panel; RLE ratio in image panel.)

> <ref slug="tags/ml/DL"> uses data to fit the regularities of this structured slice; we approximate, we don’t enumerate.

![[thoughts/gradient descent]]

![[thoughts/FFN#backpropagation]]

## exercise

given a `Y = X @ W`, with $X \in \mathbb{R}^{N\times D}, W \in \mathbb{R}^{D\times M}, Y\in \mathbb{R}^{N\times M}$, the [[thoughts/Vector calculus#Jacobian matrix|Jacobian]] is:

$$
dW = X^{T}dY, dW = dY W^{T}
$$

