profile pic
⌘ '
raccourcis clavier

Think of using autoencoders to extract representations.

sparsity allows us to interpret hidden layers and internal representations of Transformers model.

graph TD
    A[Input X] --> B[Layer 1]
    B --> C[Layer 2]
    C --> D[Latent Features Z]
    D --> E[Layer 3]
    E --> F[Layer 4]
    F --> G[Output X']

    subgraph Encoder
        A --> B --> C
    end

    subgraph Decoder
        E --> F
    end

    style D fill:#c9a2d8,stroke:#000,stroke-width:2px,color:#fff
    style A fill:#98FB98,stroke:#000,stroke-width:2px
    style G fill:#F4A460,stroke:#000,stroke-width:2px

see also latent space

definition

EncΘ1:RdRqDecΘ2:RqRdqd\begin{aligned} \text{Enc}_{\Theta_1}&: \mathbb{R}^d \to \mathbb{R}^q \\ \text{Dec}_{\Theta_2}&: \mathbb{R}^q \to \mathbb{R}^d \\[12pt] &\because q \ll d \end{aligned}

loss function: l(x)=DecΘ2(EncΘ1(x))xl(x) = \|\text{Dec}_{\Theta_2}(\text{Enc}_{\Theta_1}(x)) - x\|

contrastive representation learning

The goal of contrastive representation learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. article

intuition: to give a positive and negative pairs for optimizing loss function.

Lien vers l'original

training objective

we want smaller reconstruction error, or

Dec(Sampler(Enc(x)))x22\|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|_2^2

we want to get the latent space distribution to look something similar to isotopic Gaussian!

Kullback-Leibler divergence

denoted as DKL(PQ)D_{\text{KL}}(P \parallel Q)

definition

The statistical distance between a model probability distribution QQ difference from a true probability distribution PP:

DKL(PQ)=xXP(x)log(P(x)Q(x))D_{\text{KL}}(P \parallel Q) = \sum_{x \in \mathcal{X}} P(x) \log (\frac{P(x)}{Q(x)})

alternative form:

KL(pq)=Exp(logp(x)q(x))=xP(x)logp(x)q(x)dx\begin{aligned} \text{KL}(p \parallel q) &= E_{x \sim p}(\log \frac{p(x)}{q(x)}) \\ &= \int_x P(x) \log \frac{p(x)}{q(x)} dx \end{aligned}Lien vers l'original

variational autoencoders

idea: to add a gaussian sampler after calculating latent space.

objective function:

min(xDec(Sampler(Enc(x)))x22+λi=1q(log(σi2)+σi2+μi2))\min (\sum_{x} \|\text{Dec}(\text{Sampler}(\text{Enc}(x))) - x\|^2_2 + \lambda \sum_{i=1}^{q}(-\log (\sigma_i^2) + \sigma_i^2 + \mu_i^2))