(Raghu et al., 2017) proposed a way to compare two representations that is both invariant to affine transform and fast to compute 1

based on canonical correlation analysis which was invariant to linear transformation.

definition

Given a dataset X={x1,,xm}X = \{x_{1},\cdots, x_m\} and a neuron ii on layer ll, we define zilz_i^l to be the vector of outputs on XX, or:

zil=(zil(x1),,zil(xm))z^l_i = (z^l_i(x_1), \cdots, z^l_i(x_m))

SVCCA proceeds as following:

  1. Input: takes as input two (not necessary different) sets of neurons l1={z1l1,,zm1l1}l_{1} = \{z_1^{l_{1}}, \cdots, z_{m_{1}}^{l_1}\} and l2={z1l2,,zm2l2}l_{2} = \{z_1^{l_2}, \cdots, z_{m_2}^{l_{2}}\}

  2. Step 1: Perform SVD of each subspace to get subspace l1l1,l2l2l^{'}_1 \subset l_1, l^{'}_2 \subset l_2

  3. Step 2: Compute Canonical Correlation similarity between l1,l2l^{'}_1, l^{'}_2, that is maximal correlations between X,YX,Y can be expressed as:

    maxaTXYbaTXXabTYYb\max \frac{a^T \sum_{XY}b}{\sqrt{a^T \sum_{XX}a}\sqrt{b^T \sum_{YY}b}}

    where XX,XY,YX,YY\sum_{XX}, \sum_{XY}, \sum_{YX}, \sum_{YY} are covariance and cross-variance terms.

    By performing change of basis x1~=xx12a\tilde{x_{1}} = \sum_{xx}^{\frac{1}{2}} a and y1~=YY12b\tilde{y_1}=\sum_{YY}^{\frac{1}{2}} b and Cauchy-Schwarz we recover an eigenvalue problem:

    x1~=arg max[xTXX12XYYY1YXXX12xx]\tilde{x_{1}} = \argmax [\frac{x^T \sum_{X X}^{\frac{1}{2}} \sum_{XY} \sum_{YY}^{-1} \sum_{YX} \sum_{XX}^{-\frac{1}{2}}x}{\|x\|}]
  4. Output: aligned directions (zil1~,zil2~)(\tilde{z_i^{l_{1}}}, \tilde{z_i^{l_{2}}}) and correlations ρi\rho_i

distributed representations

SVCCA has no preference for representations that are neuron (axed) aligned. 2

References

  • Raghu, M., Gilmer, J., Yosinski, J., & Sohl-Dickstein, J. (2017). SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. arXiv preprint arXiv:1706.05806 arxiv

Footnotes

  1. means allowing comparison between different layers of network and more comparisons to be calculated than with previous methods

  2. Experiments were conducted with a convolutional network followed by a residual network:

    convnet: conv --> conv --> bn --> pool --> conv --> conv --> conv --> conv --> bn --> pool --> fc --> bn --> fc --> bn --> out

    resnet: conv --> (x10 c/bn/r block) --> (x10 c/bn/r block) --> (x10 c/bn/r block) --> bn --> fc --> out

    Note that SVD and CCA works with span(z1,,zm)\text{span}(z_1, \cdots, z_m) instead of being axis aligned to ziz_i directions. This is important if representations are distributed across many dimensions, which we observe in cross-branch superpositions!