# Practicing PCA on Two DatasetsÂ¶

In this exercise, we explore the application of PCA on two different datasets. One of them is synthetic and the other one is available through the sklearn package.

# SubmissionÂ¶

- There are four tasks for you.
- Report the results and answer the questions in the pdf file that you would submit along with your other solutions.
- Additionally, submit your code in the same Jupiter notebook format. (keep the overall format of the notebook unchanged)

## PackagesÂ¶

First of all, let's import the packages we need for this assignment.

```
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.utils.frame import extract_module_locals
from scipy.linalg import svd
from sklearn.datasets import fetch_lfw_people
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA, PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# get correct path
try: WORKING_DIR = os.path.dirname(extract_module_locals()[1]['__session__'])
except: WORKING_DIR = os.getcwd()
```

## Datasets:Â¶

First dataset is LFW (Labeled Faces in the Wild) dataset. And the second one is a synthetic dataset that will be explained in a few

### First Dataset: Labeled Faces in the Wild (LFW)Â¶

The **Labeled Faces in the Wild (LFW)** dataset is a collection of face images. This dataset contains images of famous individuals collected from the web, with each image labeled by the identity of the person in the image.

#### Features:Â¶

**Image data**: Each image is a grayscale 2D image of size 62x47 pixels, representing a person's face.**Label**: The name of the person in the image, which serves as the class label.

#### Dataset Properties:Â¶

**Number of Samples**:- Full dataset: 13,233 images.
- Reduced version (available via
`sklearn`

): 1,288 images. - "George W Bush" class: 530 images.

**Number of Classes**:- Full dataset: 5749 unique individuals, with varying numbers of images per person.
- Reduced version (available via
`sklearn`

): 7 classes.

**Image Size**: Each image is 62x47 pixels (grayscale), giving 2,914 features per image (62 Ã— 47 = 2,914).**Format**: The dataset provides images as NumPy arrays and labels as strings.

#### Loading LFW dataset:Â¶

```
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.5)
n_samples_LFW, image_height, image_width = lfw_people.images.shape
X_LFW = lfw_people.data
Y_LFW = lfw_people.target
Y_names_LFW = lfw_people.target_names
n_classes_LFW = Y_names_LFW.shape[0]
print('Dataset properties:')
print('\t Number of data points: %d' % X_LFW.shape[0])
print('\t Number of features: %d' % X_LFW.shape[1])
print('\t Number of classes: %d' % n_classes_LFW)
print('\t Width of each image: %d' % image_width)
print('\t Height of each image: %d' % image_height)
```

Dataset properties: Number of data points: 1288 Number of features: 2914 Number of classes: 7 Width of each image: 47 Height of each image: 62

#### Visualization:Â¶

```
def plot_faces(images, labels, names, n_row, n_col):
"""Helper function to plot a gallery of portraits"""
f = plt.figure(figsize=(8, 3))
for i in range(n_row * n_col):
subfigure = f.add_subplot(n_row, n_col, i + 1)
subfigure.imshow(images[i].reshape((image_height, image_width)), cmap=plt.cm.gray)
subfigure.set_title(names[labels[i]], fontsize=10)
# Removing the axes
plt.xticks(())
plt.yticks(())
plt.show()
plot_faces(X_LFW, Y_LFW, Y_names_LFW, 2, 5)
```

### Second Dataset: Two Noisy Circles (TNC)Â¶

This dataset contains synthetic samples of two noisy circles in 2D space. Each circle has a different radius, and noise is added to the points to simulate imperfect real-world data.

#### Features:Â¶

**coord_x**: The x-coordinate of the data point in 2D space.**coord_y**: The y-coordinate of the data point in 2D space.**label**: A categorical label indicating to which circle the point belongs:`1`

: Points belonging to the inner circle (radius = 2).`2`

: Points belonging to the outer circle (radius = 10).

#### Data Description:Â¶

**Inner Circle (Label 1)**:- Radius: 2.
- Number of points: 1500.
- Gaussian noise added to both x and y coordinates with a noise level of 0.2.

**Outer Circle (Label 2)**:- Radius: 10 units.
- Number of points: 1500.
- Gaussian noise added with the same level of 0.2.

#### Dataset Properties:Â¶

**Number of Samples**: 3000 (1500 samples for each circle).**Noise Level**: 0.2 (Gaussian noise applied independently to the x and y coordinates).**Labels**: 1 for the inner circle, 2 for the outer circle.

#### Generating TNC Dataset:Â¶

```
# Parameters for both circles
radius1 = 2
radius2 = 10
num_points = 1500
noise_level = 0.2
```

```
# Function to generate noisy points on a circle with a label
def generate_noisy_circle_points_with_labels(radius: int, num_points: int, noise_level: float, label: int):
# Random angles for polar coordinates
theta = np.random.uniform(0, 2 * np.pi, num_points)
# Coordinates on the circle (without noise)
coord_x = radius*np.cos(theta)
coord_y = radius*np.sin(theta)
# Add noise
coord_x += np.random.normal(0,noise_level,num_points)
coord_y += np.random.normal(0,noise_level,num_points)
# Create labels
labels = np.full(num_points, label)
return coord_x, coord_y, labels
if not os.path.exists(SYNTHETIC_FILE:=os.path.join(WORKING_DIR,'synthetic_circle_data.csv')):
# Generate points for both circles
coord_x1, coord_y1, labels1 = generate_noisy_circle_points_with_labels(radius1, num_points, noise_level, 1)
coord_x2, coord_y2, labels2 = generate_noisy_circle_points_with_labels(radius2, num_points, noise_level, 2)
# Combine the two circles
coord_x = np.concatenate([coord_x1, coord_x2])
coord_y = np.concatenate([coord_y1, coord_y2])
labels = np.concatenate([labels1, labels2])
# Create DataFrame for easy manipulation and shuffle
data = pd.DataFrame({'coord_x': coord_x, 'coord_y': coord_y, 'label': labels})
data = shuffle(data).reset_index(drop=True)
# Save the shuffled data to a CSV file
data.to_csv(SYNTHETIC_FILE, index=False)
```

#### Loading TNC dataset:Â¶

```
# Load the dataset
data = pd.read_csv(SYNTHETIC_FILE)
# Split features and labels
X_TNC = data[['coord_x', 'coord_y']].values
Y_TNC = data['label'].values
```

#### Visualization:Â¶

When plotted, the dataset exhibits two concentric, noisy circles. The inner circle is labeled as `1`

and colored red, while the outer circle is labeled as `2`

and colored blue. Both circles have Gaussian noise applied, which makes the data more realistic by introducing variance in the points' positions.

```
# Plot the points, color-coded by label (1 -> red, 2 -> blue)
plt.figure(figsize=(6, 6))
plt.scatter(data['coord_x'], data['coord_y'], c=data['label'], cmap='bwr', s=5)
# Set equal aspect ratio to make the circles look accurate
plt.gca().set_aspect('equal', adjustable='box')
# Set labels and title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Synthetic Samples of Two Noisy Circles')
plt.show()
```

# Task 1: Eigenfaces (5 points) Â¶

**NOTE:** In this task you only work with LFW dataset.

The below code is supposed to generate the "average faces" as well as the first "eigen faces" for George W Bush class.

**NOTE:** For the rest of the assignment you just work with "George W Bush" class

## Select "George W Bush" classÂ¶

Run the below code to select a specific class of faces.

```
class_name_bush = 'George W Bush'
class_indx_bush = list(Y_names_LFW).index(class_name_bush)
X_bush = X_LFW[Y_LFW == class_indx_bush, :]
Y_bush = Y_LFW[Y_LFW == class_indx_bush]
plot_faces(X_bush, Y_bush, Y_names_LFW, 2, 5)
print(X_bush.shape)
```

(530, 2914)

## ImplementationÂ¶

In this section you need to implement `centeralize_data()`

and `pca_components()`

functions.

We use SVD factorization in order to implement PCA. In the following link you can find useful information of how to do so. It explains the connection between PCA and SVD. https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca

You may check the video with link https://youtu.be/nbBvuuNVfco?si=SfJSE85fjMUMiboK for detailed explanation of SVD.

```
def centeralize_data(data):return data-(data_mean:=np.mean(data, axis=0).reshape(1,-1)), data_mean
def pca_components(Vt, n_components):return Vt[:n_components]
def normalized_svd(data):
XN, data_mean = centeralize_data(data)
return *svd(XN), data_mean
def average_image_class(images):return np.mean(images, axis=0)
def eigen_face_class(images):return pca_components(normalized_svd(images)[2], n_components=1)
def plot_class_representatives(images, class_name, aggregator):
f = plt.figure(figsize=(2, 2))
subfigure = f.add_subplot(1, 1, 1)
subfigure.imshow(aggregator(images).reshape((image_height, image_width)), cmap=plt.cm.gray)
subfigure.set_title(class_name)
# Removing the axes
plt.xticks(())
plt.yticks(())
plt.show()
plot_class_representatives(X_bush, class_name_bush, aggregator=average_image_class)
plot_class_representatives(X_bush, class_name_bush, aggregator=eigen_face_class)
```

# Task 2: PCA Transformation and Reconstructing (15 points) Â¶

In this task, we focus on implementing two fundamental functions for Principal Component Analysis (PCA):

`pca_transform()`

`pca_inverse_transform()`

## Part A (10 points): Â¶

: Projects the original data onto a new space using PCA via Singular Value Decomposition (SVD).`pca_transform()`

### Inputs:Â¶

:`X`

(ndarray)- The input data matrix to be transformed. Each row represents a sample, and each column represents a feature.

:`n_components`

(int)- The number of principal components to retain.
- Controls the dimensionality reduction by selecting the top
`n_components`

. - For instance, setting
`n_components=2`

reduces the data to 2D space by projecting it onto the first two principal components.

### Outputs:Â¶

:`transformed_data`

(ndarray)- The data projected onto the top
`n_components`

principal components, resulting in a reduced-dimension representation of the original data.

- The data projected onto the top
:`Vt`

(ndarray)- The right singular vectors from SVD, needed for inverse transformation to reconstruct the original data.

:`data_mean`

(ndarray)- The mean of the original data, which is subtracted during the transformation process and added back during the reconstruction.

```
def pca_transform(X, n_components):
"""
Transforms the data into a new space using PCA via SVD.
Parameters:
- X: ndarray
The input data matrix to be transformed.
- n_components: int
The number of principal components to retain.
Returns:
- transformed_data: ndarray
The data projected onto the top n_components principal components.
- Vt: ndarray
The right singular vectors from SVD, needed for inverse transformation.
- data_mean: ndarray
The mean of the original data, required to reconstruct the data.
"""
U,s,*result=normalized_svd(X)
return U[:, :n_components] * s[:n_components], *result
```

### We apply PCA transform to both datasets:Â¶

For TNC dataset we set n_components=2

For LFW dataset we set n_components=100

**IMPORTANT NOTE:** Remember that in TNC dataset, we are working with X_TNC that can have two labels. But in the LFW dataset, we are just working with X_bush.

```
n_components = 100
# Transform the X_bush data using the specified number of principal components
T_bush, Vt_bush, mean_bush = pca_transform(X_bush, n_components)
n_components = 2
# Transform the X_TNC data using the specified number of principal components
T_TNC, Vt_TNC, mean_TNC = pca_transform(X_TNC, n_components)
```

## Part B (5 points): Â¶

: Reconstructs the original data from its transformed representation.`pca_inverse_transform()`

:`transformed_data`

(ndarray)- The data that has been projected into the lower-dimensional space using PCA.
- This is the result of the
`pca_transform()`

function and represents the coordinates of the original data in the new principal component space.

:`Vt`

(ndarray)- The right singular vectors from SVD (the principal components), used for the reconstruction.
- This matrix is obtained from the SVD process, and only the top
`n_components`

vectors are used for reconstruction.

:`n_components`

(int)- The number of principal components that were used for the transformation.
- Determines how many components to use in the reconstruction process.

:`data_mean`

(ndarray)- The mean of the original data.
- This is used to reverse the centering operation that was applied to the data before performing PCA.

```
def pca_inverse_transform(transformed_data, Vt, n_components, data_mean):
"""
Reconstructs the original data from its transformed representation.
Parameters:
- transformed_data: ndarray
The data in the reduced-dimensional space.
- Vt: ndarray
The right singular vectors from SVD, used to reconstruct the original space.
- n_components: int
The number of principal components that were used for the transformation.
- data_mean: ndarray
The mean of the original data (used for centering).
Returns:
- reconstructed_data: ndarray
The data reconstructed back into the original space.
"""
return transformed_data @ pca_components(Vt, n_components) + data_mean
```

### We apply PCA inverse transform to both new datasets:Â¶

For TNC dataset we set n_components=2

For LFW dataset we set n_components=100

```
n_components = 100
# Reconstruct the X_bush data back to the original image space.
reconstructed_bush = pca_inverse_transform(T_bush, Vt_bush, n_components, mean_bush)
n_components = 2
# Reconstruct the X_TNC data back to the original 2D space.
reconstructed_TNC = pca_inverse_transform(T_TNC, Vt_TNC, n_components, mean_TNC)
```

### TNC Visualization:Â¶

Since we are in the 2 dimension for the circle dataset it is easy to plot the transformation results:

As shown in the TNC visualization, the dimensionality of the data is not being reduced, as we maintain the number of components at two. This ensures no information is lost. The only noticeable difference is a rotation of the data.

Observe that here, we plot each coordinate with its corresponding label.

```
# Plotting the results
fig, axs = plt.subplots(1, 3, figsize=(12, 4))
# Plotting Original Data for X_TNC
axs[0].scatter(X_TNC[:, 0], X_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[0].set_title('Original X_TNC')
# Plotting Transformed Data for X_TNC
axs[1].scatter(T_TNC[:, 0], T_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[1].set_title('Transformed X_TNC')
# Plotting Reconstructed Data for X_TNC
axs[2].scatter(reconstructed_TNC[:, 0], reconstructed_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[2].set_title('Reconstructed X_TNC')
plt.tight_layout()
plt.show()
```

### LFW Visualization:Â¶

In the case of the LFW dataset, we are working with around 500 samples, each represented in an approximately 3000-dimensional feature space. Visualizing data in such a high-dimensional space is not feasible, and even after applying PCA to reduce the dimensions to around 200, visual interpretation remains challenging. To address this, we project the data onto a 2D plane, enabling us to better visualize and understand the underlying structure. Although this 2D representation no longer reflects the pixel values of the original images, it captures the most important variance in the data, allowing us to interpret patterns and relationships more easily in the reduced space.

As a result, we can only see the scatter plot of the first two most important new features.

```
# Extract the first two principal components for plotting
pc1 = T_bush[:, 0] # First principal component (PC1)
pc2 = T_bush[:, 1] # Second principal component (PC2)
# Plotting the first two principal components
plt.figure(figsize=(5, 4)); plt.scatter(pc1, pc2, alpha=0.7, edgecolors='b')
plt.xlabel('Principal Component 1'); plt.ylabel('Principal Component 2'); plt.title('Projection onto the First Two Principal Components')
plt.grid(True); plt.show()
```

Moreover, some information is lost when reducing the dimensionality of the data. This can be observed by plotting the reconstructed image for a specific instance.

```
index = 0 # Choose an index of the image you want to show (e.g., the first image)
original_image = X_bush[index].reshape(image_height, image_width)
reconstructed_image = reconstructed_bush[index].reshape(image_height, image_width)
# Plotting the results
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Plotting Original Image from X_bush
axs[0].imshow(original_image, cmap='gray')
axs[0].set_title('Original Image (X_bush)')
# Plotting Reconstructed Image from X_bush
axs[1].imshow(reconstructed_image, cmap='gray')
axs[1].set_title('Reconstructed Image (X_bush)')
plt.tight_layout()
plt.show()
```

# Task 3: Average Reconstruction Error for LFW (30 points)Â¶

Split the image dataset into train (400 images) and test (rest of the data) datasets. Train PCA with [2, 10, 30, 60, 100] components respectively. Reconstruction error is defined as

\begin{align} \text{error}=\frac{1}{n}\sum_{i=1}^n||x_i-\text{reconstruct}(pca(x_i))||^2_2 \end{align}

## Part A (20 points): Â¶

Plot average reconstruction error on training and testing data points with the following requirements:

- X-axis shows number of components.
- Y-axis shows reconstruction error.
- Draw two graphs, one for training and the other line for testing.

```
# Define the number of components to test in PCA
c_components = [2, 10, 30, 60, 100]
# Initialize lists to store the reconstruction errors for training and testing data
train_errors, test_errors = [], []
# Initialize deterministic seed
SEED = 42
X_train, X_test = train_test_split(X_bush, train_size=400, random_state=SEED)
# \text{error}=\frac{1}{n}\sum_{i=1}^n||x_i-\text{reconstruct}(pca(x_i))||^2_2
def mse(train_data, reconstructed): return np.mean(np.sum((train_data - reconstructed)**2, axis=1))
# Loop through each specified number of components for PCA
for n_components in c_components:
# Apply PCA and then inverse PCA to the training data
transformed_train, Vt_train, mean_train = pca_transform(X_train, n_components)
# Calculate the Mean Squared Error (MSE) as the reconstruction error for the training set
train_errors.append(mse(X_train,
pca_inverse_transform(transformed_train,
Vt_train, n_components, mean_train)))
# Normalize the test data. Transform the test data using the train data's PCA components # and reconstruct the test data.
# Calculate the Mean Squared Error (MSE) as the reconstruction error for the test set
test_errors.append(mse(X_test,
pca_inverse_transform((X_test - mean_train) @ pca_components(Vt_train, n_components).T,
Vt_train, n_components, mean_train)))
# Print the average reconstruction errors for each number of components
for i,n_components in enumerate(c_components):
print(f"Components: {n_components}\n Train Error: {train_errors[i]:.4f}\n Test Error: {test_errors[i]:.4f}")
```

Components: 2 Train Error: 40.2048 Test Error: 44.1277 Components: 10 Train Error: 21.6275 Test Error: 25.1425 Components: 30 Train Error: 11.6392 Test Error: 15.6092 Components: 60 Train Error: 6.6892 Test Error: 11.4092 Components: 100 Train Error: 3.7635 Test Error: 8.7075

```
def plot_eval_results(c_components, train_error, test_error):
plt.figure(figsize=(6 * 1.5, 5 * 1.2))
plt.plot(c_components, train_error, 'b-o', label='train error')
plt.plot(c_components, test_error, 'r-o', label='test error')
plt.xlabel('n_components'); plt.ylabel('loss');
plt.legend(); plt.grid(True); plt.show()
plot_eval_results(c_components, train_errors, test_errors)
```

## Part B (10 points): Â¶

Explains the difference between the two graphs (8 points).

What would the error be if we compute it for the TNC dataset while using two components and 2000 samples for training? (2 points)

```
# Define the number of components to test in PCA
n_components = 2
TNC_train, TNC_test = train_test_split(X_TNC, train_size=2000, random_state=SEED)
transformed_TNC_train, Vt_TNC_train, mean_TNC_train = pca_transform(TNC_train, n_components)
train_error = mse(TNC_train,
pca_inverse_transform(transformed_TNC_train,
Vt_TNC_train, n_components, mean_TNC_train))
test_error = mse(TNC_test,
pca_inverse_transform((TNC_test - mean_TNC_train) @ pca_components(Vt_TNC_train, n_components).T,
Vt_TNC_train, n_components, mean_TNC_train))
# Print the average reconstruction errors for each number of components
print(f'Average reconstruction error for train data is {train_error}')
print(f'Average reconstruction error for test data is {test_error}')
```

Average reconstruction error for train data is 6.217495281310197e-30 Average reconstruction error for test data is 1.2387388672271008e-30

# Task 4: Kernel PCA (30 points)Â¶

KernelPCA is an extension of PCA that uses kernel methods to perform dimensionality reduction in a high-dimensional space. Unlike standard PCA, which finds linear components, Kernel PCA can capture non-linear relationships in the data by implicitly mapping it into a higher-dimensional space using a kernel function. In this task, we will use the Radial Basis Function (RBF) kernel, which is similar to a Gaussian kernel.

The `gamma`

parameter in the RBF kernel influences the spread of the kernel function. It is analogous to the inverse of the standard deviation (`gamma=1/sigma`

) of the Gaussian. A lower `gamma`

value results in a wider influence of each training point, while a higher `gamma`

value makes the influence more localized. You can read more about the `gamma`

parameter here.

For this task, you don't need to implement Kernel PCA from scratch; use scikit-learn's `KernelPCA`

(https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html).

An example setup for scikit-learn's Kernel PCA is:

```
KernelPCA(kernel='rbf', n_components=100, gamma=0.1, fit_inverse_transform=True)
```

Now, let's proceed with the following parts:

## Part A (10 points): Â¶

**Apply Kernel PCA and Plot Transformed Data**:Â¶

First plot the original TNC data again. Now start by applying Kernel PCA on the TNC dataset using different values of `gamma`

(e.g., 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1). For each value of `gamma`

, plot the transformed data to visualize the effect of the RBF kernel on the dataset. This step will help you understand how `gamma`

influences the data's representation in the transformed space.

For this part you need to plot 12 figures (one for original data, and the other 11 ones for different values of `gamma`

).

```
# Load the dataset
data = pd.read_csv(SYNTHETIC_FILE)
# Split features and labels
X_TNC = data[['coord_x', 'coord_y']].values
Y_TNC = data['label'].values
```

```
# Define different values of gamma to test
gamma_values = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1]
n_components = 2
# Standardize the features
scaler = StandardScaler()
X_TNC_scaled = scaler.fit_transform(X_TNC)
# Create subplots to visualize the transformed data for each gamma
plt.figure(figsize=(20, 15))
# Plot the original data before applying Kernel PCA
plt.subplot(3, 4, 1); plt.scatter(X_TNC_scaled[:, 0], X_TNC_scaled[:, 1], c=Y_TNC, cmap='bwr')
plt.title('Original Data'); plt.xlabel('coord_x'); plt.ylabel('coord_y')
# Set the limits for the x and y axes
x_limits, y_limits = (-4, 4), (-4, 4)
# Apply Kernel PCA for each gamma value
for idx, gamma in enumerate(gamma_values):
# Apply Kernel PCA
kpca = KernelPCA(n_components=n_components,kernel='rbf',gamma=gamma)
X_kpca = kpca.fit_transform(X_TNC_scaled)
# Plot the transformed data
plt.subplot(3, 4, idx + 2);
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=Y_TNC, cmap='viridis')
plt.title(f'Gamma = {gamma}')
plt.xlabel('First principal component'); plt.ylabel('Second principal component');
# Set fixed x and y axis limits
plt.xlim(x_limits); plt.ylim(y_limits);
plt.tight_layout(); plt.show();
```