Practicing PCA on Two Datasets¶
In this exercise, we explore the application of PCA on two different datasets. One of them is synthetic and the other one is available through the sklearn package.
Submission¶
- There are four tasks for you.
- Report the results and answer the questions in the pdf file that you would submit along with your other solutions.
- Additionally, submit your code in the same Jupiter notebook format. (keep the overall format of the notebook unchanged)
Packages¶
First of all, let's import the packages we need for this assignment.
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.utils.frame import extract_module_locals
from scipy.linalg import svd
from sklearn.datasets import fetch_lfw_people
from sklearn.utils import shuffle
from sklearn.decomposition import KernelPCA, PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# get correct path
try:
WORKING_DIR = os.path.dirname(extract_module_locals()[1]['__session__'])
except:
WORKING_DIR = os.getcwd()
Datasets:¶
First dataset is LFW (Labeled Faces in the Wild) dataset. And the second one is a synthetic dataset that will be explained in a few
First Dataset: Labeled Faces in the Wild (LFW)¶
The Labeled Faces in the Wild (LFW) dataset is a collection of face images. This dataset contains images of famous individuals collected from the web, with each image labeled by the identity of the person in the image.
Features:¶
- Image data: Each image is a grayscale 2D image of size 62x47 pixels, representing a person's face.
- Label: The name of the person in the image, which serves as the class label.
Dataset Properties:¶
- Number of Samples:
- Full dataset: 13,233 images.
- Reduced version (available via
sklearn
): 1,288 images. - "George W Bush" class: 530 images.
- Number of Classes:
- Full dataset: 5749 unique individuals, with varying numbers of images per person.
- Reduced version (available via
sklearn
): 7 classes.
- Image Size: Each image is 62x47 pixels (grayscale), giving 2,914 features per image (62 × 47 = 2,914).
- Format: The dataset provides images as NumPy arrays and labels as strings.
Loading LFW dataset:¶
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.5)
n_samples_LFW, image_height, image_width = lfw_people.images.shape
X_LFW = lfw_people.data
Y_LFW = lfw_people.target
Y_names_LFW = lfw_people.target_names
n_classes_LFW = Y_names_LFW.shape[0]
print('Dataset properties:')
print('\t Number of data points: %d' % X_LFW.shape[0])
print('\t Number of features: %d' % X_LFW.shape[1])
print('\t Number of classes: %d' % n_classes_LFW)
print('\t Width of each image: %d' % image_width)
print('\t Height of each image: %d' % image_height)
Dataset properties: Number of data points: 1288 Number of features: 2914 Number of classes: 7 Width of each image: 47 Height of each image: 62
Visualization:¶
def plot_faces(images, labels, names, n_row, n_col):
"""Helper function to plot a gallery of portraits"""
f = plt.figure(figsize=(8, 3))
for i in range(n_row * n_col):
subfigure = f.add_subplot(n_row, n_col, i + 1)
subfigure.imshow(images[i].reshape((image_height, image_width)), cmap=plt.cm.gray)
subfigure.set_title(names[labels[i]], fontsize=10)
# Removing the axes
plt.xticks(())
plt.yticks(())
plt.show()
plot_faces(X_LFW, Y_LFW, Y_names_LFW, 2, 5)
Second Dataset: Two Noisy Circles (TNC)¶
This dataset contains synthetic samples of two noisy circles in 2D space. Each circle has a different radius, and noise is added to the points to simulate imperfect real-world data.
Features:¶
- coord_x: The x-coordinate of the data point in 2D space.
- coord_y: The y-coordinate of the data point in 2D space.
- label: A categorical label indicating to which circle the point belongs:
1
: Points belonging to the inner circle (radius = 2).2
: Points belonging to the outer circle (radius = 10).
Data Description:¶
- Inner Circle (Label 1):
- Radius: 2.
- Number of points: 1500.
- Gaussian noise added to both x and y coordinates with a noise level of 0.2.
- Outer Circle (Label 2):
- Radius: 10 units.
- Number of points: 1500.
- Gaussian noise added with the same level of 0.2.
Dataset Properties:¶
- Number of Samples: 3000 (1500 samples for each circle).
- Noise Level: 0.2 (Gaussian noise applied independently to the x and y coordinates).
- Labels: 1 for the inner circle, 2 for the outer circle.
Generating TNC Dataset:¶
# Parameters for both circles
radius1 = 2
radius2 = 10
num_points = 1500
noise_level = 0.2
# Function to generate noisy points on a circle with a label
def generate_noisy_circle_points_with_labels(radius: int, num_points: int, noise_level: float, label: int):
# Random angles for polar coordinates
theta = np.random.uniform(0, 2 * np.pi, num_points)
# Coordinates on the circle (without noise)
coord_x = radius * np.cos(theta)
coord_y = radius * np.sin(theta)
# Add noise
coord_x += np.random.normal(0, noise_level, num_points)
coord_y += np.random.normal(0, noise_level, num_points)
# Create labels
labels = np.full(num_points, label)
return coord_x, coord_y, labels
if not os.path.exists(SYNTHETIC_FILE := os.path.join(WORKING_DIR, 'synthetic_circle_data.csv')):
# Generate points for both circles
coord_x1, coord_y1, labels1 = generate_noisy_circle_points_with_labels(radius1, num_points, noise_level, 1)
coord_x2, coord_y2, labels2 = generate_noisy_circle_points_with_labels(radius2, num_points, noise_level, 2)
# Combine the two circles
coord_x = np.concatenate([coord_x1, coord_x2])
coord_y = np.concatenate([coord_y1, coord_y2])
labels = np.concatenate([labels1, labels2])
# Create DataFrame for easy manipulation and shuffle
data = pd.DataFrame({'coord_x': coord_x, 'coord_y': coord_y, 'label': labels})
data = shuffle(data).reset_index(drop=True)
# Save the shuffled data to a CSV file
data.to_csv(SYNTHETIC_FILE, index=False)
Loading TNC dataset:¶
# Load the dataset
data = pd.read_csv(SYNTHETIC_FILE)
# Split features and labels
X_TNC = data[['coord_x', 'coord_y']].values
Y_TNC = data['label'].values
Visualization:¶
When plotted, the dataset exhibits two concentric, noisy circles. The inner circle is labeled as 1
and colored red, while the outer circle is labeled as 2
and colored blue. Both circles have Gaussian noise applied, which makes the data more realistic by introducing variance in the points' positions.
# Plot the points, color-coded by label (1 -> red, 2 -> blue)
plt.figure(figsize=(6, 6))
plt.scatter(data['coord_x'], data['coord_y'], c=data['label'], cmap='bwr', s=5)
# Set equal aspect ratio to make the circles look accurate
plt.gca().set_aspect('equal', adjustable='box')
# Set labels and title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Synthetic Samples of Two Noisy Circles')
plt.show()
Task 1: Eigenfaces (5 points) ¶
NOTE: In this task you only work with LFW dataset.
The below code is supposed to generate the "average faces" as well as the first "eigen faces" for George W Bush class.
NOTE: For the rest of the assignment you just work with "George W Bush" class
Select "George W Bush" class¶
Run the below code to select a specific class of faces.
class_name_bush = 'George W Bush'
class_indx_bush = list(Y_names_LFW).index(class_name_bush)
X_bush = X_LFW[Y_LFW == class_indx_bush, :]
Y_bush = Y_LFW[Y_LFW == class_indx_bush]
plot_faces(X_bush, Y_bush, Y_names_LFW, 2, 5)
print(X_bush.shape)
(530, 2914)
Implementation¶
In this section you need to implement centeralize_data()
and pca_components()
functions.
We use SVD factorization in order to implement PCA. In the following link you can find useful information of how to do so. It explains the connection between PCA and SVD. https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
You may check the video with link https://youtu.be/nbBvuuNVfco?si=SfJSE85fjMUMiboK for detailed explanation of SVD.
def centeralize_data(data):
return data - (data_mean := np.mean(data, axis=0).reshape(1, -1)), data_mean
def pca_components(Vt, n_components):
return Vt[:n_components]
def normalized_svd(data):
XN, data_mean = centeralize_data(data)
return *svd(XN), data_mean
def average_image_class(images):
return np.mean(images, axis=0)
def eigen_face_class(images):
return pca_components(normalized_svd(images)[2], n_components=1)
def plot_class_representatives(images, class_name, aggregator):
f = plt.figure(figsize=(2, 2))
subfigure = f.add_subplot(1, 1, 1)
subfigure.imshow(aggregator(images).reshape((image_height, image_width)), cmap=plt.cm.gray)
subfigure.set_title(class_name)
# Removing the axes
plt.xticks(())
plt.yticks(())
plt.show()
plot_class_representatives(X_bush, class_name_bush, aggregator=average_image_class)
plot_class_representatives(X_bush, class_name_bush, aggregator=eigen_face_class)
Task 2: PCA Transformation and Reconstructing (15 points) ¶
In this task, we focus on implementing two fundamental functions for Principal Component Analysis (PCA):
pca_transform()
pca_inverse_transform()
Part A (10 points): ¶
pca_transform()
: Projects the original data onto a new space using PCA via Singular Value Decomposition (SVD).Inputs:¶
X
(ndarray):- The input data matrix to be transformed. Each row represents a sample, and each column represents a feature.
n_components
(int):- The number of principal components to retain.
- Controls the dimensionality reduction by selecting the top
n_components
. - For instance, setting
n_components=2
reduces the data to 2D space by projecting it onto the first two principal components.
Outputs:¶
transformed_data
(ndarray):- The data projected onto the top
n_components
principal components, resulting in a reduced-dimension representation of the original data.
- The data projected onto the top
Vt
(ndarray):- The right singular vectors from SVD, needed for inverse transformation to reconstruct the original data.
data_mean
(ndarray):- The mean of the original data, which is subtracted during the transformation process and added back during the reconstruction.
def pca_transform(X, n_components):
"""
Transforms the data into a new space using PCA via SVD.
Parameters:
- X: ndarray
The input data matrix to be transformed.
- n_components: int
The number of principal components to retain.
Returns:
- transformed_data: ndarray
The data projected onto the top n_components principal components.
- Vt: ndarray
The right singular vectors from SVD, needed for inverse transformation.
- data_mean: ndarray
The mean of the original data, required to reconstruct the data.
"""
U, s, *result = normalized_svd(X)
return U[:, :n_components] * s[:n_components], *result
We apply PCA transform to both datasets:¶
For TNC dataset we set n_components=2
For LFW dataset we set n_components=100
IMPORTANT NOTE: Remember that in TNC dataset, we are working with X_TNC that can have two labels. But in the LFW dataset, we are just working with X_bush.
n_components = 100
# Transform the X_bush data using the specified number of principal components
T_bush, Vt_bush, mean_bush = pca_transform(X_bush, n_components)
n_components = 2
# Transform the X_TNC data using the specified number of principal components
T_TNC, Vt_TNC, mean_TNC = pca_transform(X_TNC, n_components)
Part B (5 points): ¶
pca_inverse_transform()
: Reconstructs the original data from its transformed representation.transformed_data
(ndarray):- The data that has been projected into the lower-dimensional space using PCA.
- This is the result of the
pca_transform()
function and represents the coordinates of the original data in the new principal component space.
Vt
(ndarray):- The right singular vectors from SVD (the principal components), used for the reconstruction.
- This matrix is obtained from the SVD process, and only the top
n_components
vectors are used for reconstruction.
n_components
(int):- The number of principal components that were used for the transformation.
- Determines how many components to use in the reconstruction process.
data_mean
(ndarray):- The mean of the original data.
- This is used to reverse the centering operation that was applied to the data before performing PCA.
def pca_inverse_transform(transformed_data, Vt, n_components, data_mean):
"""
Reconstructs the original data from its transformed representation.
Parameters:
- transformed_data: ndarray
The data in the reduced-dimensional space.
- Vt: ndarray
The right singular vectors from SVD, used to reconstruct the original space.
- n_components: int
The number of principal components that were used for the transformation.
- data_mean: ndarray
The mean of the original data (used for centering).
Returns:
- reconstructed_data: ndarray
The data reconstructed back into the original space.
"""
return transformed_data @ pca_components(Vt, n_components) + data_mean
We apply PCA inverse transform to both new datasets:¶
For TNC dataset we set n_components=2
For LFW dataset we set n_components=100
n_components = 100
# Reconstruct the X_bush data back to the original image space.
reconstructed_bush = pca_inverse_transform(T_bush, Vt_bush, n_components, mean_bush)
n_components = 2
# Reconstruct the X_TNC data back to the original 2D space.
reconstructed_TNC = pca_inverse_transform(T_TNC, Vt_TNC, n_components, mean_TNC)
TNC Visualization:¶
Since we are in the 2 dimension for the circle dataset it is easy to plot the transformation results:
As shown in the TNC visualization, the dimensionality of the data is not being reduced, as we maintain the number of components at two. This ensures no information is lost. The only noticeable difference is a rotation of the data.
Observe that here, we plot each coordinate with its corresponding label.
# Plotting the results
fig, axs = plt.subplots(1, 3, figsize=(12, 4))
# Plotting Original Data for X_TNC
axs[0].scatter(X_TNC[:, 0], X_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[0].set_title('Original X_TNC')
# Plotting Transformed Data for X_TNC
axs[1].scatter(T_TNC[:, 0], T_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[1].set_title('Transformed X_TNC')
# Plotting Reconstructed Data for X_TNC
axs[2].scatter(reconstructed_TNC[:, 0], reconstructed_TNC[:, 1], c=Y_TNC, cmap='bwr', s=5)
axs[2].set_title('Reconstructed X_TNC')
plt.tight_layout()
plt.show()
LFW Visualization:¶
In the case of the LFW dataset, we are working with around 500 samples, each represented in an approximately 3000-dimensional feature space. Visualizing data in such a high-dimensional space is not feasible, and even after applying PCA to reduce the dimensions to around 200, visual interpretation remains challenging. To address this, we project the data onto a 2D plane, enabling us to better visualize and understand the underlying structure. Although this 2D representation no longer reflects the pixel values of the original images, it captures the most important variance in the data, allowing us to interpret patterns and relationships more easily in the reduced space.
As a result, we can only see the scatter plot of the first two most important new features.
# Extract the first two principal components for plotting
pc1 = T_bush[:, 0] # First principal component (PC1)
pc2 = T_bush[:, 1] # Second principal component (PC2)
# Plotting the first two principal components
plt.figure(figsize=(5, 4))
plt.scatter(pc1, pc2, alpha=0.7, edgecolors='b')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Projection onto the First Two Principal Components')
plt.grid(True)
plt.show()
Moreover, some information is lost when reducing the dimensionality of the data. This can be observed by plotting the reconstructed image for a specific instance.
index = 0 # Choose an index of the image you want to show (e.g., the first image)
original_image = X_bush[index].reshape(image_height, image_width)
reconstructed_image = reconstructed_bush[index].reshape(image_height, image_width)
# Plotting the results
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Plotting Original Image from X_bush
axs[0].imshow(original_image, cmap='gray')
axs[0].set_title('Original Image (X_bush)')
# Plotting Reconstructed Image from X_bush
axs[1].imshow(reconstructed_image, cmap='gray')
axs[1].set_title('Reconstructed Image (X_bush)')
plt.tight_layout()
plt.show()
Task 3: Average Reconstruction Error for LFW (30 points)¶
Split the image dataset into train (400 images) and test (rest of the data) datasets. Train PCA with [2, 10, 30, 60, 100] components respectively. Reconstruction error is defined as
\begin{align} \text{error}=\frac{1}{n}\sum_{i=1}^n||x_i-\text{reconstruct}(pca(x_i))||^2_2 \end{align}
Part A (20 points): ¶
Plot average reconstruction error on training and testing data points with the following requirements:
- X-axis shows number of components.
- Y-axis shows reconstruction error.
- Draw two graphs, one for training and the other line for testing.
# Define the number of components to test in PCA
c_components = [2, 10, 30, 60, 100]
# Initialize lists to store the reconstruction errors for training and testing data
train_errors, test_errors = [], []
# Initialize deterministic seed
SEED = 42
X_train, X_test = train_test_split(X_bush, train_size=400, random_state=SEED)
# \text{error}=\frac{1}{n}\sum_{i=1}^n||x_i-\text{reconstruct}(pca(x_i))||^2_2
def mse(train_data, reconstructed):
return np.mean(np.sum((train_data - reconstructed) ** 2, axis=1))
# Loop through each specified number of components for PCA
for n_components in c_components:
# Apply PCA and then inverse PCA to the training data
transformed_train, Vt_train, mean_train = pca_transform(X_train, n_components)
# Calculate the Mean Squared Error (MSE) as the reconstruction error for the training set
train_errors.append(mse(X_train, pca_inverse_transform(transformed_train, Vt_train, n_components, mean_train)))
# Normalize the test data. Transform the test data using the train data's PCA components # and reconstruct the test data.
# Calculate the Mean Squared Error (MSE) as the reconstruction error for the test set
test_errors.append(
mse(
X_test,
pca_inverse_transform(
(X_test - mean_train) @ pca_components(Vt_train, n_components).T, Vt_train, n_components, mean_train
),
)
)
# Print the average reconstruction errors for each number of components
for i, n_components in enumerate(c_components):
print(f'Components: {n_components}\n Train Error: {train_errors[i]:.4f}\n Test Error: {test_errors[i]:.4f}')
Components: 2 Train Error: 40.2048 Test Error: 44.1277 Components: 10 Train Error: 21.6275 Test Error: 25.1425 Components: 30 Train Error: 11.6392 Test Error: 15.6092 Components: 60 Train Error: 6.6892 Test Error: 11.4092 Components: 100 Train Error: 3.7635 Test Error: 8.7075
def plot_eval_results(c_components, train_error, test_error):
plt.figure(figsize=(6 * 1.5, 5 * 1.2))
plt.plot(c_components, train_error, 'b-o', label='train error')
plt.plot(c_components, test_error, 'r-o', label='test error')
plt.xlabel('n_components')
plt.ylabel('loss')
plt.legend()
plt.grid(True)
plt.show()
plot_eval_results(c_components, train_errors, test_errors)
Part B (10 points): ¶
Explains the difference between the two graphs (8 points).
What would the error be if we compute it for the TNC dataset while using two components and 2000 samples for training? (2 points)
# Define the number of components to test in PCA
n_components = 2
TNC_train, TNC_test = train_test_split(X_TNC, train_size=2000, random_state=SEED)
transformed_TNC_train, Vt_TNC_train, mean_TNC_train = pca_transform(TNC_train, n_components)
train_error = mse(TNC_train, pca_inverse_transform(transformed_TNC_train, Vt_TNC_train, n_components, mean_TNC_train))
test_error = mse(
TNC_test,
pca_inverse_transform(
(TNC_test - mean_TNC_train) @ pca_components(Vt_TNC_train, n_components).T,
Vt_TNC_train,
n_components,
mean_TNC_train,
),
)
# Print the average reconstruction errors for each number of components
print(f'Average reconstruction error for train data is {train_error}')
print(f'Average reconstruction error for test data is {test_error}')
Average reconstruction error for train data is 6.217495281310197e-30 Average reconstruction error for test data is 1.2387388672271008e-30
Task 4: Kernel PCA (30 points)¶
KernelPCA is an extension of PCA that uses kernel methods to perform dimensionality reduction in a high-dimensional space. Unlike standard PCA, which finds linear components, Kernel PCA can capture non-linear relationships in the data by implicitly mapping it into a higher-dimensional space using a kernel function. In this task, we will use the Radial Basis Function (RBF) kernel, which is similar to a Gaussian kernel.
The gamma
parameter in the RBF kernel influences the spread of the kernel function. It is analogous to the inverse of the standard deviation (gamma=1/sigma
) of the Gaussian. A lower gamma
value results in a wider influence of each training point, while a higher gamma
value makes the influence more localized. You can read more about the gamma
parameter here.
For this task, you don't need to implement Kernel PCA from scratch; use scikit-learn's KernelPCA
(https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html).
An example setup for scikit-learn's Kernel PCA is:
KernelPCA(kernel='rbf', n_components=100, gamma=0.1, fit_inverse_transform=True)
Now, let's proceed with the following parts:
Part A (10 points): ¶
Apply Kernel PCA and Plot Transformed Data:¶
First plot the original TNC data again. Now start by applying Kernel PCA on the TNC dataset using different values of gamma
(e.g., 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1). For each value of gamma
, plot the transformed data to visualize the effect of the RBF kernel on the dataset. This step will help you understand how gamma
influences the data's representation in the transformed space.
For this part you need to plot 12 figures (one for original data, and the other 11 ones for different values of gamma
).
# Load the dataset
data = pd.read_csv(SYNTHETIC_FILE)
# Split features and labels
X_TNC = data[['coord_x', 'coord_y']].values
Y_TNC = data['label'].values
# Define different values of gamma to test
gamma_values = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1]
n_components = 2
# Standardize the features
scaler = StandardScaler()
X_TNC_scaled = scaler.fit_transform(X_TNC)
# Create subplots to visualize the transformed data for each gamma
plt.figure(figsize=(20, 15))
# Plot the original data before applying Kernel PCA
plt.subplot(3, 4, 1)
plt.scatter(X_TNC_scaled[:, 0], X_TNC_scaled[:, 1], c=Y_TNC, cmap='bwr')
plt.title('Original Data')
plt.xlabel('coord_x')
plt.ylabel('coord_y')
# Set the limits for the x and y axes
x_limits, y_limits = (-4, 4), (-4, 4)
# Apply Kernel PCA for each gamma value
for idx, gamma in enumerate(gamma_values):
# Apply Kernel PCA
kpca = KernelPCA(n_components=n_components, kernel='rbf', gamma=gamma)
X_kpca = kpca.fit_transform(X_TNC_scaled)
# Plot the transformed data
plt.subplot(3, 4, idx + 2)
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=Y_TNC, cmap='viridis')
plt.title(f'Gamma = {gamma}')
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
# Set fixed x and y axis limits
plt.xlim(x_limits)
plt.ylim(y_limits)
plt.tight_layout()
plt.show();
Part B (5 points): ¶
Based on your observations, how does Kernel PCA compare to Linear PCA on this dataset with red and blue labels? In what ways does Kernel PCA affect the distribution of the data points, particularly in terms of how well the red and blue points are organized? Choose the best value(s) for gamma
and report it (them). What criteria did you use to determine the optimal gamma
value?
Part C (5 points): ¶
Find the Best Values for Reconstruction Error of Kernel PCA¶
In this part, you will tune two hyperparameters: gamma
(values: [0.01, 0.05, 0.1, 0.5, 1]) and n_components
(values: [2, 10, 30, 60, 100]) for the LFW dataset. Set the train size to 400 samples. Your objective is to find the best combination of gamma
and n_components
based on the reconstruction error. Present your results in two tables showing the reconstruction error for each combination on the training set and the test set.
# Define the hyperparameter ranges
gamma_values = [0.005, 0.01, 0.05, 0.1, 0.5]
n_components_values = [2, 10, 30, 60, 100]
X_train, X_test = train_test_split(X_bush, train_size=400, random_state=SEED)
train_errors = {gamma: [] for gamma in gamma_values}
test_errors = {gamma: [] for gamma in gamma_values}
# Loop through each combination of gamma and n_components
for gamma in gamma_values:
for n_components in n_components_values:
# Fit KernelPCA
kpca = KernelPCA(n_components=n_components, kernel='rbf', gamma=gamma, fit_inverse_transform=True)
X_train_kpca = kpca.fit_transform(X_train)
X_test_kpca = kpca.transform(X_test)
# Reconstruct the data
X_train_reconstructed = kpca.inverse_transform(X_train_kpca)
X_test_reconstructed = kpca.inverse_transform(X_test_kpca)
train_errors[gamma].append(mse(X_train, X_train_reconstructed))
test_errors[gamma].append(mse(X_test, X_test_reconstructed))
train_errors_df = pd.DataFrame(train_errors, index=n_components_values)
test_errors_df = pd.DataFrame(test_errors, index=n_components_values)
# Display the results as tables
print('Reconstruction Error for Training Set:')
print(train_errors_df)
print('\nReconstruction Error for Testing Set:')
print(test_errors_df)
# Find the best combination
best_n_components, best_gamma = np.unravel_index(test_errors_df.values.argmin(), test_errors_df.shape)
best_error = test_errors_df.values[best_n_components, best_gamma]
print(
f'Best combination:\n Gamma: {gamma_values[best_gamma]} - n_components: {n_components_values[best_n_components]}\n test error: {best_error}'
)
Reconstruction Error for Training Set: 0.005 0.010 0.050 0.100 0.500 2 56.035629 52.460106 59.473942 62.702892 62.768883 10 54.013870 48.537128 54.149078 59.908009 62.227745 30 53.707760 47.722954 52.062382 57.340527 61.118011 60 53.642696 47.483402 50.792248 54.513866 57.952980 100 53.620888 47.394039 49.331505 52.528706 51.324707 Reconstruction Error for Testing Set: 0.005 0.010 0.050 0.100 0.500 2 60.903149 57.393143 65.655075 68.015724 68.021919 10 58.914120 53.705605 62.367989 67.361328 67.999069 30 58.638409 53.019703 61.464310 67.065262 68.198509 60 58.598793 52.895687 61.163082 66.958488 69.903511 100 58.587997 52.866302 61.069897 67.134178 78.636238 Best combination: Gamma: 0.01 - n_components: 100 test error: 52.866302490234375
Part D (10 points): ¶
Visualization of Reconstruction Error (5 points)¶
Plot the average reconstruction error on both the training and testing data with the following requirements:
- Use
n_components = 60
. - Use same
gamma
as before (values: [0.01, 0.05, 0.1, 0.5, 1]) - Train size = 400
- The X-axis should display the different values of
gamma
. - The Y-axis should show the reconstruction error.
- Draw two graphs: one for the training set and one for the test set.
Explanation (5 points)¶
How does Kernel PCA compare to Linear PCA on this dataset? If Kernel PCA shows improved performance, please justify your answer. If Linear PCA performs better, explain the reasons for its effectiveness.
Hint: Consider factors such as the choice of kernel, the characteristics of the data, and how the gamma
parameter influences the transformed space.
APPLY_SCALED = False
USE_SKLEARN_PCA = True
# Choose n_components = 60 for visualization
n_components = 60
scaler = StandardScaler()
# Initialize lists to store errors
train_errors_kpca = []
test_errors_kpca = []
train_error_pca = []
test_error_pca = []
X_train, X_test = train_test_split(X_bush, train_size=400, random_state=SEED)
if APPLY_SCALED:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
else:
X_train_scaled = X_train
X_test_scaled = X_test
# Compute errors for Kernel PCA
for gamma in gamma_values:
kpca = KernelPCA(n_components=n_components, kernel='rbf', gamma=gamma, fit_inverse_transform=True)
# Train set
X_train_kpca = kpca.fit_transform(X_train_scaled)
X_train_reconstructed = kpca.inverse_transform(X_train_kpca)
train_errors_kpca.append(mse(X_train_scaled, X_train_reconstructed))
# Test set
X_test_kpca = kpca.transform(X_test_scaled)
X_test_reconstructed = kpca.inverse_transform(X_test_kpca)
test_errors_kpca.append(mse(X_test_scaled, X_test_reconstructed))
# Linear PCA
if not USE_SKLEARN_PCA:
transformed_bush_train, Vt_bush_train, mean_bush_train = pca_transform(X_train_scaled, n_components)
train_error_pca.append(
mse(X_train, pca_inverse_transform(transformed_bush_train, Vt_train, n_components, mean_train))
)
test_error_pca.append(
mse(
X_test_scaled,
pca_inverse_transform(
(X_test_scaled - mean_train) @ pca_components(Vt_train, n_components).T, Vt_train, n_components, mean_train
),
)
)
else:
# sklearn PCA
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_scaled)
X_train_reconstructed_pca = pca.inverse_transform(X_train_pca)
train_error_pca.append(mse(X_train, X_train_reconstructed_pca))
X_test_pca = pca.transform(X_test_scaled)
X_test_reconstructed_pca = pca.inverse_transform(X_test_pca)
test_error_pca.append(mse(X_test, X_test_reconstructed_pca))
# Plot the results
plt.figure(figsize=(12, 5))
# Plot for training set
plt.subplot(1, 2, 1)
plt.plot(gamma_values, train_errors_kpca, marker='o', label='Kernel PCA')
plt.axhline(y=train_error_pca, color='r', linestyle='--', label='Linear PCA')
plt.xlabel('Gamma Values')
plt.ylabel('Reconstruction Error')
plt.title(f'Training Set (n_components = {n_components})')
plt.legend()
plt.xscale('log')
plt.grid(True)
# Plot for test set
plt.subplot(1, 2, 2)
plt.plot(gamma_values, test_errors_kpca, marker='o', label='Kernel PCA')
plt.axhline(y=test_error_pca, color='r', linestyle='--', label='Linear PCA')
plt.xlabel('Gamma Values')
plt.ylabel('Reconstruction Error')
plt.title(f'Test Set (n_components = {n_components})')
plt.legend()
plt.xscale('log')
plt.grid(True)
plt.tight_layout()
plt.show()
print(f'Linear PCA - Train Error: {train_error_pca[0]:.4f}, Test Error: {test_error_pca[0]:.4f}')
print('Kernel PCA Errors:')
for gamma, train_err, test_err in zip(gamma_values, train_errors_kpca, test_errors_kpca):
print(f' Gamma {gamma:.3f} - Train Error: {train_err:.4f}, Test Error: {test_err:.4f}')
Linear PCA - Train Error: 6.7114, Test Error: 11.4403 Kernel PCA Errors: Gamma 0.005 - Train Error: 53.6427, Test Error: 58.5988 Gamma 0.010 - Train Error: 47.4834, Test Error: 52.8957 Gamma 0.050 - Train Error: 50.7922, Test Error: 61.1631 Gamma 0.100 - Train Error: 54.5139, Test Error: 66.9585 Gamma 0.500 - Train Error: 57.9530, Test Error: 69.9035