See also jupyter notebook, pdf, solutions
question 1.
task 1: eigenfaces
implementation of centeralize_data()
and pca_components()
Yields the following when running plot_class_representatives
: result
task 2: PCA transformation and reconstructing
part A
Implement
pca_tranform
part B
Implement
pca_inverse_transform
Which yields the following for TNC visualisation:
and LFW visualisation:
We also expect some loss in information while reconstructing:
task 3: average reconstruction error for LFW
part A
plot average reconstruction error on training and testing data points
Training code:
yields the following observation
The eval results graph:
part B
- Explains the difference between the two graphs
- What would the error be if we compute it for the TNC dataset while using two components and 2000 samples?
- The following observation can be made:
- Both decreases as the number of components increases (lower means better reconstruction quality). However, we observe test error line (red) is higher than train error (blue). This shows some overfitting given smaller training data size (400) against LFW dataset (which includes 1288 entries)
- Both show diminishing returns, yet this effect is more pronounced on test error
- As
n_components
increases, we see a decreases in bias (improving reconstruction for both train and test data). However, test error decreases more slowly given later components are less effective in reconstructing features for unseen data
- Error for average reconstruction error for TNC is shown below:
task 4: Kernel PCA
part A
Apply Kernel PCA and plot transformed Data
Applied a StandardScaler
to X_TNC
and plot 3x4 grid with the (1,1) being the original data plot, followed by 11 slots for gamma
from .
Run on n_components=2
Yield the following graph:
part B
Based on your observations, how does Kernel PCA compare to Linear PCA on this dataset with red and blue labels? In what ways does Kernel PCA affect the distribution of the data points, particularly in terms of how well the red and blue points are organized? Choose the best value(s) for
gamma
and report it (them). What criteria did you use to determine the optimalgamma
value?
Comparison:
- Kernel PCA is more effective in capturing the non-linear relationships in the data, in which we see the spread between blue and red circles, which modify the data distribution. Whereas with linear PCA, it maintains the circular structure, meaning linear PCA doesn’t alter data distribution that much
Effects:
- For small value of gamma the points are highly concentrated, meaning kernels is too wide (this makes sense given that
gamma
is the inverse of standard deviations) - For gamma , we notice a separation between blue and red circles.
- For gamma , we start to see similar features from original data entries, albeit scaled down given RBF kernels.
- At gamma , we notice datasets to spread out, forming elongated features.
For gamma seems to provide best representation of the original data
Criteria:
- class separation: how well the blue and red circles are separated from each other
- compact: how tightly clustered the points within each classes are.
- structure preservation: how well the circular nature of the original datasets are preserved.
- dimensionality reduction: how well the data is projected in lower dimensions space
part C
Find best values for reconstruction error of kernel PCA
training loop yields the following:
part D
- Visualisation of Reconstruction Error
- How does kernel PCA compare to Linear PCA on this dataset? If Kernel PCA shows improved performance, please justify your answer. If Linear PCA performs better, explain the reasons for its effectiveness.
Reconstruction Error from kernel PCA as well as linear PCA:
Performance:
- Linear PCA has significantly better reconstruction error than kernel PCA (6.68 of linear PCA against 47.48 at of kernel PCA)
- Regardless of
gamma
, Kernel PCA shows a lot higher error
Reasoning for Linear PCA:
- Data characteristic: most likely LFW contains mostly linear relationship between features (face images have strong linear correlations in pixel intensities and structures)
- Dimensionality: This aligns with Task 3 Part B where we observe same value with
n_components=60
for linear PCA - Overfitting: less prone to overfitting, given that Kernel PCA might find local optima that overfit given patterns of data (in this case face features). Additionally, RBF is more sensitive to outliers
Explanation why Kernel PCA doesn’t work as well:
- Kernel: RBF assumes local, non-linear relationships. This might not work with facial data given strong linear correlation among facial features.
- Gamma: We notice that with achieve lowest error, still underperformed comparing to linear PCA.
- Noise: non-linear kernel mapping are more prone to capture noise or irrelevant patterns in facial images.
question 2.
problem statement
“Driving high” s prohibited in the city, and the police have started using a tester that shows whether a driver is high on cannabis. The tester is a binary classifier (1 for positive result, and 0 for negative result) which is not accurate all the time:
- if the driver is truly high, then the test will be positive with probability and negative with probability (so the probability of wrong result is in this case)
- if the driver is not high, then the test will be positive with probability and negative with probability (so the probability of wrong result is in this case)
Assume the probability of (a randomly selected driver from the population) being “truly high” is
part 1
What is the probability that the tester shows a positive result for a (randomly selected) driver? (write your answer in terms of )
Probability of a driver being truly high:
Probability of a driver not being high:
Probability of a positive test given the dirver is high:
Probability of a positive test given the dirver is not high:
using law of total probability to find overall probability of a positive test result:
part 2
The police have collected test results for n randomly selected drivers (i.i.d. samples). What is the likelihood that there are exactly positive samples among the samples? Write your solution in terms of
Let probability of positive test result for a randomly selected driver is
Now, apply binomial probability to find the likelihood of positive samples among samples:
part 3
What is the maximum likelihood estimate of given a set of random samples from which are positive results? In this part, you can assume that and are fixed and given. Simplify your final result in terms of
Assumption: using nature log ln
MLE of
Let likelikhood function :
Take log of both sides and drop constant term:
To find the maximum likelihood, we differentiate with respect to and set to zero:
Substituting :
part 4
What will be the maximum likelikhood estimate of for the special cases of
For :
For :
note: this makes sense, given when the test is completely random, then there is no information about true proportion of high drivers.
For :