See also jupyter notebook
task 1: linear SVM for MNIST classification
part a
Is the implementation of the multi-class linear SVM similar to the end-to-end multi-class SVM that we learned in the class? Are there any significant differences?
Differences | multi-class linear SVM | end-to-end multi-class SVM |
---|---|---|
Loss function | Uses MultiMarginLoss , which creates a criterion that optimises a multi-class classification hinge loss 1 | multi-vector encoding where |
Architecture | Through a single linear layers based on given input_size and num_classes | optimized over pairs of class scores with multi-vector encoding |
Parameter Learning | Uses SGD with minibatches to optimize MML | Whereas we show a theoretical formulation of optimizing over multi-vector encoded space 2 |
part B
- Compute the accuracy on the train and test set after each epoch in the training. Plot these accuracies as a function of the epoch number and include it in the report (include only the plot in your report, not all the 2*100 numbers).
- Compute the hinge loss on the train and test set after each epoch in the training. Plot these loss values as a function of the epoch number and include it in the report.(include only the plot in your report, not all the 2*100 numbers)
- Report the last epoch results (including loss values and accuracies) for both train and test sets.
- Does the model shows significant overfitting? Or do you think there might be other factors that are more significant in the mediocre performance of the model?
The following includes graph for both accuracy and loss on train/test sets after 100 epochs
Last epoch results for both train and test sets:
We observe training accuracy continuing to improve, while test accuracy plateaus. Same observation can be made for in Loss vs. Epochs
graph, where gap between training and test loss increases as epochs increase
While this shows evidence of overfitting, one can argue there are factors affecting model performance:
Liminal training data:
- we are currently only use 0.25% of MNIST dataset (which is around 150 samples) 3
- This makes it difficult for the model to learn generalizable patterns
Model limitation:
- Linear SVM can only learn linear decision boundaries
- MNIST datasets requires non-linear decision boundaries to achieve high performance (we observe this through relatively quick plateau test accuracy after 78.5%)
We don’t observe in degrading test performance, which is not primarily behaviour of overfitting.
part c
Weight decay works like regularization. Set weight decay to each of the values (0.1, 1, 10) during defining the SGD optimizer (see SGD optimizer documentation for how to do that).
Plot the train/test losses and accuracies per epoch. Also report the last epoch results (loss and accuracy for both train and test) .
Important
Does weight decay help in this case? Justify the results.
The following are logs for set of weight decay from (0.1, 1, 10)
Yes, but the result is highly sensitive based on given weight decay value.
- with
weight_decay = 0.1
we observe the best performance, with training accuracy reaches to 99.33%, smaller gap between train and test loss. Smooth learning curves with stable conversion. - with
weight_decay = 1
we saw a decrease in training accuracy, larger gap between training and test loss, training become a bit unstable with fluctuation in accuracy, and regularisation is too strong, which affect learning - with
weight_decay = 10
, we saw it severely impairs model performance, given that it is too restrictive. Unstable training, high loss values, regularisation is too aggressive.
Small dataset makes the model more sensitive to regularisation. Linearity makes it lax to require regularisation.
Weight decay does help when properly tuned, and make learning a bit more stable.
part a
Use Cross Entropy Loss (rather than Hinge loss) to implement logistic regression
context:
-
Hinge Loss: it penalized predictions that are not sufficiently confident. Only cares about correct classification with sufficient margin
-
cross-entropy:
For binary loss is defined:
For multi-class is defined:
part b
- Compute the accuracy on the train and test set after each epoch in the training. Plot these accuracies as a function of the epoch number.
- Compute the cross-entropy loss on the train and test set after each epoch in the training. Plot these loss values as a function of the epoch number.
- Report the last epoch results (including loss values and accuracies) for both train and test sets.
- Does the model shows significant overfitting? Or do you think there might be other factors that are more significant in the mediocre performance of the model?
The following is the graph entails both accuracy and loss on train/test dataset:
No sign of overfitting, given training/test accuracy are very close together. Training loss and test loss curves are pretty close
The reason for poor performance are as follow:
- random chance baseline: for 10-class problem, random guessing would give ~10% accuracy, so it perform a bit worse.
- The model doesn’t seem to learn at all. It perform significantly worse than SVM.
- Cross-entropy loss might need additional tuning.
- Non-linearity: Given that MNIST data contains non-linear features, it might be hard for LR to capture all information from training dataset.
part c
Does it work better, worse, or similar?
Significantly worse, due to the difference in loss function.
part a
Add a hidden layer with 5000 neurons and a RELU layer for both logistic regression and SVM models in Task 1 and Task 2.
- For both models, plot the train loss and the test loss.
- For both models, plot the train and test accuracies.
- For both models, report the loss and accuracy for both train and test sets.
The following is the modified version of LinearSVM with hidden layers:
With training/test accuracy and loss graph:
Final epoch result:
Modified version of LogisticRegression
with hidden layers:
With training/test accuracy and loss graph:
Final epoch result:
part b
Compare the results with the linear model (without weight decay, to keep the comparison fair). Which approach works better? Why? Which appproach is more prone to overfitting? Explain your findings and justify it.
Linear model works better in this case, even thought it achieve lower loss, similar test accuracy. The added complexity of the hidden layer and ReLU activation didn’t improve the model’s performance given the dataset size (too small)
The problem set might be linearly separable enough such that the model simply learns to generalise overall behaviour of the whole dataset (also known as grokking 4).
Note that overfitting suggests that there weren’t enough data in given training sets, given we observe similar test metrics for both
LinearSVM
andModifiedModel
(with ReLU and hidden layers)
So it is not necessary “which works better”, rather it should be about limited training data rather than architectural options.
instruction
In this task, we will explore the concept of data augmentation, which is a powerful technique used to enhance the diversity of our training dataset without collecting new data. By applying various transformations to the original training images, we can create modified versions of these images. We can then use these modified images to train our model with a “richer” set of examples. The use of data augmentation helps to improve the robustness and generalization of our models. Data augmentation is particularly beneficial in tasks like image classification, where we expect the model to be invariant to slight variations of images (e.g., rotation, cropping, blurring, etc.)
For this task, you are given a code that uses Gaussian Blur augmentation, which applies a Gaussian filter to slightly blur the images. If you run the code, you will see that this type of augmentation actually makes the model less accurate (compared with Task 3, SVM test accuracy)
For this task, you must explore other types of data augmentation and find one that improves the test accuracy by at least 1 percent compared with not using any augmentation (i.e., compared with Task 3, SVM test accuracy). Only change the augmentation approach, and keep the other parts of the code unchanged. Read the PyTorch documentation on different augmentation techniques here, and then try to identify a good augmentation method from them.
Report the augmentation approach that you used, and explain why you think it helps. Also include train/test accuracy plots per epoch, and the train/test accuracy at the final epoch.
The following augmentation achieves higher test accuracy comparing to ModifiedModel
without any transformation
ToTensor
is self-explanatory. Additional augmentation playground can also be found in the jupyter notebook
- we use degrees given that digits can appear at slightly different angles in said dataset
- small rotation preserves readability, while increase variety
- fill set to 0 to preserve black background
- add a small distortion scale to simulate viewing angle variations.
- help with robustness to viewpoint change
- Add MNIST mean and std to normalise training
- make it more stable
- Simulate some random noise
- One can also use
RandomErasing
, but the essentially work the same
The following is the final epoch result:
With graphs:
Remarque
-
Given input
parameters:
- regularization parameter
- loss function
- class sensitive feature mapping
In this case, we solve for
↩ -
MNIST datasets are 60000 28x28 grayscale images, therefore samples being used ↩
-
grokking is a process where neural network learns a pattern in the data, and it “memorize” this pattern to generalize to all unseen dataset, in which improve generalisation performance from random chance to perfect generalisation! Though, this phenomena is often observed in larger networks beyond overfitting. ↩