See also jupyter notebook, pdf, solutions

question 1.

problem 1.

  1. For homogeneous model, the MSE on training data is 26.1649 and on validation data is 77.0800

    Whereas with non-homogeneous model, the MSE on training data is 2.5900 and on validation data is 8.8059

  2. We can observe that non-homogeneous model clearly performs better than the homogeneous models, given a significantly lower MSE (indicates that predictions are closer to the actual value). We can also see the difference between training and validation sets for non-homogeneous models shows better consistency, or better generalisation.

    Test set MSE for non-homogeneous model is 2.5900

  3. We observe in both cases that the training MSE is significantly lower than the validation MSE, indicating overfitting. The non-homogeneous model shows a lower difference between training and validation MSE, which suggest there were some overfitting. The homogeneous models show more severe overfitting due to its constraints (forcing intercept to zero).

  1. For homogeneous model, the MSE on training data is 0.000 and on validation data is 151.2655

    Whereas with non-homogeneous model, the MSE on training data is 0.000 and on validation data is 15.8158

  2. We observe an increased in overfitting, given the perfit fit in training data versus validation MSE for both model. We can still see that non-homogeneous models outperform homogeneous models, but the difference between training and validation MSE is significantly higher than the previous case.

    This is largely due to smaller training set (200 training samples versus 1800 training samples), models have less data to train on.

problem 2.

The following is the graph for Training and Validation MSE as functions of lambda.

  1. Best λ\lambda would be the one corresponding to lowest point on the validation MSE curve, as it is the one that minimizes the validation MSE. From the graph, we observe it is around λ7.3891\lambda \approx 7.3891

  2. Using λ7.3891\lambda \approx 7.3891, we get the following Test MSE around 1.3947

problem 3.

We will use 2D Discrete Cosine Transform (DCT) to transform our data, followed by feature selection to reduce dimensionality by selecting a top-k coefficient.

Reason:

  1. DCT is mostly used in image compression (think of JPEG). Transform image from spatial to frequency domain.
  2. Reduce dimensionality to help with overfitting, given we will only use 200 samples for training.

In this case, we will choose n_coeffs=100

See the jupyter notebook for more information

part 3

Report the MSE on the training and validation sets for different values of lambda and plot it. As mentioned, it should perform better for getting points. choose the best value of lambda, apply your preprocessing approach to the test set, and then report the MSE after running RLS.

The following graph shows the Training and Validation MSE as functions of λ\lambda. The optimal alpha is found to be λ4.4817\lambda \approx 4.4817

The given Test MSE is found to be around 3.2911


question 2.

problem statement

In this question, we will use least squares to find the best line (y^=ax+b\hat{y}=ax + b) that fits a non-linear function, namedly f(x)=2xx31f(x) = 2x - x^3 -1

For this, assume that you are given a set of nn training point {(xi,yi)}i=1n={((i/n),2(i/n)(i/n)31)}i=1n\{ (x^i, y^i)\}^{n}_{i=1} = \{(({i}/{n}), 2({i}/{n})- ({i}/{n})^3- 1)\}^{n}_{i=1}.

Find a line (i.e a,bRa,b \in \mathbb{R}) that fits the training data the best when nn \to \infty. Write down your calculations as well as the final values for aa and bb.

Additional notes: nn \to \infty assumption basically means that we are dealing with an integral rather than a finite summation. You can also assume xx is uniformly distributed on [0, 1]

We need to minimize sum of squared errors:

MSE(a,b)=01(axi+byi)2dxMSE(a,b) = \int_{0}^{1}(ax^i + b - y^i)^2 dx

We can compute μx,μy\mu_{x}, \mu_{y}:

μx=01xdx=12μy=01f(x)dx=01(2xx31)dx=[x2]01[x44]01[x]01=14\begin{aligned} \mu_{x} &= \int_{0}^{1}x dx = \frac{1}{2} \\ \mu_{y} &= \int_{0}^{1}f(x) dx = \int_{0}^{1}(2x - x^3 - 1) dx = [x^2]^{1}_{0} - [\frac{x^4}{4}]^{1}_{0} - [x]^{1}_{0} = - \frac{1}{4} \end{aligned} Var(x)=E[x2](E[x])2=01x2dx(12)2=1314=112Cov(x,y)=E[xy]E[x]E[y]=01x(2xx31)dx(12)(14)\begin{aligned} \text{Var}(x) &= E[x^2] - (E[x])^2 = \int_{0}^{1}x^2 dx - (\frac{1}{2})^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12} \\ \text{Cov}(x,y) &= E[xy] - E[x]E[y] = \int_{0}^{1}x(2x - x^3 - 1) dx - (\frac{1}{2})(-\frac{1}{4}) \end{aligned}

Compute E[xy]=01(2xx4x)dx=231512=130E[xy] = \int_{0}^{1}(2x-x^4-x)dx = \frac{2}{3} - \frac{1}{5} - \frac{1}{2} = - \frac{1}{30}:

Therefore we can compute covariance:

Cov(x,y)=130+18=11120\text{Cov}(x,y) = - \frac{1}{30} + \frac{1}{8} = \frac{11}{120}

Slope aa and intercept bb can the be computed as:

a=Cov(x,y)Var(x)=11120×12=1.1b=μyaμx=141110×12=45=0.8\begin{aligned} a &= \frac{\text{Cov}(x,y)}{\text{Var}(x)} = \frac{11}{120} \times 12 = 1.1 \\ b &= \mu_{y} - a\mu_{x} = - \frac{1}{4} - \frac{11}{10} \times \frac{1}{2} = - \frac{4}{5} = -0.8 \end{aligned}

Thus, the best-fitting line is y^=ax+b=1110x45\hat{y} = ax + b = \frac{11}{10}x - \frac{4}{5}

question 3.

problem statement

In this question, we would like to fit a line with zero y-intercept (y^=ax\hat{y} = ax) to the curve y=x2y=x^2. However, instead of minimising the sume of squares of errors, we want to minimise the folowing objective function:

i[logy^iyi]2\sum_{i} [\log {\frac{\hat{y}^i}{y^i}}]^2

Assume that the distribution of xx is uniform on [2, 4]. What is the optimal value for aa? Show your work.

asumption: log base 10

We need to minimize the objective function

Objective(a)=argmini[logy^iyi]2\text{Objective}(a) = \text{argmin} \sum_{i} [\log {\frac{\hat{y}^i}{y^i}}]^2

where y^i=axi\hat{y}^i = ax^i and yi=(xi)2y^i=(x^i)^2

Given xx is uniformly distributed on [2, 4], we can express the sum as integral:

Objective(a)=24[logaxx2]2dx=24[log(a)+log(x)2log(x)]2dx=24[log(a)log(x)]2dx\begin{aligned} \text{Objective}(a) &= \int_{2}^{4} [\log {\frac{ax}{x^2}}]^2 dx \\ &= \int_{2}^{4} [\log(a) + \log(x) - 2 \log(x)]^2 dx \\ &= \int_{2}^{4} [\log(a) - \log(x)]^2 dx \end{aligned}

let =log(a)\ell = \log(a), we can rewrite the objective function as:

Objective()=24[log(x)]2dx=24[22log(x)+log2(x)]dx=224dx224log(x)dx+24log2(x)dx\begin{aligned} \text{Objective}(\ell) &= \int_{2}^{4} [\ell - \log(x)]^2 dx \\ &= \int_{2}^{4} [\ell^2 - 2\ell \log(x) + \log^2(x)] dx \\ &= \ell^2 \int_{2}^{4} dx - 2\ell \int_{2}^{4} \log(x) dx + \int_{2}^{4} \log^2(x) dx \end{aligned}

Compute each integral:

I0=24dx=42=2I1=24log(x)dx=[xlog(x)x]24=4log(4)42log(2)+2=4log(2)=6log(2)2I2=24log2(x)dx\begin{aligned} I_0 &= \int_{2}^{4} dx = 4 - 2 = 2 \\ I_1 &= \int_{2}^{4} \log(x) dx = [x \log(x) - x]^{4}_{2} = 4 \log(4) - 4 - 2 \log(2) + 2 = 4 \log(2) = 6 \log(2) - 2 \\ I_2 &= \int_{2}^{4} \log^2(x) dx \end{aligned}

Given we only interested in finding optimal aa, we find the partial derivatives of given objective function:

Objective()=(2I02I1+I2)=2I02I1\frac{\partial}{\partial \ell} \text{Objective}(\ell) = \frac{\partial}{\partial \ell} (\ell^2 I_0 - 2 \ell I_1 + I_2) = 2\ell I_0 - 2I_1

Set to zero to find minimum \ell: log(a)==I1I0=6log(2)22=3log(2)1\log(a) = \ell = \frac{I_1}{I_0} = \frac{6 \log(2) - 2}{2} = 3\log(2) - 1

Therefore, aopt=e=e3log(2)1=e3log(2)×1e=8ea_{\text{opt}} = e^{\ell} = e^{3 \log(2) - 1} = e^{3 \log(2)} \times \frac{1}{e} = \frac{8}{e}

Thus, optimal value for a s a=8/ea=8/e