See also: book
probability density function
if is a random variable, the probability density function (pdf) is a function such that:
if distribution of is uniform over , then
curve fitting.
how do we fit a distribution of data over a curve?
Given a set of data points
Lien vers l'original
- (or )
In the case of 1-D ordinary least square, the problems equates find to minimize
Lien vers l'original
minimize
optimal solution
where , , ,
Lien vers l'original
hyperplane
Hyperplane equation
Homogeneous hyperplane:
Matrix form OLS:
minimize
OLS solution
Example:
With
and
With
and
thus
See also Bias and intercept
Lien vers l'original
adding bias in D-dimensions OLS
and
Add an new auxiliary dimension to the input data,
Solve OLS:
Gradient for
Jacobian for
Lien vers l'originalresult
overfitting.
strategies to avoid:
- add more training data
- L1 (Lasso) or L2 (Ridge) regularization
- add a penalty term to the objective function
- L1 makes sparse models, since it forces some parameters to be zero (robust to outliers). Since having the absolute value to the weights, forcing some model coefficients to become exactly 0.
- L2 is better for feature interpretability, for higher non-linear. Since it doesn’t perform feature selection, since weights are only reduced near 0 instead of exactly 0 like L1
- Cross-validation
- split data into k-fold
- early stopping
- dropout, see example
- randomly selected neurons are ignored ⇒ makes network less sensitive
sample complexity of learning multivariate polynomials
Lien vers l'original
regularization.
L2 regularization:
Lien vers l'originalSolving
Solve that
Inverse exists as long as
polynomial curve-fitting revisited
feature map: where
training:
prediction:
Lien vers l'originalchoices of
- Gaussian basis functions:
- Polynomial basis functions:
- Fourier basis functions: DFT, FFT
kernels
compute higher dimension inner products
Polynomial kernels of degree 2:
degree M polynomial
How many operations?
Lien vers l'original
- improved: ops
kernel least squares
Steps:
- shows that , or
proof
- Uses to form the dual representation of the problem.
Solution:
choices
- polynomial kernel:
- Gaussian kernel:
mapping high-dimensional data
minimising reconstruction error
- Given , find that minimises the reconstruction error:
if , then error is zero.
Solution:
Lien vers l'original
- is subjected to
- assuming data is centered, or
eigenvalue decomposition
Lien vers l'original
pca
Idea: given input ,
Thus
Find the eigenvectors/values of :
Optimal is:
Lien vers l'original
bayes rules and chain rules
Joint distribution:
Conditional distribution of given :
Bayes rule:
Chain rule:
i.i.d assumption
assume underlying distribution , that train and test sets are independent and identically distributed (i.i.d)
Example: flip a coin
Outcome or with and , or , is the Bernoulli random variable.
and
Would be maximum likelihood estimate
maximum a posteriori estimation
Lien vers l'originalWhat if we have
expected error minimisation
Squared loss:
solution to is
Instead we have
error decomposition
bias-variance decompositions
For linear estimator:
Lien vers l'original
nearest neighbour
See also: slides 13, slides 14, slides 15
Think of contiguous loss function: margin loss, cross-entropy/negative log-likelihood, etc.
linear programming
Given that data is linearly separable
So
So
perceptron
Rosenblatt’s perceptron algorithm
greedy update
SVM
idea: maximizes margin and more robus to “perturbations”
Eucledian distance between two points and the hyperplan parametrized by is:
Assuming then the distance is
maximum margin hyperplane
has margin if
Margin:
Lien vers l'original
linear algebra review.
Diagonal matrix: every entry except the diagonal is zero.
trace: sum of the entries in main diagonal:
Properties of transpose:
Properties of inverse:
Inverse of a matrix
if a matrix exists, mean A is invertible (non-singular), and vice versa.
quadratic form
Given a square matrix , the quadratic form is defined as:
norms
A function is a norm if it satisfies the following properties:
- non-negativity:
- definiteness:
- Homogeneity:
- triangle inequality:
symmetry
A square matrix is symmetric if
Anti-semi-symmetric if
Given any square matrix , the matrix is symmetric, and is anti-symmetric.
positive definite
is positive definite if .
- It is denoted by .
- The set of all positive definite matrices is denoted by
positive semi-definite
is positive semi-definite if .
- It is denoted by .
- The set of all positive semi-definite matrices is denoted by
negative definite
is negative definite if .
- It is denoted by .
- The set of all negative definite matrices is denoted by
negative semi-definite
is negative semi-definite if .
- It is denoted by .
- The set of all negative semi-definite matrices is denoted by
A symmetric matrix is indefinite if it is neither positive semi-definite or negative semi-definite.
Given any matrix , the matrix is always positive semi-definite (known as the Gram matrix)
Proof:
eigenvalues and eigenvectors
The non-zero vector is an eigenvector of A and is called the eigenvalue of A if:
finding eigenvalues
Solving eigenvectors via
matrix representation of a system of linear equations
Equivalent matrix representation of
Transpose of a matrix
and
dot product.
linear combination of columns
Let , ,
Then
inverse of a matrix
The inverse of a square matrix is a unique matrix denoted by
euclidean norm
norm:
L1 norm:
norm:
p-norm:
Comparison
One can prove this with Cauchy-Schwarz inequality
linear dependence of vectors
Given and
Span
Given a set of vectors , the span of the set is the set of all possible linear combinations of the vectors.
If are linearly independent, then the span of the set is the entire space
Rank
For a matrix :
- column rank: max number of linearly independent columns of
- row rank: max number of linearly independent rows of
If , then the rows are linearly independent. If , then the columns are linearly independent.
rank of a matrix is the number of linearly independent columns of :
- if is full rank, then ()
solving linear system of equations
If is invertible, there exists a solution:
Range and Projection
Given a matrix , the range of , denoted by is the span of columns of :
Projection of a vector onto , is a vector in the span that is as close as possible to wrt norm
Null space of
is the set of all vectors that satisfies the following:
Lien vers l'original
probability theory
With Bayes rules we have
Chain rule states for event :
Law of Total Probability
If are finite partition of the same space, or , then law of total probability state that for an event A
cumulative distribution function
For a random variable X, a CDF is defined as:
probability mass function
for a discrete random variable X, the probability mass function is defined as:
probability density function
for a continuous random variable X, the probability density function is defined as:
Expectation
for a discrete random variable with PMF and , the expectation of is:
for a continuous random variable with PDF , the expectation of is:
Therefore, mean of a random variable X is :
Variance of a random variable X is:
discrete random variables
Bernoulli distribution:
Binomial distribution:
Poisson distribution:
continuous random variables
Uniform distribution:
Exponential distribution:
Gaussian distribution: