Statistical Machine Learning

Statistical Machine Learning

SGD > GD for online learning

SGD problem: mini batch too big

gated recurrent unit (GRU): a RNN

principal component analysis (PCA): autoencoder, dimensionality reduction

Bayesian decision theory

loss $λ_{ik}$ : take action $α_{i}$ when sample belong to $C_{k}$

goal: minimize risk

$R (α_{i} ∣ X) = k = 1 \sum K λ_{ik} P (C_{k} ∣ X)$

for 0-1 loss, $R (α_{i} ∣ X) = 1 - P (C_{i} ∣ X)$

reject class: a $K + 1$ th class w/ fixed loss $λ \in (0, 1)$

$\Rightarrow R (a_{K + 1} ∣ X) \equiv λ$
reject when $min_{i = 1 \dots K} R (α_{i} ∣ X) > λ$

or, maximize discriminant function $g_{i} (x), i = 1 \dots K$

maximum likelihood estimator (MLE)

parametric (distribution based on known parameters) vs non-parametric

$X \in R^{N \times P} Y \in R^{N \times 1} L (W) = ∥ W^{T} X - Y ∥^{2} = W^{T} X^{T} X W - 2 W^{T} X^{T} Y + Y^{T} Y W arg min L (W) \Rightarrow \frac{\partial L ( W )}{\partial W} = 0 \Rightarrow 2 X^{T} X W - 2 X^{T} Y = 0 \Rightarrow \hat{W} = (X^{T} X)^{- 1} X^{T} Y$

regularization in lasso

$W arg min [L (W) + λ P (W)] \Rightarrow \hat{W} = (X^{T} X + λ I)^{- 1} X^{T} Y$

assuming $P (W) = ∥ W ∥_{2}$

$X^{T} X + λ I$ invertible, proof by positive definite

K-nearest neighbors (KNN)

need to try different $K$

support vector machine (SVM)

hard-margin binary SVM

$y = - 1, 1$

objective, maximize minimum margin:

$W, b arg max i = 1 \dots N min \frac{1}{∥ W ∥} ∣ W^{T} X_{i} + b ∣ s.t. y_{i} (W^{T} X_{i} + b) > 0 \Rightarrow W, b arg max \frac{1}{∥ W ∥} \cdot 1 s.t. y_{i} (W^{T} X_{i} + b) \geq 1 \Rightarrow W, b arg min \frac{1}{2} ∥ W ∥^{2} s.t. y_{i} (W^{T} X_{i} + b) \geq 1$

$∣ W^{T} X_{i} + b ∣ = y_{i} (W^{T} X_{i} + b) i = 1 \dots N min ∣ W^{T} X_{i} + b ∣ := 1$

apply Lagrange multiplier:

$L (W, b, λ) = \frac{1}{2} ∥ W ∥^{2} + i = 1 \sum N λ_{i} (1 - y_{i} (W^{T} X_{i} + b)) \Rightarrow W, b min λ_{i} \geq 0 max L (W, b, λ) \Rightarrow λ_{i} \geq 0 max W, b min L (W, b, λ)$

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial b} = 0 \Rightarrow \hat{W} = i = 1 \sum N λ_{i} y_{i} X_{i}, i = 1 \sum N λ_{i} y_{i} = 0 \Rightarrow \hat{L} = - \frac{1}{2} ∥ \hat{W} ∥^{2} + i = 1 \sum N λ_{i} = - \frac{1}{2} i = 1 \sum N j = 1 \sum N λ_{i} λ_{j} y_{i} y_{j} X_{i}^{T} X_{j} + i = 1 \sum N λ_{i}$

final objective:

$λ_{i} \geq 0 arg min (- \hat{L}) = \frac{1}{2} i = 1 \sum N j = 1 \sum N λ_{i} λ_{j} y_{i} y_{j} X_{i}^{T} X_{j} - i = 1 \sum N λ_{i}$

solution: sequential minimal optimization (SMO)

fix all but 2 $λ_{i}$ , and iterate
2 variable because $\sum_{i = 1}^{N} λ_{i} y_{i} = 0$

soft-margin binary SVM

hinge loss

kernel SVM

solve non-linear problem w/ linear classifier

kernel, but w/o $φ$ restriction

objective:

$λ_{i} \geq 0 arg min (- \hat{L}) = \frac{1}{2} i = 1 \sum N j = 1 \sum N λ_{i} λ_{j} y_{i} y_{j} K (X_{i}, X_{j}) - i = 1 \sum N λ_{i}$

positive-definite kernel

positive-definite kernel output positive-definite matrix

positive-definite matrix: pivots > 0 or eigenvalues $λ_{i} > 0$ or subdeterminant > 0
Hilbert space: symmetric, positive-definite, linear

alternative definition of kernel:

symmetric
positive-definite: Gram matrix $G$ semi-positive definite $G := K (X_{1}, X_{1}) ⋮ K (X_{N}, X_{1}) \dots ⋱ \dots K (X_{1}, X_{N}) ⋮ K (X_{N}, X_{N})$

dimensionality reduction by principal component analysis (PCA)

lossy transformation from $p$ to $q$ dimension

mean: $\frac{1}{N} X^{T} 1$
covariance: $\frac{1}{N} X^{T} H X$
- centering matrix: $H := I - \frac{1}{N} 1 1^{T}$
  - $H^{T} = H, H^{2} = H$
- proof: $S = \frac{1}{N} i = 1 \sum N ∥ X_{i} - \overset{ˉ}{X} ∥^{2} = \frac{1}{N} ∥ [X_{1} - \overset{ˉ}{X} \dots X_{N} - \overset{ˉ}{X}] ∥^{2}$

maximize variance

$u arg max u^{T} S u s.t. u^{T} u = 1 \Rightarrow λ arg max L (u, λ) = u^{T} S u + λ (1 - u^{T} u) \Rightarrow S u = λ u$

minimize distance between full projection ( $p$ -dimensional) and principal component analysis (PCA, $q$ -dimensional):

centering: $\tilde{X}_{i} = X_{i} - \overset{ˉ}{X}$
lossless transformation with $u_{k}$ : $X_{i}^{'} = \sum_{k = 1}^{p} (\tilde{X}_{i}^{T} u_{k}) u_{k}$
PCA transformation: $\hat{X}_{i} = \sum_{k = 1}^{q} (\tilde{X}_{i}^{T} u_{k}) u_{k}$

objective:

$u_{k} arg min \frac{1}{N} i = 1 \sum N ∥ \hat{X}_{i} - X_{i}^{'} ∥^{2} = \frac{1}{N} i = 1 \sum N k = q + 1 \sum p (\overset{_{i}^{T} u_{k}) u_{k}^{2} = \frac{1}{N} i = 1 \sum N k = q + 1 \sum p (X_{i}^{T} u_{k})^{2} \Rightarrow u_{k} arg min k = q + 1 \sum p u_{k}^{T} S u_{k} s.t. u_{k}^{T} u_{k} = 1 \Rightarrow S u_{k} = λ u_{k}}{X}$

Monte Carlo sampling method

sampling
- purpose: determine parameter when doing sum/integral
- good: from area of high probability & independent
Monte Carlo drawback
- does not work in high dimension
- assume sample independence

transformation method

assume $z \sim p (z_{0})$
get sample $x$
assume CDF $h (z_{0} = x)$
solve $z_{1} = h^{- 1} (x)$
get another sample and repeat

rejection sampling

define distribution $q (z)$ s.t. $\exists k, \forall z, k q (z) \geq p (z)$
get sample $z_{i} \sim q (z)$
rate $α := \frac{p ( z _{i} )}{k q ( z _{i} )} \in (0, 1]$
select random variable $x \sim U (0, 1) \in (0, 1]$
if $α \geq x$ , accept sample $z_{i}$ ; else reject

reject most sample when $k$ large
$k$ hard to determine
waste iteration

importance sampling

for value $f$ following distribution with PDF $p$ , do not know CDF, want expectation

define distribution $q (z)$ with known CDF

weight (importance) $ω_{i} := \frac{p ( z _{i} )}{q ( z _{i} )}$

$E (f) := \int f (z) p (z) d z = \int f (z) \frac{p ( z )}{q ( z )} q (z) d z \approx \int ω_{i} f (z) q (z) d z \approx \frac{1}{L} i = 1 \sum L ω_{i} f (z_{i})$

sampling-importance-resampling

get $L$ sample $z_{i}$ from $q (z)$ with known CDF
$ω_{i} := \frac{p ( z _{i} )}{q ( z _{i} )}$
normalize $\tilde{ω}_{i} := \frac{ω _{i}}{\sum _{i} w _{i}}$
treat $\tilde{ω}_{i}$ as probability for $z_{i}$ and resample

simulated annealing

to avoid trapped in local minimum
still need to try multiple time

when seeking minimum, accept increase in $f$ with probability

$e^{- \frac{Δ f}{T}} < 1$

where temperature $T \in (0, 1), T \leftarrow T γ, γ = 0.99 \in (0, 1)$

Markov chain

transition matrix $Q$
- stochastic matrix, because each row sum to 1
  - ${Q^{n}}$ converge, stationary $\Leftarrow$
    - all eigenvalue $∣ λ ∣ \leq 1$
    - $Q = A Λ A^{- 1} \Rightarrow Q^{n} = A Λ^{n} A^{- 1}$ where $Λ^{n}$ is diagonal with entries $λ_{i}^{n}$
$π_{t}$ probability be at state $x = 1 \dots N$ at time $t$
- stationary distribution: $π_{m + 1} = π_{m}$ when $m$ large
- $π_{t + 1} = π_{t} Q$
sampling method: given PDF $p (z)$ , set $Q$ s.t. $π (z) = p (z)$

detailed balance

Markov chain in stationary distribution if

$π (x) p_{x y} = π (y) p_{y x}$

Markov chain Monte Carlo (MCMC)

work in high dimension
honor probability dependency between sample

metropolis hasting algorithm (MH algorithm)

want to sample target distribution $p$

design Markov chain w/ stationary distribution $π = p$ :

get Markov chain w/ $Q$ s.t. not necessarily $π = p$
acceptance rate $α (x, x^{*}) := min (1, \frac{p ( x ^{*} ) Q _{x^{*}, x}}{p ( x ) Q _{x, x^{*}}})$
- $\Rightarrow p (x) Q_{x, x^{*}} α (x, x^{*}) = p (x^{*}) Q_{x^{*}, x} α (x^{*}, x)$
use new Markov chain w/ $Q_{x, x^{*}}^{'} := α (x, x^{*}) Q_{x, x^{*}}, x^{*} \neq = x$
- $Q_{x, x}$ take the rest of probability s.t. $\sum_{y} Q_{x, y} = 1$

drawback: do not know when stationary/converge. sample may be dependent

Gibbs sampling

want to sample variable $x_{i}$ following different distribution

fix $x_{- i} := {x_{1} \dots x_{i - 1}, x_{i + 1} \dots}$ to previous value when sampling $x_{i}$ : $Q_{x, x^{*}} := p (x_{i}^{*} ∣ x_{- 1})$

a special case for metropolis hasting method $\Leftarrow$

$p (x_{- i}^{*}) = p (x_{- i}) \Rightarrow α (x, x^{*}) = \frac{p ( x ^{*} ) p ( x _{i} ∣ x _{- i}^{*} )}{p ( x ) p ( x _{i}^{*} ∣ x _{- i} )} = \frac{p ( x _{i}^{*} ∣ x _{- i}^{*} ) p ( x _{- i}^{*} ) p ( x _{i} ∣ x _{- i}^{*} )}{p ( x _{i} ∣ x _{- i} ) p ( x _{- i} ) p ( x _{i}^{*} ∣ x _{- i} )} = \frac{p ( x _{i}^{*} ∣ x _{- i} ) p ( x _{- i} ) p ( x _{i} ∣ x _{- i} )}{p ( x _{i} ∣ x _{- i} ) p ( x _{- i} ) p ( x _{i}^{*} ∣ x _{- i} )} = 1$

entropy

randomness, impurity, how easy to determine
equal to expected surprise $H = E (sup) = x \sum p (x) sup (x) = - x \sum p (x) ln p (x)$
cross entropy loss

surprise

$sup = ln \frac{1}{p ( x )} = - ln p (x)$

decision tree based on entropy

information gain $I (Y ∣ x_{i}) = H (Y) - H (Y ∣ x_{i}) where H (Y ∣ x_{i}) = x \sum p (x_{i} = x) H (Y ∣ x_{i} = x)$
maximize information gain on each split

statistical learning theory

Bayes classifier $f^{*}$ : minimize expected risk
empirical risk minimization (ERM): minimize loss on training data
estimation error (training error): because $X$ finite, grow w/ $∣ F ∣$
approximation error (model complexity): because $F$ finite

consistence wrt $F$ & $P$

empirical risk $R_{n} (f)$ close to true risk $R (f)$ $P (∣ R (f) - R_{n} (f) ∣ \geq ε) \to 0 as n \to \infty$

universally consistent wrt $F$ : $\forall P$
Bayes-consistent wrt $P$ $P (∣ R (f^{*}) - R_{n} (f) ∣ \geq ε) \to 0 as n \to \infty$
$\Leftrightarrow$ uniform convergence $P (f \in F sup ∣ R_{n} (f) - R (f) ∣ > ε) \to 0$
- sufficiency $P (∣ R (f) - R_{n} (f) ∣ \geq ε) \leq P (f \in F sup ∣ R_{n} (f) - R (f) ∣ \geq \frac{ε}{2})$
$P (∣ R (f) - R_{n} (f) ∣ \geq ε) \leq 2 exp (- 2 n ε^{2})$

generalization bound

for finite class $F = {f_{i}}, i = 1, \dots, m$ $P (∣ R (f) - R_{n} (f) ∣ \geq ε) \leq 2 m exp (- 2 n ε^{2})$

proposition: choose $δ \in (0, 1) \Rightarrow$ w/ at least $1 - δ$ probability $∣ R (f) - R_{n} (f) ∣ \leq \frac{ln ( 2 m ) - ln δ}{2 n}$
- by $δ := 2 m exp (- 2 n ε^{2})$

for infinite class $F$ $P (f \in F sup ∣ R (f) - R_{n} (f) ∣ > ε) \leq 2 N (F, 2 n) exp (\frac{- n ε ^{2}}{4})$

proof: $P (∣ R (f) - R_{n} (f) ∣ \geq ε) \leq P (f \in F sup ∣ R (f) - R_{n} (f) ∣ \geq \frac{ε}{2}) \leq 2 P (f \in F sup ∣ R_{n} (f) - R_{n}^{'} (f) ∣ \geq \frac{ε}{2}) by Symmetrization Lemma$ where $R_{n}^{'} (f)$ is empirical risk of another $n$ sample (ghost sample)

$\Rightarrow \exists c \leq N (F, 2 n)$ class of $f$ for sample & ghost sample
problem: hard to compute shattering coefficient

shattering coefficient $N (F, n)$

maximum number of $F_{X_{1}, \dots, X_{n}}$ , function we can get by restricting $F$ to $X_{1}, \dots, X_{n}$

(Vapnik-Chervonenkis dimension) VC dimension

maximum $n$ s.t. $\exists f \in F, \forall X = {X_{1} \dots, X_{n}}$ , $f$ classify $X$ completely correctly

for function class $F$ w/ VC dimension $d$ $N (F, n) \leq i = 0 \sum d (d n)$
- for $n > d$ , $N (F, n) \leq (\frac{e n}{d})^{d}$
ERM is consistence $\Leftrightarrow$ VC dimension finite

Rademacher complexity

richness of function class $F$ in sample set ${X_{i}}$

sample label $σ_{i} = \pm 1, i = 1, \dots, n$

$R a d_{n} (F) = E [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (X_{i})]$

solution: for each possible set of $σ_{i}$ , find the function $f$ to maximize the inner sum, then take weighted average

generalization bound: with ≥ $1 - δ$ probability, $\forall f \in F$ , $R (f) \leq R_{n} (f) + 2 R a d_{n} (F) + \frac{- ln ( δ )}{2 n} \leq R_{n} (f) + 2 \frac{d ln ( \frac{e n}{d} ) - ln δ}{n}$

structural risk minimization (SRM)

to balance training error and model complexity

e.g. regularization, linear SVM VC $= d + 1$ , RBF kernel $\Rightarrow \infty$

Steven Hé (Sīchàng)