Statistical Machine Learning

SGD > GD for online learning

SGD problem: mini batch too big

gated recurrent unit (GRU): a RNN

principal component analysis (PCA): autoencoder, dimensionality reduction

Bayesian decision theory

loss : take action when sample belong to

goal: minimize risk

for 0-1 loss,

reject class: a th class w/ fixed loss

  • reject when

or, maximize discriminant function

maximum likelihood estimator (MLE)

parametric (distribution based on known parameters) vs non-parametric

linear regression

regularization in lasso

assuming

invertible, proof by positive definite

K-nearest neighbors (KNN)

need to try different

support vector machine (SVM)

hard-margin binary SVM

objective, maximize minimum margin:

by

apply Lagrange multiplier:

final objective:

solution: sequential minimal optimization (SMO)

  • fix all but 2 , and iterate
  • 2 variable because

soft-margin binary SVM

hinge loss

kernel SVM

solve non-linear problem w/ linear classifier

kernel, but w/o restriction

objective:

positive-definite kernel

positive-definite kernel output positive-definite matrix

  • positive-definite matrix: pivots > 0 or eigenvalues or subdeterminant > 0
  • Hilbert space: symmetric, positive-definite, linear

alternative definition of kernel:

  • symmetric
  • positive-definite: Gram matrix semi-positive definite

dimensionality reduction by principal component analysis (PCA)

lossy transformation from to dimension

  • mean:
  • covariance:
    • centering matrix:
    • proof:

maximize variance

minimize distance between full projection (-dimensional) and principal component analysis (PCA, -dimensional):

  • centering:
  • lossless transformation with :
  • PCA transformation:

objective:

Monte Carlo sampling method

  • sampling
    • purpose: determine parameter when doing sum/integral
    • good: from area of high probability & independent
  • Monte Carlo drawback
    • does not work in high dimension
    • assume sample independence

transformation method

  1. assume
  2. get sample
  3. assume CDF
  4. solve
  5. get another sample and repeat

rejection sampling

  1. define distribution s.t.
  2. get sample
  3. rate
  4. select random variable
  5. if , accept sample ; else reject
  • reject most sample when large
  • hard to determine
  • waste iteration

importance sampling

for value following distribution with PDF , do not know CDF, want expectation

define distribution with known CDF

weight (importance)

sampling-importance-resampling

  1. get sample from with known CDF
  2. normalize
  3. treat as probability for and resample

simulated annealing

  • to avoid trapped in local minimum
  • still need to try multiple time

when seeking minimum, accept increase in with probability

where temperature

Markov chain

  • transition matrix
    • stochastic matrix, because each row sum to 1
      • converge, stationary
        • all eigenvalue
        • where is diagonal with entries
  • probability be at state at time
  • sampling method: given PDF , set s.t.

detailed balance

Markov chain in stationary distribution if

Markov chain Monte Carlo (MCMC)

  • work in high dimension
  • honor probability dependency between sample

metropolis hasting algorithm (MH algorithm)

want to sample target distribution

design Markov chain w/ stationary distribution :

  1. get Markov chain w/ s.t. not necessarily
  2. acceptance rate
  3. use new Markov chain w/
    • take the rest of probability s.t.
  • drawback: do not know when stationary/converge. sample may be dependent

Gibbs sampling

want to sample variable following different distribution

fix to previous value when sampling :

a special case for metropolis hasting method

entropy

  • randomness, impurity, how easy to determine
  • equal to expected surprise
  • cross entropy loss

surprise

decision tree based on entropy

  • information gain
  • maximize information gain on each split

statistical learning theory

  • Bayes classifier : minimize expected risk
  • empirical risk minimization (ERM): minimize loss on training data
  • estimation error (training error): because finite, grow w/
  • approximation error (model complexity): because finite

consistence wrt &

empirical risk close to true risk

  • universally consistent wrt :
  • Bayes-consistent wrt
  • uniform convergence
    • sufficiency

generalization bound

for finite class

  • proposition: choose w/ at least probability
    • by

for infinite class

  • proof: where is empirical risk of another sample (ghost sample)

    class of for sample & ghost sample

  • problem: hard to compute shattering coefficient

shattering coefficient

maximum number of , function we can get by restricting to

(Vapnik-Chervonenkis dimension) VC dimension

maximum s.t. , classify completely correctly

  • for function class w/ VC dimension
    • for ,
  • ERM is consistence VC dimension finite

Rademacher complexity

richness of function class in sample set

sample label

solution: for each possible set of , find the function to maximize the inner sum, then take weighted average

  • generalization bound: with ≥ probability, ,

structural risk minimization (SRM)

to balance training error and model complexity

  • e.g. regularization, linear SVM VC , RBF kernel