SGD > GD for online learning
SGD problem: mini batch too big
gated recurrent unit (GRU): a RNN
principal component analysis (PCA): autoencoder, dimensionality reduction
loss λ ik : take action α i when sample belong to C k
goal: minimize risk
R ( α i ∣ X ) = k = 1 ∑ K λ ik P ( C k ∣ X )
for 0-1 loss, R ( α i ∣ X ) = 1 − P ( C i ∣ X )
reject class: a K + 1 th class w/ fixed loss λ ∈ ( 0 , 1 )
⇒ R ( a K + 1 ∣ X ) ≡ λ reject when min i = 1 … K R ( α i ∣ X ) > λ or, maximize discriminant function g i ( x ) , i = 1 … K
maximum likelihood estimator (MLE)
parametric (distribution based on known parameters) vs non-parametric
X ∈ R N × P Y ∈ R N × 1 L ( W ) = ∥ W T X − Y ∥ 2 = W T X T X W − 2 W T X T Y + Y T Y W arg min L ( W ) ⇒ ∂ W ∂ L ( W ) = 0 ⇒ 2 X T X W − 2 X T Y = 0 ⇒ W ^ = ( X T X ) − 1 X T Y
W arg min [ L ( W ) + λ P ( W ) ] ⇒ W ^ = ( X T X + λ I ) − 1 X T Y
assuming P ( W ) = ∥ W ∥ 2
X T X + λ I invertible, proof by positive definite
need to try different K
y = − 1 , 1
objective, maximize minimum margin:
W , b arg max i = 1 … N min ∥ W ∥ 1 ∣ W T X i + b ∣ s.t. y i ( W T X i + b ) > 0 ⇒ W , b arg max ∥ W ∥ 1 ⋅ 1 s.t. y i ( W T X i + b ) ≥ 1 ⇒ W , b arg min 2 1 ∥ W ∥ 2 s.t. y i ( W T X i + b ) ≥ 1
by
∣ W T X i + b ∣ = y i ( W T X i + b ) i = 1 … N min ∣ W T X i + b ∣ := 1
apply Lagrange multiplier:
L ( W , b , λ ) = 2 1 ∥ W ∥ 2 + i = 1 ∑ N λ i ( 1 − y i ( W T X i + b )) ⇒ W , b min λ i ≥ 0 max L ( W , b , λ ) ⇒ λ i ≥ 0 max W , b min L ( W , b , λ )
∂ W ∂ L = ∂ b ∂ L = 0 ⇒ W ^ = i = 1 ∑ N λ i y i X i , i = 1 ∑ N λ i y i = 0 ⇒ L ^ = − 2 1 ∥ W ^ ∥ 2 + i = 1 ∑ N λ i = − 2 1 i = 1 ∑ N j = 1 ∑ N λ i λ j y i y j X i T X j + i = 1 ∑ N λ i
final objective:
λ i ≥ 0 arg min ( − L ^ ) = 2 1 i = 1 ∑ N j = 1 ∑ N λ i λ j y i y j X i T X j − i = 1 ∑ N λ i
solution: sequential minimal optimization (SMO)
fix all but 2 λ i , and iterate 2 variable because ∑ i = 1 N λ i y i = 0 hinge loss
solve non-linear problem w/ linear classifier
kernel , but w/o φ restriction
objective:
λ i ≥ 0 arg min ( − L ^ ) = 2 1 i = 1 ∑ N j = 1 ∑ N λ i λ j y i y j K ( X i , X j ) − i = 1 ∑ N λ i
positive-definite kernel output positive-definite matrix
positive-definite matrix: pivots > 0 or eigenvalues λ i > 0 or subdeterminant > 0 Hilbert space: symmetric, positive-definite, linear alternative definition of kernel:
symmetric positive-definite: Gram matrix G semi-positive definite G := K ( X 1 , X 1 ) ⋮ K ( X N , X 1 ) ⋯ ⋱ ⋯ K ( X 1 , X N ) ⋮ K ( X N , X N ) lossy transformation from p to q dimension
mean: N 1 X T 1 covariance: N 1 X T H X centering matrix: H := I − N 1 1 1 T proof: S = N 1 i = 1 ∑ N ∥ X i − X ˉ ∥ 2 = N 1 ∥ [ X 1 − X ˉ ⋯ X N − X ˉ ] ∥ 2 maximize variance
u arg max u T S u s.t. u T u = 1 ⇒ λ arg max L ( u , λ ) = u T S u + λ ( 1 − u T u ) ⇒ S u = λ u
minimize distance between full projection (p -dimensional) and principal component analysis (PCA, q -dimensional):
centering: X ~ i = X i − X ˉ lossless transformation with u k : X i ′ = ∑ k = 1 p ( X ~ i T u k ) u k PCA transformation: X ^ i = ∑ k = 1 q ( X ~ i T u k ) u k objective:
u k arg min N 1 i = 1 ∑ N ∥ X ^ i − X i ′ ∥ 2 = N 1 i = 1 ∑ N k = q + 1 ∑ p ( X i T u k ) u k 2 = N 1 i = 1 ∑ N k = q + 1 ∑ p ( X i T u k ) 2 ⇒ u k arg min k = q + 1 ∑ p u k T S u k s.t. u k T u k = 1 ⇒ S u k = λ u k sampling purpose: determine parameter when doing sum/integral good: from area of high probability & independent Monte Carlo drawback does not work in high dimension assume sample independence assume z ∼ p ( z 0 ) get sample x assume CDF h ( z 0 = x ) solve z 1 = h − 1 ( x ) get another sample and repeat define distribution q ( z ) s.t. ∃ k , ∀ z , k q ( z ) ≥ p ( z ) get sample z i ∼ q ( z ) rate α := k q ( z i ) p ( z i ) ∈ ( 0 , 1 ] select random variable x ∼ U ( 0 , 1 ) ∈ ( 0 , 1 ] if α ≥ x , accept sample z i ; else reject reject most sample when k large k hard to determinewaste iteration for value f following distribution with PDF p , do not know CDF, want expectation
define distribution q ( z ) with known CDF
weight (importance) ω i := q ( z i ) p ( z i )
E ( f ) := ∫ f ( z ) p ( z ) d z = ∫ f ( z ) q ( z ) p ( z ) q ( z ) d z ≈ ∫ ω i f ( z ) q ( z ) d z ≈ L 1 i = 1 ∑ L ω i f ( z i )
get L sample z i from q ( z ) with known CDF ω i := q ( z i ) p ( z i ) normalize ω ~ i := ∑ i w i ω i treat ω ~ i as probability for z i and resample to avoid trapped in local minimum still need to try multiple time when seeking minimum, accept increase in f with probability
e − T Δ f < 1
where temperature T ∈ ( 0 , 1 ) , T ← T γ , γ = 0.99 ∈ ( 0 , 1 )
transition matrix Q stochastic matrix, because each row sum to 1 { Q n } converge, stationary ⇐ all eigenvalue ∣ λ ∣ ≤ 1 Q = A Λ A − 1 ⇒ Q n = A Λ n A − 1 where Λ n is diagonal with entries λ i n π t probability be at state x = 1 … N at time t sampling method: given PDF p ( z ) , set Q s.t. π ( z ) = p ( z ) Markov chain in stationary distribution if
π ( x ) p x y = π ( y ) p y x
work in high dimension honor probability dependency between sample want to sample target distribution p
design Markov chain w/ stationary distribution π = p :
get Markov chain w/ Q s.t. not necessarily π = p acceptance rate α ( x , x ∗ ) := min ( 1 , p ( x ) Q x , x ∗ p ( x ∗ ) Q x ∗ , x ) ⇒ p ( x ) Q x , x ∗ α ( x , x ∗ ) = p ( x ∗ ) Q x ∗ , x α ( x ∗ , x ) use new Markov chain w/ Q x , x ∗ ′ := α ( x , x ∗ ) Q x , x ∗ , x ∗ = x Q x , x take the rest of probability s.t. ∑ y Q x , y = 1 drawback: do not know when stationary/converge. sample may be dependent want to sample variable x i following different distribution
fix x − i := { x 1 … x i − 1 , x i + 1 … } to previous value when sampling x i : Q x , x ∗ := p ( x i ∗ ∣ x − 1 )
a special case for metropolis hasting method ⇐
p ( x − i ∗ ) = p ( x − i ) ⇒ α ( x , x ∗ ) = p ( x ) p ( x i ∗ ∣ x − i ) p ( x ∗ ) p ( x i ∣ x − i ∗ ) = p ( x i ∣ x − i ) p ( x − i ) p ( x i ∗ ∣ x − i ) p ( x i ∗ ∣ x − i ∗ ) p ( x − i ∗ ) p ( x i ∣ x − i ∗ ) = p ( x i ∣ x − i ) p ( x − i ) p ( x i ∗ ∣ x − i ) p ( x i ∗ ∣ x − i ) p ( x − i ) p ( x i ∣ x − i ) = 1
randomness, impurity, how easy to determine equal to expected surprise H = E ( sup ) = x ∑ p ( x ) sup ( x ) = − x ∑ p ( x ) ln p ( x ) cross entropy loss sup = ln p ( x ) 1 = − ln p ( x )
information gain I ( Y ∣ x i ) = H ( Y ) − H ( Y ∣ x i ) where H ( Y ∣ x i ) = x ∑ p ( x i = x ) H ( Y ∣ x i = x ) maximize information gain on each split Bayes classifier f ∗ : minimize expected risk empirical risk minimization (ERM): minimize loss on training data estimation error (training error): because X finite, grow w/ ∣ F ∣ approximation error (model complexity): because F finite empirical risk R n ( f ) close to true risk R ( f ) P ( ∣ R ( f ) − R n ( f ) ∣ ≥ ε ) → 0 as n → ∞
universally consistent wrt F : ∀ P Bayes-consistent wrt P P ( ∣ R ( f ∗ ) − R n ( f ) ∣ ≥ ε ) → 0 as n → ∞ ⇔ uniform convergence P ( f ∈ F sup ∣ R n ( f ) − R ( f ) ∣ > ε ) → 0 sufficiency P ( ∣ R ( f ) − R n ( f ) ∣ ≥ ε ) ≤ P ( f ∈ F sup ∣ R n ( f ) − R ( f ) ∣ ≥ 2 ε ) P ( ∣ R ( f ) − R n ( f ) ∣ ≥ ε ) ≤ 2 exp ( − 2 n ε 2 ) for finite class F = { f i } , i = 1 , … , m P ( ∣ R ( f ) − R n ( f ) ∣ ≥ ε ) ≤ 2 m exp ( − 2 n ε 2 )
proposition: choose δ ∈ ( 0 , 1 ) ⇒ w/ at least 1 − δ probability ∣ R ( f ) − R n ( f ) ∣ ≤ 2 n ln ( 2 m ) − ln δ for infinite class F P ( f ∈ F sup ∣ R ( f ) − R n ( f ) ∣ > ε ) ≤ 2 N ( F , 2 n ) exp ( 4 − n ε 2 )
proof: P ( ∣ R ( f ) − R n ( f ) ∣ ≥ ε ) ≤ P ( f ∈ F sup ∣ R ( f ) − R n ( f ) ∣ ≥ 2 ε ) ≤ 2 P ( f ∈ F sup ∣ R n ( f ) − R n ′ ( f ) ∣ ≥ 2 ε ) by Symmetrization Lemma where R n ′ ( f ) is empirical risk of another n sample (ghost sample)
⇒ ∃ c ≤ N ( F , 2 n ) class of f for sample & ghost sample
problem: hard to compute shattering coefficient
maximum number of F X 1 , … , X n , function we can get by restricting F to X 1 , … , X n
maximum n s.t. ∃ f ∈ F , ∀ X = { X 1 … , X n } , f classify X completely correctly
for function class F w/ VC dimension d N ( F , n ) ≤ i = 0 ∑ d ( d n ) ERM is consistence ⇔ VC dimension finite richness of function class F in sample set { X i }
sample label σ i = ± 1 , i = 1 , … , n
R a d n ( F ) = E [ f ∈ F sup n 1 i = 1 ∑ n σ i f ( X i ) ]
solution: for each possible set of σ i , find the function f to maximize the inner sum, then take weighted average
generalization bound: with ≥ 1 − δ probability, ∀ f ∈ F , R ( f ) ≤ R n ( f ) + 2 R a d n ( F ) + 2 n − ln ( δ ) ≤ R n ( f ) + 2 n d ln ( d e n ) − ln δ to balance training error and model complexity
e.g. regularization, linear SVM VC = d + 1 , RBF kernel ⇒ ∞