Machine Learning
polynomial data fitting
- interpolation in polynomial data fitting
k-nearest neighbors predictor
- Voronoi diagram
gradient descent
logistic-regression classifier
support vector machine (SVM)

Machine Learning

general problem format

given input $A$ , want output $y \in Y$

$f : A \to Y, y = f (a) \in Y for a \in A$

traditional approach
hand-craft the function $f$
machine learning
build another function $λ$ and use it to generate an approximation $f$

machine learning is about building a function

training data $L = {(a_{1}, y_{1}), \dots, (a_{n}, y_{n})}$
class of allowed function $F$

use a predefined algorithm to compute $f \in F$ with the goal:

$y \approx f (x)$

simplification using feature vector

convert input into a 1d vector, making machine learning generally applicable

feature vector $x \in X \subseteq R^{d}$
$ϕ : A \to X$
training set $T_{a} = {(x_{1}, y_{1}), \dots, (x_{n}, y_{n})} \subset R^{d} \times R^{e}$
hypothesis space $H$
$Y \subseteq R^{e}$

predefined algorithm produce $h \in H$ so that

$f (a) = h (ϕ (a))$

loss

estimation of the error of $h$

$ℓ : Y \times Y \to R^{+} ℓ (y_{n}, h (x_{n}))$

zero-one loss

$ℓ (y, \overset{y}{^}) := {01 y = \overset{y}{^} otherwise$

quadratic loss

$ℓ (y, \overset{y}{^}) := (y - \overset{y}{^})^{2}$

empirical risk

average lost based on the training set

$L_{T} (h) := \frac{1}{N} n = 1 \sum N ℓ (y_{n}, h (x_{n}))$

three types of machine learning problems

classification problem

label given data

classifier (predictor) $h$ : the produced function

signature of machine learning function $λ$

$λ : T \to H such that λ (T) = h$

$λ$ learn (or train) classifier $h$ from training set $T$
inference: apply classifier $h$ on any data
- testing: apply classifier $h$ on unseen data

classifier $h$ define a partition of $X$

supervised/ unsupervised machine learning

supervised learning: classification, regression

unsupervised learning: clustering

polynomial data fitting

not machine learning but similar

$A c = b A = 1 ⋮ 1 x_{1} ⋮ x_{N} \dots \dots x_{1}^{k} ⋮ x_{N}^{k} b = y_{1} ⋮ y_{N} \hat{c} \in c arg min ∥ A c = b ∥^{2}$

number of monomial $m (d, k) = (k d + k)$

interpolation in polynomial data fitting

achieve $0$ loss

$k = N - 1 \Rightarrow$ always interpolate
overfitting

k-nearest neighbors predictor

remember the whole training set

$T = {(x_{i}, y_{i}) ∣ i = 1, \dots, N}$

return average of $y_{n}$ s corresponding to the $k$ closest $x_{n}$ s to $x$

useful for both classification and regression
smaller $k$ results in worse overfitting
good interpolation and poor extrapolation

gradient descent

stochastic gradient descent (SGD)

group training set randomly into mini-batch, use gradient from each mini-batch to descent

mini-step are in the right direction on average
epoch: using all data once

step size

fixed
decreasing
momentum

$v_{k + 1} = μ_{k} v_{k} - α_{k} \nabla f (z_{k})$
line search

logistic-regression classifier

score-based

score function for linear boundary

$s (x) = σ (a (x)) = σ (c + u^{T} x)$

signed distance to hyperplane

hyperplane $χ \in R^{d}$

$b + w^{T} x = 0 w \neq = 0$

$w$ is perpendicular to $χ$
distance of $χ$ from origin

$β := \frac{∣ b ∣}{∥ w ∥}$
signed distance of $x$ from $χ$

$Δ (x) := \frac{b + w ^{T} x}{∥ w ∥}$

logistic function

$f (Δ) := \frac{1}{1 + e ^{- Δ}}$

then the score function is

$s (x) := f (a (x)) = f (b + w^{T} x) = \frac{1}{1 + e ^{- b - w^{T} x}}$

activation $a (x)$ is signed distance scaled

softmax function

softmax function in activation $a$

$s (x) = s_{1} (a) ⋮ s_{K} (a) a (x) = a_{1} (x) ⋮ a_{K} (x) s_{k} (x) = f (a_{k} (x)) = \frac{e ^{a_{k}} ( x )}{\sum _{j = 1}^{K} e ^{a_{j}} ( x )}$

cross entropy loss for binary classification

cross entropy loss of assigning score $p$ to point whose true label is $y$

$ℓ (y, p) := {- lo g p - lo g (1 - p) y = 1 y = 0 = - y lo g p - (1 - y) lo g (1 - p)$

differentiable so we can do gradient descent
base of $lo g$ does not matter
resulting risk function is weakly convex

cross entropy loss for $K$ -class case

$ℓ (y, p) = - lo g p_{y} = - k = 1 \sum K q_{k} lo g p_{k}$

true label $y$
prediction $p$
one hot encoding $q$ of $y$

support vector machine (SVM)

binary support vector machine

separating hyperplane for binary support vector machine

$n^{T} x + c = 0 n \in R^{d}, ∥ n ∥ = 1, c \in R$

decision rule

$h (x) = sign (n^{T} x + c) \Rightarrow y (n^{T} x + c) \geq 0$
margin for $(x, y)$ with parameter $v = (n, c)$

$μ_{v} (x, y) := y (n^{T} x + c)$
- margin $μ_{v} (T)$ of training set
  
  $μ_{v} (T) := (x, y) \in T min μ_{v} (x, y)$
- linearly separable if $μ_{v} (T) > 0$

hinge loss

reference margin $μ^{*} > 0$

$ℓ_{v} (x, y) := \frac{1}{μ *} max {0, μ^{*} - μ_{v} (x, y)} = max {0, 1 - y (w^{T} x + b)}$

where

$w := \frac{n}{μ ^{*}}, b = \frac{c}{μ ^{*}}$

separating hyperplane $w^{T} x + b = 0$

empirical risk of binary support vector machine

$L_{T} (w, b) := \frac{1}{2} ∥ w ∥^{2} + \frac{C _{0}}{N} n = 1 \sum N ℓ_{(w, b)} (x_{n}, y_{n})$

bigger $C_{0} \Rightarrow$ smaller $μ^{*}$

soft linear support vector machine

$(w^{*}, b^{*}) = ERM_{T} (w, b)$

subgradient of hinge function

$ρ (z) = max {0, z} ρ^{'} (z) = {10 z > 0 otherwise$

support vector machine with kernel

representer theorem

$\forall$ loss function in the form

$L^{T} (w, b) = R (∥ w ∥) + S (w^{T} x_{1} + b, \dots, w^{T} x_{N} + b)$

where $R : R_{+} \to R$ increasing, $S : R^{N} \to R$ ,

$\exists β_{1}, \dots β_{N}$ s.t.

$w^{*} = n = 1 \sum N β_{n} x_{n}$

proof: by writing

$w^{*} = n = 1 \sum N β_{n} x_{n} + u for some u \in span {x_{i}}$

and proving $u = 0$ by contradiction

support vector

sample that are misclassified or classified correctly with a margin not larger than $μ^{*}$

only support vector contribute to $w^{*}, b^{*}$

$h (x) = sign (n = 1 \sum N β_{n} x_{n}^{T} x + b^{*}) L^{T} (w, b) = \frac{1}{2} n = 1 \sum N β_{m} β_{n} x_{n}^{T} x_{n} + \frac{C _{0}}{N} n = 1 \sum N ρ (1 - y_{n} (n = 1 \sum N β_{m} x_{n}^{T} x_{n} + b))$

kernel for support vector machine

$h (x) = sign (n = 1 \sum N β_{n} K (x_{n}, x) + b^{*}) L^{T} (w, b) = \frac{1}{2} n = 1 \sum N β_{m} β_{n} K (x_{n}, x_{n}) + \frac{C _{0}}{N} n = 1 \sum N ρ (1 - y_{n} (n = 1 \sum N β_{m} K (x_{n}, x_{n}) + b))$

where kernel $K$

$K (x, ξ) := φ (x) φ (ξ)$

for some $φ$

$K (x, ξ) = K (ξ, x)$
$K^{2} (x, ξ) \leq K (x, x) K (ξ, ξ)$

$K$ is a kernel of ${x_{i}}$

$\Leftrightarrow C \geq 0$ where $C_{ij} = K (x_{i}, x_{j})$
$\Leftrightarrow$ Mercer’s condition: $\forall f : R^{d} \to R$ s.t.

$\int_{R^{d}} f (x) d x$

is finite,

$\int_{R^{d} \times R^{d}} K (x, ξ) f (x) f (ξ) d x d ξ \geq 0$
Gaussian kernel

$K (x, ξ) = e^{- \frac{∥ x - ξ ∥ ^{2}}{σ ^{2}}}$
- radial basis function (RBF) SVM

Steven Hé (Sīchàng)