Speech Recognition
- feature extraction

Speech Recognition

broader speech processing

wake-up word detection: binary classification
- small window → low latency
- multiple models, threshold & size
echo cancellation: subtract (complex) echo of sound from speaker
sound source localization: microphone array, delay

online/streaming mode: continuous conversion

omnidirectional/ cardioid/ hyper cardioid microphone

sound sample rate

need at lease 2x of highest frequency wanted, else aliasing (Nyquist theorem)
44.1kHz for CD
16kHz good enough for speech

$μ$ -curve

audio format: use PCM .wav

online endpointing format: push to talk, hit to talk, continuous listening

feature extraction

spectrogram: energy distribution over frequency vs time

preemphasize speech signal: boost high frequency

$s_{p ree m p} (n) = s (n) - α s (n - 1) α = 0.95$
spectrogram: from time series to frequency domain
- discrete Fourier transform (DFT)
- symmetric, only first half + midpoint meaningful (by Nyquist theorem)
- magnitude: $0 \sim \frac{S _{R}}{2}$ Hz, frequency $\frac{i}{M} S_{R}$ at point $i$
  - $S_{R}$ : sample rate, $M$ : number of sample
  - power: magnitude squared
- flat noise from jump in sample windowing
  - solution: multiple input by bell-shape windowing function and half-overlap windows to avoid losing information
- before using fast Fourier transform (FFT): zero padding
  - cause fake interpolated detail
auditory perception: from frequency to Bark
- mimic human ear, which distinguish low frequency better
- frequency warping with Mel curve
- filter bank: triangular filter on Mel curve to be multiplied
  - evenly-spaced, half-overlap, 40 enough, 80 good
  - translate back to equal-area triangles on frequency axis
    - not directly on Mel curve for simplicity
log Mel spectrum: log of integration of each filter bank
Mel cepstrum discrete cosine transform (DCT) compression
- reduce redundancy from filter bank overlap
- subtract the mean to remove microphone/noise difference
  - microphone response is speech with convolution
  - convolution -DFT → multiplication -log → summation

longest common subsequence: dynamic programming, dummy char padding, streaming, search trellis, sub-trellis, lexical tree

pruning: hard threshold vs beam search

DTW: dynamic time warping: disallow skipping (vertical on trellis), non-linear mapping (super diagonal)

$P_{i, j}$ : best path cost from origin to $(i, j)$
$C_{i, j}$ : local node cost (vector distance)
global beam across templates for beam search
combine multiple template
- template averaging: first align templates by DTW
- template segmentation: compress chunks of same phoneme
- actual method: iterate from uniform segments to stabilize variance

Mahalanobis distance

$d (x, m_{j}) = (x - m_{j})^{T} C_{j}^{- 1} (x - m_{j})$

can be estimated with negative Gaussian log likelihood

$d (x, m_{j}) = \frac{1}{2} lo g ((2 π)^{D} ∣ C_{j} ∣) + (x - m_{j})^{T} C_{j}^{- 1} (x - m_{j})$

covariance

$Σ = σ_{11} σ_{21} ⋮ σ_{d 1} σ_{12} σ_{22} ⋮ σ_{d 2} \dots \dots ⋱ \dots σ_{1 d} σ_{2 d} ⋮ σ_{dd}$

estimated with diagonal covariance matrix when elements largely uncorrelated

self-transition penalty: model phoneme duration

MFCC (Mel frequency cepstrum coefficient)

dynamic time warping (DTW)

align multiple training sample
1. segment all model sample uniformly by #phoneme, and average
2. align each sample against model sample to get new segment and new average
3. iterate until convergence
transition probability: #segment switch over #frame in segment

hidden Markov model for DTW

use log probability so total score is log total probability
simulation by transition probability matrix $T$ (each row sum to 1)
initial probability $π$

expectation-maximization (EM) algorithm

initialize with k-means clustering
auxiliary function: conditional expectation of the complete data log likelihood

$Q (Θ, Θ^{(t)}) = y \sum p (y ∣ X, Θ^{t}) lo g (p (X, y ∣Θ))$
evaluate $Θ^{t + 1}$

$Θ^{t + 1} = Θ arg max Q (Θ, Θ^{t})$
iterate until convergence

forward algorithm

initialize $α (s, 1) = π_{s} P (o_{t} ∣ s)$
iterate $α (s, t + 1) = \sum_{s^{'}} α (s^{'}, 1) P (s ∣ s^{'}) P (o_{t + 1} ∣ s)$

Baum Welch: soft state alignment

log-domain math

$lo g (x^{y}) = lo g (x) 2^{l o g (y)} lo g (x + y) = lo g (x) + lo g (1 + 2^{l o g (y) - l o g (x)})$

continuous text recognition: small-scale problem, e.g. voice command

wrap back from end of template to start dummy variable
- high wrap back cost → discourage space
lextree
non-emitting state/ null state: only for connecting, no self-transition
- no transition time
prior probability for word: add to the start of its HMM
word transition probability for edge cost between word HMM
approximate sentence probability with best path

grammar: only focus on syntax not semantics

finite-state grammar (FSG)
context-free grammar (CFG)
backpointer: only word-level, additional script on word transition

training with continuous speech: bootstrap & iterate

silence: silence model, $ε$ bypass arc, self-loop non-emitting state
- loop need to go through emitting state, else infinite

N-gram

N-gram assumption: $P (w_{k} ∣ w_{1}, \dots, w_{k - 1}) = P (w_{k} ∣ w_{k - (n - 1)}, \dots, w_{k - 1})$
start/end of sentence: <s>, </s>
unigram, bigram. 5-gram good enough for commercial use
N-gram with D words need $\sum_{i = 1}^{N - 1} D^{i}$ transition node
good Turing
- Zipf’s law
backoff

state-of-the-art system

unit of sound by clustering large unlabelled dataset
transfer learning with small labeled dataset

phoneme: 39 English, Mandarin with tone

few rare phoneme, much more common than rare word → beat Zipf’s law
defined by linguistic, or learned by clustering
mono-phone/tri-phone: context-independent/dependent (for neighbor)
absorbing/generating state → non-emitting
locus: stable centers in spectrum

multiple pronunciation: multiple internal model + probability

mono-phone/ context-independent (CI) model

di-phone: model previous and current phoneme. problem: cross-word effect

tri-phone: model multiple current phoneme based on previous and next phoneme

many cases not seen, back off to mono-phone
share Gaussian with mono-phone, different weight
- decision tree

inexact search: run n-gram on (n-1)-gram model by applying word-transition prob

Steven Hé (Sīchàng)

Speech Recognition

feature extraction