Speech Recognition
broader speech processing
- wake-up word detection: binary classification
- small window → low latency
- multiple models, threshold & size
- echo cancellation: subtract (complex) echo of sound from speaker
- sound source localization: microphone array, delay
online/streaming mode: continuous conversion
omnidirectional/ cardioid/ hyper cardioid microphone
sound sample rate
- need at lease 2x of highest frequency wanted, else aliasing (Nyquist theorem)
- 44.1kHz for CD
- 16kHz good enough for speech
-curve
audio format: use PCM .wav
online endpointing format: push to talk, hit to talk, continuous listening
feature extraction
spectrogram: energy distribution over frequency vs time
preemphasize speech signal: boost high frequency
spectrogram: from time series to frequency domain
- discrete Fourier transform (DFT)
- symmetric, only first half + midpoint meaningful (by Nyquist theorem)
- magnitude: Hz, frequency at point
- : sample rate, : number of sample
- power: magnitude squared
- flat noise from jump in sample windowing
- solution: multiple input by bell-shape windowing function and half-overlap windows to avoid losing information
- before using fast Fourier transform (FFT): zero padding
- cause fake interpolated detail
auditory perception: from frequency to Bark
- mimic human ear, which distinguish low frequency better
- frequency warping with Mel curve
- filter bank: triangular filter on Mel curve to be multiplied
- evenly-spaced, half-overlap, 40 enough, 80 good
- translate back to equal-area triangles on frequency axis
- not directly on Mel curve for simplicity
log Mel spectrum: log of integration of each filter bank
Mel cepstrum discrete cosine transform (DCT) compression
- reduce redundancy from filter bank overlap
- subtract the mean to remove microphone/noise difference
- microphone response is speech with convolution
- convolution -DFT → multiplication -log → summation
longest common subsequence: dynamic programming, dummy char padding, streaming, search trellis, sub-trellis, lexical tree
- pruning: hard threshold vs beam search
DTW: dynamic time warping: disallow skipping (vertical on trellis), non-linear mapping (super diagonal)
- : best path cost from origin to
- : local node cost (vector distance)
- global beam across templates for beam search
- combine multiple template
- template averaging: first align templates by DTW
- template segmentation: compress chunks of same phoneme
- actual method: iterate from uniform segments to stabilize variance
Mahalanobis distance
- can be estimated with negative Gaussian log likelihood
covariance
- estimated with diagonal covariance matrix when elements largely uncorrelated
self-transition penalty: model phoneme duration
MFCC (Mel frequency cepstrum coefficient)
dynamic time warping (DTW)
- align multiple training sample
- segment all model sample uniformly by #phoneme, and average
- align each sample against model sample to get new segment and new average
- iterate until convergence
- transition probability: #segment switch over #frame in segment
hidden Markov model for DTW
- use log probability so total score is log total probability
- simulation by transition probability matrix (each row sum to 1)
- initial probability
expectation-maximization (EM) algorithm
initialize with k-means clustering
auxiliary function: conditional expectation of the complete data log likelihood
evaluate
iterate until convergence
forward algorithm
- initialize
- iterate
Baum Welch: soft state alignment
log-domain math
continuous text recognition: small-scale problem, e.g. voice command
- wrap back from end of template to start dummy variable
- high wrap back cost → discourage space
- lextree
- non-emitting state/ null state: only for connecting, no self-transition
- no transition time
- prior probability for word: add to the start of its HMM
- word transition probability for edge cost between word HMM
- approximate sentence probability with best path
grammar: only focus on syntax not semantics
- finite-state grammar (FSG)
- context-free grammar (CFG)
- backpointer: only word-level, additional script on word transition
training with continuous speech: bootstrap & iterate
- silence: silence model, bypass arc, self-loop non-emitting state
- loop need to go through emitting state, else infinite
N-gram
- N-gram assumption:
- start/end of sentence:
<s>
,</s>
- unigram, bigram. 5-gram good enough for commercial use
- N-gram with D words need transition node
- good Turing
- Zipf’s law
- backoff
state-of-the-art system
- unit of sound by clustering large unlabelled dataset
- transfer learning with small labeled dataset
phoneme: 39 English, Mandarin with tone
- few rare phoneme, much more common than rare word → beat Zipf’s law
- defined by linguistic, or learned by clustering
- mono-phone/tri-phone: context-independent/dependent (for neighbor)
- absorbing/generating state → non-emitting
- locus: stable centers in spectrum
multiple pronunciation: multiple internal model + probability
mono-phone/ context-independent (CI) model
di-phone: model previous and current phoneme. problem: cross-word effect
tri-phone: model multiple current phoneme based on previous and next phoneme
- many cases not seen, back off to mono-phone
- share Gaussian with mono-phone, different weight
- decision tree
inexact search: run n-gram on (n-1)-gram model by applying word-transition prob