Speech Recognition

broader speech processing

  • wake-up word detection: binary classification
    • small window → low latency
    • multiple models, threshold & size
  • echo cancellation: subtract (complex) echo of sound from speaker
  • sound source localization: microphone array, delay

online/streaming mode: continuous conversion

omnidirectional/ cardioid/ hyper cardioid microphone

sound sample rate

  • need at lease 2x of highest frequency wanted, else aliasing (Nyquist theorem)
  • 44.1kHz for CD
  • 16kHz good enough for speech

-curve

audio format: use PCM .wav

online endpointing format: push to talk, hit to talk, continuous listening

feature extraction

spectrogram: energy distribution over frequency vs time

  1. preemphasize speech signal: boost high frequency

  2. spectrogram: from time series to frequency domain

    • discrete Fourier transform (DFT)
    • symmetric, only first half + midpoint meaningful (by Nyquist theorem)
    • magnitude: Hz, frequency at point
      • : sample rate, : number of sample
      • power: magnitude squared
    • flat noise from jump in sample windowing
      • solution: multiple input by bell-shape windowing function and half-overlap windows to avoid losing information
    • before using fast Fourier transform (FFT): zero padding
      • cause fake interpolated detail
  3. auditory perception: from frequency to Bark

    • mimic human ear, which distinguish low frequency better
    • frequency warping with Mel curve
    • filter bank: triangular filter on Mel curve to be multiplied
      • evenly-spaced, half-overlap, 40 enough, 80 good
      • translate back to equal-area triangles on frequency axis
        • not directly on Mel curve for simplicity
  4. log Mel spectrum: log of integration of each filter bank

  5. Mel cepstrum discrete cosine transform (DCT) compression

    • reduce redundancy from filter bank overlap
    • subtract the mean to remove microphone/noise difference
      • microphone response is speech with convolution
      • convolution -DFT → multiplication -log → summation

longest common subsequence: dynamic programming, dummy char padding, streaming, search trellis, sub-trellis, lexical tree

  • pruning: hard threshold vs beam search

DTW: dynamic time warping: disallow skipping (vertical on trellis), non-linear mapping (super diagonal)

  • : best path cost from origin to
  • : local node cost (vector distance)
  • global beam across templates for beam search
  • combine multiple template
    • template averaging: first align templates by DTW
    • template segmentation: compress chunks of same phoneme
    • actual method: iterate from uniform segments to stabilize variance

Mahalanobis distance

  • can be estimated with negative Gaussian log likelihood

covariance

  • estimated with diagonal covariance matrix when elements largely uncorrelated

self-transition penalty: model phoneme duration

MFCC (Mel frequency cepstrum coefficient)

dynamic time warping (DTW)

  • align multiple training sample
    1. segment all model sample uniformly by #phoneme, and average
    2. align each sample against model sample to get new segment and new average
    3. iterate until convergence
  • transition probability: #segment switch over #frame in segment

hidden Markov model for DTW

  • use log probability so total score is log total probability
  • simulation by transition probability matrix (each row sum to 1)
  • initial probability

expectation-maximization (EM) algorithm

  1. initialize with k-means clustering

  2. auxiliary function: conditional expectation of the complete data log likelihood

  3. evaluate

  4. iterate until convergence

forward algorithm

  1. initialize
  2. iterate

Baum Welch: soft state alignment

log-domain math

continuous text recognition: small-scale problem, e.g. voice command

  • wrap back from end of template to start dummy variable
    • high wrap back cost → discourage space
  • lextree
  • non-emitting state/ null state: only for connecting, no self-transition
    • no transition time
  • prior probability for word: add to the start of its HMM
  • word transition probability for edge cost between word HMM
  • approximate sentence probability with best path

grammar: only focus on syntax not semantics

  • finite-state grammar (FSG)
  • context-free grammar (CFG)
  • backpointer: only word-level, additional script on word transition

training with continuous speech: bootstrap & iterate

  • silence: silence model, bypass arc, self-loop non-emitting state
    • loop need to go through emitting state, else infinite

N-gram

  • N-gram assumption:
  • start/end of sentence: <s>, </s>
  • unigram, bigram. 5-gram good enough for commercial use
  • N-gram with D words need transition node
  • good Turing
    • Zipf’s law
  • backoff

state-of-the-art system

  • unit of sound by clustering large unlabelled dataset
  • transfer learning with small labeled dataset

phoneme: 39 English, Mandarin with tone

  • few rare phoneme, much more common than rare word → beat Zipf’s law
  • defined by linguistic, or learned by clustering
  • mono-phone/tri-phone: context-independent/dependent (for neighbor)
  • absorbing/generating state → non-emitting
  • locus: stable centers in spectrum

multiple pronunciation: multiple internal model + probability

mono-phone/ context-independent (CI) model

di-phone: model previous and current phoneme. problem: cross-word effect

tri-phone: model multiple current phoneme based on previous and next phoneme

  • many cases not seen, back off to mono-phone
  • share Gaussian with mono-phone, different weight
    • decision tree

inexact search: run n-gram on (n-1)-gram model by applying word-transition prob