Advanced Natural Language Processing

  • natural language hard because ambiguity

  • breakdown of language: phonetics (sound) → phonology (meaningfully distinct sound)/ orthography (character)

    • → morphology → lexeme (word)
    • → syntax (grammar) → semantics (meaning) → pragmatics (context) → discourse (relation between sentences)
  • current model suck when samples are long and few

  • overwhelmingly obey Zipf’s law

Introduction to Natural Language Processing, Jacob Eisenstein, The MIT Press, 2018

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models, Daniel Jurafsky, James H. Martin, 2025

paper to present

linear model

  • inter-agreement rate: human annotator may disagree

    • raw agreement rate
    • need to subtract expected agreement (by chance) sum of products of raw rates
    • good rate vary by opinion: 0.8 very good; 0.4 some consider good
  • perceptron: linear gradient lost

    • loss only on argmax y
  • logistic regression: softmax on all y instead of just argmax y

    • more stable than perceptron

naive Bayes

  • assuming (wrongly)

non-linear model

  • need: non-linearly-separable feature
  • activation function
    • old & bad: sigmoid, tanh
    • contemporary: ReLU, GELU, Swish, SwiGLU (used in SoTA LLM)

Thumbs up? Sentiment Classification using Machine Learning Techniques, Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, EMNLP, 2002

distributional feature representations

  • embedding: arbitrary vector representation of word sequence
  • distributional hypothesis: word meaning determined by where they are used
  • pointwise mutual information (PMI) for words: p(word, context)
  • positive PMI (PPMI): max(PMI, 0)
    • very sparse matrix
    • ⇒ singular value decomposition (SVD)
  • word2vec

language model (LM)

  • can generate: assume what came before determine what comes next
  • n-gram: “windowed” Bayes
    • backoff
    • smoothing
    • surprisingly good smallish model
  • perplexity: how well LM predict real text by entropy
  • feed forward neural network n-gram