Advanced Natural Language Processing

  • natural language hard because ambiguity

  • breakdown of language: phonetics (sound) → phonology (meaningfully distinct sound)/ orthography (character)

    • → morphology → lexeme (word)
    • → syntax (grammar) → semantics (meaning) → pragmatics (context) → discourse (relation between sentences)
  • current model suck when samples are long and few

  • overwhelmingly obey Zipf’s law

Introduction to Natural Language Processing, Jacob Eisenstein, The MIT Press, 2018

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models, Daniel Jurafsky, James H. Martin, 2025

paper to present

replication project

paper to question

paper presentations

  • What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
    • presented by Jinyi
    • fast&slow thinking: system 1&2, subconscious&conscious
    • for LLM: no chain-of-thought (CoT) vs CoT; different level of detail
    • probing & layer importance
      1. back propagation on different level of detail
      2. nuclear SVD of Q K V O of each layer’s gradient
        • more detailed CoT ⇒ smoother gradient
      3. compare mean absolute difference (MAD) of each layer
      4. compare relative difference (RD)
    • instruction tuning does not enable better nonsense detection
  • byte latent transformer (BLT): patches scale better than tokens
    • presented by Saba
    • LLM training is end-to-end except for tokenization
    • patching: dynamic group w/o fixed vocab
    • entropy patching: start new patch when next byte hard to predict
      • small byte-level autoregressive model to produce entropy
      • small byte-level encoder&decoder surrounding 1 large latent transformer in middle
      • use byte encoding plus hashed n-gram
        • not learned!
      • higher patch size ⇒ lower FLOP per accuracy gain
      • minor accuracy degradation
  • HumT DumT: Measuring and controlling human-like language in LLMs
    • presented by Sadra
    • human-like tone (HumT): ~anthropomorphization; users usually prefer against (bc wordy, not to-the-point)
      • like flirty; overly descriptive?
    • DPO + HumT (DUMT): RL to lower HumT
      • little performance degradation
  • Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
    • presented by Feiyu
    • train rewriter (small model) to rewrite query, then feed new query to queried model (large), finally feed everything to query model again
      • use judge model for reward in training
    • higher token count & slower, in exchange for claimed performance
  • LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
    • presented by Narges
    • train attacker LLM to make victim LLM answer harmful question
  • TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
    • presented by Saeed
    • entropy tree (EPTree): RL w/ tree of action
      • branch out from high-entropy (uncertain) token
      • only fork once in the paper
    • slightly higher #answers generated & pass rate per #tokens generated
      • but unclear if better when allowing more #tokens
    • forking appear relative location near uniform distro
  • Mega: Moving Average Equipped Gated Attention, Xuezhe Ma, Chunting Zhou, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer, ICLR, 2023
    • presented by Xuezhe
    • why: longer context practically mean smarter model
    • challenge: transformer is quadratic compute&space in context length
      • communication challenge; huge KV cache
      • attention fail in large context
    • chunk-wise attention: only attend to chunk instead of global
    • damped exponential moving average (EMA): EMA w/ relaxed coupled weight
      • learned parameter
      • parallelize by precomputing flattened weight & FFT
      • theoretically unbounded context
        • practically, w/ chunking, perform worse than full attention
      • apply on layer input for Q K but not V
    • single-head gated attention: add reset gate to attention output
      • gated attention: multiple attention output w/ output of another transform from input
      • can reduce #heads e.g. from 32 to 4 ⇒ computational efficiency
    • single Mega model perform well on various long-context task
  • Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length, Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou, NeurIPS 2024
    • better Mega
    • complex EMA:
      • reason: diagonal matrix much more expressive in complex space
    • time-step normalization: auto-regressive group normalization
      • cumulative mean&variance; no reset per document in training
    • reduce communication bc only need last output instead of whole KV cache
    • Gecko: ongoing improvements to Megalodon
      • running mean&variance via constant decay
      • sliding chunk attention: less wasted compute than flash attention
      • adaptive working memory (online softmax)

linear model

  • inter-agreement rate: human annotator may disagree

    • raw agreement rate
    • need to subtract expected agreement (by chance) sum of products of raw rates
    • good rate vary by opinion: 0.8 very good; 0.4 some consider good
  • perceptron: linear gradient lost

    • loss only on argmax y
  • logistic regression: softmax on all y instead of just argmax y

    • more stable than perceptron

naive Bayes

  • assuming (wrongly)

non-linear model

  • need: non-linearly-separable feature
  • activation function
    • old & bad: sigmoid, tanh
    • contemporary: ReLU, GELU, Swish, SwiGLU (used in SoTA LLM)

Thumbs up? Sentiment Classification using Machine Learning Techniques, Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, EMNLP, 2002

distributional feature representations

  • embedding: arbitrary vector representation of word sequence
  • distributional hypothesis: word meaning determined by where they are used
  • pointwise mutual information (PMI) for words: p(word, context)
  • positive PMI (PPMI): max(PMI, 0)
    • very sparse matrix
    • ⇒ singular value decomposition (SVD)
  • word2vec

language model (LM)

  • can generate: assume what came before determine what comes next
  • n-gram: “windowed” Bayes
    • backoff
    • smoothing
    • surprisingly good smallish model
  • perplexity: how well LM predict real text by entropy
  • feed forward neural network n-gram

tokenization

reuse frequent letter sequence

  • byte pair encoding (BPE)
    1. start w/ character-level tokens
    2. count most frequent token pair & greedily replace w/ single token
    • space sometimes part of token
    • not semantically meaningful
      • Reddit cause weird token e.g. user name
  • small vocabulary cause long context

recurrent neural network (RNN)

  • each layer: feed new embedding + old hidden state to get new hidden state
  • strong emphasis on last token
  • problem: vanishing gradient bc long chain of hidden state
  • long short-term memory (LSTM)
    • forget/input/output gate
    • multiply cell state by forget gate, add input gate, finally multiply output gate
    • Bidirectional LSTM (BiLSTM)
      • go both backward & forward, then concatenate
    • residual: directly add input to output
  • attention: how much to attend to each hidden state
  • minibatching: concatenate & broadcast multiple sequences for GPU efficiency
  • major problem: cannot parallelize; inherently sequential

self-attention

  • of each token to every token: parallelizable
  • idea: soft weighted lookup (of “query”) in key-value store

step-by-step:

  1. for each token, compute query of this token, key for every token
  2. compute attention score using e.g. dot product
  3. softmax over all to get
  4. compute value for every token, multiply by and sum

problem & solution:

  • no sequence order
    • ⇒ position embedding: add a vector representing position e.g. Sinusoidal (OG)/ RoPE
  • no element-wise nonlinearity
    • ⇒ feed forward network (FFN) after attention layer
  • not looking into future
    • ⇒ attention mask: set attention score to for future token

transformer

  • residual (add): smoothen loss landscape
  • layer normalization (norm): speed up training by normalizing mean & variance
  • multi-head attention: multiple attention & concatenation
  • optimizer: basic SGD
    • Adam: momentum, per-parameter normalize adjustment by std of output
    • AdamW: de facto in 2025
  • checkpoint/restart

pretrained language model

  • ELMo (2016): word vector w/ context from BiLSTM
  • fine-tuning: reuse model for other task for new task
  • GPT (2018): decoder-only classifier-only transformer
  • BERT (2018): convert piece of text to 512-dim vector w/ bidirectional transformer
    • mask certain tokens w/ [MASK], random token, or actual token, then try to predict
    • popular partially bc good marketing as a Muppet just like Elmo
    • still fail Winograd challenge: figure out what ambiguous “it” refer to
  • GPT-2 (2019): did well on task not trained on
    • huge model: 1.5B parameters
    • PR stunt bc uhallucinated nicorn story
  • GPT-3 (2020): SoTA on many task
    • 175B parameters
    • few-shot; hardly affected by irrelevant/misleading context
    • fixated on what it learned instead of adapting to flipped task
  • T0: like T5 but only w/ task phrased in natural language
  • LLaMA (2023): public architecture, weights, training data
    • no training&preprocessing code
    • hard to reproduce
  • Gemini (2023): started trend of white paper w/o technical details w/ tons of safety check

modified pretrained language model

  • T5: encoder-decoder transformer able to do multiple task
    • cross-attention: query from decoder, key&value from encoder
  • prefix tuning: prepend token to mark what the task is
    • additional parameter alongside each layer
    • do not fine-tune original parameter
  • adapter: linear layer before&after transformer to adapt to new task
  • LoRA: add parameter alongside each layer and add them
    • work very well
    • can add anywhere: Q, K, V, FFN
      • Jon: a student claimed adding to FFN was best
  • mixture of experts (MoE, from 1990s): send embedding to different channel for FFN
    • learn the router

introspective model ourobouros

  • scratchpad (2021): let model generate hidden token
    • RNN and transformer are Turing complete if given infinite think time
  • chain-of-thought (CoT, 2022): encourage model to include reasoning
    • hand-written example for training
  • ReAct: agentic + reasoning

human feedback

  • problem
    • some words matter more
    • bad training data
    • exposure bias
  • evaluation is usually not differentiable ⇒ RL