- Advanced Natural Language Processing
- Introduction to Natural Language Processing, Jacob Eisenstein, The MIT Press, 2018
- Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models, Daniel Jurafsky, James H. Martin, 2025
- paper to present
- linear model
- non-linear model
- Thumbs up? Sentiment Classification using Machine Learning Techniques, Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, EMNLP, 2002
- distributional feature representations
- language model (LM)
Advanced Natural Language Processing
natural language hard because ambiguity
breakdown of language: phonetics (sound) → phonology (meaningfully distinct sound)/ orthography (character)
- → morphology → lexeme (word)
- → syntax (grammar) → semantics (meaning) → pragmatics (context) → discourse (relation between sentences)
current model suck when samples are long and few
overwhelmingly obey Zipf’s law
Introduction to Natural Language Processing, Jacob Eisenstein, The MIT Press, 2018
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models, Daniel Jurafsky, James H. Martin, 2025
paper to present
- Learning to Rewrite: Generalized LLM-Generated Text Detection, Wei Hao, Ran Li, Weiliang Zhao, Junfeng Yang, Chengzhi Mao, ACL, 2025
- MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts, Dominik Macko, Jakub Kopál, Robert Moro, Ivan Srba, ACL, 2025
- MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection, Ziyan Liu, Chunxiao Fan, Haoran Lou, Yuexin Wu, Kaiwei Deng, ACL, 2025
- UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench, Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang, ACL, 2025
- CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System, Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, LiGangyang LiGangyang, Xu Zhu, Weiming Zhang, Nenghai Yu, ACL, 2025
linear model
inter-agreement rate: human annotator may disagree
- raw agreement rate
- need to subtract expected agreement (by chance) sum of products of raw rates
- good rate vary by opinion: 0.8 very good; 0.4 some consider good
perceptron: linear gradient lost
- loss only on argmax y
logistic regression: softmax on all y instead of just argmax y
- more stable than perceptron
naive Bayes
- assuming (wrongly)
non-linear model
- need: non-linearly-separable feature
- activation function
- old & bad: sigmoid, tanh
- contemporary: ReLU, GELU, Swish, SwiGLU (used in SoTA LLM)
Thumbs up? Sentiment Classification using Machine Learning Techniques, Bo Pang, Lillian Lee, Shivakumar Vaithyanathan, EMNLP, 2002
distributional feature representations
- embedding: arbitrary vector representation of word sequence
- distributional hypothesis: word meaning determined by where they are used
- pointwise mutual information (PMI) for words: p(word, context)
- positive PMI (PPMI): max(PMI, 0)
- very sparse matrix
- ⇒ singular value decomposition (SVD)
- word2vec
language model (LM)
- can generate: assume what came before determine what comes next
- n-gram: “windowed” Bayes
- backoff
- smoothing
- surprisingly good smallish model
- perplexity: how well LM predict real text by entropy
- feed forward neural network n-gram