Literature
Content farm & scam
- An examination of content farms in web search using crowdsourcing An examination of content farms in web search using crowdsourcing, Richard McCreadie, Craig Macdonald, Iadh Ounis, Jim Giles, Ferris Jabr, CIKM, 2012
- Mechanical Turk to label search result
- found content farm decrease
- Polls, clickbait, and commemorative $2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections, Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021
- content farm served political ad
- extract ad w/ EasyList CSS selector
- 💡 can use uBlock Origin logger
- qualitative coding & BERT classifier to analyze
- Analyzing the (In)Accessibility of Online Advertisements, Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024
- use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList
- Plagiarism-Bot? How Low-Quality Websites Are Using AI to Deceptively Rewrite Content from Mainstream News Outlets, Newsguard, 2023
- detection: seek “error message” like “as an AI”, then human review
- plagiarize from legit site like New York Times
- money from Google Ads, from top brand
- Scam Sites at Scale: LLMs Fueling a GenAI Criminal Revolution, Netcraft, 2024
- phishing website/ fake shop/ email use LLM
identify thousands of websites each week using AI-generated content
- example w/ LLM “error message”/ beginning sentence of response
- “Certainly, here’s/ here are …:”
- “As of my last knownledge update…”
- Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites, Newsguard, 2023
- NewsGuard found 141 brand have ad on AI-driven site
- unreliable AI-generated news website (UAIN)
- mainly use Google Ads
- publish large volume, sometimes 1000 per day
- Junk websites filled with AI-generated text are pulling in money from programmatic ads, MIT Technology Review, 2023
- number for ad revenue
- People Are Spinning Up Low-Effort Content Farms Using AI, Futurism, 2023
- content farm low quality, misinformation, profit from Google Ads
- 💡 maybe we report them to Google bc against their ad policy
- Tracking AI-enabled Misinformation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools, Newsguard, 2025
- NewsGuard found 141 brand have ad on AI-driven site
- AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian, Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024
- easy to fine-tune free model into Italian content farm model (CFM)
- literature review on NewsGuard report & detector
- human&DetectGPT low accuracy—infeasible to detect
- fine-tuning of detector help a lot, but need to know base LLM used
Search Engine Optimization (SEO)
See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo.
Generative AI (GenAI)
See https://sichanghe.github.io/notes/research/gen_ai.html.
Training data curation
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023
- heuristic-based URL filtering; no ML filtering to avoid bias
- 4.6M site list 💡 useful for spotting spam/scam, etc.
- use Trafilatura for main content extraction
- Quality at a glance: An audit of web-crawled multilingual datasets, Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022
- Common Crawl include large portion of machine spam & porn
- heuristic-based URL filtering; no ML filtering to avoid bias
Synthetic data
- STaR: Bootstrapping Reasoning With Reasoning, Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman, NeurIPS, 2022
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, et al., Microsoft, 2024
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, ACL, 2024