Literature
Content farm
- An examination of content farms in web search using crowdsourcing An examination of content farms in web search using crowdsourcing, Richard McCreadie, Craig Macdonald, Iadh Ounis, Jim Giles, Ferris Jabr, CIKM, 2012
- Mechanical Turk to label search result
- found content farm decrease
- Polls, clickbait, and commemorative $2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections, Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021
- content farm served political ad
- extract ad w/ EasyList CSS selector
- 💡 can use uBlock Origin logger
- qualitative coding & BERT classifier to analyze
- Analyzing the (In)Accessibility of Online Advertisements, Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024
- use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList
- Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites
- NewsGuard found 141 brand have ad on AI-driven site
- unreliable AI-generated news website (UAIN)
- mainly use Google Ads
- publish large volume, sometimes 1000 per day
- Junk websites filled with AI-generated text are pulling in money from programmatic ads
- number for ad revenue
- People Are Spinning Up Low-Effort Content Farms Using AI
- content farm low quality, misinformation, profit from Google Ads
- 💡 maybe we report them to Google bc against their ad policy
- NewsGuard found 141 brand have ad on AI-driven site
- AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian, Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024
- easy to fine-tune free model into Italian content farm model (CFM)
- literature review on NewsGuard report & detector
- human&DetectGPT low accuracy—infeasible to detect
- fine-tuning of detector help a lot, but need to know base LLM used
Search Engine Optimization (SEO)
See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo.
Generative AI (GenAI)
See https://sichanghe.github.io/notes/research/gen_ai.html.
Training data curation
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023
- heuristic-based URL filtering; no ML filtering to avoid bias
- 4.6M site list 💡 useful for spotting spam/scam, etc.
- use Trafilatura for main content extraction
- Quality at a glance: An audit of web-crawled multilingual datasets, Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022
- Common Crawl include large portion of machine spam & porn
- heuristic-based URL filtering; no ML filtering to avoid bias