Literature
- Website
- Training data

Literature

Website

Content farm & scam

An examination of content farms in web search using crowdsourcing An examination of content farms in web search using crowdsourcing, Richard McCreadie, Craig Macdonald, Iadh Ounis, Jim Giles, Ferris Jabr, CIKM, 2012
- Mechanical Turk to label search result
- found content farm decrease
Polls, clickbait, and commemorative $2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections, Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021
- content farm served political ad
- extract ad w/ EasyList CSS selector
  - 💡 can use uBlock Origin logger
- qualitative coding & BERT classifier to analyze
- Analyzing the (In)Accessibility of Online Advertisements, Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024
  - use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList
Plagiarism-Bot? How Low-Quality Websites Are Using AI to Deceptively Rewrite Content from Mainstream News Outlets, Newsguard, 2023
- detection: seek “error message” like “as an AI”, then human review
- plagiarize from legit site like New York Times
- money from Google Ads, from top brand
Scam Sites at Scale: LLMs Fueling a GenAI Criminal Revolution, Netcraft, 2024
- phishing website/ fake shop/ email use LLM
- identify thousands of websites each week using AI-generated content
- example w/ LLM “error message”/ beginning sentence of response
  - “Certainly, here’s/ here are …:”
  - “As of my last knownledge update…”
Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites, Newsguard, 2023
- NewsGuard found 141 brand have ad on AI-driven site
  - unreliable AI-generated news website (UAIN)
- mainly use Google Ads
- publish large volume, sometimes 1000 per day
- Junk websites filled with AI-generated text are pulling in money from programmatic ads, MIT Technology Review, 2023
  - number for ad revenue
- People Are Spinning Up Low-Effort Content Farms Using AI, Futurism, 2023
  - content farm low quality, misinformation, profit from Google Ads
  - 💡 maybe we report them to Google bc against their ad policy
- Tracking AI-enabled Misinformation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools, Newsguard, 2025
AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian, Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024
- easy to fine-tune free model into Italian content farm model (CFM)
- literature review on NewsGuard report & detector
- human&DetectGPT low accuracy—infeasible to detect
  - fine-tuning of detector help a lot, but need to know base LLM used

Search Engine Optimization (SEO)

See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo.

Website clustering

Identification of Web Spam through Clustering of Website Structures, Filippo Geraci, ACM WWW, 2015

Website ownership

Domain and Website Attribution beyond WHOIS, Silvia Sebastián, Raluca-Georgia Diugan, Juan Caballero, Iskander Sanchez-Rola, Leyla Bilge, ACSAC, 2023
- use WHOIS, passive DNS, TLS certificate, website content
  - website content: copyright string, metadata, policy, terms of service (TOS), contact, security.txt
- F1 score 0.94

Web genre classification

Web page genre classification, Guangyu Chen, Ben Choi, SAC, 2008
- genre: homepage, search, resource, shop, forum
- ❌ manual feature design & threshold tuning w/ regex/ HTML tag
  - unlikely to generalize especially today
Web Genre Classification via Hierarchical Multi-label Classification, Gjorgji Madjarov, Vedrana Vidulin, Ivica Dimitrovski, Dragi Kocev, Springer IDEAL, 2015
- fancy decision tree on web page features
- 2491 features from Multi-Label Approaches to Web Genre Identification, Vedrana Vidulin, Mitja Luštrek, Matjaž Gams, JLCL, 2009
  - keyword in URL
  - specific word rate, punctuation, HTML tag, out-domain hyperlink
  - part of speech, sentence type
  - ❌ bind and brute force, bad <40% accuracy
- ❌ bad 28% accuracy
Enhancing the identification of web genres by combining internal and external structures, Chaker Jebari, Elsevier Pattern Recognition Letters, 2021
- ❌ use word (term) in heading + link; combine multiple classifier
- dataset: KI04, SANTINIS
- >0.85 accuracy w/ combined classifier
Web Page Classification using LLMs for Crawling Support, Yuichi Sasazawa, Yasuhiro Sogawa, Hitachi Ltd., 2025
- GPT-4o 0.89 F1 score when classify homepage vs content on extracted title + body

Training data

Generative AI (GenAI)

See https://sichanghe.github.io/notes/research/gen_ai.html.

Training data curation

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023
- heuristic-based URL filtering; no ML filtering to avoid bias
  - 4.6M site list 💡 useful for spotting spam/scam, etc.
- use Trafilatura for main content extraction
- Quality at a glance: An audit of web-crawled multilingual datasets, Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022
  - Common Crawl include large portion of machine spam & porn

Synthetic data

STaR: Bootstrapping Reasoning With Reasoning, Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman, NeurIPS, 2022
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, et al., Microsoft, 2024
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, ACL, 2024

Steven Hé (Sīchàng)