Literature
Website
Content farm & scam
- An examination of content farms in web search using crowdsourcing An examination of content farms in web search using crowdsourcing, Richard McCreadie, Craig Macdonald, Iadh Ounis, Jim Giles, Ferris Jabr, CIKM, 2012
- Mechanical Turk to label search result
- found content farm decrease
- Polls, clickbait, and commemorative $2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections, Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021
- content farm served political ad
- extract ad w/ EasyList CSS selector
- 💡 can use uBlock Origin logger
- qualitative coding & BERT classifier to analyze
- Analyzing the (In)Accessibility of Online Advertisements, Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024
- use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList
- Plagiarism-Bot? How Low-Quality Websites Are Using AI to Deceptively Rewrite Content from Mainstream News Outlets, Newsguard, 2023
- detection: seek “error message” like “as an AI”, then human review
- plagiarize from legit site like New York Times
- money from Google Ads, from top brand
- Scam Sites at Scale: LLMs Fueling a GenAI Criminal Revolution, Netcraft, 2024
- phishing website/ fake shop/ email use LLM
identify thousands of websites each week using AI-generated content
- example w/ LLM “error message”/ beginning sentence of response
- “Certainly, here’s/ here are …:”
- “As of my last knownledge update…”
- Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites, Newsguard, 2023
- NewsGuard found 141 brand have ad on AI-driven site
- unreliable AI-generated news website (UAIN)
- mainly use Google Ads
- publish large volume, sometimes 1000 per day
- Junk websites filled with AI-generated text are pulling in money from programmatic ads, MIT Technology Review, 2023
- number for ad revenue
- People Are Spinning Up Low-Effort Content Farms Using AI, Futurism, 2023
- content farm low quality, misinformation, profit from Google Ads
- 💡 maybe we report them to Google bc against their ad policy
- Tracking AI-enabled Misinformation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools, Newsguard, 2025
- NewsGuard found 141 brand have ad on AI-driven site
- AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian, Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024
- easy to fine-tune free model into Italian content farm model (CFM)
- literature review on NewsGuard report & detector
- human&DetectGPT low accuracy—infeasible to detect
- fine-tuning of detector help a lot, but need to know base LLM used
Search Engine Optimization (SEO)
See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo.
Website clustering
- Identification of Web Spam through Clustering of Website Structures, Filippo Geraci, ACM WWW, 2015
Website ownership
- Domain and Website Attribution beyond WHOIS, Silvia Sebastián, Raluca-Georgia Diugan, Juan Caballero, Iskander Sanchez-Rola, Leyla Bilge, ACSAC, 2023
- use WHOIS, passive DNS, TLS certificate, website content
- website content: copyright string, metadata, policy, terms of service (TOS), contact, security.txt
- F1 score 0.94
- use WHOIS, passive DNS, TLS certificate, website content
Web genre classification
- Web page genre classification, Guangyu Chen, Ben Choi, SAC, 2008
- genre: homepage, search, resource, shop, forum
- ❌ manual feature design & threshold tuning w/ regex/ HTML tag
- unlikely to generalize especially today
- Web Genre Classification via Hierarchical Multi-label Classification, Gjorgji Madjarov, Vedrana Vidulin, Ivica Dimitrovski, Dragi Kocev, Springer IDEAL, 2015
- fancy decision tree on web page features
- 2491 features from Multi-Label Approaches to Web Genre Identification, Vedrana Vidulin, Mitja Luštrek, Matjaž Gams, JLCL, 2009
- keyword in URL
- specific word rate, punctuation, HTML tag, out-domain hyperlink
- part of speech, sentence type
- ❌ bind and brute force, bad <40% accuracy
- ❌ bad 28% accuracy
- Enhancing the identification of web genres by combining internal and external structures, Chaker Jebari, Elsevier Pattern Recognition Letters, 2021
- ❌ use word (term) in heading + link; combine multiple classifier
- dataset: KI04, SANTINIS
- >0.85 accuracy w/ combined classifier
- Web Page Classification using LLMs for Crawling Support, Yuichi Sasazawa, Yasuhiro Sogawa, Hitachi Ltd., 2025
- GPT-4o 0.89 F1 score when classify homepage vs content on extracted title + body
Training data
Generative AI (GenAI)
See https://sichanghe.github.io/notes/research/gen_ai.html.
Training data curation
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023
- heuristic-based URL filtering; no ML filtering to avoid bias
- 4.6M site list 💡 useful for spotting spam/scam, etc.
- use Trafilatura for main content extraction
- Quality at a glance: An audit of web-crawled multilingual datasets, Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022
- Common Crawl include large portion of machine spam & porn
- heuristic-based URL filtering; no ML filtering to avoid bias
Synthetic data
- STaR: Bootstrapping Reasoning With Reasoning, Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman, NeurIPS, 2022
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, et al., Microsoft, 2024
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, ACL, 2024