An examination of content farms in web search using crowdsourcing An examination of content farms in web search using crowdsourcing, Richard McCreadie, Craig Macdonald, Iadh Ounis, Jim Giles, Ferris Jabr, CIKM, 2012 Mechanical Turk to label search result found content farm decrease Polls, clickbait, and commemorative <data class=“katex-src” value=“2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections , Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021 - content farm served political ad - extract ad w/ EasyList CSS selector - 💡 can use uBlock Origin logger - qualitative coding & BERT classifier to analyze - Analyzing the (In)Accessibility of Online Advertisements , Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024 - use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList - Plagiarism-Bot? How Low-Quality Websites Are Using AI to Deceptively Rewrite Content from Mainstream News Outlets , NewsGuard, 2023 - detection: seek "error message" like "as an AI", then human review - plagiarize from legit site like New York Times - money from Google Ads, from top brand - Scam Sites at Scale: LLMs Fueling a GenAI Criminal Revolution , Netcraft, 2024 - phishing website/ fake shop/ email use LLM - > identify thousands of websites each week using AI-generated content - example w/ LLM "error message"/ beginning sentence of response - "Certainly, here’s/ here are …:" - "As of my last knownledge update…" - Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites , NewsGuard, 2023 - NewsGuard found 141 brand have ad on AI-driven site - unreliable AI-generated news website (UAIN) - NewsGuard want to charge university $3,750 for each paper - ⇒ predator org w/ fake data? - mainly use Google Ads - publish large volume, sometimes 1000 per day - Junk websites filled with AI-generated text are pulling in money from programmatic ads , MIT Technology Review, 2023 - number for ad revenue - People Are Spinning Up Low-Effort Content Farms Using AI , Futurism, 2023 - content farm low quality, misinformation, profit from Google Ads - 💡 maybe we report them to Google bc against their ad policy - Tracking AI-enabled Misinformation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools , NewsGuard, 2025 - AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian , Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024 - easy to fine-tune free model into Italian content farm model (CFM) - literature review on NewsGuard report & detector - human&DetectGPT low accuracy—infeasible to detect - fine-tuning of detector help a lot, but need to know base LLM used ### Search Engine Optimization (SEO) See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo . ### Website clustering and attribution - limited top-venue work on website clustering - Detecting Malware and Spam Web Sites by Mining their Structural Features , WWW, 2015 (spam parking domain look-and-feel) - scam and phishing detection literature partial coverage - technology stack analysis tools rely on regex, e.g., Wappalyzergo ### Website clustering - Identification of Web Spam through Clustering of Website Structures , Filippo Geraci, ACM WWW, 2015 ### Website ownership - Domain and Website Attribution beyond WHOIS , Silvia Sebastián, Raluca-Georgia Diugan, Juan Caballero, Iskander Sanchez-Rola, Leyla Bilge, ACSAC, 2023 - use WHOIS, passive DNS, TLS certificate, website content - website content: copyright string, metadata, policy, terms of service (TOS), contact, security.txt - F1 score 0.94 ### Web genre classification this is what detection of homepage, etc. is called - Web page genre classification , Guangyu Chen, Ben Choi, SAC, 2008 - genre: homepage, search, resource, shop, forum - ❌ manual feature design & threshold tuning w/ regex/ HTML tag - unlikely to generalize especially today - Web Genre Classification via Hierarchical Multi-label Classification , Gjorgji Madjarov, Vedrana Vidulin, Ivica Dimitrovski, Dragi Kocev, Springer IDEAL, 2015 - fancy decision tree on web page features - 2491 features from Multi-Label Approaches to Web Genre Identification , Vedrana Vidulin, Mitja Luštrek, Matjaž Gams, JLCL, 2009 - keyword in URL - specific word rate, punctuation, HTML tag, out-domain hyperlink - part of speech, sentence type - ❌ bind and brute force, bad <40% accuracy - ❌ bad 28% accuracy - Enhancing the identification of web genres by combining internal and external structures , Chaker Jebari, Elsevier Pattern Recognition Letters, 2021 - ❌ use word (term) in heading + link; combine multiple classifier - dataset: KI04, SANTINIS - >0.85 accuracy w/ combined classifier - Web Page Classification using LLMs for Crawling Support , Yuichi Sasazawa, Yasuhiro Sogawa, Hitachi Ltd., 2025 - GPT-4o 0.89 F1 score when classify homepage vs content on extracted title + body ## Training data ### Generative AI (GenAI) See https://sichanghe.github.io/notes/research/gen_ai.html . ### Training data curation - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, LREC, 2020 - Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , Adrien Barbaresi, ACL 2021 - An Empirical Comparison of Web Content Extraction Algorithms , Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein, SIGIR, 2023 - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023 - heuristic-based URL filtering; no ML filtering to avoid bias - 4.6M site list 💡 useful for spotting spam/scam, etc. - use Trafilatura for main content extraction - Quality at a glance: An audit of web-crawled multilingual datasets , Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022 - Common Crawl include large portion of machine spam & porn - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, et al., Kyle Lo, arXiv, 2024 - for Dolma 1.6 - C4 NoPunc + Gopher All for quality filtering ### Synthetic data - STaR: Bootstrapping Reasoning With Reasoning , Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman, NeurIPS, 2022 - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, et al., Microsoft, 2024 - Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, ACL, 2024 (below generated) ## Common Crawl processing at terabyte scale - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data - Quote: “deduplicates documents and identifies their language.” arxiv - Quote: “augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.” arxiv - cc_net code - Quote: “The full mining pipeline is divided in 3 steps: hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph.” github - OSCAR project website - TODO: Extraction/dedup/filter details are not present in the provided source. - Ungoliant (OSCAR-related pipeline) summary - TODO: Only a brief description was available in the provided source; extraction/dedup/filter details not documented there. - Dolma: an Open Corpus of Three Trillion Tokens - Quote: “transforms the output of CCNet through URL and document-level deduplication, then quality and content filtering.” aclanthology - Quote: “filtering at a rate of 122 CPU hours per TB.” aclanthology - RedPajama-Data-v2 blog - Quote: “pass each CommonCrawl snapshot through the CCNet pipeline.” together - Quote: “deduplicated … using a Bloom filter,” with “reduction … roughly 40%.” together - Common Crawl FAQ - Quote: “stored on Amazon’s S3 service,” enabling “Map-Reduce processing in EC2.” commoncrawl - TODO: C4/mC4 extraction, dedup, and filtering details were not present in the provided sources. ## AWS cost/throughput for batch processing - Comparing Burstable and On-Demand AWS EC2 Instances using NAS Parallel Benchmarks - Quote: “Burstable instances… provide a baseline level of CPU utilization with the ability to burst… governed by CPU credits.” sol.sbc.org - Quote: “a CPU-bound workload can run similarly in all instances.” sol.sbc.org - Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 - TODO: No verbatim quote provided in the search results. - Choosing the Right EC2 Instance Type for your Application - Quote: “If you are running any CPU-bound scale-out applications, you should look at compute-optimized instances first.” aws.amazon - Diving Deep into EC2 Spot Instance Cost and Operational Practices - Quote: “Spot Instances… are available at up to a 90% discount compared to On-Demand EC2 instance prices.” aws.amazon - Quote: “Spot Instance Advisor populates the frequency of interruption and average savings… based on the last 30 days of historical data.” aws.amazon - Quote: “the past interruption behavior doesn’t predict the future availability of these instances.” aws.amazon - Cost-effective Batch Processing with Amazon EC2 Spot - TODO: No verbatim quote provided in the search results. - Scientific Workflow Applications on Amazon EC2 (Montage) - Quote: “Epigenomics is considered to be CPU-bound because it spends 99% of its runtime in the CPU and only 1% on I/O…” montage.ipac.caltech - Parsing Common Crawl in a day for “>2 bills: problematic political advertising on news and media websites around the 2020 U.S. elections, Eric Zeng, Miranda Wei, Theo Gregersen, Tadayoshi Kohno, Franziska Roesner, IMC, 2021 - content farm served political ad - extract ad w/ EasyList CSS selector - 💡 can use uBlock Origin logger - qualitative coding & BERT classifier to analyze - Analyzing the (In)Accessibility of Online Advertisements , Christina Yeung, Tadayoshi Kohno, Franziska Roesner, IMC, 2024 - use UWCSESecurityLab adscraper to extract ad, which in turn use EsayList - Plagiarism-Bot? How Low-Quality Websites Are Using AI to Deceptively Rewrite Content from Mainstream News Outlets , NewsGuard, 2023 - detection: seek "error message" like "as an AI", then human review - plagiarize from legit site like New York Times - money from Google Ads, from top brand - Scam Sites at Scale: LLMs Fueling a GenAI Criminal Revolution , Netcraft, 2024 - phishing website/ fake shop/ email use LLM - > identify thousands of websites each week using AI-generated content - example w/ LLM "error message"/ beginning sentence of response - "Certainly, here's/ here are …:" - "As of my last knownledge update…" - Funding the Next Generation of Content Farms: Some of the World’s Largest Blue Chip Brands Unintentionally Support the Spread of Unreliable AI-Generated News Websites , NewsGuard, 2023 - NewsGuard found 141 brand have ad on AI-driven site - unreliable AI-generated news website (UAIN) - NewsGuard want to charge university $3,750 for each paper - ⇒ predator org w/ fake data? - mainly use Google Ads - publish large volume, sometimes 1000 per day - Junk websites filled with AI-generated text are pulling in money from programmatic ads , MIT Technology Review, 2023 - number for ad revenue - People Are Spinning Up Low-Effort Content Farms Using AI , Futurism, 2023 - content farm low quality, misinformation, profit from Google Ads - 💡 maybe we report them to Google bc against their ad policy - Tracking AI-enabled Misinformation: 1,254 ‘Unreliable AI-Generated News’ Websites (and Counting), Plus the Top False Narratives Generated by Artificial Intelligence Tools , NewsGuard, 2025 - AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian , Giovanni Puccetti, Anna Rogers, Chiara Alzetta, Felice Dell’Orletta, Andrea Esuli, ACL, 2024 - easy to fine-tune free model into Italian content farm model (CFM) - literature review on NewsGuard report & detector - human&DetectGPT low accuracy—infeasible to detect - fine-tuning of detector help a lot, but need to know base LLM used ### Search Engine Optimization (SEO) See <https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo>. ### Website clustering and attribution - limited top-venue work on website clustering - Detecting Malware and Spam Web Sites by Mining their Structural Features , WWW, 2015 (spam parking domain look-and-feel) - scam and phishing detection literature partial coverage - technology stack analysis tools rely on regex, e.g., Wappalyzergo ### Website clustering - Identification of Web Spam through Clustering of Website Structures , Filippo Geraci, ACM WWW, 2015 ### Website ownership - Domain and Website Attribution beyond WHOIS , Silvia Sebastián, Raluca-Georgia Diugan, Juan Caballero, Iskander Sanchez-Rola, Leyla Bilge, ACSAC, 2023 - use WHOIS, passive DNS, TLS certificate, website content - website content: copyright string, metadata, policy, terms of service (TOS), contact, security.txt - F1 score 0.94 ### Web genre classification this is what detection of homepage, etc. is called - Web page genre classification , Guangyu Chen, Ben Choi, SAC, 2008 - genre: homepage, search, resource, shop, forum - ❌ manual feature design & threshold tuning w/ regex/ HTML tag - unlikely to generalize especially today - Web Genre Classification via Hierarchical Multi-label Classification , Gjorgji Madjarov, Vedrana Vidulin, Ivica Dimitrovski, Dragi Kocev, Springer IDEAL, 2015 - fancy decision tree on web page features - 2491 features from Multi-Label Approaches to Web Genre Identification , Vedrana Vidulin, Mitja Luštrek, Matjaž Gams, JLCL, 2009 - keyword in URL - specific word rate, punctuation, HTML tag, out-domain hyperlink - part of speech, sentence type - ❌ bind and brute force, bad <40% accuracy - ❌ bad 28% accuracy - Enhancing the identification of web genres by combining internal and external structures , Chaker Jebari, Elsevier Pattern Recognition Letters, 2021 - ❌ use word (term) in heading + link; combine multiple classifier - dataset: KI04, SANTINIS - >0.85 accuracy w/ combined classifier - Web Page Classification using LLMs for Crawling Support , Yuichi Sasazawa, Yasuhiro Sogawa, Hitachi Ltd., 2025 - GPT-4o 0.89 F1 score when classify homepage vs content on extracted title + body ## Training data ### Generative AI (GenAI) See <https://sichanghe.github.io/notes/research/gen_ai.html>. ### Training data curation - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave, LREC, 2020 - Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction , Adrien Barbaresi, ACL 2021 - An Empirical Comparison of Web Content Extraction Algorithms , Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein, SIGIR, 2023 - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only , Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay, arXiv, 2023 - heuristic-based URL filtering; no ML filtering to avoid bias - 4.6M site list 💡 useful for spotting spam/scam, etc. - use Trafilatura for main content extraction - Quality at a glance: An audit of web-crawled multilingual datasets , Julia Kreutzer, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, et al., ACL, 2022 - Common Crawl include large portion of machine spam & porn - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, et al., Kyle Lo, arXiv, 2024 - for Dolma 1.6 - C4 NoPunc + Gopher All for quality filtering ### Synthetic data - STaR: Bootstrapping Reasoning With Reasoning , Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman, NeurIPS, 2022 - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, et al., Microsoft, 2024 - Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling , Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly, ACL, 2024 (below generated) ## Common Crawl processing at terabyte scale - CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data - Quote: “deduplicates documents and identifies their language.” arxiv - Quote: “augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.” arxiv - cc_net code - Quote: “The full mining pipeline is divided in 3 steps: hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph.” github - OSCAR project website - TODO: Extraction/dedup/filter details are not present in the provided source. - Ungoliant (OSCAR-related pipeline) summary - TODO: Only a brief description was available in the provided source; extraction/dedup/filter details not documented there. - Dolma: an Open Corpus of Three Trillion Tokens - Quote: “transforms the output of CCNet through URL and document-level deduplication, then quality and content filtering.” aclanthology - Quote: “filtering at a rate of 122 CPU hours per TB.” aclanthology - RedPajama-Data-v2 blog - Quote: “pass each CommonCrawl snapshot through the CCNet pipeline.” together - Quote: “deduplicated … using a Bloom filter,” with “reduction … roughly 40%.” together - Common Crawl FAQ - Quote: “stored on Amazon’s S3 service,” enabling “Map-Reduce processing in EC2.” commoncrawl - TODO: C4/mC4 extraction, dedup, and filtering details were not present in the provided sources. ## AWS cost/throughput for batch processing - Comparing Burstable and On-Demand AWS EC2 Instances using NAS Parallel Benchmarks - Quote: “Burstable instances… provide a baseline level of CPU utilization with the ability to burst… governed by CPU credits.” sol.sbc.org - Quote: “a CPU-bound workload can run similarly in all instances.” sol.sbc.org - Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 - TODO: No verbatim quote provided in the search results. - Choosing the Right EC2 Instance Type for your Application - Quote: “If you are running any CPU-bound scale-out applications, you should look at compute-optimized instances first.” aws.amazon - Diving Deep into EC2 Spot Instance Cost and Operational Practices - Quote: “Spot Instances… are available at up to a 90% discount compared to On-Demand EC2 instance prices.” aws.amazon - Quote: “Spot Instance Advisor populates the frequency of interruption and average savings… based on the last 30 days of historical data.” aws.amazon - Quote: “the past interruption behavior doesn’t predict the future availability of these instances.” aws.amazon - Cost-effective Batch Processing with Amazon EC2 Spot - TODO: No verbatim quote provided in the search results. - Scientific Workflow Applications on Amazon EC2 (Montage) - Quote: “Epigenomics is considered to be CPU-bound because it spends 99% of its runtime in the CPU and only 1% on I/O…” montage.ipac.caltech - Parsing Common Crawl in a day for 60 Quote: “4117 p/s.” pierce Quote: “used the m7i.metal-48 spot instance…,” “around 30h for the full common crawl…,” and “32,000 items processed per second on average.” pierce Measuring the Carbon Cost of Crawling Five Billion Web Pages Quote: “Tailpipe assessed Common Crawl’s AWS usage data for the months of February and March.” tailpipe Quote: “during the post-processing phase, AWS billed Common Crawl less per day… suggests that… workload was being consolidated across fewer computing instances with a higher utilization rate…”. tailpipe AWS EC2 Instances Benchmark TODO: No verbatim quote provided in the search results. Common Crawl on AWS Marketplace TODO: No verbatim quote provided in the search results.