Arguments
(citation in Quarto format w/ reference.bib
for future writing)
Significance
- The Web is the largest public corpus of knowledge for human consumption and machine learning training
- AI-generated text usually plagiarize prior content, often hallucinate misinformation, so it pollutes contents on the Web.
- Spammy content affects search engine user experience by disrupting search engine ranking
- Users want accountable, reliable, and informative content, but AI-generated contents are often not so.
- Arguments from Perplexity.ai: LLMs may have stale information, and do not cite their sources for verification purposes.
- Generated contents in training data may harm LLM performance
- Generated contents in RAG data may harm LLM search performance
- Laws like the EU AI Act mandate disclosure of AI-generated content, but many AI content websites do not comply and thus may be illegal.
Crawling
- trying to see webpage users see although scraper see different results, search result differ based on location
- main body text extraction is difficult to do well, but we do best-effort w/ SoTA method
- Trafilatura [@barbaresi2021trafilatura] is SoTA [@bevendorff2023empirical; @reeve2024evaluation]
Generated text detection
- Binoculars score reflect information density (substance)
- is calibrated perplexity
- generated text is bad because provide less actual information
- detect both generated and low-quality content
- statistical probability may give more info than binary classification using fixed threshold
- detection of individual webpage may yield error, but aggregate analysis per website should increase accuracy
- although NLP benchmarks evaluated individual detectors on texts, they do not reflect the results from our aggregate website analysis, so we need to run our pipeline over baseline websites
- when multiple detectors exhibit the same trend, the trend is evident
Content farm
- moral issue: some site like GeeksforGeeks look like content farm but may be useful; what is our stance?
Generalizing to non-article webpages
Text-based LLM detection cannot generalize to all webpages. Types of webpages from the viewpoint of such detection (enumeration based on Deepseek):
Single or cohesive narrative/text blocks. Blog Posts, News Articles, Research Publications, Tutorials/Guides, E-books/Whitepapers, Podcast Transcripts, Recipes, Interactive Stories, Archived Content.
👌 These can be treated as a single block of text and directly classified.
Multi-Section Text Pages. Homepage, FAQ, Glossary, Forum (boards/posts), Directory, Wiki/Knowledge Base, Portfolio, Testimonials, Case Studies, Team Directory, Event Listings, Press Releases, User Profiles, Social Feeds, Q&A Platforms.
😰 These, when treated as a single block of text, causes discontinuity in the text and degrade text detection performance. Segmenting them and classifying each segment separately may work.
Boilerplate/Legal/Standardized Content. Privacy Policy, Terms of Service, Disclaimer, Product Pages (descriptions), Download Pages, Account Settings, Pricing Pages, Services Pages, Career Listings, API Documentation, Client Dashboards, Affiliate Pages.
🤷 These have standardized forms, such that humans would write them in a similar way that LLMs do, so detection makes little sense.
Media/Non-Text Pages. Image Galleries, Video Pages, Audio Streams, 3D/Virtual Tours.
❌ Text detection is not applicable to these.
Interactive/Functional Interfaces. Dashboards, Quizzes/Surveys, Calendars, Booking/Checkout Pages, Login/Registration Forms, Search Results, Advanced Filters, Live Chat, Calculators/Converters, Games, AR/VR Interfaces. Stock Tickers, Weather Forecasts, Order Tracking, Live Streams/Webinars, Auction/Bidding Pages, Real Estate Listings, Job Boards. Comparison Tools, Financial Calculators, Medical/Appointment Systems, Code Playgrounds, Maps.
❌ These are not really text-centric content, so text detection is not applicable.