Web User-Facing

Search Engine Optimization (SEO)

  • The influence of search engine optimization on Google’s results: A multi-dimensional approach for detecting SEO, Dirk Lewandowski, Sebastian Sünkler, Nurce Yagci, ACM WebSci, 2021
    • insight from interview w/ “SEO expert”
    • questionable heuristics (e.g., HTTPS, manual website classification)
    • dataset: Google Trends, radical right, coronavirus
    • most search result likely have SEO
  • Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines, Janek Bevendorff, Matti Wiegmann, Martin Potthast, Benno Stein, Springer ECIR, 2024
    • search result on product review & spot affiliate link
      • query: best <category> where <category> is in GS1 Global Product Classification/ Google Product Taxonomy
      • filter review based on keyword regex, but 80% accuracy in test
      • manual classification of top 30 domain: authentic review/ magazine&news/ content farm/ spam/ shop/ social media/ other
    • top SEO content: repetitive, less readable, shallower URL, longer content, more heading, less heading-content overlap
      • lots of SEO metric based on HTML
      • They are also indicators of lower-quality, possibly mass-produced, or even AI-generated content.

    • comparison w/ BM25 search engine ChatNoir: much more affiliate link
  • Adversarial Search Engine Optimization for Large Language Models, Fredrik Nestaas, Edoardo Debenedetti, Florian Tramèr, arXiv, 2024
    • embed instruction/defamation in web content to manipulate RAG LLM search engine (answer engine)
      • can imply other content is bad
    • test by searching w/ site: for owned domain

examples:

ideas:

  • ranking based on user feedback
  • measuring retrieval of Perplexity, ChatGPT, etc. in search mode

🤖 Agent-added generated/synthetic web directions (2026-06-07)

  • 🤖 Source-critical agents: search and answer agents should model source ecology, not just rank pages. Live pages, archives, AI-generated pages, answer-engine summaries, consent-gated pages, and logged-in pages carry different evidence status.
  • 🤖 Generated web artifacts: generated sites/components should be treated as executable supply-chain claims with provenance, security obligations, tests, and runtime monitors. This is more useful than another AI-text detector.
  • 🤖 Why now: Retrieval Collapse reports synthetic sources can dominate retrieval exposure; LLM-generated web-app studies report exploitable auth, session, validation, XSS, upload, and HTTP-header failures; AI SEO/content spam changes what web-grounded agents see.
  • 🤖 Evaluation sketch: crawl a volatile topic daily, classify source types, archive evidence, and test whether agent answers shift toward synthetic or stale evidence. For generated artifacts, generate sites, attach obligations, run security checks, and measure review/repair quality.

🤖 See literature_directions.md candidate 4 and candidate 5 for the question-first versions of these ideas.

Web Dependency

Phishing

Spam