Web User-Facing
- Investigating influencer VPN ads on YouTube, Omer Akgul, Richard Roberts, Moses Namara, Dave Levin, Michelle L. Mazurek, IEEE SP, 2022
- scrape YouTube & analyze video
- Evolving Bots: The New Generation of Comment Bots and their Underlying Scam Campaigns in YouTube, Seung Ho Na, Sumin Cho, Seungwon Shin, IMC, 2023
- scrape YouTube comment & cluster to find user w/ URL in profile
- fine-tune BERT variant to generate embedding for DBSCAN clustering
- filter URL: discard most common domain (twitter) and very rare domain (personal website)
- match URL w/ blocklist to find social scam bot (SSB)
- most are “romance scam” or “game scam”
- SSB strategy: steal top comment, self-engagement
- scrape YouTube comment & cluster to find user w/ URL in profile
- As Advertised? Understanding the Impact of Influencer VPN Ads, Omer Akgul, Richard Roberts, Emma Shroyer, Dave Levin, Michelle L. Mazurek, USENIX Security, 2025
- Reviving Dead Links on the Web with Fable, Jingyuan Zhu, Anish Nyayachavadi, Jiangchen Zhu, Vaspol Ruamviboonsuk, Harsha V. Madhyastha, IMC, 2023
- heuristic to find mapping between dead link and new link after page moved
Search Engine Optimization (SEO)
- ❓ The influence of search engine optimization on Google’s results: A multi-dimensional approach for detecting SEO, Dirk Lewandowski, Sebastian Sünkler, Nurce Yagci, ACM WebSci, 2021
- insight from interview w/ “SEO expert”
- questionable heuristics (e.g., HTTPS, manual website classification)
- dataset: Google Trends, radical right, coronavirus
- most search result likely have SEO
- ⭐ Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines, Janek Bevendorff, Matti Wiegmann, Martin Potthast, Benno Stein, Springer ECIR, 2024
- search result on product review & spot affiliate link
- query:
best <category>where<category>is in GS1 Global Product Classification/ Google Product Taxonomy - filter review based on keyword regex, but 80% accuracy in test
- manual classification of top 30 domain: authentic review/ magazine&news/ content farm/ spam/ shop/ social media/ other
- query:
- top SEO content: repetitive, less readable, shallower URL, longer content, more heading, less heading-content overlap
- lots of SEO metric based on HTML
They are also indicators of lower-quality, possibly mass-produced, or even AI-generated content.
- comparison w/ BM25 search engine ChatNoir: much more affiliate link
- search result on product review & spot affiliate link
- Adversarial Search Engine Optimization for Large Language Models, Fredrik Nestaas, Edoardo Debenedetti, Florian Tramèr, arXiv, 2024
- embed instruction/defamation in web content to manipulate RAG LLM search engine (answer engine)
- can imply other content is bad
- test by searching w/
site:for owned domain
- embed instruction/defamation in web content to manipulate RAG LLM search engine (answer engine)
examples:
- AI SEO spam content when searching for “AI SEO” in Google Scholar: AI Revolutionizes Seo: How Chatgpt & Gemini Can Supercharge Your Rankings
- leading SEO company is huge: ahrefs
ideas:
- ranking based on user feedback
- measuring retrieval of Perplexity, ChatGPT, etc. in search mode
🤖 Agent-added generated/synthetic web directions (2026-06-07)
- 🤖 Source-critical agents: search and answer agents should model source ecology, not just rank pages. Live pages, archives, AI-generated pages, answer-engine summaries, consent-gated pages, and logged-in pages carry different evidence status.
- 🤖 Generated web artifacts: generated sites/components should be treated as executable supply-chain claims with provenance, security obligations, tests, and runtime monitors. This is more useful than another AI-text detector.
- 🤖 Why now:
Retrieval Collapsereports synthetic sources can dominate retrieval exposure; LLM-generated web-app studies report exploitable auth, session, validation, XSS, upload, and HTTP-header failures; AI SEO/content spam changes what web-grounded agents see. - 🤖 Evaluation sketch: crawl a volatile topic daily, classify source types, archive evidence, and test whether agent answers shift toward synthetic or stale evidence. For generated artifacts, generate sites, attach obligations, run security checks, and measure review/repair quality.
🤖 See literature_directions.md candidate 4 and candidate 5 for the question-first versions of these ideas.
Web Dependency
- Poster: Web Dependency Analyzer to Identify Resource Dependencies and their Impact on Rendering, Yasin Alhamwy, Paul Mertens, Oliver Hohlfeld, IMC, 2024
- which domain provide which assets, contribute how much to rendered content
- ablation: what happens to rendering when block some domains (or down)
Phishing
- Phishing in the Free Waters: A Study of Phishing Attacks Created using Free Website Building Services, Sayak Saha Roy, Unique Karanjit, Shirin Nilizadeh, IMC, 2023
- description of free website building service
- much pressure on detecting phishing domain ⇒ costly to buy frequently
Spam
- Click Trajectories: End-to-End Analysis of the Spam Value Chain, S&P, 2011
- SybilGuard: defending against sybil attacks via social networks, Haifeng Yu, Michael Kaminsky, Phillip B. Gibbons, Abraham Flaxman, SIGCOMM, 2006