Literature

Website

Content farm & scam

Search Engine Optimization (SEO)

See https://sichanghe.github.io/notes/research/web_user_facing.html#search-engine-optimization-seo.

Website clustering

Website ownership

  • Domain and Website Attribution beyond WHOIS, Silvia Sebastián, Raluca-Georgia Diugan, Juan Caballero, Iskander Sanchez-Rola, Leyla Bilge, ACSAC, 2023
    • use WHOIS, passive DNS, TLS certificate, website content
      • website content: copyright string, metadata, policy, terms of service (TOS), contact, security.txt
    • F1 score 0.94

Web genre classification

Training data

Generative AI (GenAI)

See https://sichanghe.github.io/notes/research/gen_ai.html.

Training data curation

Synthetic data