Execution

(code and discussion are taken private; please ask Steven for repo access)

Preliminary

  • run Binoculars over Common Crawl; see if result make sense (preliminary_binoculars_eval.md)

Keyword aquisition

  • from Google Trends (google_trends.md)
    • Google completion?
  • classify into topics get topic class from Google Trends
  • human brainstorm related keywords

Web searching and crawling

  • search Google/Bing/Brave/Perplexity/ChatGPT/ChatNoir
  • crawl top 10 results/reference
  • extract body text

Generated text detection

  • text cleaning
  • Binoculars
    • speed enhancement for large scale
  • get powerful processor

Case studies

  • what are those “positive” page
    • manual inspection
    • clustering