Execution

(code and discussion are taken private; please ask Steven for repo access)

see also arguments.md and literature.md

Preliminary

  • run Binoculars over Common Crawl; see if result make sense (preliminary_binoculars_eval.md) edit: … and WikiHow search result

Keyword aquisition

  • WikiHow article titles (wikihow.md)
  • [ ] from Google Trends (google_trends.md)
    • Google completion?
  • get topic class from Google Trends
  • [ ] human brainstorm related keywords

Web searching and crawling

web_search.md

  • search Google/Bing/Brave/Perplexity/ChatGPT/ChatNoir
  • crawl top 10 results/reference
  • extract body text

Generated text detection

  • [ ] text cleaning cleaned by Trafilatura
  • Binoculars
    • speed enhancement for large scale
  • get powerful processor

Case studies

  • what are those website w/ many “positive” page
    • content farm w/ many ad (ad_extraction.md)
    • content farm selling product
    • false positive: forum/support
  • what are those “positive” page
    • manual inspection
    • clustering

Development

  • clone w/ --recurse-submodules and remember to update submodules on pull
  • use Rye to manage Python dependencies (rye sync, rye add)
    • note: some dependency like nvidia-cuda-runtime-cu12 version for Binoculars are unfortunately hardcoded for Exxact; need to change if used on other machine