Execution
(code and discussion are taken private; please ask Steven for repo access)
Preliminary
- run Binoculars over Common Crawl; see if result make sense (
preliminary_binoculars_eval.md
)
Keyword aquisition
- from Google Trends (
google_trends.md
)- Google completion?
classify into topicsget topic class from Google Trends- human brainstorm related keywords
Web searching and crawling
- search Google/Bing/Brave/Perplexity/ChatGPT/ChatNoir
- crawl top 10 results/reference
- extract body text
DOM Distiller Reading Mode- Trafilatura
Generated text detection
- text cleaning
- Binoculars
- speed enhancement for large scale
- get powerful processor
Case studies
- what are those “positive” page
- manual inspection
- clustering