Execution
(code and discussion are taken private; please ask Steven for repo access)
see also arguments.md
and literature.md
Preliminary
- run Binoculars over Common Crawl; see if result make sense (
preliminary_binoculars_eval.md
) edit: … and WikiHow search result
Keyword aquisition
- WikiHow article titles (
wikihow.md
) [ ] from Google Trends (google_trends.md
)- Google completion?
get topic class from Google Trends[ ] human brainstorm related keywords
Web searching and crawling
web_search.md
- search Google/Bing/Brave/Perplexity/ChatGPT/ChatNoir
- crawl top 10 results/reference
- extract body text
DOM Distiller Reading Mode- Trafilatura
Generated text detection
[ ] text cleaningcleaned by Trafilatura- Binoculars
- speed enhancement for large scale
- get powerful processor
Case studies
- what are those website w/ many “positive” page
- content farm w/ many ad (
ad_extraction.md
) - content farm selling product
- identify scam seller
- perhaps use blacklist used by RefinedWeb
- identify scam seller
- false positive: forum/support
- content farm w/ many ad (
- what are those “positive” page
- manual inspection
- clustering
Development
- clone w/
--recurse-submodules
and remember to update submodules on pull- automatically do these w/ these Git config
- use Rye to manage Python dependencies (
rye sync
,rye add
)- note: some dependency like
nvidia-cuda-runtime-cu12
version for Binoculars are unfortunately hardcoded for Exxact; need to change if used on other machine
- note: some dependency like