Changelog
2025-11-11
- finished DB migration & bulk Git history cleanup
- re-statted search result sites and observed filtering too strict
2025-10-07
- tried decision tree/forest for filtering; default settings high FPR; tree uninterpretable
2025-09-29
- removed TimescaleDB after confirming PostgreSQL handled compression
2025-09-18
- re-crawling Common Crawl 10k subdomains; many rejected as non-articles
2025-09-09
- DB migration blocked by TimescaleDB slowness until adding index and fixing chunking
- re-crawling Common Crawl 10k subdomains remains slow after filter adjustments but likely correct
2025-07-23
- recomputed incorrect Binoculars scores
- re-downloaded Common Crawl index for uniform sampling and filtered IP hostnames
2025-07-08
- analyzed Common Crawl 10k subdomains: lower %AI overall, more when crawled later
- quantized Falcon-7B to fp8 and improved dynamic batching for Binoculars
- SVMs need only 15 pages per site; Binoculars SVMs highly confident on baseline
- enabled CDC-based dedup per site, removed unused Common Crawl downloads
2025-06-24
- NLP researchers hinted worse Binocular results from new models are due to model being less similar to ChatGPT-3.5
- removed Playwright thread bottleneck and scaled task spawning by available resources
- set up SQLite concurrent writes and Rust+Python codebase for maintenance
- building Common Crawl index DB faced gzip issues, SSL errors, and rate limiting
- sampled subdomains with 10 concurrent downloads; observed CPU-bound extraction and subdomain completion stats
- filtered by duplication rate using CDC
2025-06-03
- fixed low-hanging non-article filtering issues
- sampled subdomains from Common Crawl for classification
2025-05-06
- completed per-site deduplication and additional filtering tweaks
- decided against moving extraction and ad counting into the browser
2025-04-29
- improved ad counting by counting only elements with content and considered ads-per-token metric
- debugged asyncio crawler issues; noted resource balancing contribution
- fixed false positive on www.w3docs.com by filtering code text
2025-04-15
- settled on decile-based feature vectors
- sampled subdomains via how-to queries to broaden coverage
- generated positive websites for baseline
- built website feature vector SVM with train/test split
2025-04-08
- surveyed website generator capabilities
2025-03-25
- expanded AI and human website datasets, including IndieWeb sources
2025-02-24
- confirmed high-score pages in low-score sites rarely good
- subdomain SVMs for statistical scoring
- reviewed blacklist matching, sites near threshold, and WHOIS checks
2025-02-11
- switched to Wayback Machine prefix search instead of recursive crawling