Changelog

2025-12-03

tried decision tree/forest for filtering; default settings high FPR; tree uninterpretable

DB migration blocked by TimescaleDB slowness until adding index and fixing chunking
re-crawling Common Crawl 10k subdomains remains slow after filter adjustments but likely correct

analyzed Common Crawl 10k subdomains: lower %AI overall, more when crawled later
quantized Falcon-7B to fp8 and improved dynamic batching for Binoculars
SVMs need only 15 pages per site; Binoculars SVMs highly confident on baseline
enabled CDC-based dedup per site, removed unused Common Crawl downloads

NLP researchers hinted worse Binocular results from new models are due to model being less similar to ChatGPT-3.5
removed Playwright thread bottleneck and scaled task spawning by available resources
set up SQLite concurrent writes and Rust+Python codebase for maintenance
building Common Crawl index DB faced gzip issues, SSL errors, and rate limiting
sampled subdomains with 10 concurrent downloads; observed CPU-bound extraction and subdomain completion stats
filtered by duplication rate using CDC

improved ad counting by counting only elements with content and considered ads-per-token metric
debugged asyncio crawler issues; noted resource balancing contribution
fixed false positive on www.w3docs.com by filtering code text