Classifying Websites

SVM website classifier

classifying/site_svm.py

  • webpage filtering
    • filter by a URL path regex to rid index page, tag page, etc.
    • filter by Content-Type: text/html
    • filter by English & #tokens > 200 (filter_non_article.md)
  • compute 11 Binoculars score deciles among webpage for each website
    • also tried 101 percentiles, 9 deciles, 5 quartiles, 3 quartiles; little difference
  • train linear SVM classifier on deciles as feature vector
    • train on company/personal website dataset (baseline_sites.md)
    • out-of-distribution test on personal/company/other website dataset
    • perfect performance in every combination
    • when comparing whether filter by beforeGPT, perfect performance except if trained on company deciles/quartiles beforeGPT and tested on non-beforeGPT (98.3% accuracy)
  • ⇒ aggregate Binoculars score analysis perform well regardless of the noise in data (e.g., boilerplate page)
    • generalize across different kinds of website

Applying in the wild

  • full SVM model: train SVM on all baseline website, w/ 9 deciles (classifying/full_site_svm.py)

Crawling websites for classification

browser/bing_search.py

  • Sample 1000 WikiHow article how-to questions as queries by SHA256.
  • Search Bing API for 20 results for each query.
  • For each result link, extract subdomain.
  • For each subdomain, fetch the sitemap.
    • If no sitemap, fetch last 2000 pages from Wayback Machine CDX.
  • Randomly sample pages from the sitemap to crawl until 20 non-filtered.
    • respect robots.txt
    • give up on 3 error connecting, e.g., DNS resolution failure, connection timeout