Classifying Websites

SVM website classifier

classifying/site_svm.py

webpage filtering
- filter by a URL path regex to rid index page, tag page, etc.
- filter by Content-Type: text/html
- filter by English & #tokens > 200, etc. (filter_non_article.md)
compute 9 Binoculars score deciles among webpage for each website
- also tried 101 percentiles, 11 deciles, 5 quartiles, 3 quartiles; little difference
train linear SVM classifier on deciles as feature vector
- train on company/personal website dataset (baseline_sites.md)
- out-of-distribution test on personal/company/other website dataset
- perfect performance in every combination
- when comparing whether filter by beforeGPT, perfect performance except if trained on company deciles/quartiles beforeGPT and tested on non-beforeGPT (98.3% accuracy)
⇒ aggregate Binoculars score analysis perform well regardless of the noise in data (e.g., boilerplate page)
- generalize across different kinds of website

full SVM model: train SVM on all baseline website, w/ 9 deciles (classifying/full_site_svm.py)

browser/bing_search.py

Sample 1000 WikiHow article how-to questions as queries by SHA256.
Search Bing API for 20 results for each query.
For each result link, extract subdomain.
For each subdomain, fetch the sitemap.
- If no sitemap, fetch last 2000 pages from Wayback Machine CDX.
Randomly sample pages from the sitemap to crawl until 20 non-filtered.
- respect robots.txt
- give up on 3 error connecting, e.g., DNS resolution failure, connection timeout