Preliminary Binoculars Evaluation

  • dump English HTTP response from random Common Crawl WARC file
  • extract main text; if longer than 2000 character, feed into Binoculars
  • split long page down to Falcon-7B context window 2048 token
    • currently, simply truncate
    • average score weighted on token count

script: degentweb.common_crawl.classify_english

result dumped to data/common_crawl/prelim_test/; 5961 page; used 1h 20m

Manual inspection of low-score page among 1010 Common Crawl page

low-score page:

  • ~19 seem indeed generate article, mostly blog & product
  • 3 simply table-like listing
  • 7 short listing/interaction page (scoreboard, links; only 2 are > 2000 character)
  • boilerplate text: 7 cookie banner (< 2000 character); 1 legal notice
  • 1 seem false positive https://catalyticconvertersolutions.com/writer/nicolas-will/

lower-score page (above FPR-threshold but around F1-threshold):

  • many also seem generated, some not
    • 1 page error message (MySQL)
    • boilerplate text (legal)

Case studies

Problem

  • listing/interaction page should not be in this study
    • cannot simply filter by markup ratio (link, button, etc.) bc some attach large description block
    • ❓ ML model to distinguish article
      • can base on BERT or BART
      • cannot ask LLM bc unreliable (tested)
    • <meta property="og:type" content="article"> but not every article has this tested, unreliable

Google Bing 500 WikiHow articles

script: degentweb.browser.bing_search degentweb.classifying.google_prelim degentweb.classifying.prelim_data_analysis

Site crawling

script: degentweb.browser.visit_subdomains

  1. for each subdomain shown in search result, if < 20 page crawled, then try finding sitemap w/ ultimate-sitemap-parser
  2. if no sitemap, then query Wayback Machine (WM) Content Index API for last 2000 ok HTML response in the last 4 year
  3. if found any page, then crawl (20 – #already_crawled) page