Filter out Non-Article

Binoculars, and presumably other generated text detectors, can only perform well when detecting articles such as blog posts. Non-articles, include lists of links on a homepage, login pages, dashboards of sports scores, and spec charts of products, may cause high false positive rates.

Current method to filter out non-articles (visit_subdomains.py):

  • use how-to search results. most of them are articles and forum threads, except for occasional product sales page
  • discard pages w/ < 200 tokens
  • discard pages w/ > 20% text in link/code
  • discard pages w/o any “block” > 250 characters long
    • a block is a paragraph, list item, or other non-container HTML element text
    • likely not consist of paragraph
  • discard pages w/ < 20% text in “large blocks”
    • large block need ≥100 characters
  • discard pages w/ > 40% text in list/table

Ideas for filtering:

  • discard pages based on HTML structure
    • but HTML structure is extremely flexible and behave differently based on CSS
    • perhaps study how Trafilatura filter out non-main-body text
  • feed extracted text to classifier
    • BERT-based classifier for article/non-article

Unreliable ideas:

  • filter by og:type = article. unreliable because not every article has this tag and content farm may have this tag be website

Deduplication

  • Finding Similar Files in a Large File System, Udi Manber, USENIX Winter, 1994
    • use Rabin fingerprint (polynomial fingerprint) to randomly determine “anchor” for chunking
      • polynomial fingerprint bc can efficiently compute next n-byte hash after shifting 1 byte
    • find identical chunks
  • Simhash

Segmentation for generalization

Instead of only filtering out what we cannot handle (non-article), we can also try to generalize our method to all webpages.

If we could split a webpage into sections, we could apply Binoculars to each section. For example, we could segment a homepage full of links into each of the links it contains, and subsequently filter them out based on length; we could segment a forum page into each comment.

Importantly, we need to segment after main text extraction with Trafilatura.

Webpage segmentation (WPS) tool