Filter out Non-Article
Binoculars, and presumably other generated text detectors, can only perform well when detecting articles such as blog posts. Non-articles, include lists of links on a homepage, login pages, dashboards of sports scores, and spec charts of products, may cause high false positive rates.
Current method to filter out non-articles (visit_subdomains.py
):
- use how-to search results. most of them are articles and forum threads, except for occasional product sales page
- discard pages w/ < 200 tokens
- discard pages w/ > 20% text in link/code
- Trafilatura sometimes extract wrong thing
- extract many link following main text, e.g. https://entrance-exam.net/annamalai-university-pg-diploma-in-automobile-maintenance-auto-garage-management-papers/
- extract only code block, e.g. for https://plaid.com/institutions/mos/
- ⇒ try Trafilatura w/
favor_precision
instead offavor_recall
if too much text in link/code
- Trafilatura sometimes extract wrong thing
- discard pages w/o any “block” > 250 characters long
- a block is a paragraph, list item, or other non-container HTML element text
- likely not consist of paragraph
- discard pages w/ < 20% text in “large blocks”
- large block need ≥100 characters
- discard pages w/ > 40% text in list/table
Ideas for filtering:
- discard pages based on HTML structure
- but HTML structure is extremely flexible and behave differently based on CSS
- perhaps study how Trafilatura filter out non-main-body text
- feed extracted text to classifier
- BERT-based classifier for article/non-article
Unreliable ideas:
- filter by
og:type
=article
. unreliable because not every article has this tag and content farm may have this tag bewebsite
Deduplication
- Finding Similar Files in a Large File System, Udi Manber, USENIX Winter, 1994
- use Rabin fingerprint (polynomial fingerprint) to randomly determine “anchor” for chunking
- polynomial fingerprint bc can efficiently compute next n-byte hash after shifting 1 byte
- find identical chunks
- use Rabin fingerprint (polynomial fingerprint) to randomly determine “anchor” for chunking
- Simhash
Segmentation for generalization
Instead of only filtering out what we cannot handle (non-article), we can also try to generalize our method to all webpages.
If we could split a webpage into sections, we could apply Binoculars to each section. For example, we could segment a homepage full of links into each of the links it contains, and subsequently filter them out based on length; we could segment a forum page into each comment.
Importantly, we need to segment after main text extraction with Trafilatura.
Webpage segmentation (WPS) tool
- Web page segmentation with structured prediction and its application in web page classification, Lidong Bing, Rui Guo, Wai Lam, Zheng-Yu Niu, Haifeng Wang, SIGIR, 2014
- too old
- webpage class: index, image, forum, product, research result, blog, download, news, video, other
- Box clustering segmentation: A new method for vision-based web page preprocessing, Jan Zeleny, Radek Burget, Jaroslav Zendulka, Information Processing Management, Elsevier, 2017
- vision-based
- no code
- Multimodal Web Page Segmentation Using Self-organized Multi-objective Clustering, Srivatsa Ramesh Jayashree, Gaël Dias, Judith Jeyafreeda Andrew, Sriparna Saha, Fabrice Maurel, Stéphane Ferrari, TOIS, 2022
- MCS: automatically choose #segment using K-means
- no code?
- Web Page Segmentation Revisited: Evaluation Framework and Dataset, Johannes Kiesel, Florian Kneist, Lars Meyer, Kristof Komlossy, Benno Stein, Martin Potthast, CIKM, 2020
- ⭐ Webis-WebSeg-20 dataset for WPS benchmark
- WebSAM-Adapter: Adapting Segment Anything Model for Web Page Segmentation, Bowen Ren, Zefeng Qian, Yuchen Sun, Chao Gao, Chongyang Zhang, Advances in Information Retrieval, Springer ECIR, 2024
- segment any model vision-based SoTA webpage segmentation
- use Webis-WebSeg-20 dataset for eval
- did not cite Multimodal Web Page Segmentation
- A DOM-structural cohesion analysis approach for segmentation of modern web pages, Hieu Huynh, Quoc-Tri Le, Vu Nguyen, Tien Nguyen, Springer World Wide Web, 2025
- use DOM structure analysis
- use Webis-WebSeg-20 dataset for eval
- did not cite WebSAM-Adapter