Ad Extraction
extracting&counting ad may help identify content farm
Options for methods
- (currently implemented) apply EasyList CSS selectors on HTML (browser, JS enabled) using
lxml
or similar - Huanchen idea: turn Ad blocker on/off in browser & diff visual element
- use uBlock Origin logger when running browser
- use uBlock-Origin-compatible rule parser (e.g., adblock-rust) & somehow
Browser operation
to load all ad, scroll to the bottom and top, 20 time combined, each time wait for networkidle
(in degentweb.browser.save_page
)
- some ad load slowly
- some ad load after a timeout
- some ad load after user interaction
re-crawled all 0-ad page after adjusting browser interaction w/ degentweb.browser.recrawl_no_ads
Ad classification
see Analyzing the (In)Accessibility of Online Advertisements
most should be Google Ads