Web Crawling

Web Crawling

The Prevalence of Single Sign-On on the Web: Towards the Next Generation of Web Content Measurement, Calvin Ardi, Matt Calder, IMC, 2023
- automate SSO login w/ Google/Facebook, etc. to access gated content
- 58% of top 10K website can SSO
On Landing and Internal Web Pages: The Strange Case of Jekyll and Hyde in Web Performance Measurement, Waqar Aqeel, Balakrishnan Chandrasekaran, Anja Feldmann, Bruce M. Maggs, IMC, 2020
- landing page different from internal page
- method to find representative internal page
System to Identify and Elide Superfluous JavaScript Code for Faster Webpage Loads, Utkarsh Goel, Moritz Steiner, arXiv, 2020
- passive real user monitoring system (RUM) in CDN proxy
- on median page, 31% JS code superfluous, can rm for 5% speedup
Browser-based Crawling of News Websites Behind Paywalls, IFLA News Media Section & IIPC Workshop on Archiving News Media, 2025
- Danish & Luxembourgish & Finnish using legal deposit law
- ask site owner to get login credential; legal deposit law enforce company to abide
  - ask for IP-based paywall bypass; better than login
  - cookie to bypass cache for IP-based paywall bypass
- Heritrix job queue for non-browser crawling
- Browsertrix browser automation
  - show preview; designed for non-technical user
  - pre-crawl validation of browser “profile” to ensure logged in
  - Selenium + Sikuli + weekly manual profile check
  - fork to crawl news frequently and dedup
    - hash page after removing dynamic element w/ regex 🤣
- maintain shared document of paywall info
- “student helper” to check periodically & submit ticket
- https://github.com/iipc/awesome-web-archiving

Steven Hé (Sīchàng)

Web Crawling