Common Crawl Dataset

Common Crawl index

Formats

path file e.g.: https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-18/cc-index.paths.gz

index file e.g.: https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2025-21/indexes/cdx-00002.gz

line in index file e.g. Languages are ,-separated, English is eng:

ar,com,fusionbikes)/producto/luz-knog-blinder-mini-niner 20250512054001 {"url": "https://fusionbikes.com.ar/producto/luz-knog-blinder-mini-niner/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "ZA6MPPJEFGZ53KWFSLWMF7NDO7RKOWR4", "length": "78501", "offset": "243658810", "filename": "crawl-data/CC-MAIN-2025-21/segments/1746990412231.65/warc/CC-MAIN-20250512042640-20250512072640-00410.warc.gz", "charset": "UTF-8", "languages": "spa"}

line in index file with redirect:

ar,com,futbolinterior)/component/mailto?link=b2449dac9dd9dbc27fe2d9f8c9dfc250a10eb664&template=shaper_helix3&tmpl=component 20250516140105 {"url": "http://futbolinterior.com.ar/component/mailto/?tmpl=component&template=shaper_helix3&link=b2449dac9dd9dbc27fe2d9f8c9dfc250a10eb664", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "OQ3OBNFR7DBCFQ4ANGEQW5P2FOXZZJRA", "length": "1103", "offset": "391196", "filename": "crawl-data/CC-MAIN-2025-21/segments/1746990412530.66/crawldiagnostics/CC-MAIN-20250516130253-20250516160253-00626.warc.gz", "redirect": "https://futbolinterior.com.ar/component/mailto/?tmpl=component&template=shaper_helix3&link=b2449dac9dd9dbc27fe2d9f8c9dfc250a10eb664"}

filtering: mime is text/html, status is 200, languages contains eng.

ideal compression (468B → ~80B for this file, ~17%):

subdomain,path,length,offset,filename
fusionbikes.com.ar,producto/luz-knog-blinder-mini-niner/,20250512054001,78501,243658810,CC-MAIN-2025-21/segments/1746990412231.65/warc/CC-MAIN-20250512042640-20250512072640-00410.warc.gz
# after storing subdomain and filename in other tables, 68 bytes for this entry
4B, producto/luz-knog-blinder-mini-niner/,20250512054001,4B, 4B,  4B

Size and storage estimate

  • 44 path files from 2020 to May 2025 (very small)
  • ~301 index files per path, ~783MB per file, ~5.8GB uncompressed
  • ~230GB of index files under each path file, ~10.1TB total (too much)
  • assume 5% entries left after filtering, 17% compression rate after index
    • ⇒ ~14.9GB per path file, ~657GB total (managable)
  • only keep 2000 entries per subdomain, ⇒ est. <10% size, <66GB (very fine)
    • throw away existing entries w/ largest BLAKE3 64bit hash number