Baseline Websites

curl/seed_db.py

Human-written

  • credible source

  • personal website

    • (30x) IndieWeb Wiki: registry of personal website
      • after filtering by having sitemap, most are tech blog
    • WordPress directory?
  • Tranco low-rank blogs mostly dead or non-English, unsuitable

  • company website

    • EDGAR database → company name → search for website many of them do not have website
    • US Business Database? no website link
    • LinkedIn? forbit crawling
    • (30x) Russell 2000
      • many do not have blog; many are after ChatGPT; many have no sitemap
      • after filtering by having sitemap, most are tech company
      • content: most are blog/ company statement (news); some are service/product description; few functional page e.g. form
  • find .*/blog/.* URL in CommonCrawl?

Machine-generated

  • (2x) self-claim generated
  • generated using Wix and B12 manually or w/ Browser-Use
    • prompt generated by ChatGPT/Gemini (curl/suggestions.py)
      • summarize given URL as website description; suggest name for similar site
      • suggest 50 name+description for blog post
    • input into Wix/B12 to generate home page + boilerplate + 20 blog post
    • Browser-Use only usable w/ best model like Gemini 2.5 Pro
      • expensive: spent ~$8/site just for LLM API
    • 30x Wix site correspond to Russell 2000 company site
    • 30x B12 site correspond to IndieWeb site
    • 4x + 4x other
  • (not included in dataset) clear cue, e.g., “as an AI”
  • additional AI website generators beyond Wix/B12
    • artificial set-level sample from benchmark datasets?
  • more baseline negatives from before ChatGPT
  • document prompt engineering effort needed for generator outputs

Training/test dataset 144x

  • company website dataset 60x: 30x Wix vs. 30x Russell 2000
  • personal website dataset 60x: 30x B12 vs. 30x IndieWeb Wiki
  • other website dataset 24x: 4x Wix + 4x B12 + 2x self-claim generated vs. 8 Reddit recommended blog + 6 top blog

note:

  • human site have blog, statement, service description, etc.; while generated site mostly are blog
  • generated site have boilerplate page, e.g. policy, by the website generator, causing occasional high Binoculars score

AI website generator

  • 10Web claim to generate&host website on name&description
    • landing page & 1-paragraph sample article
    • claim to have generated 1.5M+ websites
    • from $13/month; need $28/month “pro” to edit&multi-site; WordPress, Cloudflare CDN
      • ❌ need $49/month for each additional website
    • ❌ extremely slow when generating, e.g., >10min/page
  • Wix AI Website Builder
    • landing page & short/long blog article on demand
    • allow multiple site, sell domain&service instead of generator
    • for arbitrary page, “Generate Full Page Text” produce poor result
      • ❌ only generate 1 very short text block & no layout generation
  • ContentBot.ai automate AI-driven content creation
    • claim to be used on ABCNews, Contagious, PR Week, etc.
    • no free trial; from $0.5/1000 word, $29/month for full plan
  • Copy.ai go-to market AI for marketing, sales, etc.
    • claim to be used by SIEMENS, Rubrik, etc.
    • no free trial; $49/month for starter individual plan; mainly target business
  • WebWave AI
    • landing page & manually written blog
    • ❌ very slow; had bug of not publishing blog
    • from $3.5/month; $5/month for blog&SEO
  • B12
    • landing page/ medium-length blog/ service/project description/ team member, on demand
      • or any page given name+description
    • from $42/month
    • very fast generation
  • Contentful AI Content Generator use OpenAI API to write content
  • HubSpot AI Website Generator optimize existing company website
    • only generate landing page
  • Relume only generate mockup/HTML
  • Webflow only generate layout
  • GoDaddy Airo focus on marketing & selling
    • ❌ need GoDaddy domain
  • Dorik AI
    • ❌ need $39/month for unlimited #page, else limit to 5 (free) or 25 ($18/month) per site
  • Vzy
    • $10/month/site for 100 page
  • Wegic, Tilda, Shopify Magic?

Provided example generated sites

  • hand-picked; probably not purely generated
  • some not text-heavy (mainly image, etc.)
  • commonly business w/ /blog; unlike most content farm found

each generator:

Website generator capability

What category to cover

  • can only cover what AI website generator can generate

covering:

  • personal/company/organization blog

want:

  • personal/team project description
  • news
  • products