Building a Production Web Crawler for SEO Analysis

Tutorial crawlers work on tutorial websites. Production crawlers need to survive the real web.

Every web scraping tutorial follows the same script: fetch a page, parse some HTML, extract a few links, done. It works beautifully on the instructor's example site. Then you point it at a real client website — one with JavaScript-rendered content, chained redirects, malformed sitemaps, and pages that take 45 seconds to respond — and everything falls apart.

I'm Manuel Porras, an AI Automation Engineer based in Berlin. Over the past couple of years, I've built and maintained a self-hosted web crawler called A-Crawler that powers SEO analysis for 30+ client websites at a digital marketing agency. It processes sites with 10,000+ pages, feeds data into an AI-powered analysis pipeline, and generates client-ready reports. This is the story of what it took to make it actually work.

Why Build Your Own Crawler?

The honest answer: I didn't want to.

I evaluated Screaming Frog, Sitebulb, and various SaaS crawlers. They're excellent tools. But none of them fit into the pipeline I needed. The requirements were specific:

Crawl data had to flow directly into a PostgreSQL database (Supabase) for downstream analysis

Every crawl needed to be reproducible — same configuration, same scope, queryable history

The crawler had to be one stage in a larger automated pipeline, not a standalone tool with a GUI

Cost at scale mattered — crawling 30+ client sites monthly with a SaaS tool adds up fast

I needed a crawler that was a component, not a product. So I built one.

The Architecture

A-Crawler is built in TypeScript on top of the Crawlee framework by Apify. The high-level flow looks like this:

Sitemap.xml / Seed URLs
        │
        ▼
   A-Crawler (Crawlee + Playwright)
        │
        ├── Parse HTML / JS-rendered content
        ├── Extract: titles, metas, headings, canonicals, status codes
        ├── Follow redirect chains (301 → 302 → meta refresh → final)
        ├── Respect robots.txt
        │
        ▼
   MySQL (crawl state + raw data)
        │
        ▼
   SEO Processor Worker
        │
        ├── Normalize + deduplicate into Supabase (PostgreSQL)
        ├── Run AI content analysis (Gemini)
        ├── Score pages, detect issues, flag opportunities
        │
        ▼
   Client-Ready SEO Reports

The separation between crawler and analyzer was a deliberate design decision. The crawler's job is to faithfully capture what's on the web. The analyzer's job is to make sense of it. Mixing those concerns is how you end up with a system that's impossible to debug.

The Hard Parts

Here's what actually breaks in production. None of this shows up in tutorials.

Redirect Chains

A clean redirect is 301 → final URL. The real web gives you:

HTTP 301 → HTTP 302 → HTTPS 301 → meta refresh → JavaScript redirect → final URL

Some chains are 6-7 hops deep. Some are circular. Some alternate between HTTP and HTTPS in ways that suggest the server configuration was written by committee.

A-Crawler tracks the full redirect chain for every URL — every hop, every status code, every intermediate URL. This data is critical for SEO analysis because redirect chains bleed PageRank and each unnecessary hop is a problem the client needs to fix.

The implementation tracks each response in sequence before Playwright settles on the final page:

const redirectChain: RedirectHop[] = [];

page.on('response', (response) => {
  const status = response.status();
  if (status >= 300 && status < 400) {
    redirectChain.push({
      url: response.url(),
      statusCode: status,
      location: response.headers()['location'] || null,
    });
  }
});

You also need a hard cap on redirect depth. I use 10. If a chain goes deeper than that, something is fundamentally broken and you log it as an error rather than following it into infinity.

Robots.txt Compliance

Respecting robots.txt sounds simple until you encounter:

Sites with no robots.txt (returns 404 HTML page instead)

robots.txt files that are actually HTML error pages with a 200 status code

Wildcard patterns that technically match everything

Multiple User-Agent blocks with conflicting rules

Crawlee handles basic robots.txt parsing, but I added validation on top. If the response Content-Type isn't text/plain, it's not a real robots.txt — treat it as absent.

Sitemap.xml Parsing

Sitemaps are supposed to be simple XML files. In practice:

Sitemap index files point to other sitemaps, sometimes nested 3 levels deep

Gzipped sitemaps that the server doesn't flag with proper Content-Encoding

URLs in sitemaps that return 404/500 — the sitemap says they exist, the server disagrees

50,000+ URLs in a single sitemap that need to be streamed, not loaded into memory

Sitemaps with namespace issues that break standard XML parsers

I built a recursive sitemap resolver that handles index files, validates URLs in batches, and streams large sitemaps rather than loading them whole.

Timeouts and Broken Pages

Some pages take 30+ seconds to load. Some never finish loading because they're waiting for a third-party script that's down. Some return partial HTML. Some return valid HTML that causes Playwright to consume 2GB of memory because of an embedded data visualization.

The timeout strategy is layered:

// Navigation timeout — how long to wait for the page to load
navigationTimeoutSecs: 30,

// Request timeout — how long for the entire request lifecycle
requestHandlerTimeoutSecs: 60,

// Max retries with exponential backoff
maxRequestRetries: 3,

But timeouts alone aren't enough. You need to handle the cases where a page loads "successfully" but the content is garbage — empty body tags, error messages rendered as content, soft 404s that return status 200.

Why Crawlee

I chose Crawlee over alternatives like Scrapy (Python) for a specific reason: the rest of my stack is Node.js/TypeScript. The SEO Processor Worker, the API layer, the report generators — all TypeScript. Having the crawler in the same language means shared types, shared utilities, and one mental model.

What Crawlee gives you out of the box:

Request queue management with deduplication and persistence

Auto-scaling concurrency based on system resources

Built-in Playwright integration for JavaScript-rendered pages

Session rotation and retry logic

Pluggable storage (I use MySQL for crawl state)

What I built on top:

Full redirect chain tracking

Canonical URL detection and validation

Structured data extraction (titles, metas, headings, all of it)

Integration layer to push results into the SEO analysis pipeline

Crawlee handles the plumbing. I handle the SEO-specific logic.

Feeding Into AI Analysis

Raw crawl data is necessary but not sufficient. A page's title tag and meta description tell you what the page claims to be about. The actual content tells you what it is about. Evaluating content quality at scale requires AI.

The SEO Processor Worker takes crawl data from MySQL, normalizes it into Supabase, and then runs content through Google's Gemini for analysis. For each page, Gemini evaluates:

Content relevance to the target keyword

Content depth compared to competing pages

Structural quality — heading hierarchy, internal linking, readability

Opportunities — what's missing, what could be expanded

This runs across 600+ pages per client analysis. The results feed into scoring algorithms that prioritize which pages need attention first — a client doesn't want a list of 600 problems, they want the 20 fixes that will move the needle.

Scaling to 30+ Clients

The scaling challenges aren't where you'd expect. CPU and memory are manageable. The real issues are:

Crawl scheduling. You can't crawl 30 sites simultaneously without overwhelming your outbound bandwidth and getting rate-limited everywhere. Crawls are queued and run sequentially, with configurable concurrency per site.

Data isolation. Every crawl gets a unique crawl_id. All downstream data — analysis results, scores, reports — trace back to that ID. When a client asks "what changed since last month?" you can diff two crawl IDs.

Failure recovery. A crawler that runs for 4 hours on a 10,000-page site and crashes at page 9,500 needs to resume, not restart. Crawlee's persistent request queue handles this, but you need to be disciplined about checkpointing your own derived data too.

What I'd Do Differently

Start with streaming from day one. Early versions loaded too much into memory. Sitemaps, crawl results, analysis batches — everything should stream. I refactored this later, but it would have saved time to design for it upfront.

Invest in observability earlier. When a crawl fails at 3 AM, you need to know exactly where and why. I added structured logging and a health monitor that alerts on stale messages, but I wish I'd had that from month one.

Separate the Playwright dependency. Not every page needs JavaScript rendering. Most don't. Running Playwright for every page is expensive. A smarter approach: try a lightweight HTTP fetch first, detect if JS rendering is needed, and only spin up Playwright when necessary.

By the Numbers

30+ client websites crawled regularly

10,000+ pages on the largest sites

600+ pages analyzed per client report

7-hop redirect chains (longest encountered and survived)

1 self-hosted infrastructure — no SaaS crawling costs

Wrapping Up

Building a production crawler taught me that the gap between "it works" and "it works reliably on the messy, broken, unpredictable real web" is enormous. Every edge case you skip in development will show up as a 3 AM alert in production.

If you're building SEO tooling, marketing automation, or any system that needs to understand what's actually on a website, I'd encourage you to think about your crawler as infrastructure, not as a script. Invest in reliability, observability, and clean separation of concerns. Your future self — the one debugging why a crawl stalled on a client's site at 2 AM — will thank you.

If you're working on similar challenges — web crawling, SEO automation, AI-powered content analysis — I'm always happy to talk shop. You can find me on LinkedIn or reach out at breakthrough3x.com.

Manuel Porras is an AI Automation Engineer based in Berlin, building tools at the intersection of web crawling, SEO analysis, and AI-powered content intelligence.

Need a reliable data pipeline? Let's talk.