Running AI Agents Across Multiple Machines

Last month, I watched three AI agents—running on three different machines in two countries—coordinate to ship a complete feature branch without me touching a single file. One agent wrote the backend on a dedicated server in Germany. Another handled the frontend from my laptop in Berlin. A third orchestrated the whole thing from a MacBook, deciding what to build next, checking dependencies, and routing tasks based on which machine was free. It worked. Then it broke. Then I fixed it. Here's how.

The Problem With One Machine

If you've used Claude Code or a similar AI coding agent, you know the pattern: you give it a task, it works for a while, and you wait. Your machine is pegged. You can't do much else. And if the task is complex—say, a multi-phase project with database migrations, API endpoints, frontend components, and tests—you're looking at a long sequential pipeline on a single box.

I wanted parallelism. Real parallelism, not "open two terminal tabs" parallelism. I wanted to say "build this project" and have multiple agents work on independent pieces simultaneously, on different hardware, with automatic coordination.

So I built what I call the Super-Agent: an orchestrator that dispatches work to AI agent workers running on separate physical machines.

Three Machines, Three Roles

Here's the setup:

|---------|----------|----------|------|

The MacBook never writes code itself. It's the brain. It reads project specs, builds a dependency graph of phases, and sends individual phase tasks to whichever worker is available. Each worker runs Claude Code autonomously—it receives a task description, implements it, runs tests, commits to a branch, and reports back.

The Messaging Layer: SSH + SQLite

I evaluated a few options for inter-machine communication. RabbitMQ? Overkill for three machines. Redis pub/sub? Another service to maintain. HTTP webhooks? Fragile if a machine is behind NAT.

I landed on the simplest thing that could work: SSH + SQLite queues.

Each worker has a SQLite database (queue.db) that acts as its task inbox. The orchestrator sends messages by SSH-ing into the worker and inserting a row:

ssh [email protected] \
  "cd ~/awsc-new/awesome/slack-app && node queue-helper.js enqueue 'Implement Phase 3: API endpoints for /api/projects CRUD'"

The worker polls its local queue, picks up pending messages, and processes them. When done, it marks the message as processed and writes back a response:

node queue-helper.js respond <message-id> 'Phase 3 complete. 4 endpoints implemented, 12 tests passing. Branch: feat/phase-3-api'

The orchestrator polls for responses via SSH. Simple. No ports to open, no services to run, no firewall rules to manage. SSH is already there and already secured.

For the WSL2 machine, which sits behind a Cloudflare tunnel, I added a local JSON file queue (message-queue-wsl2.json) as an alternative path. A small webhook-notifier.js process watches the file and triggers the agent to check its queue. Two queue sources, one worker—it checks both every time.

Why SQLite?

Because it's a single file, it's transactional, and it doesn't need a server process. Each worker owns its own database. No distributed consensus needed. The orchestrator writes to it over SSH; the worker reads from it locally. Ownership is clear. Conflicts are impossible by design.

The Orchestrator: Phases, DAGs, and Specs

The orchestrator on lipo-360 is where the real logic lives. When I say "build project X," it does the following:

Reads a project spec — a structured document defining phases, deliverables, and dependencies

Builds a DAG (directed acyclic graph) of phases — Phase 3 depends on Phase 2, but Phase 4 might be independent and can run in parallel

Dispatches ready phases to available workers — if @remote is idle and Phase 3 has no unmet dependencies, send it

Tracks state — which phases are pending, in-progress, complete, or failed

Handles errors — if a worker reports failure, the orchestrator can retry, reassign, or pause dependent phases

I wrote 16 shell scripts for the orchestrator: project scaffolding, spec validation, status dashboards, branch management, worker health checks. It's not a framework. It's a collection of sharp tools that work together.

The spec validation is worth mentioning. Before dispatching anything, the orchestrator checks that every phase has a clear deliverable, that the dependency graph has no cycles, and that worker assignments make sense (don't send a task requiring local Windows file access to the Hetzner server). Catching bad specs early saves a lot of wasted compute.

What Broke

Everything interesting I learned came from things breaking.

Node Version Mismatches

The Hetzner server was running Node 18. My WSL2 machine had Node 20. The MacBook had Node 21. A task would succeed on one machine and fail on another because of subtle differences in fetch behavior, fs.cp availability, or ES module resolution.

Fix: I standardized on Node 20 across all machines and added a version check to the worker startup script. If the version doesn't match, it refuses to process tasks. Fifteen minutes of setup saved hours of debugging phantom failures.

SQLite Concurrent Writes

This one bit me when I added Slack integration. The queue-agent.js process (listening for Slack messages) and the orchestrator's SSH commands were both writing to the same SQLite database simultaneously. SQLite handles concurrent reads fine, but concurrent writes can throw SQLITE_BUSY errors.

Fix: I wrapped all database operations in a helper (queue-db.js) with WAL mode enabled and automatic retry logic:

db.pragma('journal_mode = WAL');
db.pragma('busy_timeout = 5000');

WAL (Write-Ahead Logging) lets readers and writers coexist without blocking. The 5-second busy timeout handles the rare case where two writes collide. Since adding this, I've had zero SQLITE_BUSY errors across thousands of messages.

Message Deduplication

Network hiccups caused the orchestrator to occasionally send the same task twice. Two identical Phase 3 implementations running simultaneously on the same worker, stepping on each other's branches.

Fix: Every message gets a hash of its content. The worker checks for duplicates before processing:

const hash = crypto.createHash('md5').update(message.content).digest('hex');
if (recentHashes.has(hash)) {
  markProcessed(message.id, 'slack', 'Duplicate — skipped');
  return;
}

Cross-Worker State Persistence

The hardest problem was state. When the orchestrator dispatches Phase 3 to @remote and Phase 4 to @local, both workers need to know about the repository state. Phase 4 might depend on branches or files that Phase 3 created.

Fix: Git is the state layer. Every phase commits to a well-named branch (feat/phase-3-api, feat/phase-4-frontend). Before starting work, each worker pulls the latest from the remote repository. The orchestrator doesn't dispatch a dependent phase until the prerequisite's branch is pushed. Git was already solving distributed state synchronization—I just had to lean into it.

What I Learned

Start with the messaging layer, not the orchestrator. I initially tried to build the orchestrator first and bolt messaging on later. Wrong order. Once reliable messaging was in place, the orchestrator logic was straightforward. Without it, everything was fragile.

SQLite is absurdly underrated for this kind of work. No server, no config, no ports, ACID-compliant, handles thousands of messages without breaking a sweat. For single-writer, multi-reader workloads—which is exactly what a task queue is—it's close to perfect.

AI agents need guardrails, not micromanagement. I don't tell Claude Code how to implement a phase. I tell it what to deliver and what tests to pass. The spec is the contract. The agent figures out the rest. This only works if your specs are precise enough, which is why spec validation matters.

SSH is the most underappreciated tool in distributed systems. Encrypted, authenticated, already deployed on every server, supports tunneling, port forwarding, and remote command execution. I didn't need Kubernetes. I needed ssh and a clear architecture.

The Numbers

3 machines coordinated across 2 physical locations

16 orchestrator scripts for project lifecycle management

39 automated tests passing across 3 layers: messaging, orchestrator logic, and end-to-end

0 SQLITE_BUSY errors since enabling WAL mode (across ~2,000+ messages)

~60% faster project completion compared to single-machine sequential execution for multi-phase projects

2 queue sources (SSH + local JSON) converging on each worker

The system isn't perfect. It doesn't handle machine failures gracefully yet—if a worker goes offline mid-task, the orchestrator just waits. I'm working on heartbeat-based health monitoring (there's a health-monitor.js that alerts on messages pending for more than 10 minutes, but automatic reassignment isn't there yet).

Where This Is Going

The architecture scales horizontally by design. Adding a fourth machine means adding one SSH key and one queue database. The orchestrator already handles dynamic worker assignment. The real bottleneck is spec quality—the better the project spec, the more autonomous the agents can be.

I think multi-agent orchestration across physical machines is going to become a standard pattern as AI coding agents mature. The tools are all there: SSH for transport, SQLite for state, Git for synchronization, and increasingly capable agents that can take a spec and ship working code.

If you're building something similar, or thinking about how AI automation could fit into your engineering workflow, I'd be happy to compare notes. You can reach me at /contact.

Have a project in mind? Let's talk.