TWITTER_ARTICLE

OpenClaw migrations from older clawdbot installs can fail because both…

2026-02-13 · 00:10 UTC ·kloss_xyz ·21 min read

Brief

kloss_xyz’s February 2026 guide is a field report on making OpenClaw multi-agent systems stable after more than a week of continuous use, with the central claim that real deployments are messy and mostly about hardening infrastructure rather than merely writing prompts. The post catalogs concrete failure modes: bot migrations colliding on port 18789 because legacy clawdbot services remain active, silent hangs that require a watchdog polling the health endpoint every 15 minutes, plugin installs that can crash the gateway, and delivery pipelines that break when Telegram bots have never received an initial direct message. Security advice is similarly practical: keep the gateway bound to loopback, avoid exposing ports, and use Cloudflare Tunnel or Tailscale instead.

The architecture that emerges is opinionated. One top-level agent handles all external communications, while specialized internal agents with separate SOUL, AGENTS, and IDENTITY files do focused work and can spawn one level of subagents for atomic tasks. The guide emphasizes strict definitions of task completion, queue-based message handling, symlinked shared state, startup checks via BOOT.md, and crash recovery via memory/active-tasks.md. It also dives into cost and context management: fallback chains should stay within one provider family, stronger models should process untrusted external content for prompt-injection resilience, and bloated MEMORY.md or HEARTBEAT.md files waste tokens or silently truncate context. Overall, the post is less a beginner setup guide than an operations manual for keeping OpenClaw reliable under continuous, multi-agent load.

Why it matters

OpenClaw migrations from older clawdbot installs can fail because both `clawdbot-gateway.service` and `openclaw-gateway.service` may compete for port 18789; the author recommends disabling the old service, uninstalling `clawdbot`, and checking leftover files in `/usr/local/bin/clawdbot` and `/usr/local/lib/node_modules/clawdbot` before reinstalling.

Key details

For 24/7 reliability, the guide suggests a watchdog that pings the gateway health endpoint every 15 minutes and restarts the stack on failure, plus using `openclaw doctor` or `openclaw doctor --fix` to repair common problems such as missing directories, legacy config keys, permission errors, and outdated service paths.
The recommended multi-agent architecture keeps one external-facing 'CEO' agent for Telegram and Discord while specialist agents like CTO, CMO, and COO stay internal; top-level session concurrency defaults to `maxConcurrent: 4`, subagent concurrency to `subagents.maxConcurrent: 8`, and subagents are intentionally limited to one delegation level to prevent runaway token burn.
The post argues for disciplined model configuration: keep fallback chains to 2-3 models from the same provider family, configure them in files rather than the TUI/GUI, use cheaper defaults for subagents, and reserve stronger models such as Opus for processing untrusted external content because weaker models are more susceptible to prompt injection from emails, webpages, and social posts.
OpenClaw’s memory system has practical limits: oversized workspace files are truncated with a 70/20/10 split across head, tail, and omitted middle, `MEMORY.md` should stay lean and act as an index, and detailed state should move into `memory/` files such as `active-tasks.md`, `projects.md`, and `lessons.md`; QMD can add fully local BM25, vector search, and reranking but requires a separate `qmd` binary and downloads roughly 300MB+ of models on first use.
Several operational gotchas are highly specific: Telegram output can fail until the user sends the bot an initial DM, heartbeat files should stay under 20 lines because they run every 30 minutes by default, cron jobs should use full model IDs like `anthropic/claude-haiku-4-5` instead of aliases, and debugging should start with `gateway.err.log` because plugin failures, channel permission issues, and announcement queue timeouts often surface there first.

Cleaned source text

title: @kloss_xyz: I've been running multi agent orchestration with my OpenClaw for over a week. It...

author: kloss_xyz

content_type: twitter_article

published: 2026-02-13T00:10:23+00:00

source_url: https://x.com/kloss_xyz/status/2022101005064974600

word_count: 4637

I've been running multi agent orchestration with my OpenClaw for over a week. It's getting much, muc

I've been running multi agent orchestration with my OpenClaw for over a week. It's getting much, much closer to where I want it to be. The issue is though that everyone's telling you theirs is perfect, and they did it in a day.

They're lying. I've been building mine for a week plus.

Every single thing that could break when I built mine, literally broke.

What follows is every annoying ass problem I hit, every issue power users have reported, and every config that actually made this thing work reliably. Whether you're running one agent or ten with subagents, whether you're technical or just getting started, this is the reference guide I wish existed when I began. This isn't a setup guide. It's an optimization one.

This isn't something you skim and forget. Please bookmark it and save it. Come back to it every time something goes sideways with OpenClaw.

Let's get into it.

1. Upgrading Your OpenClaw

If you're coming from an old OpenClaw setup, the good news is that installer actually handles the migration pretty well. It moves your .clawdbot directory to .openclaw and creates a symlink back so nothing breaks, and your config, soul files, memory, and workspace all carry over. The clawdbot package also stays as a compatibility shim.

The real problem is what it doesn't actually clean up. The old clawdbot-gateway.service can keep running alongside the new openclaw-gateway.service in some cases, and when both try to grab port 18789 you'll get a restart loop where the new gateway fails over and over with "another gateway instance is already listening." It looks like OpenClaw is broken but it's actually just fighting itself with its own predecessor for the port.

Before you install OpenClaw, stop and disable the old service completely:

Then uninstall the old npm package (npm uninstall -g clawdbot) and check for leftover files in /usr/local/bin/clawdbot or /usr/local/lib/node_modules/clawdbot since residual packages can silently interfere with the new install. After that, install OpenClaw fresh and your existing workspace files will be picked up automatically.

2. Agent Stability (Hangs, Crashes, and Silent Deaths)

Your agent will hang, it will crash, and it will go completely silent for minutes with no explanation. This is normal when you're running agents 24/7, and the fix is building around it.

Write a simple watchdog script that pings the gateway health endpoint every 15 minutes. If it doesn't respond, auto-restart the whole thing. You shouldn't be babysitting this manually.

OpenClaw has a built-in diagnostic command called openclaw doctor that checks your config, gateway, channels, workspace, and permissions in one shot. Run it with the --fix flag (openclaw doctor --fix) and your actual issue and it will auto-repair common issues like permission problems, missing directories, legacy config keys, and outdated service paths. It backs up your config before making changes and won't touch your API keys or credentials, so it's safe to run whenever something feels off or after an upgrade. You can also use it without the --fix flag and some users have found better results that way.

3. Security (Lock This Down Immediately)

If your server is internet-facing, assume someone is already trying to brute force their way in. This isn't paranoia, it's what happens to every exposed server within hours of going live.

At minimum:

But honestly, the better move is to not expose it at all. Use Cloudflare Tunnel or Tailscale to access your server without opening ports to the internet. In your OpenClaw config, set gateway.bind: "loopback" so the gateway only listens locally. No exposed ports means no attack surface for nefarious actors.

4. Plugins Breaking Your Gateway

Plugins are powerful but they can take your entire gateway down. If something dies right after you install a plugin, that plugin is almost certainly the cause.

The fix is simple:

Check gateway.err.log for the actual error

Uninstall the plugin: openclaw plugins disable

Restart

The habit to build here is verifying your gateway restarts cleanly after every single plugin install. Don't install three plugins at once and then wonder which one broke everything, just go one at a time, verify, and move on.

5. The Autonomy Problem

This is the issue I see more than any other, where the agent doesn't listen, leaves tasks unfinished, or says "done" when the work is clearly broken.

Here's the thing: the agent is exactly as autonomous as your instructions make it. If your instructions are vague, the agent's behavior will be vague. If you don't define what "done" means, the agent will decide for itself, and its definition will be generous.

Put explicit rules in your AGENTS (.md) file:

Every time the agent claims "done," it must include the repo name, branch, and commit hash. It must verify its work with actual commands, not just say "I checked." Design heartbeat loops that catch incomplete tasks before they sit there rotting for hours.

The agent isn't being lazy. It's following the looseness in your instructions. Tighten those up and the behavior changes immediately.

6. Model Configuration

Too many models in your fallback list creates unpredictable behavior. The agent switches between different reasoning styles mid-task and the output quality becomes inconsistent.

Keep your model list to 2 or 3 maximum and stay within the same family, either all Anthropic or all OpenAI. Don't mix providers in the same fallback chain.

Configure your models in the actual config file, not through the TUI or GUI. Those interfaces sometimes don't persist settings correctly, and you'll wonder why your changes disappeared after a restart.

If you're using free models, put them last in the fallback chain and never as the primary model for anything critical since they're a safety net, not the foundation.

Here's how a solid failover config looks:

Auto-switches on failures so your agent never goes dark.

7. TUI Shows "(no output)"

This one drove me crazy before I figured it out. The TUI shows "(no output)" for every reply and nothing seems to work.

If you configured Telegram delivery with the --deliver flag but you've never actually sent a direct message to your bot on Telegram, the delivery failure kills the entire output pipeline, not just Telegram delivery but everything.

The fix is absurdly simple: open Telegram, send one message to your bot, and the pipeline unblocks and everything starts flowing.

8. Messages Getting Dropped

When your agent is busy processing a request and new messages come in, they can get silently dropped. You'll never know they existed and the sender thinks they were heard, but they weren't.

Enable queue mode:

Every message gets queued and processed in order with nothing lost, even if 5 messages pile up while the agent is in the middle of a long tool call.

9. Local Memory with QMD

If you want to avoid paying for embedding APIs, OpenClaw's QMD backend does BM25 keyword search, vector search, and reranking entirely on your local machine. It requires the qmd binary (install via bun install -g github.com/tobi/qmd ) and runs local GGUF models for embeddings and reranking.

The default builtin backend already does hybrid BM25 + vector search, but QMD adds a local reranker on top and can index multiple external folders beyond your workspace. The tradeoff is more moving parts and heavier CPU/disk usage, so if your memory set is small and mostly workspace Markdown, the builtin hybrid search is already solid without QMD. First search will be slow since QMD downloads local models (~300MB+) on the first query.

10. Making Responses Feel Human

Instant replies on Telegram and Discord feel robotic. Real people don't respond in 200 milliseconds. It immediately signals "this is a bot" to anyone paying attention.

Responses now arrive with natural timing, somewhere between 0.8 and 2.5 seconds of delay. It feels like a person typing instead of a machine firing back instantly, and that small detail makes a big difference in how people interact with your agent.

11. Controlling Who Can Spawn What

First, understand the difference between agents and subagents because most people conflate them.

Agents are your team, where each one is a distinct personality with its own workspace, its own SOUL (.md), its own model, and its own role. Think of them as employees: your CEO agent handles strategy and external communication, your CTO agent handles technical decisions, and your CMO handles content and marketing. They're all top-level, persistent, and always available.

Subagents are temporary background workers that any agent on your team can spawn to handle a task without blocking its main conversation. The subagent runs in an isolated session, does its work, and reports back when finished before getting archived. They're one-off workers, not permanent team members.

The important distinction: by default, subagents cannot spawn other subagents. This is intentional. It prevents runaway delegation chains that burn tokens exponentially. There's a feature request to make this configurable, but for now, the nesting stops at one level per spawn.

To control which agents your team members can delegate to, set up allowlists:

Your CEO can spawn subagents under the CTO, CMO, or CRO agent identities. Your CMO can't spawn work under the CTO's identity unless you explicitly allow it. This gives you a clean organizational hierarchy without runaway cross-delegation.

12. Different Models for Different Agents

Not every agent on your team needs the most expensive model. Your CEO needs strong reasoning for orchestration and decision-making. Your CTO agents need code-optimized models. Your COO running operational tasks can use something lighter and cheaper.

The pattern:

Your CEO runs on Opus for complex reasoning and strategic decisions. Your CTO runs on Codex for code generation and technical work. Your COO runs on Haiku for quick operational tasks and routine coordination.

Set global defaults in your config and then override per agent. This is how you control costs without sacrificing quality where it matters. A subagent doing a quick research task doesn't need Opus. Set a cheaper model as the subagent default via agents.defaults.subagents.model and keep your top-level agents on the higher-quality models.

13. Concurrency Settings

OpenClaw defaults are conservative on purpose. Two settings control parallel processing:

maxConcurrent: 4 controls how many top-level sessions can run at the same time. subagents.maxConcurrent: 8 controls how many subagent sessions can run in parallel.

If your hardware can handle it, crank these up. More concurrency means your multi-agent system processes work faster instead of queuing everything behind a bottleneck.

14. Nesting and Delegation Depth

By default, subagents cannot spawn other subagents. OpenClaw hardcodes this restriction to prevent runaway fan-out where delegation chains spiral and burn tokens exponentially.

The architecture that works is having your top-level agents (CEO, CTO, CMO) each spawn subagents for background work, and those subagents complete their tasks in a single shot and report back. No deeper nesting and no subagent-spawning-subagent chains.

There's a configurable override being discussed (via allowNesting: true or adding sessions_spawn to subagent tool allowlists), but the default behavior exists for a good reason. Keep your delegation flat where the top-level agent breaks work into atomic units that a subagent completes and returns.