10 · Subagents

§1 · TL;DR

TL;DR

Subagents (subagent — a thread- or process-level child session the current agent spawns for a separate task) look simple on paper but turn into a minefield in production. Anyone who has actually shipped subagents has watched every failure type appear: a subagent calls itself recursively and burns tokens, a subagent ends up with tools the parent did not have (privilege escalation), a subagent runs its own approval flow and confuses the user with parallel popups, a subagent posts replies cross-platform and creates duplicates, a subagent writes to the parent's long-term memory and poisons it. So the real question this chapter answers is «in the face of those failure modes, what hard boundaries must the framework enforce» — and the four systems give wildly different answers. Codex takes the «type every subagent use case» route: every spawn entry must carry a `SubAgentSource` enum (Rust algebraic data type — the caller picks one concrete variant and cannot omit the field) declaring the purpose (`Review` for user-triggered code review, `Compact` for context compression, `ThreadSpawn` for user-configured templates, `MemoryConsolidation` for periodic memory promotion, `Other` for plugin / experimental escape hatches), and each source routes to a different prompt / permission / telemetry path. The critical engineering discipline is that subagents must inherit 7 core services from the parent session (`exec_policy`, `plugins`, `mcp_manager`, `skills_manager`, `environment_manager`, `agent_control`, …) and can never be more permissive than the parent — there is no «spawn a subagent to bypass execpolicy» escape route. Approvals are forced back to the parent session, and subagents never own UI. Claude Code goes the opposite way and unifies subagents with every other async execution: a generic `Task` abstraction covers everything «runs in the background and returns later», with 7 `TaskType` variants (`local_bash` for shell commands, `local_agent` for local subagents, `remote_agent` for remote endpoints, `in_process_teammate` for same-process teammates, `local_workflow`, `monitor_mcp` for long-running MCP monitors, `dream` for background prefetch) and a clean `TaskStatus` state machine (`pending` / `running` / `completed` / `failed` / `killed`). Every tool self-reports `isConcurrencySafe()` (a boolean method on the tool instance) so the dispatcher decides whether to parallelise with the rest of the turn's tool calls. OpenClaw built the most complete subagent platform — one subsystem split across 6 files (`spawn`, `registry`, `depth` for cross-process depth persistence, `announce` for completion push, `attachments` for file passing, `lifecycle-events`), and `SpawnSubagentParams` carries 13 configuration fields (task, label, agentId, model, timeout, mode, cleanup, sandbox, attachments, …). The most distinctive bit: the anti-pattern warning «do not poll, wait for the push event» is literally written into the spawn tool's return-value text so the model reads it as it decides what to do next — an engineering fix for the «subagent mode burns tokens polling» failure mode. Hermes takes the most restrained route: a single `delegate_task` function (parameters: `goal`, `context`, `toolset` whitelist, `max_iterations`), with the critical safety constraints baked into constants. `MAX_DEPTH=2` is a hard-coded ceiling (parent 0 → child 1 → grandchild rejected); 5 tools are hard-blocked from recursive spawning (`delegate_task` / `clarify` / `memory` / `send_message` / `execute_code`, removed from any subagent's toolset); the parent uses `ThreadPoolExecutor` (Python stdlib thread-pool) to synchronously block on all children; a heartbeat fires every 30 seconds so the UI knows the run is alive. Building an orchestration framework? Borrow from OpenClaw. Building an agent product where subagent use cases are clear? Borrow from Codex. Unifying async execution? Borrow from Claude Code. Want the simplest reliable delegation? Borrow from Hermes.

§2 · Reference architecture

Four subagent models: typed inheritance vs polymorphic TaskType vs full platform vs minimal delegate — One product need (spawn a child agent to do work) lands at four different abstraction layers.

How the four systems cover the five concurrency-critical concerns:

Dimension	Codex	Claude Code	OpenClaw	Hermes
Subagent abstraction	`SubAgentSource` enum, 5 variants (Review / Compact / ThreadSpawn{depth, agent_role} / MemoryConsolidation / Other)	`TaskType` 7 variants (local_bash / local_agent / remote_agent / in_process_teammate / local_workflow / monitor_mcp / dream)	`SpawnSubagentParams` 13 fields (task / label / agentId / model / thinking / mode / sandbox / cleanup / attachments...)	`delegate_task(goal, context, toolset, max_iterations)` function call
Depth control	`ThreadSpawn.depth` recorded; rejects past configured limit	No explicit depth limit (user manages it)	`DEFAULT_SUBAGENT_MAX_SPAWN_DEPTH` configurable; session store persists spawnDepth	`MAX_DEPTH = 2` (parent 0 -> child 1 -> grandchild rejected)
Tool propagation	Child inherits parent `exec_policy` / `plugins` / `mcp_manager`; review explicitly disables some features	Each task type carries its own spec	inherit / require sandbox toggle; workspace inheritable or overridden	Five tools hard-blocked: delegate / clarify / memory / send_message / execute_code
Concurrency model	Interactive uses receiver / sender; one-shot awaits; callers use `tokio::join!` for parallel	`isConcurrencySafe()` per task type tells dispatcher when to parallelize	Per-session active-run limit; push-based completion + auto-announce so callers do not poll	ThreadPoolExecutor, default 3 workers, configurable; parent blocks until all done
Completion signal	Receiver pushes `EventMsg::TurnComplete` events back to parent	Task FSM: pending -> running -> completed / failed / killed	subagent-announce pushes events to parent; falls back to polling on timeout	Synchronous `as_completed` over futures

How sophisticated each system treats subagent spawning

§3 · How each system does it

Codex · Split “spawn subagent” into 5 typed sources, each routing to a different permission and prompt path

Codex’s core judgement on subagents is this: “spawn a subagent” should not be an undifferentiated function call but a set of semantically distinct behaviours. Even though “user manually triggers /review for code review”, “context exceeded so trigger summarisation”, “user invokes their own ThreadSpawn template”, and “system periodically consolidates memory” are all ‘spawn a subagent’, they use different prompts, allow different tools, expect different output shapes, may or may not show in the UI, and emit different telemetry. If they all funnel through one generic spawn function, every caller has to manually configure these differences themselves, and the configuration will drift as call sites multiply.

So Codex forces every spawn to carry a SubAgentSource enum value that explicitly tells the system what kind of subagent this is:

Codex codex/codex-rs/protocol/src/protocol.rs:2558-2576 — SubAgentSource 5 variants; each routes to a different prompt and permission profile

#[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq, JsonSchema, TS)]
#[serde(rename_all = "snake_case")]
#[ts(rename_all = "snake_case")]
pub enum SubAgentSource {
    Review,
    Compact,
    ThreadSpawn {
        parent_thread_id: ThreadId,
        depth: i32,
        #[serde(default)]
        agent_path: Option<AgentPath>,
        #[serde(default)]
        agent_nickname: Option<String>,
        #[serde(default, alias = "agent_type")]
        agent_role: Option<String>,
    },
    MemoryConsolidation,
    Other(String),
}

These 5 sources carry very different semantics and constraints. Review corresponds to the user typing /review for code review — the subagent runs the review-specific system prompt and disables a set of features (see chapter 09), output is a structured code review report. Compact corresponds to system-triggered context compression when token limit is hit — the subagent receives the entire parent conversation history and outputs a “semantically preserved but shorter summary” (see chapter 11 on session lifecycle). ThreadSpawn is the user-configured subagent template, freely composable via agent_role and agent_nickname (e.g. “define a testwriter agent specifically for adding tests to my code”); critically, this variant carries a depth field so the framework can track “how deep is the current call chain” and reject further spawns past a configured limit. MemoryConsolidation is periodic memory consolidation — the subagent runs in background to distill long-term-worthy insights from conversation history into memory entries (see chapter 16). Other(String) is an escape hatch for plugins and experimental scenarios, but the string must be explicit — “anonymous spawn” is forbidden.

The benefits of this typing are immediate. First, dispatch can branch on source: Review goes through review prompt, Compact goes through summarise prompt, MemoryConsolidation runs silently in background, ThreadSpawn shows up in the user’s UI, Other gets the strictest default constraints. Without this typing, dispatch would have to heuristically infer from prompt content — fragile and unmaintainable. Second, telemetry is consistent — every spawn event automatically carries a source label, and analysis can ask “how many Reviews last week, average duration, failure rate?” rather than manually mining a pile of spawn logs. Third, the permission path is explicit — each source maps to its own feature constraints (Review automatically disables web_search because reviewing code doesn’t need network, while ThreadSpawn allows it by default), and dispatch just looks up the table by source.

Spawning itself goes through one entry point Codex::spawn(CodexSpawnArgs { ... }), but the args carry many services inherited from the parent session:

Codex codex/codex-rs/core/src/codex_delegate.rs:74-105 — Child agent inherits 7 service categories from parent session, avoiding cold-start overhead

pub(crate) async fn run_codex_thread_interactive(
    config: Config,
    auth_manager: Arc<AuthManager>,
    models_manager: SharedModelsManager,
    parent_session: Arc<Session>,
    parent_ctx: Arc<TurnContext>,
    cancel_token: CancellationToken,
    subagent_source: SubAgentSource,
    initial_history: Option<InitialHistory>,
) -> Result<Codex, CodexErr> {
    let (tx_sub, rx_sub) = async_channel::bounded(SUBMISSION_CHANNEL_CAPACITY);
    let (tx_ops, rx_ops) = async_channel::bounded(SUBMISSION_CHANNEL_CAPACITY);
    let CodexSpawnOk { codex, .. } = Box::pin(Codex::spawn(CodexSpawnArgs {
        config,
        installation_id: parent_session.installation_id.clone(),
        auth_manager,
        models_manager,
        environment_manager: Arc::clone(&parent_session.services.environment_manager),
        skills_manager: Arc::clone(&parent_session.services.skills_manager),
        plugins_manager: Arc::clone(&parent_session.services.plugins_manager),
        mcp_manager: Arc::clone(&parent_session.services.mcp_manager),
        conversation_history: initial_history.unwrap_or(InitialHistory::New),
        session_source: SessionSource::SubAgent(subagent_source.clone()),
        thread_source: Some(ThreadSource::Subagent),
        agent_control: parent_session.services.agent_control.clone(),
        dynamic_tools: Vec::new(),
        persist_extended_history: false,
        metrics_service_name: None,
        // ...
        inherited_exec_policy: Some(Arc::clone(&parent_session.services.exec_policy)),
        // ...

Look at inherited_exec_policy — this is the most critical safety boundary in Codex’s subagent design. A child cannot be more permissive than its parent: a binary the parent forbids is also forbidden for the child, no “spawn a subagent to bypass execpolicy” escape route. This single constraint shuts down a particularly nasty attack pattern — if parent and child could have separate permissions, a malicious prompt could induce the agent to spawn a subagent “to help me run this tool” and bypass the parent’s restrictions, effectively adding a privilege escalation attack surface. Codex closes that path directly. The same logic applies to plugins / mcp_manager / skills_manager / environment_manager / agent_control / models_manager / auth_manager: they are all Arc references inherited from the parent, the child sees exactly the parent’s toolset (minus any explicitly disabled features), with no path to gaining capabilities the parent lacks.

Approval handling follows the same logic — when a subagent wants to run a risky command, it does not pop its own UI dialog for user approval (if it did, the user would see a flurry of popups from different subagents with no way to make sense of them); instead, the approval event is pushed back to the parent session, which decides whether to surface a prompt. So from the user’s perspective, every approval comes from the parent agent; subagents are “an extension of the parent’s hand”, UI consistency stays intact. This is the same idea as the review sub-agent’s AskForApproval::Never in chapter 09 applied generally — subagents never own the UI, all user interaction goes through the parent.

Claude Code · Unify subagents with every other async execution into one Task abstraction

Claude Code thinks about subagents very differently from Codex. It observes a fact: “spawn a subagent to do work” is really a special case of “asynchronously execute something that takes time”, and the IDE scenario has many more async-execution cases than just subagents — running a local bash command is async, calling a remote agent service is async, collaborating with another teammate agent in the same process is async, running a workflow is async. Writing separate lifecycle managers, state machines, concurrency controls, and UI displays for each async execution type incurs high long-term maintenance cost, and behaviour drifts apart (a bash-task cancel mechanism diverging from a subagent cancel mechanism, etc.).

So Claude Code abstracts all async executions into one unified Task concept, with TaskType being an enum of 7 variants:

Claude Code claude-code/src/Task.ts:6-30 — TaskType 7 variants; subagent is just one of them

export type TaskType =
  | 'local_bash'
  | 'local_agent'
  | 'remote_agent'
  | 'in_process_teammate'
  | 'local_workflow'
  | 'monitor_mcp'
  | 'dream'

export type TaskStatus =
  | 'pending'
  | 'running'
  | 'completed'
  | 'failed'
  | 'killed'

/**
 * True when a task is in a terminal state and will not transition further.
 * Used to guard against injecting messages into dead teammates, evicting
 * finished tasks from AppState, and orphan-cleanup paths.
 */
export function isTerminalTaskStatus(status: TaskStatus): boolean {
  return status === 'completed' || status === 'failed' || status === 'killed'
}

The 7 TaskTypes each cover a specific async-execution scenario. local_bash is “run a shell command locally” (cd to a directory, run tests, run lint) — shares the same lifecycle management as spawning a subagent but does not involve LLM calls. local_agent is “run a subagent locally” (Codex-style subagent), primarily for letting the model split a big task off to a focused subagent. remote_agent is “call a remote agent service” (e.g. an Anthropic-deployed endpoint specialised for one task). in_process_teammate is “collaborate with another teammate agent in the same process” — typical use is an IDE with multiple agent profiles each handling a slice (one writes code, one writes docs, one does review), spawning each other when needed. local_workflow is “run a predefined workflow” (e.g. a “release process” or “review process” composed of multiple standardised steps). monitor_mcp is “run a long-running MCP monitor task” (e.g. monitoring GitHub PR status changes). dream is the most distinctive new one introduced in Claude Code 2.1 — “background prefetch subagent” — while the user is having their main conversation, a dream task in the background prepares resources that might be needed (e.g. pre-greps relevant project files, pre-loads relevant docs); this is the only TaskType not directly user-triggered, the reason it exists is “make the agent feel faster” (resources are already prepared by the time the user asks).

TaskStatus is a clean state machine: pending (waiting to schedule), running (executing), completed (succeeded), failed (errored), killed (cancelled). isTerminalTaskStatus(status) is a simple helper for “has this task reached a terminal state” — used to prevent injecting messages into dead teammates, evicting tasks from AppState when terminal, and orphan resource cleanup. The “state machine terminal state” abstraction is foundational but critical in async execution systems — without it, you get a flood of race conditions.

Every task also self-reports whether it is “concurrency-safe”: every tool implements an isConcurrencySafe() interface returning true or false, and the dispatcher (chapter 04 on tool system) reads this value to decide whether to run this tool in parallel with the rest of the turn’s tool calls. For instance, reading a file is concurrency-safe (multiple parallel reads have no side effect), writing to a file is not (two parallel writes to the same file produce chaos), spawning a subagent is usually safe (each subagent carries independent state), running a bash command is usually not safe (commands may have side-effect dependencies on each other). This “tool self-declared capability” frees the dispatcher from hard-coding concurrency rules per tool.

Claude Code also provides a set of tools the model uses to manage the task list itself: TaskCreateTool creates a new task, TaskListTool lists currently active tasks, TaskUpdateTool injects new info into a task, TaskGetTool retrieves task results, TaskStopTool terminates a task. These 5 tools form a user-level API, and the model can decide in its prompt “I will spawn 3 subagents to do these three things in parallel, then wait for all to complete before consolidating”. Every task writes its output to disk via outputFile, keeping the main process from running out of memory from many subagent outputs — a very pragmatic engineering detail.

OpenClaw · Build subagents as a complete subsystem: 6 modules + 13 fields + anti-polling warning

OpenClaw does the most on subagents, building it as an independent subsystem rather than just a tool call. Open the source and you see a coordinated set of files: subagent-spawn.ts handles creation, subagent-registry.ts handles tracking active subagents, subagent-depth.ts handles cross-process depth persistence, subagent-announce.ts handles completion push, subagent-attachments.ts handles file attachments to subagents, subagent-lifecycle-events.ts handles lifecycle events (start / complete / fail / cancel). This “6-file coordination” decomposition shows OpenClaw does not treat subagents as a lightweight tool but as a first-class runtime entity peer to the main agent.

The most direct illustration: SpawnSubagentParams has 13 configuration fields:

OpenClaw openclaw/src/agents/subagent-spawn.ts:46-80 — SpawnSubagentParams 13 fields; finer-grained config than either Codex or Claude Code

export type SpawnSubagentParams = {
  task: string;
  label?: string;
  agentId?: string;
  model?: string;
  thinking?: string;
  runTimeoutSeconds?: number;
  thread?: boolean;
  mode?: SpawnSubagentMode;          // "run" | "session"
  cleanup?: "delete" | "keep";
  sandbox?: SpawnSubagentSandboxMode; // "inherit" | "require"
  expectsCompletionMessage?: boolean;
  attachments?: Array<{
    name: string;
    content: string;
    encoding?: "utf8" | "base64";
    mimeType?: string;
  }>;
  attachMountPath?: string;
};

Each of these 13 fields solves a specific scenario. task is the description of what the subagent should do. label is the human-readable name (so the parent can say in status updates “review-agent still running, test-writer completed”). agentId chooses which agent profile to use (OpenClaw supports multiple agent templates). model lets the subagent use a different model from the parent (e.g. parent uses sonnet but the child uses opus for harder reasoning). thinking specifies reasoning depth (low / medium / high). runTimeoutSeconds is the subagent’s maximum execution time (it gets killed past this). thread decides whether to start a new thread or extend the parent’s. mode is “run” or “session” — run ends after completion, session stays alive for follow-up. cleanup decides whether to delete or keep the subagent’s session data after completion (keep for debugging, delete to save disk). sandbox decides whether to inherit the parent’s sandbox or require an independent one. expectsCompletionMessage decides whether the parent waits synchronously or fire-and-forget. attachments is a list of file attachments to pass to the subagent (each with name / content / encoding / mimeType). attachMountPath is where the attachments mount inside the subagent’s sandbox. This “13 dimensions configurable” makes OpenClaw able to serve very diverse agent orchestration scenarios, but it also means new users will trip over mode / cleanup / sandbox enum fields — sensible defaults and clear docs are critical.

OpenClaw’s most distinctive design is push-based completion + writing the anti-polling warning literally into the spawn tool’s return value text:

OpenClaw openclaw/src/agents/subagent-spawn.ts:81-84 — Hard warning to the model: do not poll, wait for the push event

export const SUBAGENT_SPAWN_ACCEPTED_NOTE =
  "Auto-announce is push-based. After spawning children, do NOT call sessions_list, sessions_history, exec sleep, or any polling tool. Wait for completion events to arrive as user messages, track expected child session keys, and only send your final answer after ALL expected completions arrive. If a child completion event arrives AFTER your final answer, reply ONLY with NO_REPLY.";

That string is literally the spawn tool’s return value. After spawning a subagent, the model reads this and is explicitly told four things: “do not poll” (do not call sessions_list, sessions_history, exec sleep, or any other tool that tries to actively query subagent state), “wait for push events” (the subagent’s completion arrives as a user message automatically delivered into the parent’s conversation), “track the expected child session keys” (remember the session keys you got at spawn time; only give a final answer once all expected keys have completed), “if a completion arrives after your final answer, reply NO_REPLY” (do not respond again, avoid duplicating content).

This “write engineering discipline directly into tool output text” looks simple-minded but is extremely effective — the model reads the spawn return value while deciding what to do next, and telling it “do not poll” in the spawn output is far more visible and actionable than burying it in the system prompt. This is an engineering fix for the “subagent mode burns tokens polling” anti-pattern, reflecting the OpenClaw team having observed the same issue recurring in real use and deciding to use the tool return text as an in-context warning.

Depth information is persisted to the session store rather than kept in memory:

OpenClaw openclaw/src/agents/subagent-depth.ts:1-48 — spawnDepth persisted to session store, survives process restart

import fs from "node:fs";
import JSON5 from "json5";
import type { OpenClawConfig } from "../config/config.js";
import { resolveStorePath } from "../config/sessions/paths.js";
import { getSubagentDepth, parseAgentSessionKey } from "../sessions/session-key-utils.js";
import { resolveDefaultAgentId } from "./agent-scope.js";

type SessionDepthEntry = {
  sessionId?: unknown;
  spawnDepth?: unknown;
  spawnedBy?: unknown;
};

function normalizeSpawnDepth(value: unknown): number | undefined {
  if (typeof value === "number") {
    return Number.isInteger(value) && value >= 0 ? value : undefined;
  }
  if (typeof value === "string") {
    const trimmed = value.trim();
    if (!trimmed) {
      return undefined;
    }
    const numeric = Number(trimmed);
    return Number.isInteger(numeric) && numeric >= 0 ? numeric : undefined;
  }
  return undefined;
}

This matches OpenClaw’s overall positioning — a long-lived agent platform where sessions resume across processes, so subagent depth has to be persistent state, not an in-memory variable scoped to the process. Imagine the scenario: parent agent spawns a subagent, then the process is restarted by the IDE; after restart, the parent session resumes and the subagent resumes too. If depth lived only in memory, both would restart from depth=0 after restart, breaking the “max N levels of recursion” guarantee (because in principle after restart you could spawn another N levels). Persisting to the session store prevents that corner case.

Hermes · The most restrained subagent model: one function + 5 hard-blocked tools + MAX_DEPTH baked in

Hermes’s subagent design is the simplest in this chapter — just one delegate_task(goal, context, toolset, max_iterations) function. It looks like “didn’t do it” levels of minimalism, but it is actually “did the most important things, declined to do the rest” restraint. The reason: Hermes itself is a chat agent (multi-platform chat), not an agent orchestration framework, so its subagent needs come down to one thing — “give a clearly defined sub-task to a temporary agent, get the result” — and it does not need OpenClaw’s level of complex lifecycle management.

But “simple” is not “undesigned”. Hermes’s critical safety constraints are all baked into constants so no caller can bypass them:

Hermes hermes-agent/tools/delegate_tool.py:31-54 — Hard-block 5 tools to prevent recursion / cross-platform side effects / shared-state pollution; MAX_DEPTH baked in

# Tools that children must never have access to
DELEGATE_BLOCKED_TOOLS = frozenset([
    "delegate_task",   # no recursive delegation
    "clarify",         # no user interaction
    "memory",          # no writes to shared MEMORY.md
    "send_message",    # no cross-platform side effects
    "execute_code",    # children should reason step-by-step, not write scripts
])

# Build a description fragment listing toolsets available for subagents.
# Excludes toolsets where ALL tools are blocked, composite/platform toolsets
# (hermes-* prefixed), and scenario toolsets.
_EXCLUDED_TOOLSET_NAMES = frozenset({"debugging", "safe", "delegation", "moa", "rl"})
_SUBAGENT_TOOLSETS = sorted(
    name for name, defn in TOOLSETS.items()
    if name not in _EXCLUDED_TOOLSET_NAMES
    and not name.startswith("hermes-")
    and not all(t in DELEGATE_BLOCKED_TOOLS for t in defn.get("tools", []))
)
_TOOLSET_LIST_STR = ", ".join(f"'{n}'" for n in _SUBAGENT_TOOLSETS)

_DEFAULT_MAX_CONCURRENT_CHILDREN = 3
MAX_DEPTH = 2  # parent (0) -> child (1) -> grandchild rejected (2)

Each of these 5 hard-blocked tools corresponds to a real disaster Hermes actually hit in production. If delegate_task self-recursion were not blocked — the subagent could spawn the same tool calling itself — it instantly turns into infinite nesting: every level doubles the tokens, in a few levels the context window blows up and the bill goes wild. clarify is Hermes’s tool for the agent to ask the user for clarification — subagents should not have it because the user is currently in conversation with the parent agent, and a subagent unilaterally questioning the user creates UI chaos where nobody can tell who is asking what. memory is Hermes’s tool for writing to long-term MEMORY.md — letting a subagent write would pollute the parent’s long-term preferences; subagents are temporary, what they learn shouldn’t affect the parent’s permanent state. send_message is the tool for sending to Telegram / Slack / Discord — letting a subagent use it would cause cross-platform duplicate messages (one Telegram message gets sent by parent and child both). execute_code is the tool for running a Python sandbox — subagents shouldn’t use it because they are scoped to “use existing tools to solve one focused problem”; spawning another Python sandbox makes the reasoning chain hard to trace.

These 5 restrictions share one property — they are not “subagents might abuse them so block”, they are “this role conceptually should not have this capability”. That is a design principle worth learning: restrictions are not about distrusting the model but about cleanly defining “this role does not have this capability”.

The meaning of MAX_DEPTH = 2 is: the root agent (the one the user is talking to) is depth 0, it can spawn subagents at depth 1, depth 1 subagents trying to spawn further are rejected (because they don’t have the delegate_task tool). Why 2 and not 3 or 5? The Hermes team’s experience: practically useful subagent tasks almost always fit in one layer (parent focuses on one thing and delegates a helper task out); going to the second layer typically reflects the model falling into a useless decomposition loop (every sub-task gets further decomposed to grandchildren that don’t actually need it). Allowing more than 2 levels wastes tokens in 99% of scenarios rather than truly needing more depth.

Concurrency goes through ThreadPoolExecutor:

Hermes hermes-agent/tools/delegate_tool.py:56-84 — ThreadPoolExecutor with configurable concurrency; parent blocks until all children complete

def _get_max_concurrent_children() -> int:
    """Read delegation.max_concurrent_children from config, falling back to
    DELEGATION_MAX_CONCURRENT_CHILDREN env var, then the default (3).

    Uses the same ``_load_config()`` path that the rest of ``delegate_task``
    uses, keeping config priority consistent (config.yaml > env > default).
    """
    cfg = _load_config()
    val = cfg.get("max_concurrent_children")
    if val is not None:
        try:
            return max(1, int(val))
        except (TypeError, ValueError):
            logger.warning(
                "delegation.max_concurrent_children=%r is not a valid integer; "
                "using default %d", val, _DEFAULT_MAX_CONCURRENT_CHILDREN,
            )
    env_val = os.getenv("DELEGATION_MAX_CONCURRENT_CHILDREN")
    if env_val:
        try:
            return max(1, int(env_val))
        except (TypeError, ValueError):
            pass
    return _DEFAULT_MAX_CONCURRENT_CHILDREN
DEFAULT_MAX_ITERATIONS = 50
_HEARTBEAT_INTERVAL = 30  # seconds between parent activity heartbeats during delegation

The parent synchronously blocks on as_completed() waiting for all children to complete; during execution a heartbeat fires every 30 seconds so the upper UI knows the parent is still waiting (not deadlocked). This is the classic synchronous concurrency model, the opposite of OpenClaw’s push-based async. The upside of synchronous-blocking is the parent’s logic is dead simple (spawn all children, wait for all, consolidate); the downside is the parent cannot do anything else while waiting, which hurts UI responsiveness for long tasks. Hermes chooses synchronous because its subagent tasks are typically short (completing in 1-2 minutes), the UX loss of synchronous wait is small, and it buys extreme code simplicity.

§4 · What the four systems agree on at the subagent layer

Despite wildly different implementation depth, four things are universally agreed — these consensus points reflect inescapable hard constraints of subagent engineering.

The first is that depth must be controlled — unbounded recursion is not allowed. Codex records via ThreadSpawn.depth, OpenClaw persists spawnDepth to session store so it survives process restarts, Hermes hard-codes MAX_DEPTH = 2 as a source constant. None of the four allows unbounded recursive spawning. The reason is simple — without depth limits, a misconfigured subagent prompt can let the model spawn itself in infinite nesting, doubling tokens per layer; within a few layers the entire context window blows up, the bill explodes. Depth limits are not an optional optimisation; they are required hard safety boundaries.

The second is that recursive spawning must be blocked — a subagent cannot spawn the same kind of subagent. Hermes uses the most direct route — putting delegate_task into the hard-blocked tool list so subagents simply do not get the tool. Codex / Claude Code / OpenClaw go through feature flags or toolset restrictions — by default subagent toolsets don’t include the spawn tool. Different implementations, same goal: subagents are “temporary units doing work”, not “another scheduler”; scheduling is reserved for the root agent.

The third is that subagent tool capabilities are a strict subset of the parent’s — children may not gain tools the parent lacks. This goes deeper than the first two — it defends not against “subagent acts maliciously” but against “prompt injection induces a subagent to gain tools the parent doesn’t have”. If a child could have tools its parent lacked, a malicious prompt could let the parent first spawn a more-privileged child, then have the child do what the parent was forbidden — fundamentally a privilege escalation attack. All four systems treat “inherit the parent’s permission boundary” as default behaviour, and none of them allow “enable in child what is disabled in parent”.

The fourth is that subagent completion notification must avoid polling — either push events or await futures. Codex pushes EventMsg::TurnComplete to the parent via receiver/sender, OpenClaw actively pushes notifications to the parent via subagent-announce with an explicit “do not poll” warning, Hermes uses as_completed() to await futures, Claude Code triggers callbacks via TaskStatus state machine transitions. None uses “parent loops querying child status” polling. The reason is very practical — under polling, the model repeatedly calls sessions_list / sessions_history wasting massive tokens on “is it done yet”, while push lets the model just wait, token efficiency orders of magnitude better.

§5 · Where the four diverge most sharply on subagents

Four subagent models plotted on abstraction depth × concurrency freedom — Hermes function-style is the lightest; Codex typed enum sits in the middle; Claude Code Task gives the most concurrency freedom; OpenClaw is the heaviest, full-platform abstraction.

While the agreements above are the floor, the divergence on “how heavy should the subagent abstraction be” is what really decides which approach suits which product. Viewed through “what kind of agent are you building”, the four trade-offs each map to one best-fit scenario.

If you are building an agent orchestration framework hoping users write all sorts of subagent workflows, then OpenClaw’s full subagent platform is the right choice. In this scenario, capabilities like “cleanly managing dozens of concurrent subagents”, “persisting subagent state across processes”, “passing attachments to subagents” are all required, and OpenClaw’s 6-module split + 13-field config + push-based completion + depth persistence covers these needs. The cost is high new-user ramp-up (every one of 13 fields must be understood, every one of 6 submodules must be maintained), but for a framework-positioned product this is unavoidable complexity.

If you are building an agent product where subagent uses have converged to a few clear scenarios (e.g. code review, context compression, memory consolidation), then Codex’s SubAgentSource typing is the right choice. In this scenario, each subagent use has its own prompt and constraints; typing gives dispatch / telemetry / permission paths a single source of truth, vastly more maintainable than asking every caller to manually configure parameters for a generic spawn function. The cost is that adding a new use requires modifying the protocol definition (can’t be added by external plugin), reducing flexibility — but this kind of constraint is actually a good thing for mature products.

If you are building an IDE-integrated agent that needs to manage subagents together with other async executions (bash commands, remote agent calls, workflow batches, teammate collaboration), then Claude Code’s Task abstraction is the right choice. In this scenario, having all async executions share one lifecycle / state machine / UI / cancellation mechanism brings huge consistency gains, and the model can use the same set of tools (TaskCreate / List / Update / Get / Stop) to manage all these async behaviours. The cost is long-term maintenance of 7 TaskType boundaries and no explicit depth limit relying on user discipline.

If you are building a simple conversational agent and only need the “delegate a task to a throwaway subagent” capability, then Hermes’s delegate_task function + 5 hard-blocked tools is the right choice. In this scenario, the entire subagent module is one function plus a few constants, a new engineer can fully understand it in 5 minutes, and every restriction has a clear failure case it prevents. The cost is missing advanced features like attachments / mode / cleanup, and the synchronous blocking model freezes the UI on long tasks — but for simple chat agents none of these matter.

§6 · My takeaway

System	Score	Strengths	Risks
Codex	★★★★★	5 SubAgentSource variants name every usage; child inherits 7 service categories (including exec_policy); approvals route back to parent; depth lives in ThreadSpawn. Extremely consistent engineering	Adding a sixth variant requires a protocol change; inherited_exec_policy makes the child strictly weaker than the parent, limiting flexibility
Claude Code	★★★★	Task abstraction unifies 7 async execution kinds; isConcurrencySafe lets the dispatcher decide parallelism; clear FSM; outputFile keeps memory bounded	Long-term maintenance of 7 variants; no explicit depth limit, users manage it themselves; "dream" background tasks add operational complexity
OpenClaw	★★★★★	Complete subagent platform: spawn / registry / depth / announce / attachments / lifecycle all built in; push-based plus an anti-polling warning literally inside the tool return value; depth persisted across process restarts	13-field config has a learning curve; six submodules to maintain; new users get tripped up by the mode/cleanup/sandbox enums
Hermes	★★★★	One function plus five blocked tools plus MAX_DEPTH=2. The minimum viable subagent design, each restriction tied to a concrete failure mode	Synchronous blocking model freezes the parent UI when children run long; missing advanced features like attachments; the 7-backend compatibility surface is user-managed

Scoring criteria: abstraction consistency + safety-boundary clarity + concurrency-model fit

§7 · Build recipe

Below is the recipe distilled from the four systems for writing your own sub-agent. Start functional, then add production-grade features, finally avoid five common dead ends.

Build recipe

最小可行

Start functional (borrow from Hermes' delegate_task): args task (task description) / context (necessary context, not parent's full history) / toolset (restricted tool set) / max_iterations (independent budget) — simple and clear, args explicit
Hard-block 5 tools to prevent recursion / cross-platform side effects: blocked_tools = [delegate, clarify, memory, send_message, execute_code] — delegate prevents recursion, clarify prevents child asking parent back, memory prevents shared memory pollution, send_message prevents out-of-bounds messaging, execute_code restricts available tools
Hard-code MAX_DEPTH (borrow from Hermes' 2 levels) — parent(0) → child(1) → grandchild reject; 2 levels covers 95% of sub-agent scenarios, deeper requires special audit
Default 3 concurrency, configurable; parent uses ThreadPoolExecutor blocking for all children — 3 concurrency is the sweet spot for most scenarios (more hits API rate limit), configurable lets users tune by quota

进阶

Type the source enum (borrow from Codex' SubAgentSource): Review / Compact / ThreadSpawn / MemoryConsolidation / Other — different source sub-agents need different trace / monitoring strategies; typing enables event stream classification queries
Child inherits parent core services (exec_policy / plugins / mcp / skills); never more permissive than parent — prevents child "bypassing parent's safety constraints to do things" (many attack patterns rely on this)
Approvals route back to parent (child has AskForApproval::Never); child never owns UI — child runs silently, approval events bubble to parent agent; avoiding "multiple sub-agents simultaneously prompting" UX disaster
Push-based completion (borrow from OpenClaw' subagent-announce) — spawn return value tells the model "don't poll, just wait for notification"; model polling wastes tokens calling wait_for_subagent tool
Persist depth to session store (borrow from OpenClaw' subagent-depth.ts) — survives process restart with depth info preserved (recovery knows which layer this is); pure in-memory depth can't resume
Unify subagents with other async execution (borrow from Claude Code' Task + TaskStatus) — 7 TaskTypes managed under one abstraction (local_bash / local_agent / remote_agent / in_process_teammate / local_workflow / monitor_mcp / dream), UI / monitoring sees Task without distinguishing types
Spill outputs to disk (borrow from Claude Code' outputFile) — a fleet of sub-agent outputs OOMs main process; once on disk main process holds only path; this is long-running scenario memory optimisation

一开始别做

Don't share parent conversation history wholesale into child — context inheritance biases child toward "direction already taken" (cognitive anchoring bias), losing independent judgement; only pass task description + necessary context
Don't let children spawn same-kind sub-agents — recursive nesting blows tokens (one level 100K, three levels 1M), and debug chains explode beyond depth>2; MAX_DEPTH is required hard limit
Don't poll for child completion — parent agent constantly calling wait_for_subagent burns tokens for nothing; use push-based completion + hard warning (borrow from OpenClaw)
Don't let children run send_message / write shared memory — cross-platform side effects (child suddenly messages user, user confused), shared state pollution (multiple children writing memory simultaneously, data conflicts)
Don't ignore attachment size — base64-encoded images in prompts easily blow context (a 1MB image ~1.4M tokens); attachments must have size limit + auto-compression

§8 · Four subagent flows side by side

Lined up, the abstraction-depth spectrum is one glance: function (Hermes) -> typed enum (Codex) -> task abstraction (Claude Code) -> full platform (OpenClaw).

§9 · Further reading / source entry points

§10 · Exercises

Easy: write a delegate_task function with args goal / context / toolset / max_iterations. The child toolset hard-blocks recursive delegate.
Medium: add a depth limit. Record spawn_depth in the session and reject spawns beyond 2. Verify: a three-level spawn fails at level 3 with an error containing the depth.
Medium: push-based completion. The parent must not poll for child status. The child’s completion arrives as a user message in the parent’s next turn. Verify: after spawning, the parent can only wait for the message, not call sessions_list.
Hard: type the source. Define a SubAgentSource enum with at least Review / Compact / ThreadSpawn. Each source uses a different system prompt and a different feature constraint. Verify: Review disables web_search automatically; ThreadSpawn allows web_search.

§11 · Interview drill: 10 questions with worked answers

Q1 · Concept: Why does Codex type the subagent source as a SubAgentSource enum? Wouldn’t a single spawn(prompt) function do?

Typing solves three concrete problems:

1. Dispatch paths diverge.

The five sources take genuinely different code paths. Review loads the review prompt and disables a batch of features; Compact switches to the summarization prompt while allowing some tools; MemoryConsolidation goes through the memory subsystem and cannot use file edit; ThreadSpawn is a user template whose feature set is config-driven. With a generic spawn(prompt), all that logic moves to the caller; every new type forces every caller to be updated. The enum keeps dispatch centralized.

2. Telemetry can aggregate.

Codex cares about “average review subagent duration”, “compact failure rate”, and “memory consolidation trigger frequency”. These metrics only make sense grouped by source. If source were a free-form string, each caller would invent its own (review / Review / “code review” / “/review”) and the metrics would be unusable. The enum forces alignment.

3. Permission / sandbox configuration lives in one place.

Each source corresponds to a reasonable “tightening”: Review wants AskForApproval::Never; Compact wants web_search=Disabled; MemoryConsolidation does not need BashTool. The enum puts these definitions in one file (subagent_source.rs) instead of scattering them across five callers.

Why not a trait + dynamic dispatch?

Rust traits would also work. But the enum serializes more directly (it already implements Serialize + Deserialize + JsonSchema + TS), which matters because the protocol must travel to the TUI, the log, and the subagent itself. Enum + serde gives one consistent shape.

Counter-example: Claude Code’s TaskType is a string union, no strong type. Adding a new variant relies on TypeScript exhaustiveness checks; you might miss a switch case. Codex’s Rust enum forces every match to be updated by the compiler.

Source: codex/codex-rs/protocol/src/protocol.rs:2558-2576 plus the source-specific branches in codex_delegate.rs.

Follow-up: “Why does Other(String) exist?” To cover plugin / experimental flows. The five canonical variants cover 95% of production usage; new types live in Other("plugin-name") until they stabilize and graduate to a real variant. Classic enum escape hatch.

Q2 · Architecture: Codex makes the child inherit the parent’s exec_policy / plugins / mcp_manager. When does this “parent-child shared service” cause problems?

Inheritance is correct by default for three reasons:

Child is never broader than parent: there is no “spawn a child to bypass limits” escape hatch. Safe.
Avoid cold start: plugins / mcp_manager are expensive to boot (subprocesses, connections). Sharing one saves 100-500ms.
State stays in sync: if the parent just disabled a plugin, the child should also see it disabled. Sharing prevents drift.

But three concrete scenarios bite back:

Scenario 1 · Concurrent mutation of mcp_manager state.

mcp_manager keeps an “active MCP servers” list. Parent enables git server at turn 1; child also enables it mid-turn (refcount = 2). Parent disables at turn 2 (refcount = 1, server stays up). Child finishes without an explicit disable (it inherited from parent). Now the git server lives forever — process leak.

Codex’s fix: mcp_manager is shared via Arc, but enable/disable is parent-exclusive; the child can only read. The dispatch layer in codex_delegate enforces this.

Scenario 2 · exec_policy updated by parent while child is running.

Rust Arc::clone(&parent.exec_policy) gives the child an immutable Arc. But if exec_policy internally holds a Mutex<HashMap<...>> and the parent edits it mid-flight, should the child see the new policy?

Codex picks “snapshot at spawn” — the child uses the policy as it was at spawn time; later changes do not affect it. Rationale: the child’s prompt was decided at spawn time; runtime policy changes make child behavior unpredictable, and “user edits config and expects it to apply mid-task” is unintuitive.

Trade-off: in-flight subagent runs cannot pick up policy changes. Codex prefers “predictable behavior” over “live config”.

Scenario 3 · skills_manager load order.

skills_manager keeps a “loaded skills” list. Parent loads skill A at turn 0; child loads skill B at turn 0.5; should the parent see skill B at turn 1?

Codex’s choice: shared manager, both parent and child see B. Rationale: skills are “loadable capabilities”, not “behavioral state”. Loading more does no harm; whoever needs it next gets it ready.

Engineering lesson: the shared-service boundary follows semantics. Read-only loadable capabilities are sharable; capabilities with state machines need cloning; cross-child mutation of parent state needs strict enforcement.

Source: codex/codex-rs/core/src/codex_delegate.rs:74-105 (the seven Arc::clone lines).

Follow-up: “What about cross-process agents?” One service tree per process; cross-process subagents get a serialized snapshot, not an Arc. Cross-process subagents therefore lose live state (e.g., plugin updates) but gain isolation. Trade-off you make at the protocol layer.

Q3 · Concept: Hermes blocks five tools for children (delegate_task / clarify / memory / send_message / execute_code). What real failure does each correspond to?

Each of the five maps to a real incident, not dogma:

delegate_task · Self-recursive nesting

Children are also LLM-driven. If a child has delegate_task, it will think “this task is too big, let me spawn my own child”. Parent → child → grandchild → great-grandchild — every level burns tokens. Hermes once hit a four-level recursive spawn that drained 60K of context. After that, MAX_DEPTH = 2 was added as a fallback, plus the tool-level block as belt-and-suspenders.

clarify · Child cannot talk to the user

The user is talking to the parent. A child popping a clarify("which algorithm do you want?") confuses the routing: either the UI shows multiple stacked clarify dialogs, or the message vanishes silently. Hermes hit a deadlock when an analysis subagent fired clarify and the TUI did not know where to route it. From then on, children must never ask; if they do not know something they put it in the result and let the parent decide.

memory · Shared MEMORY.md pollution

MEMORY.md is Hermes’s long-term memory file, user-scoped. If a child can write:

Child A: “user prefers concise answers”
Child B: “user prefers detailed explanations” (because A and B saw different tasks)

After both write, MEMORY.md contradicts itself, and the parent’s next answer splits personality. From then on only the parent writes; children emit short-term facts in the result and the parent decides whether to promote them.

send_message · Cross-platform duplicate notifications

send_message is Hermes’s Telegram / Slack / Email tool. If children can also send, a “summarize and notify” task ends up:

Parent asks a child to summarize.
Child finishes and decides “let me notify them too” (autonomous decision).
Parent gets the result and also notifies.

User pinged twice. Hermes had this happen — a user got two identical “Build broken” messages on Slack. From then on, external side effects are parent-exclusive.

execute_code · Child should reason with tools, not spawn another Python

execute_code is Hermes’s Python sandbox. Children given access tend to “write a Python script”. But child tasks are usually “analyze this log” or “summarize this file” — they should reason and use read_file / grep, not spawn another Python process. Once children also use execute_code:

Extra process per child, resource use 2×.
Python errors don’t propagate cleanly back to the parent.
Child stdout goes to the child LLM but the parent wants it too; needs explicit return.

Engineering lesson: a child is a “resource-constrained reasoning LLM”, not “a mini agent that picks its own stack”. Sharing the parent’s tools is fine; “spawn a new sandbox” is not.

Source: hermes-agent/tools/delegate_tool.py:31-54 (DELEGATE_BLOCKED_TOOLS).

Follow-up: “What if the business genuinely needs the child to write memory?” Make the child output proposed_memory_update: ["fact 1", "fact 2"]. The parent then applies rule / human review before touching MEMORY.md. Keep the decision with the parent.

Q4 · Concept: OpenClaw uses push-based completion and stuffs an anti-polling warning into the tool return value. What problem does that solve?

It solves a classic LLM-as-orchestrator failure: the model will not wait; it polls.

The failure

Parent spawns a child and the model thinks “I should check on this task”. It starts calling sessions_list, sessions_history, exec sleep, etc. Every poll is another LLM call plus tool call, burning tokens, while the UI looks like the parent is “thinking”.

GPT-4 / Claude 3 / Sonnet have all been observed doing this. Even if the system prompt says “do not poll”, the model forgets after a few turns.

OpenClaw’s fix

SUBAGENT_SPAWN_ACCEPTED_NOTE is literally inside the tool result field:

Auto-announce is push-based. After spawning children, do NOT call
sessions_list, sessions_history, exec sleep, or any polling tool.
Wait for completion events to arrive as user messages, track expected
child session keys, and only send your final answer after ALL expected
completions arrive. If a child completion event arrives AFTER your
final answer, reply ONLY with NO_REPLY.

Putting it in the tool result instead of the system prompt is intentional, for four reasons:

Adjacent to behavior: the model has just spawned and reads “do not poll” immediately. More salient than a system prompt at the top.
Repeats per call: every spawn re-injects the warning. The model gets a reminder each time, even if it forgot the system prompt.
Points at specific anti-patterns: “do not call sessions_list / sessions_history / exec sleep” is actionable, unlike a vague “do not poll”.
NO_REPLY protocol: teaches the model how to gracefully handle the “child completed after final answer” case.

Required supporting machinery

The warning alone is not enough:

subagent-announce push event: the child sends a user message into the parent’s conversation when it finishes; the parent wakes up without polling.
Timeout fallback: if push exceeds N seconds, fall back to a single poll as a safety net.
Session-key tracking: parent records the child session keys at spawn time; final answer waits until every key has announced.

This is “prompt engineering as architecture”

OpenClaw did not change the model, did not fine-tune, did not retrieve. It put the behavioral rule at the point where the behavior happens. A textbook case of prompt engineering becoming system design.

Contrast with Codex

Codex children also push (EventMsg::TurnComplete), but the parent is Rust code awaiting recv(), not an LLM deciding when to query. Codex does not depend on the model “not polling”.

Which is better?

Codex is more reliable (code enforces it) but limits flexibility (the parent must be code). OpenClaw is more flexible (LLM can do complex decisions) but needs prompt enforcement. Both are valid; pick based on whether your orchestrator is code or model.

Source: openclaw/src/agents/subagent-spawn.ts:81-84 + subagent-announce.ts.

Follow-up: “Why not just block sessions_list?” Because sessions_list is legitimate in other flows (user asks for the session list). A tool cannot be globally banned just because subagent mode misuses it. So the enforcement moves to the prompt.

Q5 · Concept: Claude Code’s Task abstraction unifies seven async execution kinds. What are the benefits and costs of putting subagent and bash under one abstraction?

Benefit 1 · UI consistency

The task list UI needs one renderer, not four (bash / subagent / remote / teammate). A unified TaskListUI displays status / progress / result. Low cognitive cost for users.

Benefit 2 · Concurrency dispatch lives in one place

isConcurrencySafe() is an interface method, each task self-reports. The dispatcher does not branch if task_type === 'bash' then ... else if task_type === 'subagent' then .... Adding a task type just means implementing the interface; the dispatcher stays untouched. Open-closed.

Benefit 3 · State machine consistency

pending → running → completed/failed/killed applies to all task types. Logging, monitoring, error handling stay uniform. isTerminalTaskStatus() is a single function used everywhere.

Benefit 4 · Persistence consistency

Every task writes outputFile to disk. Bash output, subagent reply, remote result — same persistence strategy. Crash recovery and cross-session queries use one mechanism.

Cost 1 · Abstraction loses detail

The seven types genuinely differ:

bash: blocking / streaming output / signal handling
subagent: async / token cost / approval routing
remote_agent: network latency / auth / quota
teammate: long-lived / message injection
workflow: multi-step / state persistence

Each type-specific concern either bloats the Task interface or leaks into type-specific config. Claude Code picks the latter — TaskCreateTool has 13+ params, many only relevant to one type.

Cost 2 · New task types are not free

Even with no dispatcher change, you must implement isConcurrencySafe() / outputFile / status transitions / signal handling, etc. When dream was added, a pile of corner cases appeared (what if the user switches conversations while dream runs? Who sees the dream result?).

Cost 3 · Debugging is harder

Bash failure and subagent failure look very different, but the log says Task xxx failed. You have to drill into type-specific errors. Per-type loggers make logs more direct but lose uniformity.

Conclusion

The Task abstraction fits “async execution is varied but scheduling is similar” products:

Claude Code is an agent IDE; from the user’s view, running bash / calling a subagent / dispatching a teammate is “wait for a result”. Abstracting helps users.
Hermes is a chat agent; subagents are a minority operation (the dominant tools are read / write / bash). Unifying the abstraction costs more than it saves. Hermes uses a standalone delegate_task function.

Engineering lesson: abstraction is not unconditionally good. It depends on whether these things are actually homogeneous in the user’s view. If yes, abstract. If not, keep them separate.

Source: claude-code/src/Task.ts:6-30 + src/tools/TaskCreateTool/.

Follow-up: “How did dream fit into the 7 types?” dream is the “background prefetch” added in 2.1; the user did not trigger it. To fit Task, the dispatcher gained an auto_triggered flag, and the UI hides dream by default (debug view shows it). Concrete example of abstraction leakage — accommodating dream added a property to the abstraction.

Q6 · Practical: You are adding subagent capability to your own agent from scratch. What does the path from MVP to production look like?

Four stages: functional MVP → safety boundary → typed → platformized.

Stage 1 (Day 1) · Functional MVP

Borrow from Hermes:

def delegate_task(goal: str, context: str, toolset: list[str], max_iter: int = 50):
    return run_subagent_sync(goal, context, toolset, max_iter)

Day-one ready. Parent uses it like:

parent: I want to analyze these 3 log files
parent: delegate_task("analyze log 1", ..., ["read_file", "grep"])
... wait for result
parent: delegate_task("analyze log 2", ..., ["read_file", "grep"])
... wait for result

Stage 2 (Day 2-3) · Safety boundary (do not skip)

Add four things that cannot be skipped:

MAX_DEPTH = 2: parent → child → grandchild rejected. Simple counter passed in spawn args.
blocked_tools: hard-block 5 (recursive delegate / clarify / memory / send_message / execute_code). Filter even if the toolset listed them.
Concurrent limit: default 3, configurable. ThreadPoolExecutor / asyncio.Semaphore.
Timeout: per-child max_iterations + total timeout (e.g. 5 min). On timeout, kill the child; parent gets TimeoutError.

This stage is the boundary between “minimum viable” and “blows up in production”.

Stage 3 (Week 2) · Parallelism

Let the parent run multiple children at once:

import asyncio

async def parent_turn():
    tasks = [
        delegate_task_async("analyze log 1", ...),
        delegate_task_async("analyze log 2", ...),
        delegate_task_async("analyze log 3", ...),
    ]
    results = await asyncio.gather(*tasks)

Each child is an independent LLM call. 3× speed.

UI consideration: three children running simultaneously means three progress indicators. Give each child a child_id and render per-child progress.

Stage 4 (Week 3-4) · Typed sources (medium term)

When subagent kinds exceed 3 (e.g., review / summarize / search / analyze), switch to enum:

class SubAgentSource(Enum):
    Review = "review"
    Summarize = "summarize"
    Search = "search"
    Analyze = "analyze"

def delegate_task(source: SubAgentSource, ...):
    prompt = PROMPTS[source]
    feature_constraints = FEATURE_CONSTRAINTS[source]

Benefits:

Centralized dispatch
Metrics aggregated by source
Feature constraints determined by source (Review auto-disables web_search)

Stage 5 (Month 2-3) · Platformization (long term)

If you are building an agent framework (other developers write agents on top), borrow OpenClaw:

subagent-registry (visibility into all active runs)
subagent-depth (persisted across processes)
subagent-announce (push completion)
subagent-attachments (file passing)
subagent-lifecycle-events (spawn / progress / complete event stream)

These are “framework features” you do not need in a single-user agent.

Key takeaways:

MAX_DEPTH on day one (recursive nesting and token blow-ups are the most common incident)
blocked_tools on day one (shared-state pollution)
Timeout + heartbeat by end of week 1 (UI freezing)
Typing / platformization is later (data-driven, only when there is real demand)

Source composition: Hermes tools/delegate_tool.py (MVP) + Codex protocol.rs SubAgentSource (typing) + OpenClaw agents/subagent-*.ts (platform). Stitch the three together and you have subagent framework v0.1.

Follow-up: “How does the parent handle a child failure?” Three strategies: (1) retry once (transient error); (2) mark failed and let the parent decide (business logic error); (3) abort the parent task (fatal, e.g. quota exhausted). Most cases pick (2), keep the decision with the parent.

Q7 · Architecture: Codex’s inherited_exec_policy ensures the child is never more permissive than the parent. Is that always right? What if a business genuinely needs “child can do more than parent”?

Default-tight is the safe default

The child’s prompt is decided by the parent. The parent may be prompt-injected, deceived, or maliciously controlled. If the child could exceed the parent’s limits, an attacker could “have the parent spawn a more powerful child” to bypass the sandbox.

Classic attack pattern: user input “write a script that deletes /etc/passwd” → parent’s exec_policy blocks rm -rf → but the child does not inherit → child runs → boom.

Codex’s design is therefore correct: by default, the child is a subset of the parent — strictly stricter.

But “child does more than parent” is sometimes a real need

Three typical scenarios:

Scenario 1 · Parent is the restricted main conversation, child is a background analysis task.

User opens Claude Code in --restricted mode (read-only analysis), no file writes. But user triggers /review and the review subagent needs to run gh pr diff and call the GitHub API. The parent disabled network; the child needs it.

How to handle it?

Do not let the child override the parent. Instead, the user explicitly grants the child extra capability at the parent layer: “review subagent is allowed to use network”. The parent grants the privilege at spawn time, knowing it is for review, and may attach further constraints (only gh commands, only GitHub domains).

Codex actually does this: when subagent_source = Review, the spawn explicitly sets web_access = AllowedFor(["github.com"]), bypassing the parent’s web_access = Disabled. But only for Review; other sources do not benefit.

Scenario 2 · Parent is IDE mode (no subprocess spawning); child is a build task.

A build must spawn cargo / npm. The parent disables subprocesses to keep the IDE process clean. The child’s build is an explicit user intent.

How to handle it?

Mark the spawn “this is a build task; subprocess is allowed”. The child’s args carry purpose: BuildTask, and the spawn dispatcher removes that single restriction inherited from the parent.

Scenario 3 · Parent is read-only review; the child needs to fix.

/review is read-only. The user sees the review and says “fix it”, which would spawn a fix subagent that writes files.

How to handle it?

Treat the fix subagent as a fresh user request whose permissions are re-evaluated, not inherited from the parent. The “child” is nominally a child but the permission slate is reset. Or, route the fix subagent through the parent’s entry (user re-prompts).

General pattern

inherited_exec_policy strict-by-default + explicit grants for exceptions. Grants must:

Happen at spawn time (not during child runtime)
Scope to a specific source (only certain SubAgentSource benefit)
Carry additional constraints (whitelist of domains / commands / resource caps)
Audit logs (every grant logged for post-incident review)

Anti-pattern

Letting the child request more permissions from the parent during runtime. The child says “I need rm permission” in its prompt; the parent LLM decides whether to grant. That outsources the permission decision to the LLM — too large a security surface. Attackers can write a prompt that convinces the parent “this is fine” and break the model.

Source: codex/codex-rs/core/src/codex_delegate.rs:117 (inherited_exec_policy = Some(Arc::clone(...))) + Review-specific overrides in tasks/review.rs.

Follow-up: “Will Anthropic / OpenAI fine-grained permissions solve this?” Similar in spirit. Anthropic’s Computer Use introduces per-action permissions; each action is decided separately, not “child inherits parent”. But the engineering still defaults the child to be strict and only grants on demand; the granularity is finer.

Q8 · Practical: A user reports “after my agent spawns 10 subagents, the parent UI freezes”. Systematic triage.

Four layers: concurrency model → resources → communication → model.

Layer 1 · Concurrency model (10 min)

Check whether subagents are sync or async:

Sync blocking (Hermes-style as_completed): the parent is stuck in executor.shutdown(wait=True). Waits for all 10 to finish before unblocking. Check if children all completed.
Async push (OpenClaw-style): the parent should not block. If it does, push events did not arrive.

Investigation commands:

py-spy dump --pid <parent_pid>     # Python
rust-gdb -p <parent_pid>             # Rust

Look for the parent stuck in wait() / recv() / gather(). If yes, go to the next layer.

Layer 2 · Resource exhaustion (20 min)

Ten children can blow up four resources:

CPU: each child is an LLM call, cheap locally. But child internal work (bash / build) eats 1 core; 10 children consume 8 cores entirely.
Memory: each child has its own conversation history + tool output buffer. 10 × 50K context × 4 bytes/token ≈ hundreds of MB. Node.js / Python easily OOM.
File descriptors: each child opens N (log / temp / socket). 10 × 50 ≈ 500, hitting ulimit.
API rate limit: 10 children concurrent LLM calls exceed RPM. Subsequent calls retry, cascading slowdowns.

Investigation:

top -p <parent_pid> -H              # per-thread CPU / memory
lsof -p <parent_pid> | wc -l        # FD count
cat /proc/<parent_pid>/limits        # ulimit
grep "429" logs/                     # rate-limit errors

Layer 3 · Communication bottleneck (30 min)

For push-based:

Event queue backlog: 10 children push completion at once; the main loop cannot keep up. Check queue size / drain rate.
Token buffer blown: each child result is a few K characters; 10 returned together creates a 50K input to the parent LLM, exceeding context window.
UI render lag: each push triggers a re-render. 10 events in 1 second; 60fps cannot keep up.

Investigation:

curl localhost:9090/metrics | grep subagent_queue
grep "input_tokens" logs/parent.log
chrome devtools performance tab

Layer 4 · Model behavior (1 hour)

If everything else looks fine:

Parent is polling: even with OpenClaw’s warning, the parent starts sessions_list. Each poll is an LLM call (10s each); UI says “thinking” but the parent is doing meaningless lookups.
Parent waits for all completions: parent was told “wait until all children done”; one child stuck, parent stuck. Need a timeout safety net.
Children spawn more children: even with MAX_DEPTH = 2, a child can suggest “I need to spawn another”; the parent recursively spawns more.

Investigation: look at the parent LLM trace (the messages of each call); find polling or spawn loops.

Common root causes + fixes

By frequency:

Rate limit hit (30%): limit concurrency to 3-5, add exponential backoff.
Sync blocking dragged down by a slow child (25%): switch to async / push-based + per-child timeout.
Resource exhaustion (20%): monitoring + concurrency limit + cleanup.
Parent model polling (15%): borrow OpenClaw’s anti-poll warning.
UI render lag (10%): debounce + batch events.

Prevention

Make these defaults so users do not step on the rake:

max concurrent = 3
single-child timeout = 5 min
queue size limit + drop policy
API rate-limit-aware retry

Source reference: Hermes’s _DEFAULT_MAX_CONCURRENT_CHILDREN = 3 + _HEARTBEAT_INTERVAL = 30 + DEFAULT_MAX_ITERATIONS = 50. Three defaults distilled from real incidents.

Follow-up: “What if the business genuinely needs 10 concurrent without freezing?” Do specialized parallelism (worker pool / connection pool / API tier upgrade), not use a generic subagent framework to brute-force it. Subagent is “one agent delegates one task to a throwaway agent”, not “batch process 10000 tasks”. The latter needs a data pipeline, not an agent.

Q9 · Engineering: Claude Code writes task outputFile to disk; OpenClaw persists spawnDepth to session store. Which kind of persistence matters more?

Both matter, for different reasons

outputFile (Claude Code)

Purpose: avoid in-memory blow-up + cross-session recovery.

Scenario: user spawns a bash task running find / for minutes producing hundreds of thousands of lines. An in-memory buffer cannot hold it; outputFile to disk lets the parent hold only a path reference and read chunks on demand.

Bonus: session recovery. User kills Claude Code and reopens; outputFile is still on disk and task history is intact.

spawnDepth (OpenClaw)

Purpose: safety constraints survive across processes.

Scenario: parent spawns child, sets child.spawnDepth = 1. Parent process crashes and restarts. If spawnDepth was in-memory only, the reloaded child thinks it is root again and can spawn more children — MAX_DEPTH is defeated.

Persisting spawnDepth keeps the safety constraint intact across crashes, restarts, and cross-process moves.

Which matters more?

If forced to pick:

Single-process, short session (CLI tool): outputFile matters more (memory blow-up is a daily issue).
Multi-process, long session (agent platform): spawnDepth matters more (safety must survive restarts).

Claude Code is the former (CLI / IDE); OpenClaw is the latter (long-lived agent platform). Their priorities follow from that.

Best practice: do both

Any serious subagent system should:

Persist large outputs (outputFile / stdout redirect / chunked storage)
Persist safety constraints (depth / quota / rate-limit state)
Persist state machines (task FSM across restarts)

Hermes does neither because:

Sync blocking, short sessions, small outputs: outputFile not urgent.
Single process: spawnDepth in memory is fine.

But Hermes will need both when it becomes a long-lived agent.

Implementation details

outputFile:

Path: ~/.claude/tasks/<task_id>/output.log
Write: append-only, child stdout redirected directly
Read: parent reads on demand (no full load)
Cleanup: N days after completion, or user delete

spawnDepth:

Path: ~/.openclaw/sessions/<session_id>/depth.json
Write: on every spawn
Read: at parent startup
Cleanup: archive at session end

Anti-patterns

In-memory only: crash loses everything; safety constraints defeated.
SQL DB: subagent spawn/complete is high frequency; per-event DB write is too slow. File-based is enough.
Shared file with concurrent writes: race conditions. One directory per session.

Follow-up: “How to share outputFile across users?” Tasks are user-scoped; outputFile lives in the user’s home dir, not shared. For cross-user analytics, build a separate telemetry pipeline; do not mix it with outputFile.

Q10 · Open-ended: Combine the best of all four into a “universal subagent framework”. Provide a minimum API + implementation outline.

Layered design; enable layers on demand:

Layer 1 · Functional interface (required)

interface SpawnArgs {
  goal: string;
  context?: string;
  toolset?: string[];
  max_iterations?: number;
  source?: SubAgentSource;
}

async function spawnSubagent(args: SpawnArgs): Promise<SubAgentResult> {
  // borrow Hermes delegate_task
}

Minimum viable, 30 lines.

Layer 2 · Typing (recommended)

enum SubAgentSource {
  Review = 'review',
  Summarize = 'summarize',
  Search = 'search',
  Analyze = 'analyze',
  Custom = 'custom',
}

const SOURCE_CONFIG: Record<SubAgentSource, SourceConfig> = {
  [SubAgentSource.Review]: {
    system_prompt: REVIEW_PROMPT,
    feature_constraints: { web_search: 'disabled', spawn_subagent: 'disabled' },
    approval_policy: 'never',
  },
};

Borrow Codex SubAgentSource. Each source routes to its own config.

Layer 3 · Safety constraints (required, do not skip)

const HARD_LIMITS = {
  MAX_DEPTH: 2,
  MAX_CONCURRENT: 3,
  SINGLE_TIMEOUT_MS: 5 * 60 * 1000,
  BLOCKED_TOOLS: new Set([
    'spawn_subagent',
    'clarify',
    'memory_write',
    'send_message',
    'execute_code',
  ]),
};

function validateSpawn(parent: Session, args: SpawnArgs) {
  if (parent.spawnDepth >= HARD_LIMITS.MAX_DEPTH) {
    throw new DepthExceeded();
  }
  args.toolset = args.toolset?.filter(t => !HARD_LIMITS.BLOCKED_TOOLS.has(t));
}

Borrow Hermes’s five blocks + MAX_DEPTH. Absolutely required.

Layer 4 · Service inheritance (recommended)

interface ParentServices {
  execPolicy: ExecPolicy;
  plugins: PluginManager;
  mcpManager: McpManager;
  skills: SkillsManager;
}

function inheritServices(parent: ParentServices): ParentServices {
  return {
    execPolicy: cloneStrict(parent.execPolicy), // child ≤ parent always
    plugins: shareRef(parent.plugins),
    mcpManager: shareRef(parent.mcpManager),
    skills: shareRef(parent.skills),
  };
}

Borrow Codex inherited_exec_policy.

Layer 5 · Concurrency (optional, pick one)

// Sync blocking (Hermes style)
async function runSync(args: SpawnArgs[]): Promise<SubAgentResult[]> {
  const limit = pLimit(HARD_LIMITS.MAX_CONCURRENT);
  return Promise.all(args.map(a => limit(() => spawnSubagent(a))));
}

// Async push (OpenClaw style)
async function spawnAsync(args: SpawnArgs, callback: (r: SubAgentResult) => void) {
  spawnSubagent(args).then(callback);
}

Pick one based on business; expose both for complex products.

Layer 6 · Persistence (production required)

interface SubAgentStore {
  saveDepth(sessionId: string, depth: number): Promise<void>;
  loadDepth(sessionId: string): Promise<number>;
  saveOutput(taskId: string, output: string): Promise<void>;
  loadOutput(taskId: string): Promise<string>;
}

class FileSystemStore implements SubAgentStore {
  // outputFile → ~/.youragent/tasks/<task_id>/output.log
  // spawnDepth → ~/.youragent/sessions/<session_id>/depth.json
}

Borrow Claude Code outputFile + OpenClaw spawnDepth.

Layer 7 · Completion notification (optional)

type CompletionEvent =
  | { kind: 'started'; taskId: string }
  | { kind: 'progress'; taskId: string; output: string }
  | { kind: 'completed'; taskId: string; result: SubAgentResult }
  | { kind: 'failed'; taskId: string; error: Error };

interface SubAgentBroadcaster {
  subscribe(callback: (e: CompletionEvent) => void): void;
}

Borrow OpenClaw subagent-announce. Push to UI / logger / parent agent.

Layer 8 · Anti-polling warning (strongly recommended)

const ANTI_POLL_WARNING = `
Auto-announce is push-based. After spawning children, do NOT call
sessions_list, sessions_history, exec sleep, or any polling tool.
Wait for completion events to arrive as user messages.
`;

function spawnTool(args: SpawnArgs): ToolResult {
  return {
    success: true,
    message: `Spawned ${args.taskId}. ${ANTI_POLL_WARNING}`,
  };
}

Borrow OpenClaw SUBAGENT_SPAWN_ACCEPTED_NOTE. Embed in tool result.

Layer 9 · Task abstraction (optional, only for agent platforms)

type TaskType = 'subagent' | 'bash' | 'remote_agent' | 'workflow' | ...;
type TaskStatus = 'pending' | 'running' | 'completed' | 'failed' | 'killed';

interface Task {
  id: string;
  type: TaskType;
  status: TaskStatus;
  isConcurrencySafe(): boolean;
}

Borrow Claude Code Task abstraction. Only relevant when the product supports multiple async execution kinds.

Final API

import { spawnSubagent } from '@your-org/subagent';

const result = await spawnSubagent({
  goal: 'analyze these 3 logs',
  toolset: ['read_file', 'grep'],
  source: SubAgentSource.Analyze,
  max_iterations: 30,
});

console.log(result.summary);

Simple one-liner. Enable layers as scenarios get richer.

vs four systems:

Codex contributes SubAgentSource + service inheritance.
Claude Code contributes outputFile + Task abstraction (optional layer).
OpenClaw contributes anti-polling warning + spawnDepth persistence.
Hermes contributes the five tool blocks + ThreadPoolExecutor.

Implementation effort:

Layers 1-3: 1 week with tests
Layers 4-6: 2-3 weeks
Layers 7-9: 1-2 months with UI integration

Critical decisions:

Core API is JSON-friendly (cross-language schema)
Configuration is file-based (YAML / JSON describing source behavior)
Reject polling: anti-polling warning baked in on day one

Follow-up: “How do you test such a framework?” Three layers: (1) unit-test spawn / depth / blocked_tools; (2) integration-test push event flow; (3) chaos-test 100 child spawns + kill half + verify resource cleanup. The first two run in CI; the third runs weekly.