高级工作流、执行控制与自治模式

工作流编排、状态机、事件响应、部署管线、监控、规划和自治 agent 行为的深度模式 — 在 500+ 插件中发现的生产级架构，扩展了基础执行模式（5-9）。

来源研究： 从 DevOps、安全、迁移和事件响应领域的 500+ 生产级 AI agent 插件分析中提取。

Pattern 70: State File as Sole Continuity Mechanism

出现频率： 约 2% 的插件 相关模式： Configuration Persistence, Persistent Team with Message Board

定义： 多 agent 管线的每个阶段读写一个共享 markdown/JSON 状态文件，因为 agent 运行在隔离的上下文窗口中没有共享内存。状态文件是阶段之间信息流动的唯一方式。

正面示例

Every phase reads the state file at start and writes updated state at end.
The state file is the SOLE continuity mechanism between phases.

State file structure:
- Phase status (pending/in-progress/completed/failed)
- Phase artifacts (file paths, PR URLs, work item IDs)
- Accumulated decisions and context
- Error state for recovery

Recovery: If a phase fails, the next invocation reads the state file
and resumes from the last completed phase, not from the beginning.

为什么有效： Agent 没有共享内存 — 状态文件桥接上下文窗口。结构化阶段状态支持从失败点恢复而无需重新运行已完成阶段。累积的决策防止上下文丢失。

反面示例

Each phase passes its results to the next phase via the return value.
If a phase fails, restart the entire pipeline from the beginning.

为什么失败： 返回值仅存在于单个上下文窗口内 — 下一个 agent 在新会话中启动时那些值就消失了。从头重启浪费已完成阶段的所有工作，且在失败阶段不确定时会创建无限重试循环。

Pattern 71: Zero-Questions Triage (Maximum Autonomy)

出现频率： <1% 的插件 相关模式： Interactive Flow Control, Confirmation Gates

定义： 零用户交互的固定协议，带硬时间预算和尊重外部速率限制的交错突发查询。agent 必须在严格时间约束内自主完成全部分析。

正面示例

Total time budget: 95 seconds
- Phase 1 (Data Gathering): 30s — staggered burst queries (max 3 concurrent)
- Phase 2 (Correlation): 25s — cross-reference all signals
- Phase 3 (Impact Assessment): 20s — quantified customer impact
- Phase 4 (Triage Decision): 20s — severity, routing, immediate actions

Zero user interaction during execution.
Respect external rate limits: stagger MCP calls, max 3 concurrent.
If a data source times out, proceed with available data — note gap.

为什么有效： 时间预算防止无边界分析。零交互意味着 agent 可以被自动触发（如事件告警）。速率限制尊重防止 agent 依赖的工具中出现级联故障。

反面示例

Gather all available data before making any triage decision.
Ask the oncall engineer for clarification if any signal is ambiguous.
Query all data sources in parallel for maximum speed.

为什么失败： “All available data”没有时间边界，agent 可能在事件恶化时花数分钟追逐边际信号。需要人类澄清阻塞了自主执行。无限制并行查询会触发速率限制，导致 MCP 工具的级联故障。

Pattern 72: Pull-Based Kanban Orchestration

出现频率： <1% 的插件 相关模式： Multi-Agent Orchestration, Complexity-Tiered Dispatch

定义： Kanban 式工作系统，agent 基于亲和性（匹配其专长）拉取任务而非由编排器推送。包含两层 Ready Gate、动态扩展的 fork 协议和范围升级规则。

正面示例

Two-Tier Ready Gate:
- Tier 1: Task has clear acceptance criteria AND no blockers
- Tier 2: Task is scoped to ≤ 2 hours of work for one agent

Pull Assignment by Affinity:
- Each agent has skill tags (frontend, backend, infra, test)
- Agents pull tasks matching their affinity from the Ready queue
- If no affinity match, any available agent can pull

Fork Protocol:
- If task exceeds 4 hours estimate, fork into sub-tasks
- Sub-tasks inherit parent's priority but get independent Ready Gate checks
- Original task moves to "Waiting for subtasks" column

Scope Escalation:
- If agent discovers task is larger than estimated, escalate to coordinator
- Coordinator decides: re-scope, fork, or reassign

8 Prompt-Enforced Invariants with Detection/Recovery:
1. No task may bypass Ready Gate
2. No agent may hold > 2 tasks simultaneously
3. Every task must have an owner before leaving Ready
...

为什么有效： 拉取式分配自然匹配任务到最适合的 agent。两层 Ready Gate 防止定义不清的任务消耗 agent 时间。Fork 协议动态处理范围蔓延。带检测/恢复的不变量防止系统状态损坏。

反面示例

The orchestrator assigns tasks to agents round-robin.
If a task is too large, the agent should try its best to complete it.
Agents can pick up as many tasks as needed to stay busy.

为什么失败： 轮询忽略 agent 专长，前端任务可能落在基础设施 agent 上。没有 fork 协议，超大任务无限期阻塞 agent。不限制并发任务数意味着一个 agent 可能囤积工作而其他空闲，且同时处理太多上下文时质量下降。

Pattern 73: Deployment State Machine (Stateless/Re-entrant/Idempotent)

出现频率： 约 1% 的插件 相关模式： Hub-and-Spoke State Machine, Error Handling

定义： 部署工作流的详细状态机，其中处理器并发运行，系统被设计为无状态、可重入和幂等 — 任何处理器可以崩溃和重启而不损坏状态。

正面示例

States: QUEUED → VALIDATING → BUILDING → TESTING → STAGING → CANARY → PRODUCTION → COMPLETED
         ↘ FAILED (from any state)
         ↘ ROLLING_BACK (from CANARY or PRODUCTION)

Design principles:
- Stateless: Handler reads current state from external store, processes, writes new state
- Re-entrant: If handler crashes mid-execution, restart reads same state and re-processes
- Idempotent: Running the same handler twice on the same state produces the same result

Concurrent handlers:
- Multiple handlers can process independent deployment units simultaneously
- Handlers claim work via optimistic locking (read version, write new version)
- Version conflict → re-read state and retry

为什么有效： 无状态 + 可重入 + 幂等意味着崩溃无需人工干预即可恢复。乐观锁实现并发而无分布式锁。状态机使每个转换显式且可审计。

反面示例

Each handler keeps deployment progress in local variables.
Handlers coordinate via shared in-memory state.
If a handler fails, an operator manually restarts the deployment.

为什么失败： 本地变量在崩溃时丢失，使部署无法在没有人工干预的情况下恢复。共享内存状态在并发处理器不加锁地读-改-写时损坏。需要手动重启违背了自动化部署管线的目的。

Pattern 74: Autonomous PR Feedback Resolution

出现频率： <1% 的插件 相关模式： Self-Critique, Confirmation Gates

定义： Agent 自主读取 PR 审查评论并决定是实施反馈还是以理由反驳 — 每个评论无需人工干预。

正面示例

For each PR review comment:
1. Classify: bug-fix, style-nit, architecture-concern, question, praise
2. Decide:
   - bug-fix → implement immediately
   - style-nit → implement if < 5 minutes
   - architecture-concern → push back with reasoning if disagree, implement if agree
   - question → respond with explanation
   - praise → acknowledge
3. After all comments resolved: re-request review

Decision log: Record every decision (implement/pushback) with reasoning
for audit trail.

为什么有效： 分类防止将所有评论同等对待。”push back with reasoning”选项意味着 agent 不会盲目实施每个建议。决策日志提供问责。

反面示例

For each PR review comment:
Implement the requested change
Mark the comment as resolved
After all comments addressed, re-request review

为什么失败： 盲目实施每个评论意味着 agent 会同等应用矛盾的建议、风格细节争论和不正确的反馈。没有分类，琐碎的格式请求与关键 bug 修复消耗相同精力。没有决策日志意味着审查者无法理解为什么做了这些更改。

Pattern 75: 11-Phase Autonomous Development Flow

出现频率： <1% 的插件 相关模式： Phased Execution, Skill Composition

定义： 完整的自主开发工作流，从任务理解到部署，串联11个阶段，包括对抗性代码审查和 CI 监控。

正面示例

Phase 1: Task Understanding (read work item, clarify requirements)
Phase 2: Codebase Analysis (repo structure, conventions, dependencies)
Phase 3: Design (architecture, data model, API design)
Phase 4: Implementation (write code following discovered conventions)
Phase 5: Self-Review (adversarial review of own code)
Phase 6: Testing (write and run tests, fix failures)
Phase 7: CI Integration (push, monitor CI, fix build breaks)
Phase 8: Code Review (read reviewer comments, implement/pushback)
Phase 9: Merge (after approval, merge and monitor)
Phase 10: Deployment Monitoring (watch for regression signals)
Phase 11: Work Item Update (close task, write completion notes)

Guardrails:
- Hard stop after 3 consecutive build failures
- Escalate to human after 2 review cycles without approval
- Never force-push, never merge without CI green

反面示例

Phases:
Read the task
Write the code
Push to main
Fix any issues that come up in production

为什么失败： 跳过设计、自审和测试阶段意味着 bug 在生产环境而非开发期间被发现。直接推送到 main 而不经过 CI 或代码审查绕过了所有质量门控。”Fix issues in production”将生产环境当作测试环境。

Pattern 76: Staggered Burst Query with Rate Limit Respect

出现频率： 约 2% 的插件 相关模式： Tool Routing Tables, Error Handling

定义： 发起多个 MCP/API 调用时，用受控并发和明确速率限制意识交错发送，而非全部并行发射。

正面示例

Staggered burst queries:
- Max 3 concurrent MCP calls
- Group queries by target service (all Kusto together, all ADO together)
- Wait for each group to complete before starting the next
- If any call returns 429 (rate limited): back off 2s, retry once, then skip

Cross-server parallelism rule:
NEVER run MCP calls to different servers in the same parallel batch —
one 403 cancels ALL parallel calls in the batch.

为什么有效： 跨服务器反并行规则防止单个认证失败取消不相关的查询。按服务分组在速率限制内最大化吞吐量。跳过重试后的策略防止无限重试循环。

反面示例

Fire all MCP queries in parallel for maximum speed.
If a query fails, retry it immediately up to 10 times.
Mix Kusto, ADO, and Graph queries in the same parallel batch.

为什么失败： 在一个并行批次中混合 MCP 服务器意味着来自一个服务器的单个 403 取消对其他服务器所有进行中的查询。无限并行查询超过速率限制并全面触发 429。不加退避地重试10次放大速率限制问题，可能阻塞 agent 数分钟。

Pattern 77: Time-Boxed Investigation with Partial Results

出现频率： 约 2% 的事件响应插件 相关模式： Error Handling, Progress Feedback

定义： 每个调查阶段的硬时间预算，时间到期时强制报告部分结果而非无限继续。

正面示例

Investigation time budget: 5 minutes per hypothesis
- If hypothesis not confirmed in 5 minutes, mark as INCONCLUSIVE
- Move to next hypothesis
- Report all investigated hypotheses (confirmed, refuted, inconclusive)
- Never spend > 15 minutes total on initial investigation

Partial result format:
| Hypothesis | Status | Evidence | Time Spent |
| DB connection pool exhaustion | CONFIRMED | Connection count = MAX | 2m 30s |
| Memory leak in service X | REFUTED | Memory stable over 24h | 3m 15s |
| Network partition | INCONCLUSIVE | Insufficient data | 5m 00s (timeout) |

为什么有效： 时间预算防止 agent 在事件中速度重要时走入死胡同。假设表展示了所有调查过的内容包括死胡同 — 对交接至关重要。

反面示例

Investigate the root cause thoroughly before reporting any findings.
Do not move to the next hypothesis until the current one is fully resolved.
Only report confirmed findings — do not include inconclusive results.

为什么失败： “Thoroughly”没有时间边界，agent 可能在一个假设上花 30 分钟而事件不断升级。要求完全解决后才能继续意味着死胡同假设阻塞所有进展。隐藏非结论性结果丢失了下一个响应者交接所需的调查上下文。

Pattern 78: Deployment Override Knowledge Encoding

出现频率： 约 1% 的插件 相关模式： Domain Knowledge Embedding

定义： 将部署覆盖类型的完整分类、效果和常用 KQL 过滤模式直接编码到 prompt 中，支持精确的部署状态查询。

正面示例

Blocking Override Types (checked by deployment gate system DeploymentBlockRule):
| Type | Effect |
| BlockAll | Blocks all deployment to matching machines |
| Halt | Halts deployment of specific version range |
| HaltAndStop | Halts and stops any in-progress deployment |
| Purge | Rolls back to previous version |

Common KQL Filters:
| Filter | KQL |
| Active only | where IsDeleted == false |
| Blocking types only | where DeploymentConfigurationItemType in ("BlockAll",...) |
| By ring | where TargetFilterExpression has "global" |

反面示例

Query the deployment override API for current status.
Filter out any overrides that seem irrelevant.
Summarize the results for the user.

为什么失败： 不编码覆盖分类，agent 无法区分阻塞性覆盖（Halt）和信息性覆盖。”Seem irrelevant”是主观的 — agent 缺乏正确过滤的领域知识。没有 KQL 模式意味着 agent 必须对每个请求猜测查询语法，产出不一致且常常不正确的结果。

Pattern 79: Incident Escalation Decision Matrix

出现频率： 约 2% 的事件响应插件 相关模式： Confirmation Gates, Blast Radius Formula

定义： 基于量化影响维度确定升级路径的决策矩阵，带有明确阈值决定何时呼叫、开桥接会议或升级到管理层。

正面示例

| Customer Impact | Duration | Escalation |
| < 100 users | < 30 min | Sev 3 — assign to oncall |
| 100-10K users | < 1 hour | Sev 2 — page secondary oncall |
| 10K+ users | Any | Sev 1 — bridge call, page management |
| Data loss | Any | Sev 0 — all-hands, exec notification |

Auto-escalation triggers:
- Sev 2 unacknowledged for 15 minutes → escalate to Sev 1
- Any severity with "security breach" signal → Sev 0 immediately

反面示例

Assess the severity of the incident based on your best judgment.
Escalate if the situation seems serious.
Page the oncall team if needed.

为什么失败： “Best judgment”和”seems serious”是主观的 — 不同 agent 运行对相同事件产出不一致的严重性分配。没有量化阈值（用户数、持续时间），Sev 2 和 Sev 1 之间没有可复现的边界。缺失的自动升级规则意味着未确认的 Sev 2 可以数小时坐着不升级。

Pattern 80: Scope Estimation and Re-estimation Checkpoints

出现频率： 约 2% 的规划插件 相关模式： Complexity-Tiered Dispatch, Phased Execution

定义： 要求 agent 在开始前估算任务范围，然后在执行期间的定义检查点重新估算。范围显著增加时触发升级。

正面示例

Before starting:
- Estimate: hours of work, number of files, risk level
- If estimate > 8 hours: recommend decomposition

Checkpoint re-estimation (at 25%, 50%, 75%):
- Compare actual progress to estimate
- If actual/estimate > 1.5x: flag scope creep
- If actual/estimate > 2x: stop and escalate to user

为什么有效： 早期估算在投入前就捕获不合理的任务。重新估算检查点在过程中捕获范围蔓延。1.5x/2x 阈值是具体的而非主观的。

反面示例

Start working on the task immediately.
If it takes longer than expected, let the user know when you're done.

为什么失败： 没有前期估算，agent 承诺可能需要数天的任务 — 在有人意识到范围错误之前浪费 token 和时间。没有检查点意味着范围蔓延在完成（或失败）前是不可见的。”完成时通知”不提供用于干预的早期预警信号。

Pattern 150: Continuous Execution Mandate (持续执行强制令)

出现频率： 单一来源（superpowers/subagent-driven-development），但新颖性高 相关模式： Zero-Questions Triage, Iron-Law Inviolable Rule Framing

定义： 一项明确的”任务之间不准核对”的禁令。计划一旦获批，agent 就连续完成计划中的所有任务，不再在任务之间问用户”我该继续吗？”。和 Pattern 71 不同（Zero-Questions Triage 覆盖单次自主分析）—— 这条管的是多任务计划执行。

正面示例

## Continuous Execution Mandate

Once the plan is approved, execute ALL tasks in order. Do NOT:
- Ask "should I continue?" between tasks
- Pause for confirmation after each task completes
- Summarize what you just did and wait for approval to do the next thing

Mid-plan check-ins are forbidden. The check-in was the plan approval.

You MAY pause mid-plan only if:
1. A task fails in a way that invalidates remaining tasks
2. You encounter unexpected state that would change the plan
3. A task has a Confirmation Gate explicitly written into it

Otherwise: complete the next task immediately when the previous one ends.
Report progress in-line, not as a question.

为什么有效： 在对话数据上训练的模型有很强的”在任务之间核对一下”的先验 —— 在自主执行场景里这会扼杀吞吐量。明确点名这一禁令可以消除歧义。三条显式例外可以防止规则变成鲁莽执行的许可证。

反面示例

Work through the plan and check in with the user as needed.

为什么失败： “按需”会默认变成”每完成一个任务就核对一次”，因为那才是对话先验的默认行为。一个 12 任务的计划变成 12 次与用户的来回沟通，完全摧毁了自主执行的初衷。

Pattern 154: Self-Looping Stop Hook (Ralph Loop) (自循环 Stop hook)

出现频率： ralph-loop 插件中独立的编排模型；可推广 相关模式： Hook-Driven Automation, Loop Prevention with Max Iterations

定义： 一个 Stop 事件 hook，会用”继续干活”的提示再次进入同一个 session，形成一个无需手动再次唤起的自主循环。prompt 侧的纪律规定：agent 绝不能为了逃离循环而发出假的”我完成了”信号。

正面示例

// .claude/settings.json
{
  "hooks": {
    "Stop": [{
      "matcher": "*",
      "command": "ralph-loop check-and-resume"
    }]
  }
}

## Loop Discipline

A Stop hook re-enters this session automatically. To exit the loop, write
exactly:
  DONE: <one-sentence summary of what was completed>
to the file `./ralph-state.md`.

Do NOT:
- Claim "done" to escape the loop while work remains
- Produce a fake DONE marker because you're stuck
- Hide an unresolved task to make the loop terminate

If you're stuck, write:
  BLOCKED: <one-sentence description of what you need>
to `./ralph-state.md`. The loop will exit and surface the block to the user.

Honesty inside the loop is the load-bearing constraint — the loop's value
depends on it.

为什么有效： Stop hook 在无人干预的情况下闭合循环。prompt 侧的诚实约束针对的是模型为了让自己摆脱循环而说方便的谎话的失败模式。把 DONE 和 BLOCKED 分开可以防止模型把”我完成了”和”我放弃了”混为一谈。

反面示例

{ "hooks": { "Stop": [{ "command": "claude --continue" }] } }

（且没有 prompt 侧纪律。）

为什么失败： 没有诚实约束，模型一旦碰到难题就会立刻发”我完成了” —— 循环在工作未完成的情况下终止。没有 DONE/BLOCKED 区分，用户无法分辨”完成”还是”放弃”。