Pattern: Decision Trees Beat Prose in AI Prompts
When you tell an AI what to do using natural language paragraphs, it works — most of the time. But when the logic has branches (“if X then Y, otherwise Z”), prose prompts produce inconsistent outputs across runs — even when the input is identical.
Decision trees fix this. Replace prose branching logic with a visual tree structure, and the model follows it deterministically.
The Problem
Consider an incident response system. The AI must triage incidents, assign responders, choose mitigation actions, and produce a structured plan. The rules are complex: 4 severity levels, 6 root cause categories, tier-based responder assignment, escalation conditions, communication deadlines.
We wrote two prompts for the same task — one as prose paragraphs (~140 lines), one as decision trees (~260 lines) — and ran each 20 times with identical input using claude -p.
Results (20 Runs Each)
Primary Mitigation Action
| Prose | Decision Tree |
|---|---|
| 15 unique phrasings out of 20 runs | 1 phrasing out of 20 runs |
Prose outputs looked like this — every run different:
Run 1: "Enable graceful degradation on Cosmos DB (cached responses, reduced functionality) while investigating root cause"
Run 2: "Enable graceful degradation on Cosmos DB (cached responses, reduced functionality for non-critical read paths)"
Run 3: "Enable graceful degradation mode on Cosmos DB East US 2"
Run 4: "Enable graceful degradation on Cosmos DB East US 2 to serve cached/reduced-functionality responses while..."
...15 unique variants total
Decision tree — every run identical:
Run 1-20: "enable_graceful_degradation"
Secondary Action
| Prose | Decision Tree |
|---|---|
| 20 unique phrasings (zero repeats) | 2 variants (19× hotfix, 1× failover_to_secondary) |
Prose achieved 0% reproducibility on secondary actions. Every single run produced a unique sentence.
Full Comparison Table
| Dimension | Prose (20 runs) | Decision Tree (20 runs) |
|---|---|---|
| Severity classification | 20/20 SEV2 | 20/20 SEV2 |
| Primary action (unique values) | 15 | 1 |
| Secondary action (unique values) | 20 | 2 |
| Responder role sets | 2 variants (capitalization: on-call vs On-Call) |
1 (consistent) |
| Actions-to-avoid count | inconsistent (16× two items, 4× three items) | consistent (20× two items) |
| Escalation triggers | 3/3 consistent | 3/3 consistent |
What This Means in Practice
Both approaches made the same correct decisions — SEV2, enable graceful degradation, assign 4 responders. The difference is output determinism.
If your downstream code does:
if response["mitigation_plan"]["primary_action"]["action"] == "enable_graceful_degradation":
execute_graceful_degradation()
- Decision tree prompt: works 20/20 times
- Prose prompt: works 0/20 times (action is a free-form sentence, never matches)
This isn’t a theoretical problem. Any system that parses AI output — automation pipelines, agent orchestration, tool calling — breaks when the output format is unpredictable.
The Pattern
Here’s the core idea. Replace this:
If the root cause is a bad deployment and the service supports instant
rollback, always prefer rollback over other mitigations because it restores
the last known good state with minimal risk. If the root cause is a bad
deployment but rollback is not available, then consider feature flag
disablement if the change is behind a feature flag. If no feature flag
exists, proceed with hotfix.
With this:
Root cause category?
├─ Bad deployment
│ └─ Service supports instant rollback?
│ ├─ YES → action: "rollback"
│ └─ NO
│ └─ Change behind feature flag?
│ ├─ YES → action: "disable_feature_flag"
│ └─ NO → action: "hotfix"
Same logic. The tree version gives the model an exact string to output at each leaf node, and the indentation encodes the decision path spatially.
Why It Works
- Narrows attention at each step. The model evaluates one condition at a time instead of holding all rules in attention simultaneously.
- Provides exact output text. Leaf nodes contain the literal action string — the model copies it rather than composing a new sentence.
- Encodes hierarchy spatially. Indentation tokens (whitespace) encode parent-child relationships. LLMs learned this pattern from millions of code files, YAML configs, and directory trees during training.
When to Use / When Not To
Use decision trees when:
- Your prompt has branching logic (if/else, switch/case)
- Output is parsed by code (JSON fields, action names, status values)
- You need consistency across multiple runs
- You’re building agent orchestration or automation pipelines
Skip decision trees when:
- The task is open-ended creative work (variation is desirable)
- There’s no branching logic
- Output is read by humans only (phrasing variation doesn’t matter)
Try It Yourself
Clone the repo and run:
cd eval/decision-tree-ab
bash run.sh 20 # 20 runs, default model
bash run.sh 10 haiku # 10 runs, haiku model
python3 analyze.py # analyze results
Three scenarios are included: simple (deployment controller), ambiguous (unknown server status), and complex (incident response with 200+ line prompts).