> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adenhq.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Outcome-Driven Development

> From intent to observed results - a precise framework for building agents that deliver measurable value

## What is Outcome-Driven Development?

Outcome-driven development is a methodology where you define what success looks like, generate tests as hypotheses, and observe real outcomes at every stage. Unlike traditional approaches that focus on implementation details, ODD keeps you anchored to measurable results.

The key insight: **tests produce outcomes, but they are experimental outcomes whose validity is bounded by the test context.**

## The Progression: From Intent to Outcomes

```mermaid theme={null}
flowchart TD
    GOAL["Goal<br/><small>Intent & Direction</small>"] --> CRITERIA["Criteria<br/><small>Success + Constraints</small>"]
    CRITERIA --> TESTS["Tests<br/><small>Hypotheses about outcomes</small>"]
    TESTS --> TEST_OUTCOMES["Test Outcomes<br/><small>Controlled environment</small>"]
    TEST_OUTCOMES --> OPS_OUTCOMES["Operational Outcomes<br/><small>Staging & pilot</small>"]
    OPS_OUTCOMES --> REAL_OUTCOMES["Real-World Outcomes<br/><small>Production & business value</small>"]
    REAL_OUTCOMES -.->|"Feedback loop"| GOAL
```

### 1. Start with a Goal

A goal is **intent, direction, and preference**. No outcomes yet - only what you want to achieve.

```
"Build an agent that resolves Tier-1 support tickets reliably."
```

At this stage, you have expressed direction but haven't defined what success means or observed any results.

### 2. Define Criteria

Criteria translate intent into **what winning would look like**. They set expectations but still aren't outcomes. In our framework, there are two types of criteria:

* **Success Criteria:** Measurable conditions that define success.
* **Constraints:** Conditions that define failure.

#### Success Criteria

Defined in `goal.py`, each `SuccessCriterion` represents a measurable condition:

| Field    | Purpose                                                                   |
| -------- | ------------------------------------------------------------------------- |
| `metric` | How to measure: `output_contains`, `output_equals`, `llm_judge`, `custom` |
| `target` | The expected value or condition                                           |
| `weight` | Relative importance (0-1)                                                 |

```python theme={null}
SuccessCriterion(
    id="sc-kb-grounded",
    description="Answers are grounded in knowledge base content",
    metric="llm_judge",
    target="answers reference KB sources",
    weight=0.3,
)
```

#### Constraints

Also in `goal.py`, `Constraint` defines boundaries the agent must respect:

| Constraint Type | Behavior                      |
| --------------- | ----------------------------- |
| **Hard**        | Violation = immediate failure |
| **Soft**        | Discouraged but allowed       |

| Category  | Example                             |
| --------- | ----------------------------------- |
| `time`    | Response within 30 seconds          |
| `cost`    | Less than \$0.20 per request        |
| `safety`  | No PII in responses                 |
| `scope`   | Only answer questions about billing |
| `quality` | Hallucination rate below 5%         |

Criteria are expectations - measurable thresholds that define success and failure.

### 3. Generate Tests

Tests are **hypotheses about outcomes**. Each test implicitly says:

> "If the system behaves correctly, outcome X should occur under condition Y."

Tests are measurement instruments, not outcomes themselves. They encode your assumptions about what correct behavior looks like.

```python theme={null}
# This is a hypothesis, not yet an outcome
def test_common_question_resolution():
    """Agent should resolve common billing questions without escalation."""
    result = agent.run({"question": "How do I update my payment method?"})
    assert result.resolved == True
    assert result.escalated == False
```

### 4. Run Tests → Observe Outcomes

When you run tests, you finally have **outcomes**. But be precise about what kind:

These are **observed outcomes in a controlled environment**.

| What You Observe | Example                    |
| ---------------- | -------------------------- |
| Pass/fail rates  | 47/50 tests passing        |
| Error modes      | Timeout on complex queries |
| Latency          | P95 at 2.3 seconds         |
| Tool misuse      | Called wrong API 3 times   |
| Hallucinations   | 2 cases from edge prompts  |

**Why these are outcomes:** Something actually happened. The system interacted with inputs. Effects were observed, not imagined.

**Why they're limited:** They don't yet tell you whether users trust the agent, whether it reduces real workload, or whether edge cases in the wild dominate behavior.

## Outcome Structure

Defined in `decision.py`, the `Outcome` class captures everything about an execution result:

```python theme={null}
class Outcome(BaseModel):
    success: bool           # Did it succeed?
    result: Any             # Actual output
    error: str | None       # Error if failed
    state_changes: dict     # Side effects
    tokens_used: int        # LLM tokens consumed
    latency_ms: int         # Execution time
    summary: str            # Human-readable narrative
```

The framework tracks **what should happen** (criteria/constraints) and **what actually happened** (outcomes), then evaluates the gap.

## The Feedback Loop

Outcome-driven development depends on closing the loop:

```mermaid theme={null}
flowchart LR
    subgraph DEFINE["Define"]
        G["Goal"] --> C["Criteria"]
    end

    subgraph MEASURE["Measure"]
        C --> T["Tests"]
        T --> O["Outcomes"]
    end

    subgraph ADAPT["Adapt"]
        O --> U["Update"]
    end

    U -.->|"Refine system"| T
    U -.->|"Adjust criteria"| C
    U -.->|"Revisit goal (Only with Developer Intervention)"| G
```

Outcomes inform updates at every level:

* **Update the system** - Fix bugs, improve prompts, add guardrails
* **Adjust criteria** - Thresholds may be too strict or too lenient
* **Revisit the goal** - Sometimes the original intent was not accurate or not up to date. Developer need to revisit and update the goal accoding to the outcomes. However, the framework should not be able to change the original goal to ensure the goal is consistent and respected.

## The Outcome Taxonomy

All results are outcomes - they differ by **context and validity**, not by kind.

| Stage               | What You Observe                                | Outcome Type         |
| ------------------- | ----------------------------------------------- | -------------------- |
| **Tests**           | Pass/fail, errors, timings                      | Test outcomes        |
| **Staging / Pilot** | Task completion, escalations, user feedback     | Operational outcomes |
| **Production**      | Behavior change, value created, business impact | Real-world outcomes  |

### Test Outcomes

Controlled environment. Known inputs. Repeatable conditions.

* High internal validity
* Limited external validity
* Fast feedback loop
* Low cost to observe

### Operational Outcomes

Real users in limited deployment. Shadow mode or pilot groups.

* Moderate internal validity
* Growing external validity
* Reveals integration issues
* Surface unexpected edge cases

### Real-World Outcomes

Full production. Business metrics. User trust.

* Full external validity
* Highest signal quality
* Slowest feedback loop
* Highest cost to observe

## Example: Support Agent Development

### Phase 1: Define

**Goal:**

```
Build an agent that handles Tier-1 support tickets reliably.
```

**Success Criteria:**

```python theme={null}
[
    SuccessCriterion(
        id="sc-resolution",
        description="Resolves tickets without escalation",
        metric="custom",
        target="resolution_rate >= 0.6",
        weight=0.4,
    ),
    SuccessCriterion(
        id="sc-accuracy",
        description="Responses are factually correct",
        metric="llm_judge",
        target="no hallucinations or misinformation",
        weight=0.3,
    ),
    SuccessCriterion(
        id="sc-escalation",
        description="Escalates appropriately when needed",
        metric="custom",
        target="escalation_accuracy >= 0.9",
        weight=0.3,
    ),
]
```

**Constraints:**

```python theme={null}
[
    Constraint(type="hard", category="cost", value="cost_per_ticket < 0.20"),
    Constraint(type="hard", category="safety", value="no_pii_in_responses"),
    Constraint(type="soft", category="time", value="response_time < 30s"),
]
```

### Phase 2: Test Outcomes

Run automated tests against synthetic tickets. Each test produces an `Outcome`:

```python theme={null}
Outcome(
    success=True,
    result={"answer": "To reset your password, go to Settings > Security..."},
    error=None,
    state_changes={"ticket_status": "resolved"},
    tokens_used=847,
    latency_ms=1230,
    summary="Resolved password reset question correctly"
)
```

**Aggregate results across test suite:**

| Criterion             | Result | Status                       | Judgment |
| --------------------- | ------ | ---------------------------- | -------- |
| Resolution rate       | 52%    | Below 60% target             | `REPLAN` |
| Accuracy (llm\_judge) | 97%    | Passing                      | `ACCEPT` |
| Cost per ticket       | \$0.18 | Below \$0.20 hard constraint | `ACCEPT` |
| Escalation accuracy   | 87%    | Below 90% target             | `RETRY`  |

**Weighted score:** 78% (below 90% threshold)

**Insight:** Resolution and escalation need work. The `HybridJudge` recommends `REPLAN` - adjust prompts, add examples.

### Phase 3: Operational Outcomes

Deploy to 5% of traffic in shadow mode:

| Metric             | Result   | Insight             |
| ------------------ | -------- | ------------------- |
| Resolution rate    | 58%      | Closer to target    |
| User satisfaction  | 3.8/5    | New signal          |
| Edge case failures | 12 types | Unexpected patterns |

**Insight:** Real traffic reveals edge cases tests missed. Update test suite.

### Phase 4: Real-World Outcomes

Full production deployment:

| Metric                 | Result  | Business Impact |
| ---------------------- | ------- | --------------- |
| Resolution rate        | 63%     | Target exceeded |
| Support cost reduction | 34%     | Direct savings  |
| Customer satisfaction  | +12 NPS | Trust building  |

## Best Practices

<CardGroup cols={2}>
  <Card title="Weight Criteria Carefully" icon="balance-scale">
    Higher weights on critical criteria ensure the 90% threshold reflects true success
  </Card>

  <Card title="Use Hard Constraints for Safety" icon="shield">
    Cost limits, PII protection, and scope boundaries should be hard constraints
  </Card>

  <Card title="Let llm_judge Handle Nuance" icon="brain">
    Use `llm_judge` metric for quality criteria that can't be measured programmatically
  </Card>

  <Card title="Trust the Judgment Actions" icon="gavel">
    ACCEPT, RETRY, REPLAN, ESCALATE - let the HybridJudge guide the response
  </Card>
</CardGroup>

## Next Steps

<CardGroup cols={2}>
  <Card title="Goals and Outcomes" icon="bullseye" href="/building-agent/concepts/goals-outcomes">
    See the canonical goal model used by Hive runtime and evaluation
  </Card>

  <Card title="Evolution" icon="arrows-rotate" href="/building-agent/concepts/evolution">
    Understand how outcome gaps drive iterative agent improvement
  </Card>

  <Card title="Agent Architecture" icon="sitemap" href="/building-agent/concepts/agent-architecture">
    Learn about nodes, edges, and the agent graph structure
  </Card>

  <Card title="Build Your First Agent" icon="hammer" href="/building/first-agent">
    Create your first outcome-driven agent
  </Card>
</CardGroup>
