Outcome-Driven Development

What is Outcome-Driven Development?

Outcome-driven development is a methodology where you define what success looks like, generate tests as hypotheses, and observe real outcomes at every stage. Unlike traditional approaches that focus on implementation details, ODD keeps you anchored to measurable results. The key insight: tests produce outcomes, but they are experimental outcomes whose validity is bounded by the test context.

The Progression: From Intent to Outcomes

1. Start with a Goal

A goal is intent, direction, and preference. No outcomes yet - only what you want to achieve.

"Build an agent that resolves Tier-1 support tickets reliably."

At this stage, you have expressed direction but haven’t defined what success means or observed any results.

2. Define Criteria

Criteria translate intent into what winning would look like. They set expectations but still aren’t outcomes. In our framework, there are two types of criteria:

Success Criteria: Measurable conditions that define success.
Constraints: Conditions that define failure.

Success Criteria

Defined in goal.py, each SuccessCriterion represents a measurable condition:

Field	Purpose
`metric`	How to measure: `output_contains`, `output_equals`, `llm_judge`, `custom`
`target`	The expected value or condition
`weight`	Relative importance (0-1)

SuccessCriterion(
    id="sc-kb-grounded",
    description="Answers are grounded in knowledge base content",
    metric="llm_judge",
    target="answers reference KB sources",
    weight=0.3,
)

Constraints

Also in goal.py, Constraint defines boundaries the agent must respect:

Constraint Type	Behavior
Hard	Violation = immediate failure
Soft	Discouraged but allowed

Category	Example
`time`	Response within 30 seconds
`cost`	Less than $0.20 per request
`safety`	No PII in responses
`scope`	Only answer questions about billing
`quality`	Hallucination rate below 5%

Criteria are expectations - measurable thresholds that define success and failure.

3. Generate Tests

Tests are hypotheses about outcomes. Each test implicitly says:

“If the system behaves correctly, outcome X should occur under condition Y.”

Tests are measurement instruments, not outcomes themselves. They encode your assumptions about what correct behavior looks like.

# This is a hypothesis, not yet an outcome
def test_common_question_resolution():
    """Agent should resolve common billing questions without escalation."""
    result = agent.run({"question": "How do I update my payment method?"})
    assert result.resolved == True
    assert result.escalated == False

4. Run Tests → Observe Outcomes

When you run tests, you finally have outcomes. But be precise about what kind: These are observed outcomes in a controlled environment.

What You Observe	Example
Pass/fail rates	47/50 tests passing
Error modes	Timeout on complex queries
Latency	P95 at 2.3 seconds
Tool misuse	Called wrong API 3 times
Hallucinations	2 cases from edge prompts

Why these are outcomes: Something actually happened. The system interacted with inputs. Effects were observed, not imagined. Why they’re limited: They don’t yet tell you whether users trust the agent, whether it reduces real workload, or whether edge cases in the wild dominate behavior.

Outcome Structure

Defined in decision.py, the Outcome class captures everything about an execution result:

class Outcome(BaseModel):
    success: bool           # Did it succeed?
    result: Any             # Actual output
    error: str | None       # Error if failed
    state_changes: dict     # Side effects
    tokens_used: int        # LLM tokens consumed
    latency_ms: int         # Execution time
    summary: str            # Human-readable narrative

The framework tracks what should happen (criteria/constraints) and what actually happened (outcomes), then evaluates the gap.

The Feedback Loop

Outcome-driven development depends on closing the loop: Outcomes inform updates at every level:

Update the system - Fix bugs, improve prompts, add guardrails
Adjust criteria - Thresholds may be too strict or too lenient
Revisit the goal - Sometimes the original intent was not accurate or not up to date. Developer need to revisit and update the goal accoding to the outcomes. However, the framework should not be able to change the original goal to ensure the goal is consistent and respected.

The Outcome Taxonomy

All results are outcomes - they differ by context and validity, not by kind.

Stage	What You Observe	Outcome Type
Tests	Pass/fail, errors, timings	Test outcomes
Staging / Pilot	Task completion, escalations, user feedback	Operational outcomes
Production	Behavior change, value created, business impact	Real-world outcomes

Test Outcomes

Controlled environment. Known inputs. Repeatable conditions.

High internal validity
Limited external validity
Fast feedback loop
Low cost to observe

Operational Outcomes

Real users in limited deployment. Shadow mode or pilot groups.

Moderate internal validity
Growing external validity
Reveals integration issues
Surface unexpected edge cases

Real-World Outcomes

Full production. Business metrics. User trust.

Full external validity
Highest signal quality
Slowest feedback loop
Highest cost to observe

Example: Support Agent Development

Phase 1: Define

Goal:

Build an agent that handles Tier-1 support tickets reliably.

Success Criteria:

[
    SuccessCriterion(
        id="sc-resolution",
        description="Resolves tickets without escalation",
        metric="custom",
        target="resolution_rate >= 0.6",
        weight=0.4,
    ),
    SuccessCriterion(
        id="sc-accuracy",
        description="Responses are factually correct",
        metric="llm_judge",
        target="no hallucinations or misinformation",
        weight=0.3,
    ),
    SuccessCriterion(
        id="sc-escalation",
        description="Escalates appropriately when needed",
        metric="custom",
        target="escalation_accuracy >= 0.9",
        weight=0.3,
    ),
]

Constraints:

[
    Constraint(type="hard", category="cost", value="cost_per_ticket < 0.20"),
    Constraint(type="hard", category="safety", value="no_pii_in_responses"),
    Constraint(type="soft", category="time", value="response_time < 30s"),
]

Phase 2: Test Outcomes

Run automated tests against synthetic tickets. Each test produces an Outcome:

Outcome(
    success=True,
    result={"answer": "To reset your password, go to Settings > Security..."},
    error=None,
    state_changes={"ticket_status": "resolved"},
    tokens_used=847,
    latency_ms=1230,
    summary="Resolved password reset question correctly"
)

Aggregate results across test suite:

Criterion	Result	Status	Judgment
Resolution rate	52%	Below 60% target	`REPLAN`
Accuracy (llm_judge)	97%	Passing	`ACCEPT`
Cost per ticket	$0.18	Below $0.20 hard constraint	`ACCEPT`
Escalation accuracy	87%	Below 90% target	`RETRY`

Weighted score: 78% (below 90% threshold) Insight: Resolution and escalation need work. The HybridJudge recommends REPLAN - adjust prompts, add examples.

Phase 3: Operational Outcomes

Deploy to 5% of traffic in shadow mode:

Metric	Result	Insight
Resolution rate	58%	Closer to target
User satisfaction	3.8/5	New signal
Edge case failures	12 types	Unexpected patterns

Insight: Real traffic reveals edge cases tests missed. Update test suite.

Phase 4: Real-World Outcomes

Full production deployment:

Metric	Result	Business Impact
Resolution rate	63%	Target exceeded
Support cost reduction	34%	Direct savings
Customer satisfaction	+12 NPS	Trust building

Best Practices

Weight Criteria Carefully

Higher weights on critical criteria ensure the 90% threshold reflects true success

Use Hard Constraints for Safety

Cost limits, PII protection, and scope boundaries should be hard constraints

Let llm_judge Handle Nuance

Use llm_judge metric for quality criteria that can’t be measured programmatically

Trust the Judgment Actions

ACCEPT, RETRY, REPLAN, ESCALATE - let the HybridJudge guide the response

Next Steps

Agent Architecture

Learn about nodes, edges, and the agent graph structure

Build Your First Agent

Create your first outcome-driven agent

Overview

Core Concepts

What is Outcome-Driven Development?

The Progression: From Intent to Outcomes

1. Start with a Goal

2. Define Criteria

Success Criteria

Constraints

3. Generate Tests

4. Run Tests → Observe Outcomes

Outcome Structure

The Feedback Loop

The Outcome Taxonomy

Test Outcomes

Operational Outcomes

Real-World Outcomes

Example: Support Agent Development

Phase 1: Define

Phase 2: Test Outcomes

Phase 3: Operational Outcomes

Phase 4: Real-World Outcomes

Best Practices

Weight Criteria Carefully

Use Hard Constraints for Safety

Let llm_judge Handle Nuance

Trust the Judgment Actions

Next Steps

Agent Architecture

Build Your First Agent

Overview

Core Concepts

​What is Outcome-Driven Development?

​The Progression: From Intent to Outcomes

​1. Start with a Goal

​2. Define Criteria

​Success Criteria

​Constraints

​3. Generate Tests

​4. Run Tests → Observe Outcomes

​Outcome Structure

​The Feedback Loop

​The Outcome Taxonomy

​Test Outcomes

​Operational Outcomes

​Real-World Outcomes

​Example: Support Agent Development

​Phase 1: Define

​Phase 2: Test Outcomes

​Phase 3: Operational Outcomes

​Phase 4: Real-World Outcomes

​Best Practices

Weight Criteria Carefully

Use Hard Constraints for Safety

Let llm_judge Handle Nuance

Trust the Judgment Actions

​Next Steps

Agent Architecture

Build Your First Agent

What is Outcome-Driven Development?

The Progression: From Intent to Outcomes

1. Start with a Goal

2. Define Criteria

Success Criteria

Constraints

3. Generate Tests

4. Run Tests → Observe Outcomes

Outcome Structure

The Feedback Loop

The Outcome Taxonomy

Test Outcomes

Operational Outcomes

Real-World Outcomes

Example: Support Agent Development

Phase 1: Define

Phase 2: Test Outcomes

Phase 3: Operational Outcomes

Phase 4: Real-World Outcomes

Best Practices

Next Steps