Skip to main content

What is Outcome-Driven Development?

Outcome-driven development is a methodology where you define what success looks like, generate tests as hypotheses, and observe real outcomes at every stage. Unlike traditional approaches that focus on implementation details, ODD keeps you anchored to measurable results. The key insight: tests produce outcomes, but they are experimental outcomes whose validity is bounded by the test context.

The Progression: From Intent to Outcomes

1. Start with a Goal

A goal is intent, direction, and preference. No outcomes yet - only what you want to achieve.
"Build an agent that resolves Tier-1 support tickets reliably."
At this stage, you have expressed direction but haven’t defined what success means or observed any results.

2. Define Criteria

Criteria translate intent into what winning would look like. They set expectations but still aren’t outcomes. In our framework, there are two types of criteria:
  • Success Criteria: Measurable conditions that define success.
  • Constraints: Conditions that define failure.

Success Criteria

Defined in goal.py, each SuccessCriterion represents a measurable condition:
FieldPurpose
metricHow to measure: output_contains, output_equals, llm_judge, custom
targetThe expected value or condition
weightRelative importance (0-1)
SuccessCriterion(
    id="sc-kb-grounded",
    description="Answers are grounded in knowledge base content",
    metric="llm_judge",
    target="answers reference KB sources",
    weight=0.3,
)

Constraints

Also in goal.py, Constraint defines boundaries the agent must respect:
Constraint TypeBehavior
HardViolation = immediate failure
SoftDiscouraged but allowed
CategoryExample
timeResponse within 30 seconds
costLess than $0.20 per request
safetyNo PII in responses
scopeOnly answer questions about billing
qualityHallucination rate below 5%
Criteria are expectations - measurable thresholds that define success and failure.

3. Generate Tests

Tests are hypotheses about outcomes. Each test implicitly says:
“If the system behaves correctly, outcome X should occur under condition Y.”
Tests are measurement instruments, not outcomes themselves. They encode your assumptions about what correct behavior looks like.
# This is a hypothesis, not yet an outcome
def test_common_question_resolution():
    """Agent should resolve common billing questions without escalation."""
    result = agent.run({"question": "How do I update my payment method?"})
    assert result.resolved == True
    assert result.escalated == False

4. Run Tests → Observe Outcomes

When you run tests, you finally have outcomes. But be precise about what kind: These are observed outcomes in a controlled environment.
What You ObserveExample
Pass/fail rates47/50 tests passing
Error modesTimeout on complex queries
LatencyP95 at 2.3 seconds
Tool misuseCalled wrong API 3 times
Hallucinations2 cases from edge prompts
Why these are outcomes: Something actually happened. The system interacted with inputs. Effects were observed, not imagined. Why they’re limited: They don’t yet tell you whether users trust the agent, whether it reduces real workload, or whether edge cases in the wild dominate behavior.

Outcome Structure

Defined in decision.py, the Outcome class captures everything about an execution result:
class Outcome(BaseModel):
    success: bool           # Did it succeed?
    result: Any             # Actual output
    error: str | None       # Error if failed
    state_changes: dict     # Side effects
    tokens_used: int        # LLM tokens consumed
    latency_ms: int         # Execution time
    summary: str            # Human-readable narrative
The framework tracks what should happen (criteria/constraints) and what actually happened (outcomes), then evaluates the gap.

The Feedback Loop

Outcome-driven development depends on closing the loop: Outcomes inform updates at every level:
  • Update the system - Fix bugs, improve prompts, add guardrails
  • Adjust criteria - Thresholds may be too strict or too lenient
  • Revisit the goal - Sometimes the original intent was not accurate or not up to date. Developer need to revisit and update the goal accoding to the outcomes. However, the framework should not be able to change the original goal to ensure the goal is consistent and respected.

The Outcome Taxonomy

All results are outcomes - they differ by context and validity, not by kind.
StageWhat You ObserveOutcome Type
TestsPass/fail, errors, timingsTest outcomes
Staging / PilotTask completion, escalations, user feedbackOperational outcomes
ProductionBehavior change, value created, business impactReal-world outcomes

Test Outcomes

Controlled environment. Known inputs. Repeatable conditions.
  • High internal validity
  • Limited external validity
  • Fast feedback loop
  • Low cost to observe

Operational Outcomes

Real users in limited deployment. Shadow mode or pilot groups.
  • Moderate internal validity
  • Growing external validity
  • Reveals integration issues
  • Surface unexpected edge cases

Real-World Outcomes

Full production. Business metrics. User trust.
  • Full external validity
  • Highest signal quality
  • Slowest feedback loop
  • Highest cost to observe

Example: Support Agent Development

Phase 1: Define

Goal:
Build an agent that handles Tier-1 support tickets reliably.
Success Criteria:
[
    SuccessCriterion(
        id="sc-resolution",
        description="Resolves tickets without escalation",
        metric="custom",
        target="resolution_rate >= 0.6",
        weight=0.4,
    ),
    SuccessCriterion(
        id="sc-accuracy",
        description="Responses are factually correct",
        metric="llm_judge",
        target="no hallucinations or misinformation",
        weight=0.3,
    ),
    SuccessCriterion(
        id="sc-escalation",
        description="Escalates appropriately when needed",
        metric="custom",
        target="escalation_accuracy >= 0.9",
        weight=0.3,
    ),
]
Constraints:
[
    Constraint(type="hard", category="cost", value="cost_per_ticket < 0.20"),
    Constraint(type="hard", category="safety", value="no_pii_in_responses"),
    Constraint(type="soft", category="time", value="response_time < 30s"),
]

Phase 2: Test Outcomes

Run automated tests against synthetic tickets. Each test produces an Outcome:
Outcome(
    success=True,
    result={"answer": "To reset your password, go to Settings > Security..."},
    error=None,
    state_changes={"ticket_status": "resolved"},
    tokens_used=847,
    latency_ms=1230,
    summary="Resolved password reset question correctly"
)
Aggregate results across test suite:
CriterionResultStatusJudgment
Resolution rate52%Below 60% targetREPLAN
Accuracy (llm_judge)97%PassingACCEPT
Cost per ticket$0.18Below $0.20 hard constraintACCEPT
Escalation accuracy87%Below 90% targetRETRY
Weighted score: 78% (below 90% threshold) Insight: Resolution and escalation need work. The HybridJudge recommends REPLAN - adjust prompts, add examples.

Phase 3: Operational Outcomes

Deploy to 5% of traffic in shadow mode:
MetricResultInsight
Resolution rate58%Closer to target
User satisfaction3.8/5New signal
Edge case failures12 typesUnexpected patterns
Insight: Real traffic reveals edge cases tests missed. Update test suite.

Phase 4: Real-World Outcomes

Full production deployment:
MetricResultBusiness Impact
Resolution rate63%Target exceeded
Support cost reduction34%Direct savings
Customer satisfaction+12 NPSTrust building

Best Practices

Weight Criteria Carefully

Higher weights on critical criteria ensure the 90% threshold reflects true success

Use Hard Constraints for Safety

Cost limits, PII protection, and scope boundaries should be hard constraints

Let llm_judge Handle Nuance

Use llm_judge metric for quality criteria that can’t be measured programmatically

Trust the Judgment Actions

ACCEPT, RETRY, REPLAN, ESCALATE - let the HybridJudge guide the response

Next Steps