What is Outcome-Driven Development?
Outcome-driven development is a methodology where you define what success looks like, generate tests as hypotheses, and observe real outcomes at every stage. Unlike traditional approaches that focus on implementation details, ODD keeps you anchored to measurable results. The key insight: tests produce outcomes, but they are experimental outcomes whose validity is bounded by the test context.The Progression: From Intent to Outcomes
1. Start with a Goal
A goal is intent, direction, and preference. No outcomes yet - only what you want to achieve.2. Define Criteria
Criteria translate intent into what winning would look like. They set expectations but still aren’t outcomes. In our framework, there are two types of criteria:- Success Criteria: Measurable conditions that define success.
- Constraints: Conditions that define failure.
Success Criteria
Defined ingoal.py, each SuccessCriterion represents a measurable condition:
| Field | Purpose |
|---|---|
metric | How to measure: output_contains, output_equals, llm_judge, custom |
target | The expected value or condition |
weight | Relative importance (0-1) |
Constraints
Also ingoal.py, Constraint defines boundaries the agent must respect:
| Constraint Type | Behavior |
|---|---|
| Hard | Violation = immediate failure |
| Soft | Discouraged but allowed |
| Category | Example |
|---|---|
time | Response within 30 seconds |
cost | Less than $0.20 per request |
safety | No PII in responses |
scope | Only answer questions about billing |
quality | Hallucination rate below 5% |
3. Generate Tests
Tests are hypotheses about outcomes. Each test implicitly says:“If the system behaves correctly, outcome X should occur under condition Y.”Tests are measurement instruments, not outcomes themselves. They encode your assumptions about what correct behavior looks like.
4. Run Tests → Observe Outcomes
When you run tests, you finally have outcomes. But be precise about what kind: These are observed outcomes in a controlled environment.| What You Observe | Example |
|---|---|
| Pass/fail rates | 47/50 tests passing |
| Error modes | Timeout on complex queries |
| Latency | P95 at 2.3 seconds |
| Tool misuse | Called wrong API 3 times |
| Hallucinations | 2 cases from edge prompts |
Outcome Structure
Defined indecision.py, the Outcome class captures everything about an execution result:
The Feedback Loop
Outcome-driven development depends on closing the loop: Outcomes inform updates at every level:- Update the system - Fix bugs, improve prompts, add guardrails
- Adjust criteria - Thresholds may be too strict or too lenient
- Revisit the goal - Sometimes the original intent was not accurate or not up to date. Developer need to revisit and update the goal accoding to the outcomes. However, the framework should not be able to change the original goal to ensure the goal is consistent and respected.
The Outcome Taxonomy
All results are outcomes - they differ by context and validity, not by kind.| Stage | What You Observe | Outcome Type |
|---|---|---|
| Tests | Pass/fail, errors, timings | Test outcomes |
| Staging / Pilot | Task completion, escalations, user feedback | Operational outcomes |
| Production | Behavior change, value created, business impact | Real-world outcomes |
Test Outcomes
Controlled environment. Known inputs. Repeatable conditions.- High internal validity
- Limited external validity
- Fast feedback loop
- Low cost to observe
Operational Outcomes
Real users in limited deployment. Shadow mode or pilot groups.- Moderate internal validity
- Growing external validity
- Reveals integration issues
- Surface unexpected edge cases
Real-World Outcomes
Full production. Business metrics. User trust.- Full external validity
- Highest signal quality
- Slowest feedback loop
- Highest cost to observe
Example: Support Agent Development
Phase 1: Define
Goal:Phase 2: Test Outcomes
Run automated tests against synthetic tickets. Each test produces anOutcome:
| Criterion | Result | Status | Judgment |
|---|---|---|---|
| Resolution rate | 52% | Below 60% target | REPLAN |
| Accuracy (llm_judge) | 97% | Passing | ACCEPT |
| Cost per ticket | $0.18 | Below $0.20 hard constraint | ACCEPT |
| Escalation accuracy | 87% | Below 90% target | RETRY |
HybridJudge recommends REPLAN - adjust prompts, add examples.
Phase 3: Operational Outcomes
Deploy to 5% of traffic in shadow mode:| Metric | Result | Insight |
|---|---|---|
| Resolution rate | 58% | Closer to target |
| User satisfaction | 3.8/5 | New signal |
| Edge case failures | 12 types | Unexpected patterns |
Phase 4: Real-World Outcomes
Full production deployment:| Metric | Result | Business Impact |
|---|---|---|
| Resolution rate | 63% | Target exceeded |
| Support cost reduction | 34% | Direct savings |
| Customer satisfaction | +12 NPS | Trust building |
Best Practices
Weight Criteria Carefully
Higher weights on critical criteria ensure the 90% threshold reflects true success
Use Hard Constraints for Safety
Cost limits, PII protection, and scope boundaries should be hard constraints
Let llm_judge Handle Nuance
Use
llm_judge metric for quality criteria that can’t be measured programmaticallyTrust the Judgment Actions
ACCEPT, RETRY, REPLAN, ESCALATE - let the HybridJudge guide the response