> ## Documentation Index > Fetch the complete documentation index at: https://docs.adenhq.com/llms.txt > Use this file to discover all available pages before exploring further. # Outcome-Driven Development > From intent to observed results - a precise framework for building agents that deliver measurable value ## What is Outcome-Driven Development? Outcome-driven development is a methodology where you define what success looks like, generate tests as hypotheses, and observe real outcomes at every stage. Unlike traditional approaches that focus on implementation details, ODD keeps you anchored to measurable results. The key insight: **tests produce outcomes, but they are experimental outcomes whose validity is bounded by the test context.** ## The Progression: From Intent to Outcomes ```mermaid theme={null} flowchart TD GOAL["Goal
Intent & Direction"] --> CRITERIA["Criteria
Success + Constraints"] CRITERIA --> TESTS["Tests
Hypotheses about outcomes"] TESTS --> TEST_OUTCOMES["Test Outcomes
Controlled environment"] TEST_OUTCOMES --> OPS_OUTCOMES["Operational Outcomes
Staging & pilot"] OPS_OUTCOMES --> REAL_OUTCOMES["Real-World Outcomes
Production & business value"] REAL_OUTCOMES -.->|"Feedback loop"| GOAL ``` ### 1. Start with a Goal A goal is **intent, direction, and preference**. No outcomes yet - only what you want to achieve. ``` "Build an agent that resolves Tier-1 support tickets reliably." ``` At this stage, you have expressed direction but haven't defined what success means or observed any results. ### 2. Define Criteria Criteria translate intent into **what winning would look like**. They set expectations but still aren't outcomes. In our framework, there are two types of criteria: * **Success Criteria:** Measurable conditions that define success. * **Constraints:** Conditions that define failure. #### Success Criteria Defined in `goal.py`, each `SuccessCriterion` represents a measurable condition: | Field | Purpose | | -------- | ------------------------------------------------------------------------- | | `metric` | How to measure: `output_contains`, `output_equals`, `llm_judge`, `custom` | | `target` | The expected value or condition | | `weight` | Relative importance (0-1) | ```python theme={null} SuccessCriterion( id="sc-kb-grounded", description="Answers are grounded in knowledge base content", metric="llm_judge", target="answers reference KB sources", weight=0.3, ) ``` #### Constraints Also in `goal.py`, `Constraint` defines boundaries the agent must respect: | Constraint Type | Behavior | | --------------- | ----------------------------- | | **Hard** | Violation = immediate failure | | **Soft** | Discouraged but allowed | | Category | Example | | --------- | ----------------------------------- | | `time` | Response within 30 seconds | | `cost` | Less than \$0.20 per request | | `safety` | No PII in responses | | `scope` | Only answer questions about billing | | `quality` | Hallucination rate below 5% | Criteria are expectations - measurable thresholds that define success and failure. ### 3. Generate Tests Tests are **hypotheses about outcomes**. Each test implicitly says: > "If the system behaves correctly, outcome X should occur under condition Y." Tests are measurement instruments, not outcomes themselves. They encode your assumptions about what correct behavior looks like. ```python theme={null} # This is a hypothesis, not yet an outcome def test_common_question_resolution(): """Agent should resolve common billing questions without escalation.""" result = agent.run({"question": "How do I update my payment method?"}) assert result.resolved == True assert result.escalated == False ``` ### 4. Run Tests → Observe Outcomes When you run tests, you finally have **outcomes**. But be precise about what kind: These are **observed outcomes in a controlled environment**. | What You Observe | Example | | ---------------- | -------------------------- | | Pass/fail rates | 47/50 tests passing | | Error modes | Timeout on complex queries | | Latency | P95 at 2.3 seconds | | Tool misuse | Called wrong API 3 times | | Hallucinations | 2 cases from edge prompts | **Why these are outcomes:** Something actually happened. The system interacted with inputs. Effects were observed, not imagined. **Why they're limited:** They don't yet tell you whether users trust the agent, whether it reduces real workload, or whether edge cases in the wild dominate behavior. ## Outcome Structure Defined in `decision.py`, the `Outcome` class captures everything about an execution result: ```python theme={null} class Outcome(BaseModel): success: bool # Did it succeed? result: Any # Actual output error: str | None # Error if failed state_changes: dict # Side effects tokens_used: int # LLM tokens consumed latency_ms: int # Execution time summary: str # Human-readable narrative ``` The framework tracks **what should happen** (criteria/constraints) and **what actually happened** (outcomes), then evaluates the gap. ## The Feedback Loop Outcome-driven development depends on closing the loop: ```mermaid theme={null} flowchart LR subgraph DEFINE["Define"] G["Goal"] --> C["Criteria"] end subgraph MEASURE["Measure"] C --> T["Tests"] T --> O["Outcomes"] end subgraph ADAPT["Adapt"] O --> U["Update"] end U -.->|"Refine system"| T U -.->|"Adjust criteria"| C U -.->|"Revisit goal (Only with Developer Intervention)"| G ``` Outcomes inform updates at every level: * **Update the system** - Fix bugs, improve prompts, add guardrails * **Adjust criteria** - Thresholds may be too strict or too lenient * **Revisit the goal** - Sometimes the original intent was not accurate or not up to date. Developer need to revisit and update the goal accoding to the outcomes. However, the framework should not be able to change the original goal to ensure the goal is consistent and respected. ## The Outcome Taxonomy All results are outcomes - they differ by **context and validity**, not by kind. | Stage | What You Observe | Outcome Type | | ------------------- | ----------------------------------------------- | -------------------- | | **Tests** | Pass/fail, errors, timings | Test outcomes | | **Staging / Pilot** | Task completion, escalations, user feedback | Operational outcomes | | **Production** | Behavior change, value created, business impact | Real-world outcomes | ### Test Outcomes Controlled environment. Known inputs. Repeatable conditions. * High internal validity * Limited external validity * Fast feedback loop * Low cost to observe ### Operational Outcomes Real users in limited deployment. Shadow mode or pilot groups. * Moderate internal validity * Growing external validity * Reveals integration issues * Surface unexpected edge cases ### Real-World Outcomes Full production. Business metrics. User trust. * Full external validity * Highest signal quality * Slowest feedback loop * Highest cost to observe ## Example: Support Agent Development ### Phase 1: Define **Goal:** ``` Build an agent that handles Tier-1 support tickets reliably. ``` **Success Criteria:** ```python theme={null} [ SuccessCriterion( id="sc-resolution", description="Resolves tickets without escalation", metric="custom", target="resolution_rate >= 0.6", weight=0.4, ), SuccessCriterion( id="sc-accuracy", description="Responses are factually correct", metric="llm_judge", target="no hallucinations or misinformation", weight=0.3, ), SuccessCriterion( id="sc-escalation", description="Escalates appropriately when needed", metric="custom", target="escalation_accuracy >= 0.9", weight=0.3, ), ] ``` **Constraints:** ```python theme={null} [ Constraint(type="hard", category="cost", value="cost_per_ticket < 0.20"), Constraint(type="hard", category="safety", value="no_pii_in_responses"), Constraint(type="soft", category="time", value="response_time < 30s"), ] ``` ### Phase 2: Test Outcomes Run automated tests against synthetic tickets. Each test produces an `Outcome`: ```python theme={null} Outcome( success=True, result={"answer": "To reset your password, go to Settings > Security..."}, error=None, state_changes={"ticket_status": "resolved"}, tokens_used=847, latency_ms=1230, summary="Resolved password reset question correctly" ) ``` **Aggregate results across test suite:** | Criterion | Result | Status | Judgment | | --------------------- | ------ | ---------------------------- | -------- | | Resolution rate | 52% | Below 60% target | `REPLAN` | | Accuracy (llm\_judge) | 97% | Passing | `ACCEPT` | | Cost per ticket | \$0.18 | Below \$0.20 hard constraint | `ACCEPT` | | Escalation accuracy | 87% | Below 90% target | `RETRY` | **Weighted score:** 78% (below 90% threshold) **Insight:** Resolution and escalation need work. The `HybridJudge` recommends `REPLAN` - adjust prompts, add examples. ### Phase 3: Operational Outcomes Deploy to 5% of traffic in shadow mode: | Metric | Result | Insight | | ------------------ | -------- | ------------------- | | Resolution rate | 58% | Closer to target | | User satisfaction | 3.8/5 | New signal | | Edge case failures | 12 types | Unexpected patterns | **Insight:** Real traffic reveals edge cases tests missed. Update test suite. ### Phase 4: Real-World Outcomes Full production deployment: | Metric | Result | Business Impact | | ---------------------- | ------- | --------------- | | Resolution rate | 63% | Target exceeded | | Support cost reduction | 34% | Direct savings | | Customer satisfaction | +12 NPS | Trust building | ## Best Practices Higher weights on critical criteria ensure the 90% threshold reflects true success Cost limits, PII protection, and scope boundaries should be hard constraints Use `llm_judge` metric for quality criteria that can't be measured programmatically ACCEPT, RETRY, REPLAN, ESCALATE - let the HybridJudge guide the response ## Next Steps See the canonical goal model used by Hive runtime and evaluation Understand how outcome gaps drive iterative agent improvement Learn about nodes, edges, and the agent graph structure Create your first outcome-driven agent