Entropy, Engineered

Feb 9, 2026 ~ 9 min read

Requirements Before Implementation: The Research Flow


The agent writes tests. They pass. Code ships. Then you realize: the tests verified what the code does, not what it should do.

TDD assumes the developer knows what success looks like. Humans carry that knowledge implicitly — years of domain experience, context from conversations, intuition about edge cases. Agents don’t. They have only what’s in the prompt and codebase, plus some vibes from their training. They can vibe scenarios, but there’s no guarantee of completeness.

This post covers how to make requirements explicit before implementation begins.


The Agentic TDD Gap

TDD is elegant. Red-green-refactor. Tests define behavior. Implementation follows specification. For humans, it mostly works — though let’s be honest, writing tests is the boring part. The structure forces you to think about how your code will behave, but tests that verify implementation instead of requirements happen to humans too. Just less severely, because we have general understanding to fall back on.

Here’s the thing: for humans, documentation and requirements are a huge cognitive burden. It’s a different type of work that doesn’t feel like you’re moving the needle. Writing code feels productive. Writing specs feels like overhead. So humans skip it, or do it poorly, or do it after the fact.

Agents don’t have this problem. Documentation, requirements, test plans — the cognitive cost is basically zero. They don’t get bored. They don’t feel like they’re wasting time. They just need a system to follow. Not “be a TDD expert” — that’s a persona. A procedure: research requirements, document acceptance criteria, map tests to ACs, then implement.

Same documentation work, different cognitive cost: humans feel buried under requirements overhead while agents just process procedures.
Same documentation work, different cognitive cost: humans feel buried under requirements overhead while agents just process procedures.

Humans bring years of domain knowledge to TDD. They’ve had the meetings, seen the prior projects, developed intuition about edge cases. When something’s unclear, they ask. Agents have none of this. They see only what’s in the prompt and codebase — no intuition, just pattern matching. They don’t even know what questions to ask.

So agents write tests that verify implementation details (HOW), not requirements (WHAT).

# Implementation-driven (BAD)
def test_uses_threadpool():
    """Verify ThreadPoolExecutor is used."""
    # Tests HOW, not WHAT

# Requirement-driven (GOOD)
def test_function_works_in_async_context():
    """Verify function works when called from async context."""
    # Tests WHAT the requirement specifies

We already solved the instruction problem. In Claude Code Plugins, I covered why “procedures, not personas” matters — “You are a TDD expert” is useless, but “Run tests before implementing, verify failures before fixes” is actionable. For the full methodology on building skills that encode procedures, see The Agent Skills series.

But skills assume requirements exist. Even with perfect procedural skills, if the agent doesn’t know what success looks like, it produces meaningless tests. Skills cover HOW. They don’t cover WHAT.


The Decision Checklist

Before writing a single test, answer four questions:

  • I know EXACTLY what success looks like for this task
  • I can write specific test assertions from the acceptance criteria
  • I know how to verify each acceptance criterion
  • All edge cases are documented

If any answer is NO → Stop. Get clarity first.

The decision gate: four questions that must all pass before implementation begins.
The decision gate: four questions that must all pass before implementation begins.

Example — unclear vs clear:

Unclear: “Add model validation”

  • What models? Which validation rules? What happens on failure?
  • Agent will guess. Guesses become implementation details. Tests verify guesses.

Clear:

  • AC1: Validate model names against provider’s model list
  • AC2: Return helpful error with suggestions on invalid model
  • AC3: Cache model list for 24 hours
  • AC4: --refresh-models flag on the main CLI command to force cache refresh

Now the agent can write meaningful tests because the requirements are explicit.


The Research Command

The pattern: separate research from implementation.

When implementation starts, the agent is already thinking about code. Requirements get discovered mid-implementation. Changes cascade. Tests get adjusted to match code. That’s a TDD violation.

The fix: a research phase that happens BEFORE implementation claims the task.

I built a /research command for this:

/research imagefactory-v2-7no2

What it does:

  1. Loads task from beads
  2. Validates requirements (checks PRODUCT_SPEC, design docs)
  3. Researches the codebase — looks for implementation patterns, similar code, files to modify
  4. Answers the decision checklist
  5. Creates research artifact with:
    • Requirements summary (acceptance criteria)
    • Test plan (TDD-ready)
    • Architecture notes (files to modify, patterns to follow)
    • Questions/unknowns
    • Implementation checklist
  6. Labels task: research:complete or research:blocked

When questions arise during codebase research, they get documented too. By the time an implementation agent picks up the task, it’s already researched — just take it and do it.

Research artifact structure:

# Research: Add model validation (imagefactory-v2-7no2)

**Status**: ✅ Ready for implementation
**Requirements Source**: PRODUCT_SPEC.md Feature 11

## Requirements Summary
- AC1: ModelProvider ABC with discover_models(), validate_model()
- AC2: Provider implementations for Anthropic, Google, fal.ai
- AC3: File-based cache with 24h TTL
...

## Test Plan (TDD-Ready)
1. test_model_provider_interface - Verify ABC enforcement (AC1)
2. test_provider_registry_caching - Cache TTL behavior (AC3)
...

## Questions/Unknowns
✅ Q1: Which providers in Phase 1? → Anthropic, Google, fal.ai (confirmed)
❓ Q2: Cache location? → ASK USER

## Decision Checklist
- [x] I know EXACTLY what success looks like
- [x] I can write specific test assertions
- [x] I know how to verify each acceptance criterion
- [x] All edge cases documented

**Verdict**: ✅ READY FOR IMPLEMENTATION

Research is READ-ONLY. It doesn’t modify code. It produces an artifact that makes implementation mechanical.

For a working example, see the research workflow in the demo project.


Three Paths

When an implementation agent picks up a task, it checks for the research label:

bd show <task-id> | grep "Labels:"

Path A: research:complete

  • Research artifact exists with “READY FOR IMPLEMENTATION”
  • Agent reads artifact, skips requirements research
  • Goes straight to TDD (tests are already mapped to ACs)

Path B: research:blocked ⚠️

  • Research artifact exists but has unresolved questions
  • Agent reads questions, asks user
  • User answers, updates requirements docs
  • Agent updates artifact, changes label to complete
  • Proceeds to implementation

Path C: No label 📝

  • No research done
  • Agent does requirements research inline
  • Takes full time, questions emerge mid-implementation, risk of rework
Three paths based on research status: complete goes straight to code, blocked needs answers first, no label means inline research with rework risk.
Three paths based on research status: complete goes straight to code, blocked needs answers first, no label means inline research with rework risk.

Research is optional but worth it. Without it, implementation takes longer and requirements emerge mid-code.


Parallel Research

In my projects, tasks are often sequential — one builds on another, so I can’t run multiple implementations in parallel using git worktrees. Implementation becomes the bottleneck. This isn’t true for every project, but even when tasks are independent, splitting research from implementation is valuable because they’re different modes. Implementation is “go execute.” Research is interactive — you research something, we talk about the path forward.

Research can be parallelized. It’s read-only — doesn’t modify code. Multiple research sessions can run simultaneously, each producing an artifact for a different task.

The workflow:

# Session 1: Research upcoming tasks
/research imagefactory-v2-7no2   # → research:complete
/research imagefactory-v2-5czx   # → research:blocked (has questions)
/research imagefactory-v2-a91s   # → research:complete

# Session 2: Implementation (separate conversation)
bd show imagefactory-v2-7no2     # Has research:complete
# Read artifact, go straight to coding

Even blocked research saves time. Without it, you discover unknowns mid-implementation, have to stop, ask the user, wait. With blocked research, questions are already documented — get answers immediately.

In my experience, no research means maybe an hour including discovery. Pre-identified questions cut that roughly in half. Complete research? Five minutes to read the artifact, then you’re just executing.

Research runs in parallel, implementation is sequential. Each task arrives with its research data attached — never blocked on discovery.
Research runs in parallel, implementation is sequential. Each task arrives with its research data attached — never blocked on discovery.

Research as Task Data

My current approach: research artifacts in .claude/research/task-id.md

The problem is scattered files that aren’t atomic with task commits. You have to know to look there. They get out of sync with the task.

Better approach: keep research IN the beads task.

Beads has fields for this:

  • --notes: Implementation notes, research summary
  • --design: Architecture decisions, test plan

Now it’s atomic — commit task, commit research. Single source of truth. bd show <id> shows everything.

# Instead of creating .claude/research/task-id.md
# Put research summary in beads task

bd update imagefactory-v2-7no2 --design="$(cat <<'EOF'
## Requirements (PRODUCT_SPEC Feature 11)
- AC1: ModelProvider ABC
- AC2: Provider implementations
- AC3: File-based cache, 24h TTL

## Test Plan
1. test_model_provider_interface (AC1)
2. test_provider_registry_caching (AC3)

## Decision Checklist: All YES
EOF
)"

bd label add imagefactory-v2-7no2 research:complete

Task data stays with the task. Research is task data.


The Bigger Picture

This series covered context through MCP, skills for repeatable work, verification patterns, task tracking with beads, and sub-agents for parallel exploration. Requirements sit underneath all of it. Without them, the other pillars don’t work — skills tell the agent how to do things but not what to do, quality gates verify implementation but not correctness.

My current hypothesis: BDD, domain-driven design, and formal use case analysis become powerful leverage for AI agents. Structured requirements modeling — scenarios, invariants, constraints, behavior specs — is almost free to create and maintain when AI handles the documentation cost. If I can model system behavior in a structured way that anyone can read and review, that specification becomes the reference point for the agent when writing code, running tests, validating behavior.

The flow I’m testing: design use cases → define scenarios → define invariants and schema constraints → write behavior tests that verify business rules → agent implements against those tests. At any point, the agent can run behavior tests and know if it’s on track.

I don’t have a strong preference for BDD specifically. What I want is a systematic way to surface requirements during analysis — a thinking pattern that doesn’t miss scenarios, constraints, edge cases. I’ll share results as I get them. If you’re working on similar problems, I’d like to hear about it.


Try It

Before your next task, answer the four questions. If any answer is NO, don’t start. Research first.

The time “lost” to research is time saved debugging unclear requirements later. I’ve watched agents spend 90 minutes implementing something that took 30 minutes to redo once requirements were clear.

The workflow: /research <task-id> → artifact → research:complete → implementation as execution.


Final post in the Agentic Development Workflows series. Previously: Stop Using Claude Code Like a Copilot, MCP Setup, Plugins, Verification Patterns, Beads for Multi-Session Tasks, and Sub-Agents. For building skills that encode expertise, see The Agent Skills series.