A Development Harness for Building Software with AI Agents

Published

29 June 2026

For the last couple of years the conversation about AI and software has been mostly about models. Which one writes better React, which one hallucinates less, which one you should pay for. That matters, but it is only half the picture.

A language model on its own is not an agent. It answers one prompt and stops. Wrap it in a loop, give it tools, load instruction files, and you have something people are starting to call an agent harness: the runtime around the model that makes it useful for real work. Cursor, Claude Code, Windsurf, and the rest are harnesses. Addy Osmani and others frame it simply as Agent = Model + Harness. The model supplies reasoning. The harness supplies state, tool execution, context management, and guardrails.

That gets you a capable assistant in your editor. It does not, by itself, get you shipped software.

What was missing

Most of us have felt the gap. You ask an agent to build a feature. It writes a pile of code that looks plausible, maybe even runs, and then you spend the afternoon untangling assumptions it never surfaced, tests it never wrote, and architecture it invented on the fly.

That is vibe coding. The model is doing its job. The process is not.

What senior engineers actually do on a non-trivial feature looks more like this:

Talk the idea through until the scope is honest
Write down what you are building and what you are not
Break the work into small, verifiable steps
Build with tests, not hope
Prove the running app does what the requirements say
Ship with a clean branch and a reviewable PR

AI agents can do those steps. They just need to be told how, consistently, every time. That is what I have been building in levi-putna/skills: a chain of agent skills that sit on top of whatever harness your editor already provides.

I call the full chain a development harness. It is the engineering workflow layer: not the loop and the tools, but the spec-driven process that turns an idea into verified, mergeable code and keeps doing that as the application grows.

The goal is a clean split of responsibility. You explore ideas, own docs/technical/, approve impact reports, and tune locked decisions. The harness drafts architecture, writes tests first, executes surgical deltas, and ships when everything is green. Documentation and requirements stay in your hands. Implementation speed stays with the agent.

What others call this

The naming in this space is still settling. A few patterns keep showing up:

Spec-driven development (SDD) is probably the most common umbrella term. Mariano Aguero's spec-driven-development skill describes a gated pipeline: constitution, specify, plan, tasks, implement, validate, with drift detection at the end. Spec-Driven Develop pushes the same idea as a platform-agnostic Markdown workflow with GitHub Issue tracking. Anton Arhipov's talks on the topic land in the same place: requirements, plan, tasks, then guided execution one step at a time.

Agent skills as a packaging format is gaining traction separately. Addy Osmani's agent-skills repo ships lifecycle skills (interview, spec, plan, build, review, ship) with slash commands that activate the right skill at the right time. Anthropic's Skills system and Cursor's skill files follow the same idea: reusable Markdown instructions the agent loads when a task matches.

Harness and scaffolding get used in overlapping ways. The Hugging Face agent glossary draws a useful line: the harness is the execution loop; scaffolding is what the model works from (prompts, tools, skills, directory structure). Firecrawl's write-up on agent harnesses emphasises harness engineering: treat agent failures as system problems to fix permanently, not prompts to retry.

Harness-as-a-Service (HaaS) is newer framing from people building agent SDKs: you configure prompts, tools, and policies; the platform runs the loop.

My development harness sits in the scaffolding layer of that stack. Cursor or Claude Code is the harness. The skills are the opinionated engineering process loaded into it.

The pipeline

Seventeen skills chain across design, change control, plan, build, and ship. Each one writes a reviewable artifact and hands off to the next. Five skills focus on UI and component design, ensuring consistent, reusable, and tested UI elements. Two skills form the verification loops inside the build phase. Two utility skills sit outside the chain for copy and diagrams.

Development pipeline

Seventeen workflow skills across design, change control, plan, build, and ship, with verification loops inside build.

Install the lot with Agent Kit:

npx @levi-putna/agent-kit@latest add levi-putna/skills --global

Or pick skills individually. The skills documentation on this site covers each one.

Pick your path

New project? Start here:

New project

Help me brainstorm a [feature]. Then write technical docs and build it.

Skills invoked: brainstorming → technical-documentation → planning → project-setup → executing-plans → shipping

Existing app, changed the spec? Start here:

Brownfield change

I updated requirements.md. Analyse the impact and implement the delta.

Skills invoked: refining-docs → conformance-check → reconciling-changes → planning → executing-plans → shipping

Plan already written? Start here:

Execute plan

Implement the plan at docs/plans/2026-06-29-feature.md with checkpoints.

Skills invoked: executing-plans (which drives test-driven-development and end-to-end-testing)

Design: shape the idea, write the docs

brainstorming is the colleague conversation before anyone opens an IDE. One question at a time. Honest trade-offs. A pre-mortem before you commit. Output lands in docs/designs/ as a dated spec: the what and why at decision time.

technical-documentation expands that into the how. Architecture, data model, API contracts, non-functional requirements, engineering guidelines. Requirements get stable REQ-IDs (REQ-042, REQ-042-AC1) that trace through plans, tests, and audits. Every technical choice gets an owner tag: human-locked, agent-discretion, or needs-decision. Nothing ambiguous should survive into the build. Living docs live in docs/technical/ with no dates in the filenames. architecture.md is always the current architecture.

refining-docs is how you maintain that living truth. Gap analysis, cross-document consistency, Mermaid diagrams, tightening prose, changelog entries. You run it when the first draft is on disk, when decisions change, or when you are about to kick off a brownfield delta. It diagnoses before editing and never silently changes human-locked decisions. Doc changes that affect behaviour append to docs/technical/CHANGELOG.md at the REQ level.

Change control: safe brownfield work

Greenfield is the easy story. The harder problem is what happens six months in, when you edit requirements.md and need the agent to change only what the spec says, not rewrite half the app because the prompt felt ambitious.

That is what the change control phase is for.

conformance-check audits docs/technical/ against the codebase in both directions. Missing tests for requirements. Orphan tests with no REQ-ID. Locked stack violations. Contract drift. It writes a conformance report to docs/plans/ and hands off to refining-docs if the docs are wrong, or reconciling-changes if the code is behind.

reconciling-changes analyses what changed in the living docs since a baseline: last shipped plan, git tag, or changelog entry. Each change gets classified: ADD, MODIFY, REMOVE, or CLARIFY. The skill maps affected REQ-IDs to files, modules, and tests, then defines frozen scope: everything that must not change in the upcoming delta. You approve the impact report before any code gets touched.

Review gates sit between these hand-offs. The agent stops and waits for your approval unless you explicitly say to run through.

Plan: tasks an agent can execute without guessing

planning reads the approved design and the technical doc set, maps files to create or modify, and breaks work into small TDD tasks with checkbox steps. Each task should end in something independently testable. Plans save to docs/plans/ with dated filenames. They are history, not living truth.

Two modes matter:

Greenfield: new project or subsystem. All requirements in scope.
Delta: brownfield, after an approved impact report. Only ADD/MODIFY/REMOVE items. Frozen scope is explicit. Touch nothing else.

The planning skill assumes the implementer is capable but has zero context. That is the right default for agents.

Build: scaffold, loop, verify

project-setup runs once on greenfield work. It reads tech-stack.md, honours every locked decision, wires up node:test, linting, CI, and seeds AGENTS.md or CLAUDE.md from your engineering guidelines. You get a repo that builds and tests green on zero tests. Then you build features.

executing-plans is the orchestrator. Load the plan, sanity-check it, work in batches of three tasks, checkpoint with you after each batch. On delta work it requires a green test baseline before any edits, reports out-of-plan file touches, and runs a post-build conformance check. It does not freestyle. It follows the plan and drives the loops underneath.

test-driven-development is the inner loop. Failing test first. Watch it fail for the right reason. Minimal code to green. Refactor. Tests are named with REQ-IDs for traceability: test('REQ-042-AC1: includes header row', …). No weakening tests to force a pass. Unit tests beside source, functional tests in test/. All on the built-in node:test runner unless your stack doc locks something else.

end-to-end-testing is the outer gate. Unit green is not enough. This skill maps acceptance criteria to real scenarios through HTTP, CLI, or Playwright-driven browser flows. If the running app fails, it files fix tasks and sends control back into the TDD loop. It only hands off to shipping when the app actually works.

That inner/outer loop distinction matters. TDD proves the pieces. E2E proves the product.

UI and component design: component-first development

Five skills handle UI and component work, preventing the common AI pitfall of generating one-off, untested UI elements scattered across pages.

design-system runs once per project. It defines semantic design tokens (colours, typography, spacing), component creation standards, and a decision framework for when to extend vs. create components. This becomes the single source of truth for all UI decisions. Brand changes update in one place, not hundreds of hardcoded values.

ui-ux-best-practices reviews composed views for clean, professional design. It checks hierarchy, whitespace, responsiveness, accessibility, and catches generic AI-generated patterns. This runs before building pages, after composing components into screens, and before shipping UI work. The goal: screens that solve specific user tasks, not generic SaaS sameness.

component-library sets up the showcase infrastructure, typically Storybook 8 or an in-app gallery. Components are built in isolation with stories, then integrated into the application. This runs once per project before the first UI component.

component-development builds UI elements following component-driven development (CDD) best practices. Before creating any component, it audits existing components to check for reuse opportunities. The skill decides whether to extend an existing component with variants or create a new one based on design system rules. Components get TypeScript types, stories for all states, and full documentation. This integrates directly into executing-plans. When a task requires UI elements, work pauses, the component is built in isolation, verified, then integrated.

component-testing adds interaction tests, visual regression, accessibility audits, and unit tests. Components are verified in isolation before integration. When extending components, backward compatibility is tested. This runs automatically after component-development, before the component is used in the application.

The rule: never create UI inline. Components are always built in isolation, tested, documented, and showcased before integration. This prevents component proliferation, ensures design system compliance, and maintains a library of reusable, tested UI elements.

Ship: nothing outward-facing without a green build

shipping runs pre-flight: full test suite, e2e, lint, working tree review. Branch hygiene, conventional commits, changelog if the project keeps one. It asks before pushing or opening a PR. Merge only when you say so.

The skill is deliberately cautious. Shipping unverified work is worse than not shipping.

Utility skills

general-writing polishes prose anywhere you need it: clear, natural language without AI clichés. Not part of the build pipeline. Use for README files, blog posts, documentation, or any copy that needs to sound human.

mermaid-diagrams makes architecture and flow visuals look professional: themed styling, semantic colours, layout rules, captions. Complements technical-documentation and refining-docs when the diagram needs to look polished, not like a default black-line export.

Three workflows

Greenfield: idea to shipped software

text

1. brainstorming          →  docs/designs/2026-06-29-feature.md2. technical-documentation →  docs/technical/ (requirements, architecture, stack, …)3. planning (greenfield)  →  docs/plans/2026-06-29-feature.md      └─ design-system (if UI needed)          →  Design tokens, component standards      └─ component-library (if UI needed)      →  Storybook infrastructure4. project-setup          →  scaffolded repo5. executing-plans        →  builds in batches      ├─ test-driven-development                per task      ├─ ui-ux-best-practices                   review composed views      ├─ component-development                  when UI elements needed      │    └─ component-testing                 verify before integration      └─ end-to-end-testing                     when tasks complete6. shipping               →  PR

Brownfield: doc change to deployed delta

This is the day-to-day loop once an app exists. You edit docs; the agent changes only what the impact report allows.

text

1. refining-docs          →  update docs/technical/ + CHANGELOG.md2. conformance-check      →  know existing drift before planning3. reconciling-changes    →  impact report + frozen scope  ✋ approve4. planning (delta)       →  tasks for changed REQ-IDs only  ✋ approve5. executing-plans        →  baseline tests green first; surgical edits      ├─ test-driven-development      ├─ component-development                  when new UI elements needed      │    └─ component-testing                 verify before integration      ├─ end-to-end-testing      └─ conformance-check                      post-build audit6. shipping               →  PR

UI-focused: building components

When working primarily on UI elements:

text

1. design-system          →  define tokens, standards, component rules2. ui-ux-best-practices   →  review target flow for hierarchy and clarity3. component-library      →  set up showcase infrastructure (once)4. component-development  →  audit → extend or create UserCard with stories5. component-testing      →  interaction + visual + a11y tests6. component-development  →  audit → extend or create ProductCard with stories7. component-testing      →  interaction + visual + a11y tests8. ui-ux-best-practices   →  review composed page/view9. (integrate into app)   →  use <UserCard /> and <ProductCard /> in pages10. end-to-end-testing    →  verify in context11. shipping              →  PR

Rule: Components are always created in isolation before integration. Never build UI inline in pages. Always audit existing components before creating new ones.

Living docs vs dated history

Not every artifact in the pipeline stays current. The table below maps each docs folder to the skill that writes it, how long the content stays authoritative, and what a typical filename looks like.

Folder	Skill	Lifecycle	Filename
`docs/designs/`	`brainstorming`	Point-in-time decision record	`2026-06-29-checkout.md`
`docs/technical/`	`technical-documentation`, `refining-docs`	Living source of truth	`requirements.md`, `CHANGELOG.md`
`docs/plans/`	`planning`, `reconciling-changes`, `conformance-check`	Point-in-time tasks, impact, audits	`2026-06-29-checkout.md`, `2026-07-01-export-impact.md`

docs/technical/ is what you own and what planning reads before it writes tasks. refining-docs keeps it accurate as decisions land. Designs and plans are archaeology: useful context, not current state.

The traceability spine

REQ-IDs link the whole pipeline:

text

requirements.md          REQ-042, REQ-042-AC1       ↓CHANGELOG.md             MODIFY REQ-042       ↓impact report            MODIFY REQ-042 → src/export.ts, tests       ↓delta plan               Task 3: REQ-042-AC1       ↓test name                test('REQ-042-AC1: includes header row', …)       ↓conformance report       REQ-042 ✅ covered

Same ID from doc to plan to test to audit. Never rename a REQ-ID; mark deprecated or removed instead.

Controls against full rewrites

The change-control skills exist because agents will happily refactor everything if you let them. These controls keep brownfield work surgical:

Control	Mechanism
Stable REQ-IDs	Same ID from doc → plan → test → audit
Impact report	`reconciling-changes` defines scope before code
Frozen scope	Explicit "do not touch" in delta plans
Delta-only tasks	`planning` (delta) covers changed REQs only
Green baseline	`executing-plans` requires green suite before delta edits
REQ-ID test names	`test-driven-development`: traceable, auditable
Design system	Single source of truth for UI decisions; prevents component proliferation
Component audit	`component-development` checks existing components before creating new
Component-first	No inline UI; elements built in isolation before integration
Backward compatibility	Extensions tested against existing usage
UX review gate	`ui-ux-best-practices` prevents generic AI-generated patterns
Batch scope check	`executing-plans` reports out-of-plan file touches
E2e gate	`end-to-end-testing` verifies real behaviour
Post-build audit	`conformance-check` catches drift
Agent rules	`engineering-guidelines.md` → `AGENTS.md` non-negotiables

You manage the spec. The harness enforces it.

Why skills and not one giant prompt?

A single mega-prompt with "always do TDD and write specs first" degrades fast. Context fills up. The agent forgets steps. You end up re-explaining the process every session.

Skills load on demand. The agent reads SKILL.md when the task matches the description in the frontmatter. Each skill is a focused workflow with checkpoints, anti-rationalisation rules, and explicit handoffs. That mirrors how good harness design works elsewhere: specialised subagents or skills for spec, database design, and implementation, orchestrated by a main loop.

Chaining skills is cheaper than chaining conversations. The artifacts persist on disk. You review the spec, tune the architecture doc, approve the plan, then let the build run. You are not re-deriving context from chat history.

How this fits the modern stack

This harness is opinionated in deliberate ways:

Node and node:test by default. No Jest dependency unless your tech stack doc locks one. Fewer moving parts.
Human checkpoints at design and plan boundaries. The agent proposes; you approve before code.
Verification is non-negotiable. TDD inside the build. E2E before ship. Red builds do not get rationalised away.
Platform agnostic. Skills are Markdown. Agent Kit installs them into Cursor, Claude Code, Windsurf, Copilot, and the rest.

It is not a replacement for your judgement. It is a way to encode the judgement you already apply so the agent applies it too.

Where to go from here

Browse the full catalogue: /skills
Install with Agent Kit: /agent-kit
Source and install commands: github.com/levi-putna/skills

If you are already using AI coding agents daily, you have the harness. The question is whether you have the process. A development harness is my answer: you own the documentation and requirements; the agent builds faster than you ever could within those constraints. Spec-driven design, change control for brownfield work, test-driven build, end-to-end verification, and a shipping gate. Each step small enough to review, chained tightly enough to run end to end.

The model will keep improving. The harness around it is where durable engineering advantage lives.

Share this article:

What was missing

What others call this

The pipeline

Pick your path

Design: shape the idea, write the docs

Change control: safe brownfield work

Plan: tasks an agent can execute without guessing

Build: scaffold, loop, verify

UI and component design: component-first development

Ship: nothing outward-facing without a green build

Utility skills

Three workflows

Greenfield: idea to shipped software

Brownfield: doc change to deployed delta

UI-focused: building components

Living docs vs dated history

The traceability spine

Controls against full rewrites

Why skills and not one giant prompt?

How this fits the modern stack

Where to go from here

Similar articles