Soft Harness Doesn't Work

OpenAI published a post in February called “Harness Engineering.” Their team built a million-line product with zero manually-written code. One of their core principles: “mechanical enforcement over documentation.”

I agree. But they assert it. They don’t show how they know.

I’ve been running a multi-agent orchestration system for three months — most of the code written by AI agents I can’t directly supervise. I learned the same lesson, but I learned it by measuring. The measurements tell a more specific story than “enforcement beats documentation.”

There are two kinds of harness, and one of them is mostly decoration.

Hard and Soft

A harness is everything in the development environment that constrains agent behavior — tests, linters, hooks, documentation, skill files, architecture conventions. OpenAI uses the term to describe their entire system of guardrails and guidance.

But not all harness is equal. Some of it is deterministic: a build fails or it doesn’t, a pre-commit hook blocks or it doesn’t. I call this hard harness. The rest — prose instructions, behavioral constraints, CLAUDE.md conventions, skill documents — is probabilistic. I call it soft harness. Soft harness works by hoping agents read it and comply.

The question is: does the soft harness actually work?

I Tested It

I built a contrastive testing framework. Take a scenario, run it three ways: bare Claude (no skill loaded), Claude with knowledge only, Claude with the full skill document. Compare outputs across multiple runs. 265 trials across 7 different skill documents.

The results broke into three categories.

Knowledge transfers. Routing tables, vocabulary, templates — factual content agents wouldn’t otherwise have. Consistent +5 point lift. When you tell an agent “here’s how the system works,” it uses that information. This is the least surprising finding.

Attention primers sometimes transfer. Stance items that change what agents notice — not what they do, but where they look — produced significant lift on specific scenarios. “Look for implicit assumptions between sources” took one scenario from 0% to 83% detection. But this only worked on problems that hide between information sources. Single-source problems got no benefit. And the mechanism is specific: “look for X” works, “do X” doesn’t.

Behavioral constraints don’t transfer at scale. This was the big finding. My orchestrator skill had 87 behavioral constraints — MUST do this, NEVER do that, always follow this procedure. On 5 of 7 test scenarios, agents with these 87 constraints performed identically to bare Claude with no skill loaded.

The dilution curve is steep. At 5 co-resident behavioral constraints, compliance starts dropping. At 10+, constraints become inert. The agents aren’t refusing to comply. They just can’t hold that many prohibitions against the current of the system prompt, which spends hundreds of words promoting the very tools and patterns you’re trying to restrict in your thirty words of “don’t do that.”

Eighty-three of my 87 behavioral constraints were non-functional. I’d been maintaining a 2,368-line skill document that performed the same as no document at all on most tasks.

I Also Learned It the Hard Way

Before I measured, I had three months of painful evidence I wasn’t reading correctly.

My CLAUDE.md file said “files over 1,500 lines require extraction before feature additions.” This is soft harness — a convention documented in prose. daemon.go grew from 667 lines to 1,559 lines over 60 days, from 30 individually-correct agent commits. Each commit added a reasonable feature: stuck detection, health checks, auto-complete, orphan recovery. Each agent read the convention. None of them stopped.

The system also had three entropy spirals — feedback loops where agents degrading the system reported success. 1,625 commits lost across three rollbacks. After each spiral, I wrote a post-mortem. Each post-mortem identified the same root causes. Each recommended the same mitigations. Between the three post-mortems: zero mitigations implemented as hard gates.

The mitigations were documented. Documentation is soft harness. Between “we should implement a circuit breaker” and a circuit breaker that actually runs, there’s a gap that gets wider under pressure.

What Actually Works

Two things reliably changed agent behavior.

Hard gates. A pre-commit hook that warns when a commit adds 30+ net lines to an 800+ line file. A spawn gate that blocks implementation work on files over 1,500 lines without proof of prior architectural review. A build command that passes or fails. These are deterministic — agents can’t drift from them because there’s no compliance gradient. The gate fires or it doesn’t.

Structural attractors. When I created pkg/spawn/backends/, spawn-related code started landing there instead of in the monolithic spawn command file. spawn_cmd.go shrank by 840 lines. Not because I told agents to put code there — because the package name primed their attention. It’s an always-visible signal that doesn’t compete with the system prompt.

The structural attractor is interesting because it’s not really hard or soft. It’s architecture doing the work of instruction. Nobody has to read a rule. The package exists, so code goes there.

The Punchline

After measuring all of this, I stripped my orchestrator skill from 2,368 lines to 422. Removed 83 behavioral constraints. Kept the knowledge and a few attention primers. The simplified version scores the same or better on every test scenario.

I’d been maintaining a document the size of a short novel that was functionally equivalent to nothing.

OpenAI’s team spent every Friday — 20% of their engineering week — cleaning up what they call “AI slop” before they built automated cleanup. That’s the same entropy I saw, measured from a different angle. They solved it with periodic garbage collection agents. I solved it with gates that prevent the accumulation in the first place.

We converged on the same principle from opposite directions. They designed their harness before their code — greenfield advantage. I discovered my harness was mostly decoration after three entropy spirals and 1,625 lost commits. The expensive way to learn the same thing.

But I have one thing they don’t have in their post: the receipts. Not “documentation doesn’t work” as a principle. Documentation doesn’t work as a measured fact, with trial counts and dilution curves and a before-and-after that shows the 2,368 lines were dead weight.

If you’re writing CLAUDE.md files or AGENTS.md files full of MUST and NEVER and ALWAYS — go count them. If you have more than four behavioral constraints, the fifth one is probably already inert. And the only way to know is to test.