A Codex-driven autonomous loop that critiques, improves, and ships our site — until it would survive a senior DeepMind review. Direct push to production. Two gates, eight axes, ten iterations, no babysitting.
You run one command on your laptop. The harness opens skills.tenten.co in a headless browser, takes screenshots across three viewports, runs Lighthouse and axe, and sends the whole package to Claude to score against an eight-axis design rubric. If the scores and the metrics both clear their thresholds, it stops.
Otherwise it writes a focused directive — "these three axes are weak, here is the doctrine, fix this" — and hands it to OpenAI Codex CLI, which edits the repo, commits, and pushes to main. Vercel deploys. The harness waits, runs a smoke test, and starts the next round. Up to ten times.
Production reflects every successful iteration in real time. If a smoke test catches broken output, the iteration is reverted before the loop continues. When the harness exits — passed, capped, or aborted — it writes a final report with the diff history, scores per round, and any reverts.
These were locked together in a four-round design session and shouldn't be re-litigated mid-loop. If any of them feels wrong on review, we change the spec — not the running agent.
Both gates must pass.Rubric AND metrics.
Single-target by design.Vercel-hosted.
User-initiated session, no daemon.Sleep-safe.
Editorial structure, Tenten dress.See §03.
New pages, features, interactions allowed.Documented risk.
Production reflects every pass.Fire-and-forget.
Equal weight across dimensions.No single axis dominates.
Auto-revert on failure.No diff-size cap.
DeepMind gives us structural discipline — monochrome canvas, scientific restraint, editorial flow. Tenten Editorial gives us the visual dress — warm paper, Instrument Serif paired with DM Sans, sodium-vapor amber used surgically, grain. The skeleton is Mountain View. The skin is Banqiao.
The skeleton: how the page thinks before it dresses.
The skin: how the structure speaks in our voice.
Claude scores each axis 0–10 from multi-viewport screenshots of every key page. The rubric gate passes only when the average is ≥ 8.0 and every individual axis is ≥ 6.0. An 8.5 average with a 4 in accessibility does not pass.
Visual hierarchy clarity, section sequencing, scanability of the page.
Correct Instrument Serif & DM Sans use. Sizing scale, leading, no in-between sizes.
Whitespace generosity, vertical rhythm, ≥ 96px section gaps on desktop.
Paper bg dominance. Amber used surgically. Dark sections with radial glow.
Staggered fadeUp present and tuned. Grain on hero. Motion restrained.
WCAG AA contrast, semantic HTML, keyboard nav, focus states, alt text.
Reads as Tenten Editorial — not generic editorial. Cross-checked against tenten.co.
IA logic, concrete copy, CTAs visible, no marketing fluff.
Rubric-only would let a beautiful page ship broken accessibility. Metric-only would let a Lighthouse-perfect Bootstrap shell pass. Requiring both is the entire point of the hybrid stop condition.
Claude scores screenshots against the doctrine. Both conditions must hold.
Tooling measures the live URL after deploy. All rows, all viewports.
Full latitude plus direct-to-main is the most aggressive configuration the harness supports. We picked it deliberately. These are the trade-offs, written down so we don't pretend they don't exist later.
With full latitude, an iteration could ship a testimonials section, a pricing page, or a chatbot widget if the model believes it serves the doctrine.
Mitigation: Read the iteration log after every run. Treat each iteration as a draft; revert any single commit with normal git tooling.
Direct push to main means every iteration is live the moment Vercel finishes building, including ones that don't render.
Mitigation: Smoke test runs after every deploy. Non-200s, missing selectors, or page errors trigger an automatic git revert before the next iteration begins.
Codex may drift toward generic Vercel-template editorial vibes — Inter, Space Grotesk, off-white, navy accent — that match the doctrine word-for-word but lose our specific signature.
Mitigation: Brand Fidelity is one of eight rubric axes, scored against tenten.co as the reference. v1.1 will pass a calibration screenshot of tenten.co into the judge prompt explicitly.
We chose not to cap diff size per iteration. Codex's reasoning is most useful when it can refactor freely; a cap forces it into small, conservative edits.
Mitigation: Smoke test plus iteration cap plus post-run review. If a single iteration is destructive but smoke-clean, we revert manually.
If iteration 7 drops accessibility from 98 → 64, the loop doesn't abort — it just won't pass the metric gate. The next iteration may fix it or not.
Mitigation: The 10-round cap is the backstop. v1.1 may add metric-regression abort if this fails in practice.
Three items to verify before pulling the trigger. Roughly thirty minutes, in two passes — one now, one after the baseline run.
Re-read the twelve principles above. If any one of them is wrong — wrong direction, wrong threshold, wrong vibe for what skills.tenten.co should feel like — fix it in the SKILL.md before the first real run.
Before any pushing happens, the harness runs once in baseline mode — capture and score, no Codex. You get an eight-axis scoreboard plus screenshots. If any axis looks mis-scored (judge too harsh, too generous, or confused), tighten the judge prompt before the real run.
Fire-and-forget on production is a real decision. If the trade-offs read differently after one more pass — push to a long-lived branch first, cap diff size, add metric-regression abort — change the config in the SKILL.md now. v1.0 is locked but not yet shipped.
Three checks in §07, then a baseline capture, then the first real iteration. Production goes live the moment the first push lands.