Internal Brief · Design Harness · v1.0 · skills.tenten.co

Teaching skills.tenten.co to design itself.

A Codex-driven autonomous loop that critiques, improves, and ships our site — until it would survive a senior DeepMind review. Direct push to production. Two gates, eight axes, ten iterations, no babysitting.

Targetskills.tenten.co
DoctrineDeepMind × Tenten Editorial
Locked13 May 2026 · v1.0.0
Pre-run checksThree, see §07
§01

What the harness actually does.

You run one command on your laptop. The harness opens skills.tenten.co in a headless browser, takes screenshots across three viewports, runs Lighthouse and axe, and sends the whole package to Claude to score against an eight-axis design rubric. If the scores and the metrics both clear their thresholds, it stops.

Otherwise it writes a focused directive — "these three axes are weak, here is the doctrine, fix this" — and hands it to OpenAI Codex CLI, which edits the repo, commits, and pushes to main. Vercel deploys. The harness waits, runs a smoke test, and starts the next round. Up to ten times.

Production reflects every successful iteration in real time. If a smoke test catches broken output, the iteration is reverted before the loop continues. When the harness exits — passed, capped, or aborted — it writes a final report with the diff history, scores per round, and any reverts.

CAPTURE Playwright Lighthouse · axe SCORE Claude as judge 8 axes · 0–10 GATE both pass? CODEX CLI full latitude directive in · diff out PUSH · DEPLOY · SMOKE git push origin main Vercel · auto-revert loop until pass · or 10 rounds stop
The loop, in one diagram. Five stages, one gate, one exit.
§02

Eight decisions, locked.

These were locked together in a four-round design session and shouldn't be re-litigated mid-loop. If any of them feels wrong on review, we change the spec — not the running agent.

01 · STOP

Hybrid stop condition

Both gates must pass.Rubric AND metrics.

02 · TARGET

skills.tenten.co

Single-target by design.Vercel-hosted.

03 · EXECUTOR

OpenAI Codex CLI, local

User-initiated session, no daemon.Sleep-safe.

04 · DOCTRINE

DeepMind × Tenten Editorial

Editorial structure, Tenten dress.See §03.

05 · SCOPE

Full latitude

New pages, features, interactions allowed.Documented risk.

06 · DEPLOY

Direct push to main

Production reflects every pass.Fire-and-forget.

07 · RUBRIC

Balanced, 8 axes

Equal weight across dimensions.No single axis dominates.

08 · GUARDRAILS

Smoke test + 10-round cap

Auto-revert on failure.No diff-size cap.

§03

The doctrine, in two parts.

DeepMind gives us structural discipline — monochrome canvas, scientific restraint, editorial flow. Tenten Editorial gives us the visual dress — warm paper, Instrument Serif paired with DM Sans, sodium-vapor amber used surgically, grain. The skeleton is Mountain View. The skin is Banqiao.

Part one · The structure

DeepMind editorial.

The skeleton: how the page thinks before it dresses.

  • iMonochrome canvas first. Color enters only at meaningful moments. The default ground is neutral, never decorative.
  • iiGenerous whitespace. Section padding ≥ 96px desktop, ≥ 64px mobile. Prose sits in the 680–820px band.
  • iiiStrong typographic hierarchy. Five clear steps — display, headline, subhead, body, caption. No font sizes in between.
  • ivSectioned editorial flow. Each section answers one question and hands off cleanly to the next. No infinite-scroll soup.
  • vScientific restraint. No decorative gradients, drop shadows, or glassmorphism unless they earn their place via meaning.
  • viInformation density bias. Prefer one dense, well-organized page over five thin ones.
Part two · The dress

Tenten Editorial.

The skin: how the structure speaks in our voice.

  • iInstrument Serif + DM Sans. Serif for display, sans for body and UI. Noto Serif TC swap for zh-Hant.
  • iiWarm paper #f6f4f0. The dominant ground. Not white, not cream — paper.
  • iiiDark moments earn a glow. Hero / contrast sections use near-black with a soft radial amber glow. Never flat black.
  • ivSodium-vapor amber. Reserved for one element per viewport — a CTA, a status pill, an inline emphasis. Never body text.
  • vStaggered fadeUp. ~80ms stagger between siblings on first paint. No scroll-triggered choreography.
  • viGrain over hero blocks. Low opacity SVG noise overlay. Texture without ornament.
§04

Eight axes, equal weight.

Claude scores each axis 0–10 from multi-viewport screenshots of every key page. The rubric gate passes only when the average is ≥ 8.0 and every individual axis is ≥ 6.0. An 8.5 average with a 4 in accessibility does not pass.

i

Hierarchy & Structure

Visual hierarchy clarity, section sequencing, scanability of the page.

ii

Typography

Correct Instrument Serif & DM Sans use. Sizing scale, leading, no in-between sizes.

iii

Spacing & Rhythm

Whitespace generosity, vertical rhythm, ≥ 96px section gaps on desktop.

iv

Color & Material

Paper bg dominance. Amber used surgically. Dark sections with radial glow.

v

Motion & Feel

Staggered fadeUp present and tuned. Grain on hero. Motion restrained.

vi

Accessibility

WCAG AA contrast, semantic HTML, keyboard nav, focus states, alt text.

vii

Brand Fidelity

Reads as Tenten Editorial — not generic editorial. Cross-checked against tenten.co.

viii

Information Clarity

IA logic, concrete copy, CTAs visible, no marketing fluff.

§05

Two gates. Both must pass.

Rubric-only would let a beautiful page ship broken accessibility. Metric-only would let a Lighthouse-perfect Bootstrap shell pass. Requiring both is the entire point of the hybrid stop condition.

Gate 01 · Subjective

The rubric gate.

Claude scores screenshots against the doctrine. Both conditions must hold.

Average across 8 axes≥ 8.0
Minimum on any single axis≥ 6.0
Viewports judgedmobile · tablet · desktop
Pages judgedall KEY_URLS
A glossy page with poor accessibility cannot pass this gate.
Gate 02 · Objective

The metric gate.

Tooling measures the live URL after deploy. All rows, all viewports.

Lighthouse Performance≥ 90
Lighthouse Accessibility≥ 95
Lighthouse Best Practices≥ 95
Lighthouse SEO≥ 90
axe-core serious violations= 0
Console errors on key pages= 0
A WCAG-perfect Bootstrap shell cannot pass the other gate.
§06

What we're accepting on purpose.

Full latitude plus direct-to-main is the most aggressive configuration the harness supports. We picked it deliberately. These are the trade-offs, written down so we don't pretend they don't exist later.

QualityHIGH

Codex may add features nobody asked for.

With full latitude, an iteration could ship a testimonials section, a pricing page, or a chatbot widget if the model believes it serves the doctrine.

Mitigation: Read the iteration log after every run. Treat each iteration as a draft; revert any single commit with normal git tooling.

OperationalMEDIUM

Production breaks if Codex ships broken code.

Direct push to main means every iteration is live the moment Vercel finishes building, including ones that don't render.

Mitigation: Smoke test runs after every deploy. Non-200s, missing selectors, or page errors trigger an automatic git revert before the next iteration begins.

Brand driftMEDIUM

"Editorial-feeling" is not Tenten Editorial.

Codex may drift toward generic Vercel-template editorial vibes — Inter, Space Grotesk, off-white, navy accent — that match the doctrine word-for-word but lose our specific signature.

Mitigation: Brand Fidelity is one of eight rubric axes, scored against tenten.co as the reference. v1.1 will pass a calibration screenshot of tenten.co into the judge prompt explicitly.

No diff capACCEPTED

A single iteration could rewrite half the site.

We chose not to cap diff size per iteration. Codex's reasoning is most useful when it can refactor freely; a cap forces it into small, conservative edits.

Mitigation: Smoke test plus iteration cap plus post-run review. If a single iteration is destructive but smoke-clean, we revert manually.

No metric abortACCEPTED

Lighthouse can tank mid-run without stopping the loop.

If iteration 7 drops accessibility from 98 → 64, the loop doesn't abort — it just won't pass the metric gate. The next iteration may fix it or not.

Mitigation: The 10-round cap is the backstop. v1.1 may add metric-regression abort if this fails in practice.

§07

Three checks before the first run.

Three items to verify before pulling the trigger. Roughly thirty minutes, in two passes — one now, one after the baseline run.

Confirm the doctrine in §03.

Re-read the twelve principles above. If any one of them is wrong — wrong direction, wrong threshold, wrong vibe for what skills.tenten.co should feel like — fix it in the SKILL.md before the first real run.

When: before the first real run · Output: locked or redlined doctrine prose.

Sanity-check the baseline rubric scores.

Before any pushing happens, the harness runs once in baseline mode — capture and score, no Codex. You get an eight-axis scoreboard plus screenshots. If any axis looks mis-scored (judge too harsh, too generous, or confused), tighten the judge prompt before the real run.

When: after the Day -2 baseline run · Output: any judge-prompt adjustments.

Re-read §06 with fresh eyes.

Fire-and-forget on production is a real decision. If the trade-offs read differently after one more pass — push to a long-lived branch first, cap diff size, add metric-regression abort — change the config in the SKILL.md now. v1.0 is locked but not yet shipped.

When: before the first real run · Output: acknowledged, or one specific config change.

Next step

Ready to run.

Three checks in §07, then a baseline capture, then the first real iteration. Production goes live the moment the first push lands.

Skill spec
skills-tenten-design-harness.skill.md
Version
v1.0.0 — locked 13 May 2026
Owner
Tenten