March 30, 20265 min

Writing Prompts That Survive Production

Most prompt guides optimize for demos. Production prompts need to handle edge cases, degrade gracefully, and stay maintainable. The difference matters.

Prompt EngineeringAIProduction

Demo prompts work great in demos. "Summarize this article" returns a clean summary. "Extract the key entities" returns a nice list. Ship that to production and watch it break on the first malformed input.

The gap between demo prompts and production prompts is the same gap between a script and a system. One handles the happy path. The other handles everything.

The Three Failure Modes

Production prompts fail in predictable ways. Once you know the patterns, you can design against them.

Drift: the model's interpretation of your prompt shifts as context accumulates. A prompt that works perfectly in message 1 starts hallucinating by message 15 because earlier responses have polluted the context. Fix: restate critical constraints at decision points, also not only at the top.

Edge collapse: the model encounters an input it wasn't designed for and produces confidently wrong output instead of signaling uncertainty. The classic: a sentiment classifier that labels gibberish as "positive" because it always picks something. Fix: give the model an explicit "I can't classify this" option and define when to use it.

Format rot: the model returns valid content in the wrong structure. You asked for JSON, it returns JSON with markdown wrapping. You asked for a list, it returns a paragraph with embedded list items. Fix: provide a concrete output example, also not only a format description.

Structural Patterns That Work

After writing hundreds of production prompts across classification, extraction, generation, and analysis tasks, a few structural patterns consistently outperform.

The constraint sandwich: state the task, list constraints, restate the most critical constraint. Models weight the end of the prompt more heavily. If "never include PII" is your most important constraint, say it last.

Explicit refusal criteria: tell the model exactly when to say "I don't know" or "this input doesn't match." Without this, models will always produce something, even when the right answer is nothing.

Output scaffolding: provide the exact structure you expect, with placeholders. Not "return JSON with the fields name, score, and reasoning" but:

{
  "name": "...",
  "score": 0-10,
  "reasoning": "One sentence explaining the score"
}

The model mirrors structure more reliably than it follows structural descriptions.

Temperature as a design parameter: temperature 0 for extraction and classification. Temperature 0.3-0.7 for generation where variety matters. Temperature 1.0+ only for brainstorming where you want surprise. Most production tasks should be at 0.

Testing Prompts Like Code

The mistake most teams make: they test prompts manually, with their own inputs, against their own expectations. This is like testing a function by calling it once with the example from the README.

Production prompt testing needs:

Edge cases as fixtures: empty input, extremely long input, input in the wrong language, input with injection attempts, input that contradicts the prompt's assumptions. Build a test suite of these and run every prompt revision against all of them.

Regression tracking: when you improve a prompt for one case, you need to know if other cases degraded. An A/B comparison template that runs both versions against the full test suite and diffs the outputs.

Scoring rubrics: not "did it work?" but "did it score 8+ on accuracy, 7+ on format compliance, and 6+ on reasoning quality?" Structured scoring catches subtle degradation that pass/fail misses.

The Prompt Testing Framework includes templates for all three of these patterns, pre-built for Claude, GPT, and Gemini.

The Maintenance Angle

Prompts are code. They need versioning, review, and documentation.

Every prompt in our system includes:

A version number
A one-line purpose statement
The last date it was tested against the full edge case suite
The model and temperature it was designed for

When a model updates (GPT-4 to GPT-4o, Claude 3 to Claude 4), every prompt gets retested. Model updates change prompt behavior in subtle ways; a prompt that worked perfectly on Claude 3.5 might need adjustment on Claude 4 because the model's default behavior shifted.

The Checklist

Before shipping a prompt to production:

Have you tested with empty/null input?
Have you tested with adversarial input?
Does the model have an explicit "I can't do this" path?
Is the output format specified by example, not description?
Are critical constraints restated at the end of the prompt?
Is the temperature appropriate for the task type?
Have you run a regression against the previous prompt version?

If any answer is "no," the prompt isn't production-ready. It's a demo.

The Prompt Engineering OS covers 30 chapters of patterns like these, with 100+ templates you can adapt. The Quick Reference Cards distill the most critical patterns into printable cheat sheets.