Improving Reliability

Make prompts dependable over time — debug failures methodically, version your prompts, have the model check its own work, and calibrate how much you trust its confidence.

Ad 728×90

Prompt debugging — change one thing at a time

Why: when a prompt misbehaves, changing several things at once hides which fix worked. When: a prompt fails, isolate the cause by altering one variable per run — wording, an example, temperature. How: keep the input fixed and watch what each single change does.

Debugging checklist for a misbehaving prompt:

  1. Lock the input and set temperature 0 (remove randomness).
  2. Change ONE thing, re-run, note the effect in your log.
  3. Add a missing example?  -> format problem
  4. Add an explicit rule?   -> the model didn't know a constraint
  5. Simplify the wording?   -> instruction was ambiguous
Repeat until it passes your test inputs.

Prompt versioning

Why: prompts are code — a small wording change can break behaviour, so you need to track versions and be able to roll back. When: version every prompt that runs in a product. How: store prompts in files with version numbers and a note on what changed and why.

summarize_email.v3.txt

  # v3 (2026-06-29): added "under 40 words" after v2 ran long
  # v2: added the billing/technical/other categories
  # v1: first draft

  You are a support triage assistant...

Store these in git alongside your code, not buried in a UI.

LLM self-evaluation

Why: a second pass where the model checks its own answer against the rules catches mistakes a single pass misses. When: use it for tasks with checkable criteria (format, factual grounding, policy compliance). How: feed the answer back with the rubric and ask for a verdict plus a fix.

Here is a draft answer and the rules it must follow.

Rules: under 50 words, no medical advice, cites the policy.
Answer: """{{answer}}"""

Does the answer satisfy every rule? Reply PASS or FAIL with the
reason. If FAIL, output a corrected version that passes.

Calibrating confidence

Why: models state wrong answers as confidently as right ones, so a plain answer hides its own uncertainty. When: ask for a confidence level and an abstention option on anything factual. How: let the model say "I am not sure" instead of forcing a guess — then route low-confidence answers to a human.

Answer the question. Then on a new line rate your confidence as
High, Medium, or Low, and if it is Low, say what you are unsure of.
If you genuinely do not know, answer exactly "I don't know."

Question: What was the population of Lyon in the year 1500?

Continue learning with related tracks

Prompt debugging — change one thing at a time

Debugging checklist for a misbehaving prompt:

  1. Lock the input and set temperature 0 (remove randomness).
  2. Change ONE thing, re-run, note the effect in your log.
  3. Add a missing example?  -> format problem
  4. Add an explicit rule?   -> the model didn't know a constraint
  5. Simplify the wording?   -> instruction was ambiguous
Repeat until it passes your test inputs.

Prompt versioning

summarize_email.v3.txt

  # v3 (2026-06-29): added "under 40 words" after v2 ran long
  # v2: added the billing/technical/other categories
  # v1: first draft

  You are a support triage assistant...

Store these in git alongside your code, not buried in a UI.

LLM self-evaluation

Here is a draft answer and the rules it must follow.

Rules: under 50 words, no medical advice, cites the policy.
Answer: """{{answer}}"""

Does the answer satisfy every rule? Reply PASS or FAIL with the
reason. If FAIL, output a corrected version that passes.

Calibrating confidence

Answer the question. Then on a new line rate your confidence as
High, Medium, or Low, and if it is Low, say what you are unsure of.
If you genuinely do not know, answer exactly "I don't know."

Question: What was the population of Lyon in the year 1500?