Prompt Engineering LLM Evaluation Classification

Refining the Prompt Made It Worse. Reframing the Output Made It Better.

6 min read

TL;DR: A sentiment classifier stuck at 81.9% accuracy. I tightened the three-class prompt — accuracy dropped to 79.6%. The fix wasn't a better prompt. It was asking the model for a 1–5 star rating instead of a three-class label, then collapsing to three classes algebraically. Same model, same data, +2.66pp accuracy. Sometimes the bottleneck isn't the prompt — it's the output frame.

The Setup

I was building a sentiment classifier for a multi-agent product analytics pipeline — 4,742 customer reviews from Amazon, Yelp, and the App Store, classified into positive, negative, or neutral against star-derived ground truth. The target was 85% accuracy.

My v1 prompt was reasonable: explicit class definitions in the docstring, a calibration scale, conservatism clauses for ambiguous text. I ran it on the full corpus.

Baseline (v1) — Below Target

Metric Value Target
Overall accuracy 81.9% >85%
Wilson 95% CI [80.7%, 82.9%]
Dominant failure Neutral class

The CI didn't even contain the 85% target. The neutral class was the worst performer — 3-star reviews are inherently ambiguous (mixed sentiment, sarcasm, faint praise), and the model conflated them with adjacent classes.

The Obvious Fix: Refine the Neutral Definition

Standard prompt-engineering move: identify the failing class, tighten its definition, add boundary clauses. I rewrote the prompt with a sharper neutral definition and explicit guidance ("3-star reviews lean neutral; reserve positive/negative for clear cases").

It got worse.

Refined Prompt (v2) — Per-class Breakdown

Class v1 accuracy v2 accuracy Delta
Neutral 40.9% 65.9% +25.0pp
Positive 92.1% 86.2% −5.9pp
Negative 89.0% 80.5% −8.5pp
Overall 81.9% 79.6% −2.25pp

The targeted class lifted by 25 points. The other two cratered. Net loss: 2.25pp. The errors didn't disappear — they migrated.

The Pattern: Errors Redistribute, They Don't Reduce

Here's what was actually happening. The v1 prompt already carried clean class definitions, calibration bands, and conservatism clauses. The class boundaries were as crisp as I could make them at the prompt level.

When I sharpened the neutral class definition further, I didn't add new information to the prompt — I shifted where the boundaries landed. Reviews that had been (correctly) classified positive or negative now drifted into neutral. The model wasn't getting smarter about ambiguous cases. It was getting more cautious about confident ones.

The pattern: When a prompt already carries a clear signal for the target class, additional specification at the same level redistributes errors rather than reduces them. You're not adding information — you're moving the goalposts.

This isn't unique to this project. It's a general property of three-class hard-label problems where the ground truth is derived from a finer-grained signal (5-star ratings collapsed to 3 buckets). The "neutral" bucket lumps together sarcasm, mixed sentiment, faint praise, and genuine ambivalence — categories that are inherently uncertain even to humans. No prompt rewrite at the three-class level fixes that.

The Reframe: Predict Five, Collapse to Three

Here's the move that worked. The ground truth is derived from 1–5 star ratings. The model has a strong pretraining prior on (text → star rating) — it's seen millions of reviews paired with star ratings during training. That signal is much richer than any three-class prompt I could write.

So I stopped asking the model for a three-class label. I asked it for a 1–5 star prediction directly. Then I collapsed the prediction to three classes algebraically, using the same star-to-sentiment mapping I was using for ground truth (1–2 → negative, 3 → neutral, 4–5 → positive).

The model never sees the three-class boundary. It does what it's good at — predicting a star rating. The boundary collapse happens in deterministic Python.

Refined three-class prompt (v2)

79.6%

−2.25pp vs baseline

5-class reframe (v3)

84.5%

+2.66pp vs baseline

5-class Reframe (v3) — Final Numbers

Metric Value Note
Three-class accuracy 84.5% +2.66pp vs baseline
Wilson 95% CI [83.5%, 85.5%] contains the 85% target
Exact 5-star match 65.6%
Within ±1 star 95.5% the model has strong star-level priors

Same model (gpt-4o-mini), same corpus, same evaluation. Only the output frame changed.

Why This Worked

The 95.5% within-±1-star number is the load-bearing diagnostic. The model isn't perfect at picking the exact star rating, but its distribution of predictions is tightly clustered around the true value. When I collapse two adjacent stars into one sentiment class (4–5 → positive), most of the noisy predictions land on the right side of the boundary.

In the three-class setup, the same noise lands directly on the boundary. A review the model scores 3.4 has nowhere to go in three-class land — it's either neutral or positive, and the prompt has to decide. In five-class land, it's a 3 (or maybe a 4), and either way the algebraic collapse handles it cleanly.

The insight: When ground truth is derived from a finer-grained signal than your output schema, ask the model for the finer-grained signal. Let the collapse happen in code, where it's deterministic. The model gets to use its full signal; you get to keep the cleaner output.

Where This Generalizes

This isn't a sentiment-specific trick. It applies any time the ground truth is a coarsening of a richer signal:

  • Star-rated reviews → sentiment buckets: predict stars, collapse to sentiment.
  • Likert-scale survey responses → satisfied/not: predict the Likert value, threshold downstream.
  • Severity scores (1–10) → minor/major: predict the score, bucket downstream.
  • Date predictions → before/after a cutoff: predict the date, threshold downstream.

In each case, the richer signal is closer to what the model actually saw during pretraining. Asking for the coarse output forces the model to reason about your boundary. Asking for the rich output lets the model do what it's good at, and lets you keep boundary logic in code where it's testable.

When Prompt Refinement Stops Working

Before reaching for another prompt rewrite, check whether you're in this regime:

  1. Does the prompt already carry clear class definitions, calibration, and conservatism guidance? If yes, you're past the point where more specification adds information. You're just shifting boundaries.
  2. Does refining one class hurt others by roughly the same amount? That's the redistribution signature. The errors aren't being eliminated — they're moving.
  3. Is your ground truth derived from a finer-grained signal than your output? If yes, you have a reframe option. Predict the underlying signal and collapse downstream.
  4. Did the model ever see the finer-grained signal during pretraining? Stars, ratings, scores, dates — these show up everywhere in training data. The model has priors. Use them.

The Takeaway

Prompt engineering has a ceiling. When your prompt is already specifying everything it can specify, more words don't add information — they shuffle errors. The next move isn't a better prompt. It's a better question.

Look at your output schema. If it's a coarsening of a richer underlying signal, and the model has seen that richer signal during pretraining, ask for the richer signal directly. Do the coarsening in code, where the boundaries are deterministic and testable.

Two prompt iterations failed. One reframe worked. The lesson wasn't "write better prompts." It was "stop solving the problem the prompt was framing."

The most surprising thing about this finding wasn't the +2.66pp gain — it was how stuck I'd been on the three-class framing. The spec said "classify as positive, negative, or neutral," and I assumed the prompt had to produce that directly. It didn't. The deliverable was three-class output; the model's job was a different question entirely. Once the deliverable and the model's task were separated, the ceiling moved.