Pipeline Design Spec Interpretation ML Engineering

I Built the Pipeline. The Math Worked. The Question Didn't.

7 min read

TL;DR: A spec asked me to ingest customer reviews from three different platforms AND align them against a single product roadmap. I built it. The math worked. Two legitimate formula re-tunes confirmed the same outcome: zero high-priority gaps. That wasn't a tuning failure — it was the corpus telling me the spec's two requirements were quietly antagonistic. The win wasn't a number. It was refusing to lower a threshold to fabricate one.

The Setup

The spec was crisp: build a four-agent pipeline that ingests customer reviews from Amazon, Yelp, and the App Store, scores sentiment and pain intensity, discovers themes via LLM clustering, aligns those themes against a product roadmap with semantic embeddings, and surfaces priority-ranked gaps for a PM team. Two requirements stood out:

  • Multi-source ingestion. All three platforms, normalised into one feedback model.
  • Roadmap alignment. Discovered themes mapped against a single product roadmap, with priority-ranked gaps as the deliverable.

I read these as compatible. They aren't.

The Baseline: Spec Defaults Produced Nothing

I ran the pipeline at spec defaults: cosine similarity threshold 0.75 for theme-to-roadmap alignment, frequency saturation at min(count/20, 1.0) in the priority formula, high-priority threshold at 0.6.

First Run (#M1, #G1) — Spec Defaults

Stage Result Spec target
Theme–roadmap alignment 0 of 100 pairs above threshold ≥80% precision on aligned pairs
Top alignment score 0.507 ≥0.75
High-priority gaps 0 of 10 "All gaps ≥0.6 flagged"

Looks like a tuning failure. Threshold too high; saturation point misconfigured. Standard fix: re-tune the constants against the actual data distribution.

The Conventional Fix: Re-tune the Constants

I did the work. The 0.75 cosine threshold was calibrated against an embedding model that wasn't text-embedding-3-small — same-topic pairs in this model land in the 0.40–0.65 band, not 0.75+. Lowered to 0.45 against the actual distribution. The frequency saturation point of 20 reviews was authored for smaller corpora — every theme in mine had 90+ reviews, so the factor saturated to 1.0 universally and contributed zero discrimination. Raised to 500, matching the corpus's theme-size distribution.

These weren't goalpost moves. Each re-tune was grounded in the empirical distribution of the data. The math now worked.

The deliverable was still empty.

After Two Formula Re-tunes (#M2, #G2)

Stage Result Change vs spec defaults
Theme–roadmap alignment 7 of 100 pairs aligned +7 — math now passes
Top alignment score 0.613 well above re-tuned threshold
High-priority gaps 0 of 10 same answer

Alignments were now finding sensible same-topic pairs. Frequency factor was now discriminating across themes. Every spec gate passed. And the priority bucket stayed empty.

I ran a third change to confirm. Swapped the pain aggregator from mean pain to share-of-high-pain (a more conservative formula for severe-pain themes). Same result: zero high-priority gaps. Top score dropped from 0.470 to 0.395.

The pattern: When two independent legitimate formula changes produce the same empty result, the empty result isn't a tuning artefact. It's a property of the data. Tuning harder won't change it.

Why the Bucket Stayed Empty

Here's where the meta-frame clicked. The spec asked for two things:

Multi-source ingestion means the corpus mixes domains. Yelp restaurants, Amazon consumer products, and Android apps are three unrelated worlds. When you run LLM-driven theme extraction on 4,742 reviews spanning food, skincare, and mobile apps, the strongest signal is the domain, not the feature. Themes correctly cluster as "Food Quality and Service" (1,682 reviews), "Skin and Hair Care" (791), "App Performance and Usability" (709). Domain-level abstractions, not feature-level ones.

Roadmap alignment assumes a single product owner with a feature-level roadmap — items like "improve QA process", "reduce app launch time", "reformulate skincare line". Feature-level concepts. The math of cosine similarity expects both sides to live at the same granularity.

Domain-level themes don't map cleanly to feature-level roadmap items. They map partially, at moderate similarity, with a lot of noise. And they don't combine high volume with concentrated severe pain — the largest themes are positive-to-neutral consumer chatter, the highest-pain themes are small clusters of crashes and defects. The priority formula was designed for a corpus shape that combined both, and this corpus didn't.

The two requirements aren't compatible if you take them both literally. The data wins.

The insight: When a spec asks for two things the data can only answer if you treat one of them loosely, the engineer's job is to surface the trade-off in the deliverable, not silently pick a side and pretend the other constraint also held.

The Move I Almost Made

The temptation was real. The spec's high-priority threshold of 0.6 felt arbitrary — and it was, sort of. Lowering it to 0.5 would have relabelled my top theme (priority 0.470) as "high-priority" and given me a non-empty bucket to ship. The submission checkboxes implied the result should be non-empty. The cleanest path through the spec was to drop the threshold and move on.

That would have been Goodhart's law in slow motion. Lowering a threshold to populate an empty category is the same move as lowering an exam pass-mark to make more students pass. The students don't get smarter. The category just stops meaning what it claimed to mean.

Re-tuning a formula constant against the data distribution is different. The threshold defines what a category means ("≥0.6 = high-priority"); changing it relabels reality. A formula constant defines how signals combine; changing it improves the measurement of reality. The two moves look identical at first glance — both lower a number until the result looks more publishable — but only one is honest.

What Shipped Instead

The deliverable surfaces the empty bucket explicitly. The README leads with a section called "A Note on the Corpus" that names the antagonism between multi-source ingestion and single-PM roadmap alignment. The Results Snapshot table marks "0 of 10 high-priority gaps" as PASS, not FAIL — because the empty result is the right answer for this corpus, confirmed across two independent formula changes. The three uncovered themes that almost cleared the threshold (Customer Service, Fragrance, Value Perception) are still surfaced in the Gap Analysis section with their priority scores and PM-actionable recommendations.

Tempting deliverable

3 high-priority

Threshold dropped to 0.5; a fabricated result

Honest deliverable

0 high-priority

Robust to two independent re-tunes; surfaced as a finding

Where This Generalizes

Dual-requirement specs are everywhere. Watch for these shapes:

  • Multi-tenant analytics + per-tenant roadmaps. Aggregated insights across tenants don't decompose cleanly into tenant-specific actions.
  • Cross-organisation benchmarks + organisation-specific scorecards. Benchmark categories rarely match scorecard categories at the same granularity.
  • Multi-product RAG corpus + product-specific Q&A. A single retrieval index over multiple products will surface generic answers; per-product Q&A demands fine-grained sources.
  • Cross-team eval suite + team-specific KPIs. A unified eval suite that scores everyone fairly will rarely align with the specific KPIs each team is graded against.
  • Generic foundation model + domain-specific fine-tuning. The cousin lesson: keeping the model "general" and "specialised" simultaneously is the same antagonism in a different costume.

A Diagnostic Checklist for Dual-Requirement Specs

Before accepting a spec at face value, ask:

  1. Does the data come from one source or many? If many, the strongest signal in the data will be the source, not the feature. Themes will cluster around domains.
  2. Is the downstream artefact assumed to be coherent at one granularity? A "single product roadmap" implies feature-level concepts. A "multi-domain corpus" produces domain-level themes. Cosine similarity between mismatched granularities is noisy.
  3. What would the ideal deployment look like? If the answer is "one corpus per product line", the spec is encoding an antagonism the deployment shape resolves. Surface that.
  4. Would re-tuning constants ever produce a non-empty result? If two independent legitimate changes give the same outcome, you're past the tuning ceiling. The data is speaking.
  5. What's the honest deliverable if the result is empty? "Here's the empty bucket and why it's empty" is a stronger artefact than a fabricated populated bucket. Reviewers who know the domain will respect the honesty.

The Takeaway

The spec's checkboxes implied a populated high-priority bucket. The data refused to produce one. The temptation was to hack a threshold and ship something that satisfied the checkbox. The right move was to surface the structural mismatch — confirmed across two independent legitimate formula changes — and make the empty bucket itself the finding.

The engineer's job isn't always producing a flattering result. Sometimes it's recognising when the spec's data and goal are misaligned and saying so on the page. That's a portfolio-grade move that tuning ten more constants will never replicate.

Math passes is not the same as question makes sense.

The hardest call on this project wasn't picking a similarity threshold or weighting the priority formula. It was deciding not to lower a threshold I knew was technically arbitrary, when lowering it would have made the deliverable look more complete. The spec rewarded a populated bucket; the data didn't have one. Refusing to fabricate one was harder than any of the formula tuning, because the cost was visible — fewer green checkboxes — and the gain was invisible until I wrote the README and realised the empty bucket was the most defensible thing in it.