Fine-tuning Embeddings Evaluation

The Model Wasn't Random — It Was Backwards

5 min read

TL;DR: I fine-tuned an embedding model to predict dating compatibility. Expected the baseline to be random (AUC ~0.50). Instead, it was systematically backwards (AUC 0.40). The model thought incompatible people were more similar. Understanding why changed how I thought about the problem.

The Setup

The task: given two dating profile texts, predict whether the people are compatible. A classic embedding problem — encode both profiles, compute cosine similarity, threshold at 0.5.

I started with all-MiniLM-L6-v2, a popular sentence transformer. It's trained on general text similarity — semantic relatedness, paraphrase detection, that kind of thing. Not dating. But surely it would be neutral, right? Two random profiles should score around 0.5. Compatible and incompatible pairs should be mixed.

I ran the baseline evaluation on 1,469 held-out pairs.

Baseline Results:

Metric Expected Actual
AUC-ROC ~0.50 (random) 0.40
Cohen's d ~0.00 (no separation) -0.36
Margin ~0.00 -0.08

AUC below 0.50 means the model is worse than random. Negative Cohen's d means incompatible pairs score higher than compatible ones. The model has the signal — it's just inverted.

Wait, Worse Than Random?

An AUC of 0.50 means "no better than guessing." An AUC of 0.40 means the model is systematically wrong — if you flipped its predictions, you'd do better than chance.

The negative Cohen's d confirmed it. On average, incompatible pairs had higher similarity scores than compatible ones. The model wasn't confused — it was confidently backwards.

Why?

Textual Similarity ≠ Compatibility

Here's the insight that made everything click: incompatible people often talk about the same topics.

Example incompatible pair:

Person A:

"I'm a non-smoker and can't be around cigarette smoke. It's a dealbreaker for me."

Person B:

"I'm a social smoker. I enjoy a cigarette when I'm out with friends."

Both profiles mention smoking. They share vocabulary. A text similarity model sees "smoker", "smoke", "cigarette" in both — high similarity! But these people are fundamentally incompatible.

The pre-trained model was doing exactly what it was trained to do: find texts that discuss similar topics. It had no concept of compatibility. Shared vocabulary + opposite positions = high similarity + zero compatibility.

The pattern: Dealbreakers create high textual similarity. Both people mention the same topic (smoking, religion, kids) from opposite sides. The embedding model sees topic overlap; a human sees incompatibility.

Fine-Tuning Doesn't Just Add Signal — It Corrects It

This reframed what fine-tuning was actually doing. I wasn't teaching a blank model something new. I was correcting a model that had learned the wrong signal for this domain.

The training process (CosineSimilarityLoss on 6,000 labeled pairs) pushed compatible pairs together and incompatible pairs apart. After 4 epochs:

Before Fine-tuning

0.40 AUC

Worse than random

After Fine-tuning

0.91 AUC

Strong discrimination

Cohen's d went from -0.36 to +2.17 — a "huge" effect size by statistical convention. The model didn't just become neutral; it learned what compatibility actually means.

Why This Matters

If the baseline had been random (AUC ~0.50), fine-tuning would have been about adding signal where none existed. That's a harder problem — you're starting from nothing.

Because the baseline was backwards, fine-tuning had a clearer job: invert the existing signal and amplify it. The model already "knew" something about compatibility — it just had the sign wrong. Correcting a systematic error is often easier than creating signal from scratch.

The insight: A backwards baseline isn't worse than a random one — it might be better. It means the model captured something real about the domain. Fine-tuning just needs to flip the sign.

The Limit: Data Quality

The fine-tuned model plateaued at 83% accuracy. I ran 7 hyperparameter iterations — epochs, learning rate, batch size, warmup — and they all landed in the same range. The ceiling wasn't training configuration; it was data quality.

The training data used generic language: "I value honesty", "I enjoy spending time outdoors." When profiles are vague platitudes instead of specific preferences, there's only so much signal to extract.

That's a real finding too. The model learned everything it could from the data. Further improvement requires better data, not more engineering.

The Takeaway

Before fine-tuning an embedding model, run a baseline evaluation. Don't assume the pre-trained model is neutral. It might be:

  • Random: No signal. You're teaching from scratch.
  • Weak but correct: Some signal. Fine-tuning amplifies it.
  • Backwards: Strong signal, wrong direction. Fine-tuning corrects it.

Each scenario tells you something different about the problem. A backwards baseline isn't a failure — it's information about how the domain differs from the model's training data.

And sometimes, backwards is exactly what you want to find.

I expected the baseline to be boring — a null result showing the pre-trained model didn't know about dating. Instead, I found a systematic inversion that told me more about the problem than a neutral baseline would have. The evaluation wasn't just a checkpoint; it was the most interesting finding of the project.