Findings from building AI systems. Evidence-based, not opinion-driven.
Pipeline Design Spec Interpretation ML Engineering
I Built the Pipeline. The Math Worked. The Question Didn't.
I tuned the formula constants until every spec gate passed. The pipeline still produced an empty high-priority bucket. The win wasn't a metric; it was recognizing the spec's two requirements were quietly antagonistic.
7 min read
Prompt Engineering LLM Evaluation Classification
Refining the Prompt Made It Worse. Reframing the Output Made It Better.
I refined a sentiment classification prompt twice trying to clear an 85% accuracy target. Both attempts regressed. The win came from changing what I asked the model for, not how I asked for it.
6 min read
Fine-tuning Embeddings Evaluation
The Model Wasn't Random. It Was Backwards.
I expected a pre-trained embedding model to be neutral on dating compatibility. Instead, it was systematically wrong, scoring incompatible pairs higher than compatible ones. Here's why that made fine-tuning more interesting.
5 min read
RAG Reranking Evaluation
Reranking Isn't Always Better: A Tale of Two RAG Systems
I built two RAG pipelines. In one, adding a reranker dropped accuracy by 7%. In the other, it improved accuracy by 7%. Same technique, opposite results.
6 min read