Blog

Findings from building AI systems. Evidence-based, not opinion-driven.

LLM Tutoring Prompting Education

Building a Socratic AI Tutor from Scratch: How LLMs Learn to Ask Better Questions Than They Answer

Prompting a model to 'ask questions instead of answering' sounds like it should produce a Socratic tutor. Two recent preprints show why that assumption doesn't hold up, and what actually works instead.

6 min read

Adaptive Tutoring A/B Testing Education

OATutor Was Built to Make Experiments Cheap to Rerun. Here's What Happened When Researchers Did.

OATutor is an open-source adaptive tutor built around Bayesian Knowledge Tracing and built-in A/B testing. Two studies run on it, about fifteen months apart, show why that infrastructure matters. They also show how a small pilot's tidy result stopped holding up at scale.

7 min read

Pipeline Design Spec Interpretation ML Engineering

I Built the Pipeline. The Math Worked. The Question Didn't.

I tuned the formula constants until every spec gate passed. The pipeline still produced an empty high-priority bucket. The win wasn't a metric; it was recognizing the spec's two requirements were quietly antagonistic.

7 min read

Prompt Engineering LLM Evaluation Classification

Refining the Prompt Made It Worse. Reframing the Output Made It Better.

I refined a sentiment classification prompt twice trying to clear an 85% accuracy target. Both attempts regressed. The win came from changing what I asked the model for, not how I asked for it.

6 min read

Fine-tuning Embeddings Evaluation

The Model Wasn't Random. It Was Backwards.

I expected a pre-trained embedding model to be neutral on dating compatibility. Instead, it was systematically wrong, scoring incompatible pairs higher than compatible ones. Here's why that made fine-tuning more interesting.

5 min read

RAG Reranking Evaluation

Reranking Isn't Always Better: A Tale of Two RAG Systems

I built two RAG pipelines. In one, adding a reranker dropped accuracy by 7%. In the other, it improved accuracy by 7%. Same technique, opposite results.

6 min read