Projects

Systems I've built and evaluated. Each project includes methodology, metrics, and honest limitations.

Multi-Agent CrewAI DSPy Embeddings

Customer Sentiment & PM Intelligence

Multi-Agent Review Analysis & Roadmap Alignment

Four-agent pipeline that ingests cross-platform customer reviews, scores sentiment and pain intensity, discovers themes via LLM map-reduce, aligns them to a product roadmap with semantic embeddings, and surfaces priority-ranked gaps.

Key findings

  • Reframing the sentiment output (predict 5-star, collapse to 3-class) beat two rounds of prompt refinement
  • Empty high-priority bucket survived two independent formula re-tunes. A corpus property, not a tuning failure
  • Spec's 0.75 cosine threshold was wrong for text-embedding-3-small; calibrated to 0.45 against actual distribution

Results

Sentiment accuracy

84.5%

Reviews analysed

4,742

Within ±1 star

95.5%

RAG Hybrid Retrieval Cross-Encoder Evaluation

PaperSearch

Academic Paper Research Assistant

RAG system that retrieves relevant passages from 1,000 academic papers and generates cited answers. Validated against the Open RAG Benchmark with 3,045 human-authored queries.

Key findings

  • Hybrid retrieval (dense + BM25) dominated all top configurations
  • MiniLM matched mpnet quality at 5× the speed
  • Reranking improved MRR by 7.6% (unlike the financial system)

Results

MRR

0.789

NDCG@5

0.797

Recall@5

0.89

LLM Structured Output FastAPI Evaluation

Synthetic Data Pipeline

Resume-Job Match Review System

FastAPI service that reviews resume-job pairs for compatibility. Rules-based pre-filtering plus LLM-as-judge scoring with structured outputs.

Key findings

  • Rules-based filtering caught 40% of mismatches without LLM calls
  • Structured outputs (Instructor + Pydantic) achieved 0% parse failures
  • Latency benchmarking identified optimal batch sizes

Results

Parse failures

0%

Pre-filter rate

40%

Synthetic Data LLM Structured Output

Synthetic Data Generator

DIY Repair Q&A Dataset

Pipeline to generate realistic Q&A pairs for DIY home repair. Instructor library for structured outputs, LLM-as-judge validation, quality metrics.

Key findings

  • Structured output constraints eliminated formatting failures
  • Diversity gap exists at dataset level, not individual item level
  • LLM-as-judge enabled automated quality filtering

Results

Format failures

0%

Quality score

4.2/5