Projects

Systems I've built and evaluated. Each project includes methodology, metrics, and honest limitations.

Multi-Agent CrewAI DSPy Embeddings

Customer Sentiment & PM Intelligence

Multi-Agent Review Analysis & Roadmap Alignment

Four-agent pipeline that ingests cross-platform customer reviews, scores sentiment and pain intensity, discovers themes via LLM map-reduce, aligns them to a product roadmap with semantic embeddings, and surfaces priority-ranked gaps.

Key findings

→ Reframing the sentiment output (predict 5-star, collapse to 3-class) beat two rounds of prompt refinement
→ Empty high-priority bucket survived two independent formula re-tunes. A corpus property, not a tuning failure
→ Spec's 0.75 cosine threshold was wrong for text-embedding-3-small; calibrated to 0.45 against actual distribution

Results

Sentiment accuracy

84.5%

Reviews analysed

4,742

Within ±1 star

95.5%

RAG Hybrid Retrieval Cross-Encoder Evaluation

PaperSearch

Academic Paper Research Assistant

RAG system that retrieves relevant passages from 1,000 academic papers and generates cited answers. Validated against the Open RAG Benchmark with 3,045 human-authored queries.

Key findings

→ Hybrid retrieval (dense + BM25) dominated all top configurations
→ MiniLM matched mpnet quality at 5× the speed
→ Reranking improved MRR by 7.6% (unlike the financial system)

Results

MRR

0.789

NDCG@5

0.797

Recall@5

0.89

RAG Chunking Evaluation Financial

Financial Report Search

RAG Pipeline for PDF Documents

Retrieval pipeline for a 160-page annual report. Systematic evaluation across 11 configurations to find what actually works for financial documents.

Key findings

→ Sentence-based chunking outperformed fixed-size and semantic
→ Overlap caused 12× chunk explosion and destroyed accuracy
→ Reranking hurt. Strong upstream chunking didn't need downstream correction

Results

MRR

0.833

Recall@5

0.967

Configs tested

LLM Structured Output FastAPI Evaluation

Synthetic Data Pipeline

Resume-Job Match Review System

FastAPI service that reviews resume-job pairs for compatibility. Rules-based pre-filtering plus LLM-as-judge scoring with structured outputs.

Key findings

→ Rules-based filtering caught 40% of mismatches without LLM calls
→ Structured outputs (Instructor + Pydantic) achieved 0% parse failures
→ Latency benchmarking identified optimal batch sizes

Results

Parse failures

Pre-filter rate

40%

Synthetic Data LLM Structured Output

Synthetic Data Generator

DIY Repair Q&A Dataset

Pipeline to generate realistic Q&A pairs for DIY home repair. Instructor library for structured outputs, LLM-as-judge validation, quality metrics.

Key findings

→ Structured output constraints eliminated formatting failures
→ Diversity gap exists at dataset level, not individual item level
→ LLM-as-judge enabled automated quality filtering

Results

Format failures

Quality score

4.2/5