Systems I've built and evaluated. Each project includes methodology,
metrics, and honest limitations.
RAG Hybrid Retrieval Cross-Encoder Evaluation
PaperSearch
Academic Paper Research Assistant
RAG system that retrieves relevant passages from 1,000 academic papers and generates cited answers. Validated against the Open RAG Benchmark with 3,045 human-authored queries.
Key findings
- → Hybrid retrieval (dense + BM25) dominated all top configurations
- → MiniLM matched mpnet quality at 5× the speed
- → Reranking improved MRR by 7.6% (unlike the financial system)
RAG Chunking Evaluation Financial
Financial Report Search
RAG Pipeline for PDF Documents
Retrieval pipeline for a 160-page annual report. Systematic evaluation across 11 configurations to find what actually works for financial documents.
Key findings
- → Sentence-based chunking outperformed fixed-size and semantic
- → Overlap caused 12× chunk explosion and destroyed accuracy
- → Reranking hurt — strong upstream chunking didn't need downstream correction
LLM Structured Output FastAPI Evaluation
Synthetic Data Pipeline
Resume-Job Match Review System
FastAPI service that reviews resume-job pairs for compatibility. Rules-based pre-filtering plus LLM-as-judge scoring with structured outputs.
Key findings
- → Rules-based filtering caught 40% of mismatches without LLM calls
- → Structured outputs (Instructor + Pydantic) achieved 0% parse failures
- → Latency benchmarking identified optimal batch sizes
Synthetic Data LLM Structured Output
Synthetic Data Generator
DIY Repair Q&A Dataset
Pipeline to generate realistic Q&A pairs for DIY home repair. Instructor library for structured outputs, LLM-as-judge validation, quality metrics.
Key findings
- → Structured output constraints eliminated formatting failures
- → Diversity gap exists at dataset level, not individual item level
- → LLM-as-judge enabled automated quality filtering