RAG systems
I design and tune retrieval pipelines end-to-end: chunking, embeddings, hybrid search, and reranking.
I build, measure what matters, and share what I learn. No "thought-leadership speak" — just findings with numbers.
I design and tune retrieval pipelines end-to-end: chunking, embeddings, hybrid search, and reranking.
I build measurement workflows with retrieval metrics like MRR, NDCG, and Recall@K to prove what actually improved.
I ship LLM-powered features with structured outputs, tool use, and practical reliability constraints.
I spent 25 years in enterprise consulting — MarkLogic, Oracle, RightNow — leading technical delivery for banks, telcos, and government agencies across APAC and North America. Now I'm applying that lens to AI engineering: evaluation pipelines, RAG systems, and the infrastructure that tells you whether an LLM actually works before it ships.
Background: Technical leadership and delivery — where things need to work reliably, at scale, for real users.
Focus: Measurable outcomes over demos. Retrieval metrics, failure-mode analysis, production-minded AI.
What metric matters? What's the target? What does "good enough" look like?
Before optimizing, know where you are. Can't claim improvement without a starting point.
Each experiment answers a question. The data tells you what to try next.
RAG system for 1,000 academic papers. Hybrid retrieval, cross-encoder reranking, validated against 3,045 human-authored queries.
Retrieval pipeline for a 160-page annual report. Sentence-based chunking, 11 configurations tested, hypothesis-driven iteration.
I expected the model to be neutral. Instead, it scored incompatible pairs higher than compatible ones. Understanding why made fine-tuning more meaningful.
Baseline AUC
0.40
Fine-tuned AUC
0.91