Notes
Lessons learned, written down.
Shorter than courses. No prerequisites, no order. Mistakes, realizations, frameworks, and things I didn't want to forget.
LLM-as-Judge: Making Models Evaluate Models
How I built a rubric-based evaluation framework at Amazon, calibrated scoring against human audits, and learned what actually makes LLM evaluation trustworthy.
Multimodal Evaluation Pipelines
Ingesting images and HTML, extracting structured signals, and measuring quality across proprietary KPIs — what I learned building this at Amazon scale.
Semantic Search at Scale: Brand Standardization
Using FAISS and embeddings to map 300K noisy brand strings to a canonical taxonomy — the decisions that mattered and the ones that didn't.
Prompt Engineering That Holds in Production
What actually works when you're scoring 32K+ products, not just demo notebooks. The patterns that survived and the ones that fell apart.
Benchmarking GenAI: Beyond Vibes
Designing evaluation systems that give you launch confidence — not just high scores. The hard lessons from building this inside Amazon.
Context Engineering Is the New Prompt Engineering
The shift from 'write better prompts' to 'design better context' — and why this reframe changes everything about how you build with LLMs.
Reasoning Models Aren't Always Worth the Cost
I ran the numbers on when o3 actually beats Claude Sonnet, and the answer surprised me. Spoiler: it's not most tasks.
Stay in the loop
New notes, straight to your inbox.
No cadence, no noise. Just a note when something is worth writing down.