Structure, Uncertainty, and the Flywheel: An LLM Judge for Recommendation Quality at Scale

By 
Wizard Engineering: Karthik Shivaram, Dave Kale
Share this

What if your AI shopping assistant could explain why a product truly matches your needs — and know when it might be wrong?

In this deep technical breakdown, the AI team at Wizard reveals how they built an “LLM-as-Judge” system capable of evaluating product recommendations with near-human accuracy. Instead of relying on brittle rules, their system breaks shopping queries into structured requirements (“waterproof,” “wide fit,” “under $150”), evaluates each independently, and routes uncertain cases to humans only when needed.

The result: a self-improving recommendation engine that catches up to 69% of AI mistakes while reviewing just 25% of cases.

The article goes beyond benchmarks and dives into the architecture decisions that made it work at production scale — from holistic multi-constraint reasoning to uncertainty-aware routing and fine-tuning open-source models to rival top commercial systems. Even more compelling, every human-reviewed mistake becomes new training data, creating a feedback flywheel that continuously improves search quality, embeddings, rerankers, and recommendation accuracy across the entire platform.

It’s a rare look inside how modern AI systems move from demos to reliable, scalable infrastructure.

Read the full article →