Wizard is an AI-powered shopping agent that helps users discover products by analyzing reviews, editorials, and conversations to deliver personalized recommendations – from initial search through checkout.
Powering these recommendations is a query understanding and scoring stack that serves as the backbone of the platform: intent classification, structured extraction, cross-encoder reranking, dense embeddings, and named entity recognition. Every millisecond of latency, every failure under load, every scaling ceiling – users feel it directly.
Over four months, we optimized the inference platform that serves these models. Per-GPU throughput on our 0.6B cross-encoder reranker went from 349 to 2,195 req/s – a 6x improvement. Individual model deployments went from degrading at 500 concurrent users to handling 5,000 with zero failures. GPU utilization went from 35% to 93%.