How We Got 6x Throughput on Our ML Serving Stack

By 
Wizard Engineering: Vamsi Mocherla
Share this

Wizard is an AI-powered shopping agent that helps users discover products by analyzing reviews, editorials, and conversations to deliver personalized recommendations – from initial search through checkout.

Powering these recommendations is a query understanding and scoring stack that serves as the backbone of the platform: intent classification, structured extraction, cross-encoder reranking, dense embeddings, and named entity recognition. Every millisecond of latency, every failure under load, every scaling ceiling – users feel it directly.

Over four months, we optimized the inference platform that serves these models. Per-GPU throughput on our 0.6B cross-encoder reranker went from 349 to 2,195 req/s – a 6x improvement. Individual model deployments went from degrading at 500 concurrent users to handling 5,000 with zero failures. GPU utilization went from 35% to 93%.

Read the full article →