How We Scaled Bert to Serve 1+ Billion Daily Requests on CPU

Author

Quoc Le and Kip Kaehler

Venue

Data + AI Summit 2021

Abstract

Machine learning is a key part of our ability to scale important services to our massive community. In this talk, we share our journey of scaling our deep learning text classifiers to process 50k+ requests per second at latencies under 20ms. We will share how we were able to not only make BERT fast enough for our users, but also economical enough to run in production at a manageable cost on CPU.

Join us in shaping the future

View All Jobs

Latest

More results

How We Scaled Bert to Serve 1+ Billion Daily Requests on CPU

Author

Venue

Abstract

Join us in shaping the future

How We Scaled Bert to Serve 1+ Billion Daily Requests on CPU

Author

Venue

Abstract

Related Publications

A White Box Framework for Estimating Long-Term Experimental Impact via Short-Term Metrics

VI3NR: Variance Informed Initialization for Implicit Neural Representations

Neural Experts: Mixture of Experts for Implicit Neural Representations

Join us in shaping the future