Data Science
Artificial Intelligence
How We Scaled Bert to Serve 1+ Billion Daily Requests on CPU
Author
Venue
Data + AI Summit 2021
Abstract
Machine learning is a key part of our ability to scale important services to our massive community. In this talk, we share our journey of scaling our deep learning text classifiers to process 50k+ requests per second at latencies under 20ms. We will share how we were able to not only make BERT fast enough for our users, but also economical enough to run in production at a manageable cost on CPU.