Serverless Transformer NLP Inference

8 min readSep 25, 2021

Lately, I have been learning about serverless functions and their cost model. Since I have been working my entire career with provisioned hot and ready back-end systems, it has been fun to learn about serverless functions scale and cost model without any provisioned resources.

This blog post will try to summarize what I have learned and demonstrate how to build and deploy a serverless function that performs Transformer ML inference at scale. In addition, I will cover the cost model of AWS Lambdas and how to tune the Lambda runtime configuration for the serverless inference task.

Choosing an NLP Model

BERT, or Transformer models, have revolutionized NLP tasks, beating previous state-of-the-art by large margins on several benchmarks. The downside of BERT or Transformer models is that these type of models comes with a high computational cost. The inference computation complexity is quadratic with the input sequence length, and the max input length is limited to 512 subword ids.

Since the introduction of BERT in 2018, researchers have spent time on reducing the computational complexity related to the all-to-all attention mechanism. However, researchers or industry practitioners, in my experience, still turn to the vanilla base Bert base model with 110M trainable parameters. However, smaller, distilled models can match BERT’s accuracy on many NLP tasks but with much less inference (and train) computational complexity.

My go-to BERT-based model is the MiniLM model from Microsoft. It rivals the bigger brother BERT-base on accuracy but roughly 3x faster. Of course, there is no free lunch in many aspects of life, but replacing vanilla BERT-base with MiniLM is a no-brainer.

In this blog post, I will use a 6-layer MiniLM version with 22.7 M parameters. With quantization, the model uses int8 weights instead of float32. Thus, quantization helps accelerate inference on CPU significantly (another 3–4 x) speedup over float32 without much accuracy loss. Do note, as always, when making model changes, one should check the accuracy impact on the specific task. This blog post has more on quantization, and this one is also great. Quantization is a game-changer for CPU serving of Transformer models. At work, I have demonstrated that quantized small MiniLM models can even rival large models like T5 with 3B parameters, see, for example, this blog post.

ONNX and ONNX-Runtime

ONNX (Open Neural Network Exchange) format allows you to train a model in your preferred framework without worrying about downstream inferencing implications. One can export models from Huggingface to ONNX format, or Tensorflow, Pytorch, and a wide range of frameworks. Even GBDT models trained with frameworks like XGBoost and LightGBM can be exported to ONNX. The ONNX-Runtime is a runtime that can run inference of ONNX models. ONNX-Runtime is very efficient and makes use of HW accelerated instruction sets (e.g., AVX2 and AVX512). It has several language bindings, including python, which I will be using for my Lambda function. At work, we also use ONNX-Runtime in Vespa.ai.

AWS Lambda

You probably know more about AWS Lambda than I do, but you can read more here if you have not heard about it. The basic summary is that AWS Lambda allows you to deploy a function (code and dependencies), where you are only charged for the duration of your function, and you do not pay for the idle time or the bootstrap time for the Lambda function. In addition, no provisioning or managing a large set of instances is required.

There are some crucial limitations that I will mention:

Concurrency limit. In a given AWS region, one AWS account can only have 1,000 concurrent Lambda functions running simultaneously. One thousand concurrent Lambda functions equal up to 6000 CPU and 10 TB of Memory. So that is a lot of computing power without having to manage a single instance.
Requests/s limit. One can have 10,000 invocation requests per region per second. If the Lambda function latency is less than 100ms, one likely hit this request rate limitation before the concurrency limit.
Memory. A Lambda function execution can use 128MB to 10240MB of RAM, which is the only configuration knob that one can tune. Increased Memory also comes with more CPU, from 1 at 128MB to 6 CPU at 10240 MB.
Deployment size. The size of the function (A deployment.zip file including the function code and all its runtime dependencies) can only be 50 MB compressed or 250 MB uncompressed in the runtime. There is an option to package a docker image that lifts this size limit to 10G, which should be plenty for most use cases.

Lambda Cost model

You can read about the price model here, but a basic summary is that you pay a fixed cost per lambda function invocation (1M invocations costs 0.20$) and for the duration which the lambda is running, measured in millisecond resolution. The millisecond duration price depends on how much memory you have configured the Lambda to be allowed to use. More on this later in this post.

Transformer NLP Inference Lambda Function

In my example, I prototype a classification model that classifies some input text passed through an event trigger. One can trigger Lambda functions in many ways, including through a managed HTTP gateway like the code below is expecting. Furthermore, the model could be used for sentiment analysis, hate speech detection, and more, including text ranking.

My inference Lambda function has three runtime dependencies: tokenizers, onnxruntime and NumPy. Tokenizers is a slim package with the native fast Rust-based tokenizers used in the popular transformer hugging face library. Due to the Lambda size limitations, I use the tokenizers as it is tiny compared to the bloated transformers package.

Note that I use batch size equals one in the function. Batching sequences does not impact overall throughput for CPU; batching does improve throughput on GPU, but not on CPU. One could use batch to overcome some of the mentioned Lambda request rate limitations for significant batch processing (e.g., producing vector representation of text). Each request would input more data than a single data point. However, the *inference* would not be any faster or produce higher throughput. See this figure which compares sequences/sec with different batch sizes (This on CPU, on GPU the story is different, and one do get a nice throughput increase by using batching).

Image from https://cloudblogs.microsoft.com/opensource/2021/03/01/optimizing-bert-model-for-intel-cpu-cores-using-onnx-runtime-default-execution-provider/. For the same sequence length, increasing batch size does not improve sequences/s (throughput).

To build the deployment zip file, one needs to bundle the runtime dependencies. Since both onnxruntime and tokenizers use native library dependencies, one needs to install these dependencies in an environment that matches the Lambda execution environment. It is easiest done using the Amazon Linux docker image. One can pull the image, run the container, and install the dependencies into the same directory as the Lambda function file (Use pip install — target to specify). Using the docker image also helps not mess up the primary python environment. One also needs the ONNX model, and the transformer BERT vocabulary file, which one can download from the huggingface hub. With all the dependencies and the function, one can zip the files to create one lambda deployment file. The inference function with all dependencies and models is about 43 MB, well below the 50MB limit.

Deployment

Once zipped up, one can deploy the function in the AWS Lambda Console or use some magic tools like AWS-cli.

Tuning Lambda Function

So once the Lambda function is deployed, it can be tested from the console. The nice thing is that it is possible to alter the configuration (The big MB knob) and instantly test the performance by inputting a test sequence.

First, let us see how our model handles various sequence input lengths. For this experiment, I use a fixed MB size of 512. Then execute the test with different sequence input lengths and capture the semi-average latency from several runs by eyeballing.

Inference Latency versus input sequence length (Subword ids)

The above pretty much illustrates why it’s pointless to talk about Transformer inference latency without mentioning what sequence length was used.

For the rest of the experiments, I fixed the sequence input length to the model to 128, which is a reasonable passage length text.
The following explores changing the allocated Lambda Memory setting and measuring how it impacts the inference latency:

With increasing Memory, one also gets more CPUs and possibly more powerful CPUs with better support for acceleration with CPU instruction sets like AVX2/AVX512). Lambda functions are guaranteed to run on CPUs with AVX2 support.

Why is the latency improving when throwing more resources on it, you might ask? The code looks to be single-threaded, right?

The ONNX-Runtime will use available CPU threads to run inference, so with the increase of MB allocated/CPUs, the inference is parallelized. 18 ms inference time for an input sequence length 128 is pretty good and likely good enough for most serving use cases.

Lambda Function Cost Modelling

I have included the Lambda memory allocated and the associated duration price from the AWS Lambda pricing above. Note that the cost model does not include network charges and other fun stuff that AWS might throw in, like charging for storing logs, so the total cost is higher than this for sure.

Given the observed inference latency for sequence length 128, it is possible to calculate the total billed price for making 1B inferences (1,000 million inferences). If caring deeply about inference serving latency, for example, using this Lambda function synchronous powering a real-time serving scenario like query embedding for search, a user would need to determine how low is low enough (latency SLA). For batch processing, latency is less important, but as can be seen above, the cheapest option (Using 1024 MB) gives better latency than using 512 MB. Note that the exact sweet spot (threads/latency) might depend significantly on sequence length; smaller sequence lengths like 32 might not get much latency speed up from using more threads during model inference.

The latency SLA decision should also be considered, e.g., network latency which easily can be several times the latency of the inference call alone. The cheapest option is using 1024 MB with 55 ms latency, and making 1B inferences will cost less than $1,000.

Lowering the latency to 18 ms is excellent for a synchronous real-time serving use case. However, by reducing latency to 18–20 ms, the serving cost grows by 3x compared with the cheapest option. So there is a cost versus latency tradeoff to be made, as always:

Price for 1B inferences using Lambda Stateless Function for NLP Inference

Conclusion

In this blog post, I’ve tried to summarize my learning from playing with Lambda functions for NLP inference; I have also touched on how to speed up inference using more CPU threads and quantify costs versus inference latency.