Migrate your Machine Learning Workload to Serverless Architecture with AWS Lambda

Neurons Lab
6 min readDec 27, 2022


Today, serverless computing offers many benefits to developers like scalability, faster time-to-market, and lower expenses. Machine learning projects have different complexities and migrating them to the cloud has inherent challenges.

In this post, we will talk about migrating your machine learning workload to serverless architecture with AWS Lambda.

What is AWS Lambda?

AWS Lambda is an event-driven, serverless computing platform that is part of Amazon Web Services. This means you can stop worrying about which AWS resources to launch (provisioning) or how to manage them. You have to place the code on Lambda to run your application.

Lambda has economical options with its pay-per-request billing, ease of use, and automatic scaling — all of which make it a popular tool. Lambda helps data scientists develop models into cost-effective, scalable API endpoints with minimal coding. Support for container image functions with up to 10 GB of memory and Advanced Vector Extensions 2 (AVX2) are part of the Lambda service. These capabilities offer the deployment of larger and more powerful models with better performance.

With Lambda’s support for container images as a packaging format which was introduced in 2020, the deployment package size limit grew from 500 MB to 10 GB. Previously, deploying ML frameworks like TensorFlow or PyTorch to Lambda was challenging due to package size constraints. The increased package size limit led to ML being a feasible workload option for Lambda deployment. The Lambda service recorded ML inference as the fastest-growing workload type in 2021.

Deploying the machine learning inference code as a Lambda function

AWS Cloud Development Kit (AWS CDK) script performs the automatic provision of container image-based Lambda functions. These functions conduct ML inference with pre-trained ML models. One can also use Amazon Elastic File System (EFS) storage attached to the Lambda functions to reduce inference latency by caching pre-trained models.

AWS architecture

Here are the steps to get an inference prediction.

1. A user sends a request to Amazon API Gateway requesting a machine learning inference.

2. The API Gateway gets the request and triggers the Lambda function.

3. Lambda loads the container image available in the Amazon ECR. The container image has the inference code and business logic for running the ML model. However, the ML model is not stored unless one is using the container-hosted option, explained in Step 6 below.

4. Model Storage Option S3: In the case of S3, triggering the Lambda function downloads the model files dynamically from S3 and performs an inference.

5. Model storage option EFS: In the case of EFS, triggering the Lambda function enables access to the models through the local mount path set in the Lambda file system and does the inference accordingly.

6. Container-hosted: If the chosen option is container-hosted, one must package the model available in Amazon ECR with the Lambda function application code defined earlier in step 3. In this specific case, choosing the model activity happens at build-time instead of run-time.

7. Lambda returns the inference prediction via the API Gateway to the user.

Depending on your storage choice for hosting models through Amazon S3, Amazon EFS, or Amazon ECR via Lambda OCI deployment, there is an impact on the cost of the infrastructure, inference latency, and DevOps deployment strategies.

Comparison: Single and Multi-model inference architectures

ML inferencing architectures involve a couple of types — single-model and multi-model. The single-model architecture has a single ML inference model handling the inference for all incoming requests. The model storage happens in ECR (via OCI deployment with Lambda), S3 or EFS, followed by Lambda compute service.

Single-model differentiates with each one having its own process of computaton. Every Lambda function will have an association with a single model in a one-to-one relationship.

Multi-model inferencing architecture, on the other hand, involves deploying multiple models. The model selection to perform the inference happens dynamically, depending on the request type. For example, you may have over five or six different models for a single application and the Lambda function should choose the appropriate model at invocation time in the form of a many-to-one relationship.

The models need to be stored in S3, EFS, or ECR via Lambda OCI deployments irrespective of whether or not one is using a single or multi-model.

Should you load a model outside the Lambda handler or inside?

Best practices in Lambda suggest it’s better to load models or other things that take longer to process ‘outside the Lambda handler.’ An example is loading a third-party package dependency.

If you are running a multi-model inference, it’s better to load inside the handler so you can load the model dynamically. This approach helps to store 100 models in EFS and decide on a model to load at the time of invocation of the Lambda function. In these scenarios, loading the models in the Lambda handler is optimal. This approach can increase your function’s processing time, as loading the model happens at the time of the request.

Recommendations for model storage

In the case of single-model architectures, preferably, load ML models outside the Lambda handler for better performance on following invocations after a primary cold start.

For multi-model architectures, it’s best to load models outside the Lambda handler. If you have many models that need advanced loading, you can load them inside the Lambda handler. Here, a model will be loaded at every invocation of Lambda, thereby increasing the timeline of the Lambda function.

When should you choose model hosting on S3, EFS, or container-hosted models?

Choosing S3 option

You can opt for S3 if you need a low-cost storage option to store models, and S3 is a good choice when you cannot predict your application traffic volume for inference. Another advantage is that if a need arises to retrain the model, you can upload the retrained model to S3 without redeploying the Lambda function.

Choosing model hosting on EFS

EFS is the storage of choice if you have a latency-sensitive workload for inference or if you are using EFS for machine learning-related activities and have it ready.

Using EFS requires you to VPC-enable the Lambda function to mount the EFS filesystem, calling for additional configuration requirements.

Throughput testing is essential with EFS burst mode and provisioned throughput modes. Based on the volume of inference request traffic, if the burst mode lacks performance, provision throughput for EFS.

Choosing container-hosted models

The container-hosted models approach is simple, as all models are present in the container image uploaded to Lambda. Latency is at the lowest, as models are not being downloaded from any external storage.

All models need to be packaged into the container image. If the models cannot fit in the container image’s available 10GB storage space, using a container-hosted solution will not be feasible.

Another drawback is in case of a model change; this will require re-packaging the models with the inference Lambda function code.

This approach is suitable only if your models can fit within the 10 GB limit for container images and you have no need to retrain models frequently.


By now, you have a fair idea of what is involved with migrating your machine learning workload to serverless architecture with AWS Lambda. Go on and choose the right storage option based on your requirements for model hosting.

Neurons Lab is an AWS Lambda Delivery Partner and helps customers to build or migrate solutions to a microservices architecture running on serverless computing.

Drop us a line for a consultation



Neurons Lab

We are a group of scientists, engineers, and developers who are passionate to revolutionize the future of businesses with AI and machine learning technologies.