Amazon SageMaker inference launches faster auto scaling for generative AI models

Amazon SageMaker Inference: Faster Auto Scaling for Generative AI Models

Introduction

Introduction

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates. The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto scaling to maintain business continuity.

Faster Auto Scaling Metrics

Faster Auto Scaling Metrics

To optimize real-time inference workloads, SageMaker employs Application Auto Scaling. This feature dynamically adjusts the number of instances in use and the quantity of model copies deployed, responding to real-time changes in demand. When in-flight requests surpass a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Similarly, as the number of in-flight requests decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs.

You May Also Like to Read  Generative Query-Aware Task-Oriented Conversational AI with Self-Modulating Memory for Efficient and Adaptive Dialogue Generation

Components of Auto Scaling

Components of Auto Scaling

The following figure illustrates a typical scenario of how a SageMaker real-time inference endpoint scales out to handle an increase in concurrent requests. This demonstrates the automated and responsive nature of scaling in SageMaker. In this example, we walk through the key steps that occur when the inference traffic to a SageMaker real-time endpoint starts to increase and concurrency to the model deployed on every instance goes up.

Target Tracking and Step Scaling Policies

Target Tracking and Step Scaling Policies

By using the new metrics, auto scaling can now…apacity=as_min_capacity,
MaxCapacity=as_max_capacity, # Replace with your desired maximum instances
)

Sample Runs and Results

Sample Runs and Results

With the new metrics, we have observed improvements in the time required to invoke scale-out events. To test the effectiveness of this solution, we completed some sample runs with Meta Llama models (Llama 2 7B and Llama 3 8B). Prior to this feature, detecting the need for auto scaling could take over 6 minutes, but with this new feature, we were able to reduce that time to less than 45 seconds.

Conclusion

Conclusion

In this post, we detailed how the ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints.

Frequently Asked Questions

Frequently Asked Questions

Question 1: What is the purpose of the new metrics?

The new metrics, ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy, provide a more direct and accurate representation of the load on the system, enabling faster auto scaling and improved responsiveness of generative AI applications.

You May Also Like to Read  Discover the Advantages of 400G Networking in Our Latest Sustainable Data Centers

Question 2: How do the new metrics improve auto scaling?

The new metrics allow for more rapid detection of changes in demand, enabling auto scaling to respond more quickly and accurately to fluctuations in inference traffic.

Question 3: Can I use the new metrics with existing invocation-based target tracking policies?

Yes, you can use the new metrics in tandem with existing invocation-based target tracking policies to achieve a more efficient and adaptive scaling behavior.

Question 4: How do I implement the new metrics in my workload?

To implement the new metrics, you can follow the steps outlined in this post, including creating a scalable target and defining a target tracking scaling policy.

Question 5: What are the benefits of using the new metrics?

The benefits of using the new metrics include faster auto scaling, improved responsiveness, and reduced costs.