Amazon SageMaker Inference: Faster Auto Scaling for Generative AI Models
Introduction
Introduction
Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates. The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto scaling to maintain business continuity.
Faster Auto Scaling Metrics
Faster Auto Scaling Metrics
To optimize real-time inference workloads, SageMaker employs Application Auto Scaling. This feature dynamically adjusts the number of instances in use and the quantity of model copies deployed, responding to real-time changes in demand. When in-flight requests surpass a predefined threshold, auto scaling increases the available instances and deploys additional model copies to meet the heightened demand. Similarly, as the number of in-flight requests decreases, the system automatically removes unnecessary instances and model copies, effectively reducing costs.
Components of Auto Scaling
Components of Auto Scaling
The following figure illustrates a typical scenario of how a SageMaker real-time inference endpoint scales out to handle an increase in concurrent requests. This demonstrates the automated and responsive nature of scaling in SageMaker. In this example, we walk through the key steps that occur when the inference traffic to a SageMaker real-time endpoint starts to increase and concurrency to the model deployed on every instance goes up.
Target Tracking and Step Scaling Policies
Target Tracking and Step Scaling Policies
By using the new metrics, auto scaling can now…apacity=as_min_capacity,
MaxCapacity=as_max_capacity, # Replace with your desired maximum instances
)
Sample Runs and Results
Sample Runs and Results
With the new metrics, we have observed improvements in the time required to invoke scale-out events. To test the effectiveness of this solution, we completed some sample runs with Meta Llama models (Llama 2 7B and Llama 3 8B). Prior to this feature, detecting the need for auto scaling could take over 6 minutes, but with this new feature, we were able to reduce that time to less than 45 seconds.
Conclusion
Conclusion
In this post, we detailed how the ConcurrentRequestsPerModel
and ConcurrentRequestsPerCopy
metrics work, explained why you should use them, and walked you through the process of implementing them for your workloads. We encourage you to try out these new metrics and evaluate whether they improve your FM and LLM workloads on SageMaker endpoints.
Frequently Asked Questions
Frequently Asked Questions
Question 1: What is the purpose of the new metrics?
The new metrics, ConcurrentRequestsPerModel
and ConcurrentRequestsPerCopy
, provide a more direct and accurate representation of the load on the system, enabling faster auto scaling and improved responsiveness of generative AI applications.
Question 2: How do the new metrics improve auto scaling?
The new metrics allow for more rapid detection of changes in demand, enabling auto scaling to respond more quickly and accurately to fluctuations in inference traffic.
Question 3: Can I use the new metrics with existing invocation-based target tracking policies?
Yes, you can use the new metrics in tandem with existing invocation-based target tracking policies to achieve a more efficient and adaptive scaling behavior.
Question 4: How do I implement the new metrics in my workload?
To implement the new metrics, you can follow the steps outlined in this post, including creating a scalable target and defining a target tracking scaling policy.
Question 5: What are the benefits of using the new metrics?
The benefits of using the new metrics include faster auto scaling, improved responsiveness, and reduced costs.