How Patsnap used GPT-2 inference on Amazon SageMaker with low latency and cost

Optimizing Low Latency and Cost: Patsnap’s Successful Utilization of GPT-2 Inference on Amazon SageMaker

Introduction:

In this blog post, co-authored by Zilong Bai, a senior natural language processing engineer at Patsnap, we explore an innovative feature implemented in collaboration with the AWS Generative AI Innovation Center. This feature aims to automatically suggest search keywords to enhance the user experience on the Patsnap platform. Patsnap is a global one-stop platform for patent search, analysis, and management that utilizes big data and machine learning to provide powerful patent tools. However, the inference latency and queries per second (QPS) of the PyTorch-based GPT-2 model used for search query autofill were not meeting the required thresholds. To address this challenge, the AWS Generative AI Innovation Center scientists optimized the GPT-2 model’s performance using NVIDIA TensorRT, resulting in a 50% reduction in latency and a 200% improvement in QPS. In this post, we delve into the technical details of the optimization process and provide a comparison of the latency between the PyTorch and TensorRT models. Furthermore, we demonstrate how to deploy the TensorRT-based model with SageMaker and a custom container. Read on to learn more about how we achieved low latency on GPU instances and improved user experiences on the Patsnap platform.

You May Also Like to Read  Newly Released on Amazon SageMaker JumpStart: Meta's Llama 2 Foundation Models

Full Article: Optimizing Low Latency and Cost: Patsnap’s Successful Utilization of GPT-2 Inference on Amazon SageMaker

Improving Search Experiences with Auto-Suggested Patent Search Keywords

Search engines have become an integral part of our daily lives, and we are all familiar with the autocomplete feature that suggests search results as we type. This feature makes it easier and faster for users to find what they are looking for. Now, this same concept is being applied to patent search.

Collaboration between AWS Generative AI Innovation Center and Patsnap has led to the development of a feature that automatically suggests search keywords in patent searches. Patsnap is a global platform for patent search, analysis, and management, and they provide powerful yet user-friendly patent tools. By using big data and machine learning, Patsnap aims to enhance user experiences on their platform.

The Challenge of Latency and Queries per Second

Patsnap trained a customized version of the GPT-2 model, a large transformer-based language model, to generate search queries. However, during experiments, they encountered challenges with the model’s latency and queries per second (QPS). The PyTorch-based GPT-2 model was unable to meet their desired performance thresholds.

Optimizing GPT-2 Inference Performance with TensorRT

To tackle this challenge, the scientists at the AWS Generative AI Innovation Center explored various solutions to optimize the GPT-2 model’s inference performance. They achieved remarkable results by leveraging NVIDIA TensorRT, a library for high-performance inference on NVIDIA GPUs.

By deploying a TensorRT-based model on Amazon SageMaker, the average latency of the GPT-2 model was reduced by 50% and the QPS improved by 200%. These improvements significantly enhanced the user experience on the Patsnap platform.

You May Also Like to Read  IROS2023 Awards Finalists and Winners Revealed: Unlock One Year of Free On-Demand Access to IROS!

Converting the GPT-2 Model with TensorRT

The process of converting the PyTorch-based GPT-2 model to a TensorRT-based model involves several steps. This includes analyzing the GPT-2 model, installing the required Python packages, and converting the model utilizing the official tool provided by NVIDIA. The conversion process did not result in any noticeable degradation in model accuracy.

Performance Comparison: PyTorch vs. TensorRT

To evaluate the performance improvements, JMeter, an Apache project for load testing, was used to benchmark the original PyTorch-based GPT-2 model and the converted TensorRT-based model on an AWS P3.2xlarge instance.

The results demonstrated significant reductions in latency with the TensorRT-based model. At a request concurrency of 1, the average latency improved by 2.9 times, and the QPS increased from 2.4 to 7. The performance improvements were consistent even as the concurrency increased, providing lower costs with acceptable latency.

Deploying TensorRT-based GPT-2 with SageMaker

To deploy the TensorRT-based GPT-2 model, a custom container was built using the bring your own container (BYOC) mode of SageMaker. This allowed for the flexibility of deploying the model in a customized environment using Docker. The model was then tested using the SageMaker endpoint API.

Conclusion

The collaboration between AWS Generative AI Innovation Center and Patsnap has resulted in the development of an auto-suggested search keyword feature for patent search. By optimizing the GPT-2 model’s inference performance with TensorRT, the latency was significantly reduced, resulting in improved search experiences for users on the Patsnap platform. This innovative solution showcases the power of machine learning in enhancing user experiences and driving business value.

You May Also Like to Read  Unveiling Revolutionary LLMs: The Future of Bot Creation!

Summary: Optimizing Low Latency and Cost: Patsnap’s Successful Utilization of GPT-2 Inference on Amazon SageMaker

This blog post discusses the collaboration between AWS Generative AI Innovation Center and Patsnap to improve the user experience on their patent search platform. The focus is on implementing a feature to automatically suggest search keywords using state-of-the-art text generation models. However, the PyTorch-based GPT-2 model used for this purpose faced challenges in terms of latency and queries per second (QPS). To optimize the performance, the post explores the use of NVIDIA TensorRT to deploy a TensorRT-based model, resulting in a significant reduction in latency and improved QPS. The technical details and comparisons with the original model are provided, along with instructions on deploying the TensorRT-based model using SageMaker and a custom container.