Etsy Engineering | Scaling Etsy Payments with Vitess: Part 3 – Reducing Cutover Risk

Enhancing Etsy Payments with Vitess: Part 3 – Mitigating Cutover Risks for Seamless Scalability

Introduction:

Between Dec 2020 and May 2022, Etsy successfully moved 23 tables with over 40 billion rows from four unsharded payments databases into a single sharded environment managed by Vitess. This marked their first usage of vindexes for sharding data. In this final part of their series on Sharding Payments with Vitess, they discuss the classes of errors that can occur when cutting over traffic from an unsharded keyspace to a sharded keyspace. They cover transaction mode errors, reverse VReplication breaking, and scatter queries. The Etsy team emphasizes the steps they took to minimize these risks and ensure a smooth production cutover.

Full Article: Enhancing Etsy Payments with Vitess: Part 3 – Mitigating Cutover Risks for Seamless Scalability

Moving and Sharding Data on Etsy’s Payments Platform: Challenges and Solutions

Between December 2020 and May 2022, Etsy’s Payments Platform, Database Reliability Engineering, and Data Access Platform teams undertook a significant project to move 23 tables, totaling over 40 billion rows, from four unsharded payments databases into a single sharded environment managed by Vitess. In this article, we will explore the challenges faced by the teams during this process and the different classes of errors that can occur when cutting over traffic from an unsharded keyspace to a sharded keyspace.

Introduction

This article is part three of our series on Sharding Payments with Vitess. In the first post, we discussed the challenges related to the application and data model. In part two, we focused on the challenges of cutting over a high traffic system. Now, we will delve into the various types of errors that can arise during the cutover process.

You May Also Like to Read  Etsy Engineering | Embracing Dynamic Leadership

The Role of the Data Access Platform Team

The Data Access Platform team at Etsy plays a crucial role in bridging the gap between product engineering and infrastructure. They are responsible for developing and maintaining Etsy’s in-house ORM (Object-Relational Mapping) and have expertise in software engineering and database software. Their involvement in the sharding project ensured that the queries generated by the ORM were compatible with vindexes and that Vitess was configured correctly to meet application expectations.

Transaction Mode Errors

When sharding data with Vitess, the choice of transaction_mode becomes crucial, especially in terms of atomicity. The default transaction_mode in Vitess is multi, which allows transactions to span multiple shards. However, partial commits are possible in this mode, meaning that writes may still be persisted on one or more shards even if a transaction fails. To ensure the usual atomicity guarantees of database transactions, Vitess offers twopc (two-phase commit) mode, but it is still considered an experimental feature.

To address these limitations, Etsy’s team opted for single transaction mode. Although it maintains all transactional guarantees, attempting to query multiple shards within a transaction will result in an error. This choice was made to ensure ease of communication with product developers and to provide more useful guarantees.

Preventing Transaction Mode Errors

To minimize the risk of transaction mode-related errors during the cutover, Etsy’s team conducted a thorough audit of their codebase. They logged and manually analyzed all SQL statements executed to determine which shards were involved. Despite the painstaking process, their efforts paid off, and no transaction mode-related errors were observed during production cutover.

Reverse VReplication Breaking

Reverse VReplication plays a critical role in keeping the original unsharded keyspace in sync with any writes sent to the sharded keyspace. If reverse VReplication breaks, the option to reverse the cutover may no longer be possible. One common issue that can break reverse VReplication is MySQL unique keys enforcement. While a unique key in an unsharded keyspace enforces global uniqueness, in a sharded keyspace, unique keys can only ensure per-shard uniqueness.

You May Also Like to Read  The Art of Training Your Own Alpaca-Style ChatGPT: A User-Friendly Guide (Part Two)

To address this issue, two potential solutions exist. The first is to delete the row corresponding to the successful write in the unsharded keyspace, allowing subsequent rows to reverse vreplicate without violating the unique-key constraint. The second option involves manually updating the Pos column in Vitess’s internal _vt.vreplication table to skip the problematic row.

Ensuring Robust Reverse VReplication

To ensure a robust reverse VReplication process, Etsy’s team implemented a series of measures. They created alerts to notify them if reverse VReplication broke and developed a runbook to address any issues promptly. Thankfully, reverse VReplication never broke in production. The issues encountered during the development environment were specific to that workflow.

Scatter Queries

In a sharded keyspace, if the sharding key or another vindexed column is not included in a query’s WHERE clause, Vitess defaults to sending the query to all shards, known as a scatter query. Forgetting to include the sharding key in queries before cutover can result in a significantly higher volume of queries post-cutover.

Identifying and Resolving Scatter Queries

To prevent scatter queries from causing disruptions during cutover, Etsy’s team implemented measures to identify and address them beforehand. They learned from a previous cutover attempt where scatter queries caused issues and focused on identifying and adding the necessary sharding keys to queries.

Conclusion

Moving and sharding data on Etsy’s Payments Platform was a complex undertaking, requiring careful planning and execution. The teams involved tackled various challenges related to transaction modes, reverse VReplication, and scatter queries. Through thorough auditing, proactive measures, and diligent communication, they successfully completed the cutover without significant errors or disruptions. This successful migration to a sharded environment will undoubtedly benefit Etsy’s Payments Platform in terms of scalability and performance.

Summary: Enhancing Etsy Payments with Vitess: Part 3 – Mitigating Cutover Risks for Seamless Scalability

Between Dec 2020 and May 2022, Etsy successfully moved billions of rows of data from multiple databases into a single sharded environment using Vitess. In this final part of the series on Sharding Payments with Vitess, the author discusses potential errors that may occur when cutting over to a sharded keyspace and how to mitigate them. These errors include transaction mode errors, issues with reverse VReplication, and scatter queries. The author emphasizes the importance of thorough auditing, creating alerts, and using Vitess tools to ensure a smooth and successful production cutover.

You May Also Like to Read  Unveiling Amazon SageMaker's Game-Changing Stream Support: Revolutionizing Generative AI

Frequently Asked Questions:

Q1: What is machine learning and how does it work?

A1: Machine learning is a branch of artificial intelligence that enables computer systems to learn from data and improve their performance over time without being explicitly programmed. It works by training algorithms on datasets to identify patterns and make accurate predictions or decisions.

Q2: What are the different types of machine learning?

A2: There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model with labeled data to make predictions. Unsupervised learning involves discovering patterns or structures in unlabeled data. Reinforcement learning involves training a model to make decisions based on trial and error interactions with its environment.

Q3: What are some real-life applications of machine learning?

A3: Machine learning finds applications in various domains such as healthcare, finance, retail, marketing, and more. Some examples include personalized medicine, fraud detection, recommendation systems, image recognition, natural language processing, and autonomous vehicles.

Q4: What are the main challenges in machine learning?

A4: One of the main challenges in machine learning is ensuring the quality and reliability of data used for training. Another challenge is selecting the right algorithms and architectures for specific tasks. Overfitting, where a model becomes too specific to the training data and performs poorly on new data, is also a common challenge. Additionally, interpretability and ethical considerations are gaining importance in machine learning.

Q5: What skills are needed to pursue a career in machine learning?

A5: Pursuing a career in machine learning requires a combination of skills. These include a solid understanding of mathematics and statistics, programming skills (such as Python or R), knowledge of machine learning algorithms and techniques, data preprocessing and analysis, and the ability to interpret and communicate results effectively. Additionally, a curiosity and willingness to learn and adapt to new technologies is highly beneficial.