Home Latest News Machine Learning Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide...

Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

July 26, 2023

Table of Contents

Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

Introduction:

In 2018, Etsy made the decision to migrate to Google Cloud Platform, choosing it as their provider. They embarked on a major redesign to host their Kafka brokers on Google’s managed Kubernetes (GKE) in order to benefit from features like on-demand capacity scaling and multi-zone/region resilience. As Etsy’s Kafka cluster grew in importance, they realized the limitations of their initial architecture that operated in a single availability zone. They needed a more resilient design to prevent outages and data loss. This post discusses how Etsy achieved zero downtime and successfully migrated their Kafka cluster to a multi-zone design, ensuring zonal resilience. They also share their optimization plans to address increased inter-zone network costs.

Full Article: Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

Etsy Successfully Migrates Kafka Cluster to Google Cloud’s Multizone Architecture for Improved Resilience and Cost Optimization

In 2018, Etsy made the decision to migrate to the cloud using Google Cloud Platform as its provider. This migration involved a major redesign, including hosting their Kafka brokers and clients on Google’s managed Kubernetes (GKE). Over time, Etsy realized the need for increased resilience and began implementing a multizone architecture for their Kafka cluster.

The Challenges of a Single-Zone Architecture

Initially, Etsy operated their Kafka cluster in a single availability zone to save on costs. However, this approach posed limitations and vulnerabilities. A Kafka outage could result in stale search results, negatively impacting buyers, sellers, and Etsy’s revenue. To address these concerns, Etsy decided to reevaluate their architecture.

Designing a Multizone Architecture for Kafka

After thorough research and experimentation, Etsy developed a plan to make their Kafka cluster resilient to zonal failures. The goal was to ensure zero downtime during the migration process. The new design involved running Kafka brokers in three different zones within the GKE cluster. Kubernetes Pod Topology Spread Constraints were applied to evenly distribute the brokers across zones. Additionally, topic partition replicas were evenly distributed based on the zone where the broker was running.

Ensuring Zero Downtime During Migration

One of the main challenges during the migration was moving the broker Pods to their correct zones without causing downtime or data loss. The disks and PVCs (Persistent Volume Claims) for the pods were zonal resources, which meant they could only be accessed locally. To overcome this challenge, Etsy utilized Google’s disk snapshotting feature. The process involved creating a base snapshot of the broker disk while it was still running, halting the broker, creating a final snapshot, creating a new disk from the final snapshot in the correct zone, deleting the original disk, recreating the StatefulSet, and waiting for the cluster health to return to normal. This process was repeated for each broker.

Relocating Topic Partitions

Unlike broker relocation, Kafka does not provide automatic partition relocation. Etsy had to manually relocate topic partitions to ensure an even distribution across all zones. This involved generating a list of partitions needing relocation, generating a new partition assignment plan in JSON form, and applying the partition assignments using the Kafka CLI tool. The data migration was throttled to prevent overwhelming the cluster and ensure a successful migration.

Testing and Validation in Production

In 2021, Etsy had the opportunity to test and validate their multizone Kafka design during a company-wide initiative to understand zonal resilience. They brought down an entire zone, a third of the Kafka cluster, in a production environment. The impact was minimal and temporary, as client requests automatically switched to still-available brokers.

Cost Optimization and Conclusion

Etsy initially believed that the cost increase from implementing a multizone architecture would be minimal. By eliminating regional disks and relying on Kafka’s inter-zone replication, they were able to optimize costs. The successful migration of their Kafka cluster to a multizone architecture has provided Etsy with increased resilience and minimized the risk of downtime and data loss.

Overall, Etsy’s experience with migrating their Kafka cluster to Google Cloud’s multizone architecture showcases the importance of resilience, cost optimization, and careful planning in ensuring a smooth and successful migration to the cloud.

Summary: Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

In 2018, Etsy made the decision to migrate to the Google Cloud Platform, which required a major redesign to host their Kafka brokers. As their Kafka cluster grew in importance and limitations of manual zone evacuation became apparent, they developed a design to make the cluster resilient to zonal failures. The new design involved running Kafka brokers in three different zones and distributing topic partition replicas evenly. They successfully migrated the Kafka cluster without data loss or downtime using Google’s disk snapshotting feature. The design was put to the test during a company-wide initiative and proved to be resilient. Although there was a cost increase, it was minimal compared to the benefits gained.

Frequently Asked Questions:

Q1: What is machine learning?

A1: Machine learning is a subset of artificial intelligence that empowers computers to learn and make predictions or decisions without being explicitly programmed. It involves the development of algorithms and models that allow machines to analyze large amounts of data, identify patterns, and make accurate predictions or take actions based on the available information.

Q2: How does machine learning work?

A2: In basic terms, machine learning involves three key components: input data, a model, and an algorithm. Initially, the model is trained using a large dataset that contains both input and output data. The algorithm analyzes this data and iteratively adjusts the model’s parameters to minimize errors. Once the training is complete, the model can be used to infer outputs or make predictions for new, unseen data.

Q3: What are the main types of machine learning?

A3: The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning.
– Supervised learning involves training a model using labeled data, where the input and corresponding output are known. It is used for classification and regression tasks.
– Unsupervised learning aims to find patterns or structures in unlabeled data. Clustering and dimensionality reduction are common applications of unsupervised learning.
– Reinforcement learning involves training models to make decisions based on trial and error. The model receives feedback in the form of rewards or penalties, enabling it to learn optimal decision-making strategies over time.

Q4: What are some real-life applications of machine learning?

A4: Machine learning is widely used across various industries and domains. Some examples of its applications include:
– Fraud detection and cybersecurity
– Recommendation systems in e-commerce and streaming platforms
– Predictive maintenance in manufacturing
– Medical diagnoses and personalized treatments
– Natural language processing and chatbots
– Autonomous driving and robotics
– Financial market analysis and trading algorithms

Q5: What are the challenges of implementing machine learning?

A5: While machine learning offers numerous benefits, there are challenges to consider when implementing it. Some key challenges include:
– Data quality and availability: Machine learning heavily relies on high-quality, representative data. Collecting and preparing the right data can be time-consuming and resource-intensive.
– Interpretability and transparency: Some machine learning models, such as deep neural networks, can be complex and challenging to interpret, leading to concerns around transparency and trust.
– Bias and fairness: Machine learning models can unintentionally reflect biases present in the training data, leading to biased predictions or decisions that may disproportionately affect certain groups.
– Scalability and compute requirements: Training machine learning models, especially with large datasets, can require significant computational resources and time.
– Privacy and security: The use of sensitive data in machine learning raises concerns about data privacy and protection from potential attacks or breaches.

Remember to always keep your content up-to-date, as the field of machine learning is constantly evolving and new advancements may arise.

Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

Full Article: Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

Summary: Building Zonal Resiliency for Etsy’s Kafka Cluster: Part 1 – A Guide by Etsy Engineering

POPULAR CATEGORIES

Must Read

POPULAR POSTS

POPULAR CATEGORY