Increasing Magic Pocket write throughput by removing our SSD cache disks

Improving Magic Pocket Performance: Boosting Write Throughput by Eliminating SSD Cache Disks

Introduction:

When Magic Pocket adopted SMR drives in 2017, they made the decision to use SSDs as a write-back cache for live writes. This was done to compensate for the slower random write speeds of SMR disks compared to their PMR counterparts. However, as data density increased, they started facing limits with maximum write throughput per host, as all live writes had to pass through a single SSD. To address this bottleneck, Magic Pocket re-imagined their storage architecture in 2021, bypassing the SSD cache disks entirely and writing directly to SMR disks. This not only increased write throughput and improved reliability, but also reduced storage costs and eliminated a single point of failure.

Full Article: Improving Magic Pocket Performance: Boosting Write Throughput by Eliminating SSD Cache Disks

Magic Pocket Enhances Storage Infrastructure by Removing SSD Cache Disks

Magic Pocket, a leading storage platform, made significant changes to its storage architecture in 2021. The company decided to remove SSD cache disks for live writes and instead write directly to SMR (Shingled Magnetic Recording) disks. This decision was driven by the need to address performance bottlenecks and improve storage infrastructure reliability and cost-efficiency.

Improving Write Throughput and Reliability

Previously, Magic Pocket used SSDs as a write-back cache for live writes on SMR disks. However, as data density increased, the company faced limitations in maximum write throughput per host due to the SSD bottleneck. Even the adoption of NVMe-based SSD drives did not fully resolve the issue. Therefore, the decision was taken to bypass the SSD cache disks and write directly to SMR disks.

You May Also Like to Read  Maximize Training Success: Boost Recovery and Efficiency in Large ML Model Failures

By implementing this new approach, Magic Pocket increased write throughput by around 15-20%. This not only improved performance but also enhanced the reliability of their storage infrastructure. Additionally, this change helped to reduce overall storage costs.

Addressing SSD Failures and Overhead

In early 2020, Magic Pocket experienced a significant issue when a large number of SSD disks failed within a short period. These failures, caused by SSDs reaching their maximum write endurance, posed a risk to durability. Although all data was eventually repaired, the incident highlighted the need to mitigate the risk of future SSD failures.

Removing SSDs from the storage architecture eliminated a single point of failure and reduced the complexity and operational overhead associated with sourcing and maintaining SSDs in Magic Pocket’s fleet. Teams responsible for hardware engineering, capacity planning, and supply chain management no longer needed to qualify SSDs for reliability and durability or project future demands.

Changes in Storage Engine Design

Magic Pocket’s storage engine organizes data into extents, which are containers of blocks typically 1-2 GB in size. Previously, block metadata was stored on SSD cache disks, while raw block data was stored on SMR disks. To eliminate the SSD cache, a new extent format was introduced to store both metadata and raw data inline on the same SMR disk.

The storage engine now parses data on extents during startup, builds an in-memory index, and directly commits new block writes to the SMR disks. Control plane workflows related to disk repair, allocation, and deallocation were also updated to accommodate the new architecture.

You May Also Like to Read  An Unusual Pattern Found in the Stock Market

Performance and Latency Considerations

Load tests showed promising results, with write throughput increasing by 2-2.5x and p95 latencies improving by 15-20%. However, under average load scenarios, the p95 of last mile disk write latencies increased by 10-15%, since writing to an SMR disk is slower than writing to an SSD. Nevertheless, this only caused a 2% increase in overall end-to-end write latency for Magic Pocket.

The removal of SSDs also impacted certain read operations, as the in-memory cache for recently written blocks was no longer available. This resulted in increased latency for these read operations, particularly for recently written blocks. However, the company deemed these tradeoffs acceptable for the benefits gained.

Conclusion

By removing SSD cache disks and adopting a direct write approach to SMR disks, Magic Pocket was able to improve write throughput, enhance reliability, and reduce storage costs. Although there were some tradeoffs in terms of latency for certain operations, the overall performance and efficiency gains made this architectural change worthwhile. Magic Pocket continues to innovate its storage infrastructure to meet the growing demands of its users while maintaining durability and availability guarantees.

Summary: Improving Magic Pocket Performance: Boosting Write Throughput by Eliminating SSD Cache Disks

Magic Pocket, a storage system adopted SMR drives in 2017 with SSDs as a write-back cache for live writes. However, as data density increased, the SSD cache became a bottleneck for maximum write throughput. To address this, the storage architecture was re-imagined to write directly to SMR disks, resulting in increased write throughput, improved reliability, and reduced storage costs. The storage engine was updated to store block metadata and raw block data inline on the same SMR disk. While there was a slight increase in latency for certain read operations, the advantages gained outweighed the tradeoffs. The success of the project led to the preparation of the feature for production.

You May Also Like to Read  Discover the Advantages of 400G Networking in Our Latest Sustainable Data Centers

Frequently Asked Questions:

Q1: What is machine learning?
A1: Machine learning is a field of study where computers are designed to learn and improve from data without being explicitly programmed. It involves the development of algorithms and statistical models to enable computers to make accurate predictions or decisions.

Q2: How does machine learning work?
A2: Machine learning algorithms use mathematical models to analyze and learn from large sets of data. These algorithms search for patterns within the data, identify relationships, and make predictions or take certain actions based on the learned information.

Q3: What are the main types of machine learning?
A3: The main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, the algorithm is trained on labeled data. Unsupervised learning involves finding patterns and relationships in unlabeled data. Reinforcement learning is a trial-and-error approach where an algorithm learns to maximize rewards through interactions with an environment.

Q4: What are some real-world applications of machine learning?
A4: Machine learning has extensive applications across various industries. Some common examples include fraud detection in finance, personalization and recommendation systems in e-commerce, image and speech recognition, medical diagnosis, autonomous vehicles, and natural language processing.

Q5: What are the challenges faced in machine learning?
A5: Some challenges in machine learning include obtaining high-quality and diverse training data, selecting appropriate algorithms for the task, handling biased or insufficient data, dealing with overfitting or underfitting issues, understanding the interpretability of models, and ensuring ethical usage of machine learning technology.