Eliminating Storage Failures in the Cloud

March 5, 2018

241

With the advent of disk mirroring over 35 years ago, data redundancy has been the basic strategy against data loss. That redundancy was extended in the replicated state machine (RSM) clusters popularized by cloud vendors in early aughts, and widely used today in scale-out systems of all types.

The idea behind RSM is that running on many servers, with the same intial state, and the same sequence of inputs, will produce the same outputs. That output will always be correct and available if a majority of the servers are functional. A consensus algorithm, such as Paxos, ensures that the state machine logs are kept in sync.

At Usenix FAST ’18 conference, Ramnatthan Altagappan et. al. presented the paper Protocol-Aware Recovery for Consensus-Based Storage that introduced a new approach to correctly recover from RSM storage faults. They call it corruption-tolerant replication, or CTRL.

RELATED ARTICLESMORE FROM AUTHOR

Celebrating the Second Year of Linux Man-Pages Maintenance Sponsorship

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

Using OpenTelemetry and the OTel Collector for Logs, Metrics, and Traces

Xen 4.19 is released

RELATED ARTICLES MORE FROM AUTHOR