When Solid State Drives are Not That Solid

43

It looked just like another page in the middle of the night. One of the servers of our search API stopped processing the indexing jobs for an unknown reason. Since we build systems in Algolia for high availability and resiliency, nothing bad was happening. The new API calls were correctly redirected to the rest of the healthy machines in the cluster and the only impact on the service was one woken-up engineer. It was time to find out what was going on.

The NGINX daemon serving all the HTTP(S) communication of our API was up and ready to serve the search queries but the indexing process crashed. Since the indexing process is guarded by supervise, crashing in a loop would have been understandable but a complete crash was not. As it turned out the filesystem was in a read-only mode. All right, let’s assume it was a cosmic ray the filesystem got fixed, files were restored from another healthy server and everything looked fine again.

Read more at Algolia Blog.