Front was unavailable for 10 minutes

Incident Report for Front

Postmortem

On May 14th at 16:10 UTC our engineering team deployed a configuration change to our infrastructure. This change contained an error that caused some processes to continuously restart, resulting in high CPU usage. This error did not manifest in a partial rollout that occurred previously.

The team was immediately notified and reverted the configuration change. However, because of the high CPU usage, some servers did not immediately respond and it took 11 minutes to execute the rollback instead of the expected 1 minute.

Our team has since added new sanity check to our global configuration to prevent this situation from happening again in the future. We are also reviewing our deployment system to make it faster in degraded conditions.

We are very sorry about this incident, we understand that even relatively brief incidents are very disruptive for our customers.

Posted May 15, 2018 - 14:43 UTC

Resolved

This incident has been resolved.

Posted May 14, 2018 - 14:49 UTC

Monitoring

Front was unavailable for 10 minutes between 7:10am PST and 7:20am PST.
We are monitoring the situation but everything should be back to normal.

Posted May 14, 2018 - 14:27 UTC