On May 14th at 16:10 UTC our engineering team deployed a configuration change to our infrastructure. This change contained an error that caused some processes to continuously restart, resulting in high CPU usage. This error did not manifest in a partial rollout that occurred previously.
The team was immediately notified and reverted the configuration change. However, because of the high CPU usage, some servers did not immediately respond and it took 11 minutes to execute the rollback instead of the expected 1 minute.
Our team has since added new sanity check to our global configuration to prevent this situation from happening again in the future. We are also reviewing our deployment system to make it faster in degraded conditions.
We are very sorry about this incident, we understand that even relatively brief incidents are very disruptive for our customers.