On Friday, Dec 19, at 14:50 UTC (6:50am PST), customers backed in our US-West-2 data center experienced dramatically increased API latency, resulting in the website failing to load and messages being queued in the backend. This continued until 18:10 UTC (10:10am PST). During this time no messages were lost, though there may have been a significant delay for messages to appear in customer inboxes. All queued messages were delivered by 21:00 (1:00pm PST).
Customers based in Front’s EU-West-1 and US-West-1 datacenters may have experienced some delays during this time, as some systems are interdependent, but this impact was intermittent and uncommon.
The root cause of this issue was the failure of a caching system. There are several database systems that support the Front application, which are supported by caching to improve performance. A recent change increased the size of some objects in the cache layer. This is not inherently wrong, and did not have any immediate impact. On Friday the 19th the caching layer in US-West-2 crossed a new threshold of data volume which triggered a large number of evictions, particularly of other data that is necessary for most application activity. Besides putting additional load on the databases, there was simply not enough room in the cache for all the data we needed to store there. This caused a high amount of thrashing that significantly increased latency for all systems.