2025-04-29 Application unavailable for some customers

Incident Report for Front

Postmortem

On Tuesday, April 29, Front experienced an application outage lasting roughly 4 hours for a subset of US-based customers—roughly 3% of users. This was the result of a human error during a planned database maintenance operation that resulted in one database shard becoming unrecoverable. Front was able to restore this database shard from an up-to-date backup and then catch up any missed messages during that period, which resulted in no data lost as a result of this issue.

Front customers are divided into database shards so that an issue with one shard does not affect all others. Earlier this week Front performed an upgrade of all database shards as a routine procedure, and while the upgrade was successful with no customer impact, there was an error in the cleanup step that caused one shard to be irrevocably invalidated. This occurred at 4:16 pm PDT (23:16 UTC).

Front maintains up-to-date database replicas for this purpose, from which we were able to initiate a restoration. This involves copying data, and therefore took nearly four hours to complete. Once the database was restored at 8:07 pm PDT (03:07 UTC Wednesday), the application became available again, and Front was able to initiate the next task of applying all changes since the incident began, like syncing in emails from Gmail and O365.

Front apologizes to the affected customers for the inconvenience caused by this extended outage. We strive to be always available to customers and have made great investments over the last several years to minimize the possibility and the impact of this kind of error. We can always do more, and are taking the following steps based on lessons from this incident.

We have modified our database upgrade procedures and the permissions granted to the operators to significantly reduce the likelihood of this same error. At the same time, we are accelerating our upgrade frequency, which means we’ll have more practice and more automation with this procedure, further reducing risk. More upgrades also means faster adoption of performance and security improvements.

Front is also investigating opportunities to improve the recovery time from the backup. Most of the time during this outage was waiting for data to be replicated from one storage region to another, and there may be ways to cut that down.

Once again, Front would like to apologize for any disruption caused by this incident and emphasize our commitment to high availability and transparency.

Posted Apr 30, 2025 - 23:11 UTC

Resolved

This incident has been resolved.
Posted Apr 30, 2025 - 05:43 UTC

Monitoring

As of 8:11 pm PDT (03:11 Wednesday UTC), the application is functioning for all customers. Front began applying all changes since the incident started, like syncing messages from email channels, which are now caught up. Customers may experience delays for a few more minutes as caches clear.

Front is now monitoring the application for further issues. All functionality should be restored; please contact support if you continue to see application issues.

[us-west-1] [by_company_00005]
Posted Apr 30, 2025 - 04:09 UTC

Update

The database recovery has completed and application functionality is now restored. Messages from the last few hours since the incident started are now being re-synced. This will take some time to catch up, but we anticipate no data loss as a result of this issue.

[us-west-1] [by_company_00005]
Posted Apr 30, 2025 - 03:13 UTC

Update

The restoration process is still underway, but nearing completion. We are targeting less than 60 minutes to restore limited application functionality.

[us-west-1] [by_company_00005]
Posted Apr 30, 2025 - 01:41 UTC

Identified

At 4:16 pm PDT (23:16 UTC), during a routine maintenance operation on one of Front’s customer databases, the database was inadvertently disabled and made unrecoverable. Front immediately began the restoration procedure from our up-to-date backup, but as it has to copy the entire database this procedure takes time to complete. We expect this to take roughly two hours, during which time the application will be unavailable for customers on this database.

This issue affects a fraction of Front's customers, but for those customers the application will be unavailable until the restoration is complete.

[us-west-1] [by_company_00005]
Posted Apr 29, 2025 - 23:54 UTC

Investigating

Front is investigating a major database outage affecting one shard of customers.
[us-west-1] [by_company_00005]
Posted Apr 29, 2025 - 23:34 UTC
This incident affected: App.