On September 25, at 11:16 UTC database replication problems started a chain of events leading to service incident for Cxense Video customers. Both master databases failed, and replication processes and inability to synchronize unique item IDs lead to duplicated IDs.
11:16 UTC - Two alerts on database replication arrives simultaneously
13:26 UTC - First report from customers showing that items uploaded to Video Console are stuck in Uninitialized state. Additional reports from customers that audio processing have failed for several sites.
16:16 UTC - Cxense R&D have solved the root cause for the database replication problem, and have started processing incoming videos. Database inconsistency and 3000 items were lost. Most of the lost items were reprocessed from RSS feeds, but a portion needed manual work.
16:50 UTC - Indexer agent processes started failing, due to bad SQL connections. Video's search index were no longer updated. Restarted manually.
18:10 UTC - Customers reporting that they see items being processed, but items that should have been processed since the service incident started are missing.
19:01 UTC - Indexer agents started manually failed again. Restarted manually, again.
2017-09-26 - Cxense R&D built solutions to resolve problem with duplicate IDs and reprocessing of items lost due to databases falling out of sync.
Customers experienced a complete stop in processing of video and audio items uploaded between 11:16 and 16:16 UTC.
Service were impacted with frequent failures of indexing new content for additional 2-3 hours.
During the service incidient, we also had reports of items from other customers showing up in video console.
Customers also experienced that content uploaded during the outage, or that were in processing queue when it started, had to be manually reprocessed by Cxense. This took several days.
Root cause and preventive steps
The main cause for this database problem causing a service incident with impact of this magnitude was that the system kept using two database instances after they had fallen out of sync. This caused duplication of IDs and all systems depending on the database entries got inconsistent data returned.
Cxense R&D is investigating a more robust scheme for assigning IDs, implemented data consistency between all members of master-master replications. While this work is onging, the master-master scheme is disabled (not in use). In addition we have changed the operational procedure for swithcing between master databases.