Follow

2016.02.25 - 17:35 UTC - Service Incident Post Mortem Report

Incident report from Cxense

On February 25, at 17:35 UTC our Domain Name Service for the domains emediate.se and emediate.eu failed.

Incident summary

On February 25, at 17:35 UTC    Monitoring systems indicated high error rates for a DNS server.

On February 25, at 17:45 UTC    Work on recovering the backup DNS server started

On February 25, at 18:40 UTC    The emediate.se and emediate.eu domains started working locally
                                                     However, two booking systems remained offline due to data quality issues.

On February 25, at 20:00 UTC    Service restored for all booking systems.

Weekend February 26-27            All emediate domain name service migrated to Amazon Cloud.

Impact

The impact on Ad delivery was that customers experienced close to complete outages in the periods shown in this table:

Booking system

Start and end UTC

Outage duration

Booking 1

18:15-20.45

2h 30 min

Booking 2

18:05-20:20

2h 15 min

Booking 3

18:30-20:05

1h 35 min

Booking 4

18:20-20:05

1h 45 min

Booking 5

18:05-20:15

2h 10 min

Booking 6

18:20-21:40

3h 20 min

Booking 8

18:10-21:45

3h 35 min

Booking 21

18:15-20:05

1h 50 min

 

Root cause

Some time prior to the outage, a master server not directly involved in serving DNS for Cxense Display failed. This left the service in a fragile state, but not critical. Work started on migrating DNS for Cxense Display to the external vendor used for other Cxense services in an orderly fashion. At the time of the incident, another failure caused the domain name service for emediate.se and emediate.eu to fail.  Service was restored by correcting the direct cause for the failure, and the planned move to an external provider was expedited to address the fragility issue. Two of the booking servers experienced longer down time, due to data quality issues with the initially restored data.

Preventive measures

The DNS for all our domains related to Emediate/Cxense Display have now been migrated to Cxense's preferred DNS vendor. They will be fully served from the cloud provider as the change propagate throughout the Internet, and the legacy system that failed is no longer in service. 

Priority 1 incident reporting

Cxense Display customers will be given an email address to escalate tickets created outside business hours for Severity 1 incidents.  This escalation email is routed to a phone backend which will allow engineers to start working on tickets during outages.

Have more questions? Submit a request

Comments

Powered by Zendesk