Follow

2017-07-20 - Incident report from Cxense

 

Dear Customer,

Cxense Video services was unavailable due to a service incident at one of our Data center providers. The impacted services were unavailable or had increased latency between 04:00 UTC and 10:32 UTC on July the 20th. The traffic collection data and the API for Cxense Insight, Search and DMP was not affected. Cxense R&D have reviewed the situation and checked the actions taken during the service incident and performed assessments of what we could have done different.

Service Incident Timeline

Time (UTC)

Description

04:00

SoftLayer initiated “emergency maintenance” for the network in the affected region

04:33

Cxense's Private network becomes unreachable for 17 of our servers in SoftLayer

05:14

Analytics API traffic routed to EMEA and Japan data centers

05:18

First customer complaint, from video customer in Japan

05:29

Adserver traffic routed away from US Datacenter

...

Cxense was unable to reconfigure Video services to route traffic around the parts of the network that was unreachable

07:00

Root cause fixed by SoftLayer, connectivity restored

07:06

Video services restored, except Video Search

07:32

Analytics API and Adserver traffic routed back to US Datacenter

10:32

Video Search restored after restart of all FAST clients

 

Impact of the incident

  • Video services down, both read and write
  • Analytics Cube clusters partly unreachable in the US Datacenter
  • API and Adserver traffic was re-routed to DE & TK to serve Analytics with latency instead of timeouts.

 

Root cause and remediation

Softlayer's network problems took out the private network interconnecting 17 servers from our Video service. The service incident affected our video services the most, as almost all the servers that lost network connectivity were used by Cxense Video. 

Cxense R&D is in the process of assessing whether we should request that Softlayer partition the servers they are providing in a manner that make our setup more robust against network outages. In this situation, the Softlayer service incident did not span all of their Datacenter, but it still affected most of Cxense's servers.

Cxense R&D is also making changes to the way we monitor availability of the servers hosted by Softlayer.

Have more questions? Submit a request

Comments

Powered by Zendesk