Tuesday, May 09, 2006

Running around in circles.

With locations in the US and Europe a customer had chronic speed problems in Europe. E-Mail was slow, adding attachments resulted in time-outs, queries could take half an hour, often timing out as a result.

Technical stuff: Click here to skip past it
No Web proxy
Available bandwidth:
US and Germany have own IPs, UK connects through them
US has a dedicated T1, Germany has a 2MB DSL, UK has 3MB "Internet Access"

Germany's Trace Route:
2 * * * Zeitberschreitung der Anforderung.
3 * * * Zeitberschreitung der Anforderung.
4 * * * Zeitberschreitung der Anforderung.

etc.

Germany's Ping:
Antwort von x.x.x.132: Bytes=128 Zeit=205ms TTL=226
Antwort von x.x.x.132: Bytes=128 Zeit=201ms TTL=226
Zeitberschreitung der Anforderung.
Zeitberschreitung der Anforderung.
...
Zeitberschreitung der Anforderung.
Zeitberschreitung der Anforderung.
Antwort von x.x.x.132: Bytes=128 Zeit=202ms TTL=226
Antwort von x.x.x.132: Bytes=128 Zeit=173ms TTL=235
Zeitberschreitung der Anforderung.
...
Zeitberschreitung der Anforderung.
Zeitberschreitung der Anforderung.
Antwort von x.x.x.132: Bytes=128 Zeit=404ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=398ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=404ms TTL=234
Antwort von x.x.x.132: Bytes=128 Zeit=398ms TTL=234
...
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=3570ms TTL=228
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=191ms TTL=228
Antwort von x.x.x.250: Bytes=56 (gesendet 128) Zeit=192ms TTL=228


That's some serious delay going on, and it got worse as soon as the US would come on-line. Whatever the Europeans wanted to get done, they needed to be finished with it by around 2:00p.m.

We got a network diagram and information on settings and everything looked kosher. There were no 404s, just time-outs and really slow response times. So bad was this packet loss, in fact, that they'd often get "Cannot find server or DNS Error" messages. Basic look-ups (many of which you'd expect to have been cached) were failing.

After two weeks of exhausting the possibility of hardware errors, we started inspecting the application servers. More people were brought in. We set logging to the highest possible and painfully checked through megs and megs of information. Surprisingly enough, despite the high amount of logging (normally not done during production hours), there was no worsening of the system.

There was also no answer in sight.

We turned to the database. Even more people were brought in. The generated SQL and activities were examined and everything looked fine, at least for those requests which actually made it to the database. Many didn't.

At least, not until around 4:30p.m. Eastern, when the system would slowly come back from the dead as people started going home.

After more than a month with no results, we tore into the data being sent through the network. Snort didn't detect any DOSing and Ethereal showed that the highest volume hits were to music related sites. After being informed of this, the customer installed WebTrends to better monitor what was going on. Still, while it could slow down the network, even if everyone had been streaming music, it couldn't cause the problems we saw.

So we logged every single packet of data. The Ethereal logs spanned gigabytes.

Meanwhile, the network was getting a little better. Management sent out a note that WebTrends saw some things they didn't like and anyone caught downloading music would be fired. An awful lot of their employees went straight to Add/Remove Programs that afternoon. But the problems continued.

This company had been only barely able to work for the past six weeks and we were running out of options.

More traffic analysis showed that a full 5% of all traffic was going from one machine to www.deviantart.com. That was second only to their mail server (with 6.5%). It got stranger: 30% of all incoming connections were coming from adelphia.net. They'd stopped downloading and started streaming. Clever ell-users.

Our application accounted for only 3% of network traffic. Adelphia and a few other sites were firewalled and things picked up a bit. There were still constant delay problems but time-outs had mostly disappeared.

Work had been going on for a full two months, non-stop, and it didn't seem like we were ever going to figure this out. And then we got an E-Mail:
You can close this case. Thanks to $EtherealLogReaderGuy's help we discovered that they had some jokers who were on partypoker.com most of the day. They've been fired and US performance is OK.
Nine fucking weeks. Twenty-eight gigabytes of logs. More than a dozen of our people. All because two utter fucktards were playing poker on some central servers.

Root Cause: 17-Fuckwit

0 Comments:

Post a Comment

<< Home

In compliance with $MegaCorp's general policies as well as my desire to
continue living under a roof and not the sky or a bus shelter, I add this:

DISCLAIMER:
The views expressed on this blog are my own and
do not necessarily reflect the views of $MegaCorp, even if every
single one of my cow-orkers who has discovered this blog agrees with me
and would also like to see the implementation of Root Cause: 17-Fuckwit.