Wednesday, September 19, 2007

Similar Symptoms

We're a big company. The software support section of the $OurBigApp division is some 450 strong, spread around the world. My functional team comprises some 50 poor saps who each take, on average, more than 30 tickets a month. The chances of the same person even seeing your ticket in the queue are slim. The chances of the same person getting your new ticket are generally below 10%.

Ticket numbers are 17 alphanumerics long, not the sort of thing you commit to memory unless you're a savant doing the talk show rounds. Idiots we may be, but most of the monkeys have yet to demonstrate their savant. Submitting a ticket which reads RE: Ticket A-40001H-4F09KC2Y. We need further solution isn't necessarily the method to employ if you want an answer before Thanksgiving.

Someone at $CaliDesign employed this method and was shocked that no answer was forthcomig within ten minutes of submission. "Files are disappearing! Records aren't being stored! The sky is falling! ESCALATE!!1!11"

I had the misfortune of being the Duty Monkey when the escalation came in, so I was forced to track down the old ticket, read all the way through it and summarise the problem. The monkey who handled that ticket has moved on to bigger paychecks and less stress, doing basically the same job but with less than a quarter of the workload. I could do the same if I was willing to move to Indianapolis. There's a reason even the residents call it "India-no-place".

So back to $CaliDesign. The problem they're having has become a sort of specialty of mine so I ended up taking the damned ticket agreeing to do so after waiting for management to demand I quickly find a monkey. My "willingness to assist the team" pleased Vera greatly. I'm learning.

It being a known problem, I sent a standard response: Do $A, check $B, is $OSfolder missing?, test $C and send me results.

"This is a Big Problem! $OSfolder is there but files keep failing!!"

Huh? The OS folder isn't being deleted? We haven't seen that before. Time for actual diagnostics. "OK, Sweetie. Drop FileMon on the server, then check the free RAM and try to add a file that's at least twice that size. Try $foo and $bar while we're at it. Oh, and send me some logs."

"We are trying to reproduce the issue in Test env. Also, we will test the work arounds and find out whether that address our issue. I will post further updates after testing these scenarios. Please hold our ticket open while we test."

A month went by, but every time I sent a request that she update the ticket, I got the same reply back. Finally an update came:

"We tried to use 800 MB file and still could not reproduce the issue. We are making further efforts to reproduce the issue. Pls wait for update"

Another two weeks passed and she wrote, "We have uploaded 800 MB file with "Add File" botton and could not reproduce the issue. The file got uploaded and succesfully. Our business owners are pressing on this to know the root cause and a fix! Please suggest next steps!

"P.S This is the 2nd occurance in production with in an year."

What the...? I tried a different tack: "Is this happening only on certain machines? Only for certain user groups?"

During the week I waited for an answer to this oh-so-urgent matter which had since been re-escalated, I'd occasionally see the ticket in my queue and cock my head slightly. There was something there but I wasn't sure what. Then it hit me...

"Are you able to reproduce this problem? If not it would appear to be an anomaly which could have been caused by any number of reasons, from a disk glitch to a network interruption to a session or activity time-out. It's next to impossible to pinpoint such one-off errors.

"If you can reproduce it then we need the server logs as well as the Event Viewer Application and System logs for the servers and client."

Two weeks later she let me know she'd upload the logs. Six weeks later she did so. Fifteen megs, uncompressed. Thank fuck for grep.

I found the errors I was looking for and explained each one of them. To each of them she said, "No, it's not that. We got shown an error message for that one."

I used the phone to get the damned name of the file which actually failed and only spent 20 minutes on hold. It wasn't referenced in the logs. This is a head-scratcher.

This morning I received an update:

"it occured twice in two years period. I can say it as intermittent. We have not seen the occurance since Jun 07. This is occcuring on different users. Not specific to one user. I have a question though Can a network glitch at the time of file upload be responsible for the disappearance of file?"

This Huge Problem deserving Much Monkey Attention and repeated escalation has occurred exactly twice in two years, both times when there was a "network glitch". And they're absolutely certain it's our fault. Makes perfect sense.

Armed with this new style of logic, the next time I visit my doctor with stomach pain and nausea I'll be sure to demand she first check me for anisakiasis, chikungunya and Ebola and ignore the question if she asks me how many weeks last night's shrimp cocktail had been sitting in the fridge.

Seventeen. Thanks for jacking up my average time to close, Sweetie. The clueless fucksticks here at $MegaCorp have no idea about statistics and refuse to throw out datapoints which lie some 14 sigma outside the fucking norm.


Labels: , ,


Post a Comment

Links to this post:

Create a Link

<< Home

In compliance with $MegaCorp's general policies as well as my desire to
continue living under a roof and not the sky or a bus shelter, I add this:

The views expressed on this blog are my own and
do not necessarily reflect the views of $MegaCorp, even if every
single one of my cow-orkers who has discovered this blog agrees with me
and would also like to see the implementation of Root Cause: 17-Fuckwit.