Luckily it was caught in user acceptance testing on a copy of the working database.
$CompCorp were seeing corruptions all over the place. Where Arabic letters should have been were, instead, lots of screwy strings like the title of this entry which meant Unicode was involved. I asked for screenshots, dumps and exports. They'd escalated this internally to their VP before bothering to submit the ticket so that as soon as I took it, Mr VeePee escalated it here with us. Despite sending answers within hours of receiving their updates I still got jumped on by their people and ours insisting I wasn't working fast enough and that my solutions sucked.
I got very lucky with additional info they sent me: the display in $OurBigApp had multiple lines in Arabic and English and only some of these were corrupt. I finally had my good and bad pieces. Better still, I knew the primary keys for these rows. "Please send me the results of "SELECT * PriKey, Desc FROM T1.xyz WHERE PriKey IN ($nn-foo, nn-bar);". It took them a day to get back to me but that didn't stop their managers howling Monday afternoon and screaming for my head.
When the results came back Tuesday I was first confused. Where text appeared corrupt in the application, it looked perfect in the SQL client, but where it was fine in the application, it came out as garbage in the SQL Client The latter is normal since the client isn't Unicode-compliant and sure enough, changing their codepage got the data to display correctly on SQL dumps.
I talked about the problem with the other two guys in I18N who might know but all we could think of was "fonts", though this couldn't be it since I had both good and bad displaying in all circumstances reproducibly. The my-head-shaped dent in my desk grew ever so slightly.
Then it finally clicked. I called and asked if some of the data had been imported. They bitched about how long it was taking before finally saying that much data had been. And no, they hadn't noticed that only imported data appeared corrupt. I got the specs on their database and didn't know whether to laugh or cry.
Their admin had been clever. Very clever. We only supported codepage 1252 or ISO8859P1 but these are only for Western European characters, not Arabic which uses either 1256 or ISO8859P6. Nevertheless the admin managed to get the $CompCorp system running with Arabic text thanks to Windows' "helpfulness". That's fine as long as you're isolated and insulated. Moving to a Unicode database was a good move but broke the insulation.
Hi,I sent a copy of the solution around to our I18N people and responses were along the lines of "Holy shit." That their database isn't completely corrupted due to Windows' internal use of Unicode amazes us all.
When you typed the letter "thal" (character 0xD0) what was saved was actually an Icelandic "eth" (ð). You then moved this data into a Unicode database. Translation was done during the move from Western European -- not Arabic -- so that the raw data for the word "green" (spelled: seen, beh, zain) was seen by the system as 0xD3, 0xC8, 0xD2. In the 8859P1 code page these three characters are "ÓÂÒ". This is how the corruption took place.
The reason you saw the data "correctly" was that Windows converted the characters to Arabic based on the codepage you were using on the clients, ignoring the database and the indication that these were Western European characters.
Two days later and still no update. No thank-you. No confirmation. Nothing. I expect they'll also slam me on the survey for having taken too long to solve a problem which by their own admission, my cow-orkers never would've seen.
The title of this entry is corrupted just like $CompCorp's data. It should read هذا منيك (Hetha Mnäyik): "This is bullshit!"
x-posted from HuSi, sans poll