Help! Getting some Chinese characters back from remaining code

Upon upon a time when some changes were made to the server hosting this blog, I lost the ability to display certain characters, including Chinese characters.

Hundreds of entries had words that were turned into gobbledygook.

I can’t tell you how irritating that was.

Do any of you who possess a Chinese language know how to change the code left behind back into Chinese characters?

Help!

About Fr. John Zuhlsdorf

Fr. Z is the guy who runs this blog. o{]:¬)
This entry was posted in SESSIUNCULA. Bookmark the permalink.

29 Comments

  1. Dr. Eric says:

    Dui bu qi. No I don’t. :-(

  2. MenTaLguY says:

    I imagine it’s an encoding issue. The encoding that pages on the blog are being served with is currently UTF-8; I don’t know what it might have been before.

    In firefox, at least, you should be able to go to one of the damaged pages, go to View > Character Encoding from the menu, and try different encodings until you find the right one (though you will need to know what one or two of the characters are supposed to be to make sure you’ve got the right one).

    At that point you should be able to copy the text out and paste it into an editing form in a separate tab/window that is still set to UTF-8 in order to convert the text.

  3. MenTaLguY says:

    Hm, unfortunately it looks like the text of entries may actually be corrupted in the database and not simply transmitted with the wrong encoding, though it wouldn’t hurt to get a second opinion.

  4. Timbot says:

    If the pages in question still do not display correctly with either DTF-8 or GB3212 then the flat files themselves have been corrupted and you are lost.

  5. Let me try to type some Chinese here….

    Father Z

    Z ??

  6. O! It didn’t work! I typed “Father” in Chinese but “??” appeared instead.

  7. Fr. Ho: Yes… I know. I will see what can be done at the level of the server.

  8. David C says:

    Here’s the problem: the original UTF-8 Chinese text was misinterpreted as though it were in Windows encoding (CP1256) and then “converted” into UTF-8. This happened *twice*. To recover the Chinese text, apply the reverse transformation. I only know how to do this under Unix using uconv (part of the ICU package from IBM):

    cat garbled-file | uconv -f utf8 -t cp1252 | uconv -f utf8 -t cp1252

  9. David C: I am not sure I entirely understand that, but I can pass that along.

  10. David C says:

    Father, it’s as if someone found a scrap of paper that said “sex dies”, mistakenly thought it was in English, and translated it into Latin “sexus moritur”.

  11. wsxyz says:

    This must by why your Korean text has also been mangled for some time.

    Hooray for Windows!

  12. David C says:

    Apologies, I wrote “CP1256” above when I should have written “CP1252.” Father, if the recipe above doesn’t work for your admin, I’d be happy to ungarble the files if they can be sent to me conveniently.

  13. MenTaLguY says:

    I can confirm that David’s recipe works (with the substitution of cp1252), even for sequences of Chinese text copy-and-pasted directly from the blog. However, it doesn’t always appear to work when there are runs of English text interspersed — it looks like the runs of Chinese text may need to be converted individually, and a handful of them appear to have been irreversably mangled.

  14. MenTaLguY says:

    Here is the recovered text (using David’s method) from Father’s post of a couple years ago:

    ???? !
    Gongxi facai!
    Happy New Year!

    Today I am heading off to what I am sure will be a marvelous celebration hosted by the Chinese Ambassador to the Holy See. I will give you a report later.
    ?????

    (We’ll see if this actually shows up correctly in the comments section — if it does, Father Ho’s problem may be that his browser is using an encoding other than UTF-8 for posting.)

  15. MenTaLguY says:

    Nope. Evidently some things are still(?) misconfigured on the server side with respect to character encodings. Oh well.

  16. Willebrord says:

    Just out of curiosity: how often exactly do you post in Chinese, Father?

  17. Peter says:

    Catholic terms in Chinese: Catholic – ???; Father – ??; Nun – ??; Sacrament of Confession – ????.

  18. Peter says:

    Chinese characters do not work. Now I will try Korean. Catholic words in Korean: Catholic – ???; Father – ??; Sister – ??; Sacrament of Confession – ????

  19. MenTaLguY says:

    Hm, let’s try accented european characters: áéíóúýñ àèìòù?ñ

  20. MenTaLguY says:

    It looks like, for posted comments, any character not covered in cp1252 is getting smashed into a ?. If “smart quotes” work, that pretty much confirms it. The really maddening thing is that all these characters show up fine in previews.

  21. MenTaLguY says:

    Yep. New comments are being shoehorned into cp1252 encoding, losing most non-European characters (and some European ones).

  22. David C says:

    MenTaLguY: but the recovered characters displayed correctly before going into the comments system, yes?

  23. David C says:

    MenTaLguY: sorry, I missed one of your comments which answers my question. Can you point me to some posts that don’t convert correctly? Thanks.

  24. teresa says:

    Dear Father,

    I am Chinese and I hope I can help you.

    But where are the passages? Perhaps you can send me them by email. And I will try to get them back into Chinese Characters, if they are not corrupted, they can be converted again.

    If you can tell me the English meaning of the passages, I can translate them back into Chinese and type the text and send you by email.

    And I will be happy to be able to help you in these kind of issues later too.

    Just write me an email.
    I am sending you an email to tell you my email address.

    Greetings.

  25. Roland de Chanson says:

    Just as a test:

    ???? ???, ??? ??? ?? ???????. (Russian)

    ????? ????, ? ?? ???? ????????. (Greek)

    O?e naš, koji si na nebesima. (Serbian, Latin characters)

    Ojcze nasz, który? jest w niebe. (Polish)

    Pater noster, qui es in caelis. (Latin)

    All of them appear correctly in the preview. Now for the post …

  26. MenTaLguY says:

    David: yes, the recovered comments show up correctly via preview. The post I linked to from which I recovered text actually doesn’t convert completely (try the title), at least partly due to the presence of some 8-bit characters (i.e. smart quotes). Also try some of the comments on that same post. In some cases it looks like the transformation is not reversible.

    (Because it appears that the transformation is lossy, I’d definitely recommend going over any recovered text with a chinese speaker to make sure it hasn’t been mangled to say something weird…)

  27. MenTaLguY says:

    Roland: to save yourself some trouble, any it appears that any character not on this chart won’t work.

    In any case (this is a _guess_; I’m not privy to the technical details of the site), it looks like one of the things which will need to be done is to change the database encoding from Microsoft cp1252, which can’t represent most international characters, to UTF-8. This will involve converting all of the data there from cp1252 to UTF-8. At that point it’s just a matter of making sure that WordPress is storing things with the correct encoding for the database (which it should do automatically, I think, but it wouldn’t hurt to check). That should take care of addressing problems for new posts and comments.

    As far as the older posts, it would probably be a good idea to save the unconverted database, as converting the database to support international characters might mangle the existing mangled data further.

    I suspect the reason that things used to work is that the software was simply passing things through without paying attention to character encodings, so things went in and out unconverted and usually happened to work. Then at some point updates to the infrastructure meant that parts of the system started paying proper attention to character encodings, which unfortunately meant that (since different parts of the system disagreed about character encodings, and data had already been stored in mismatched encodings) things started getting mangled.

  28. This is useful for me, folks. Thanks.

  29. David C says:

    Father, I’ve sent you an email with a hopefully robust solution, a script that attempts to circumvent the problems Mentalguy has pointed out. I absolutely agree that you should save a backup copy!

Comments are closed.