Help! Getting some Chinese characters back from remaining code

Posted on 20 February 2009 by Fr. John Zuhlsdorf

Upon upon a time when some changes were made to the server hosting this blog, I lost the ability to display certain characters, including Chinese characters.

Hundreds of entries had words that were turned into gobbledygook.

I can’t tell you how irritating that was.

Do any of you who possess a Chinese language know how to change the code left behind back into Chinese characters?

Help!

About Fr. John Zuhlsdorf

Fr. Z is the guy who runs this blog. o{]:¬)

View all posts by Fr. John Zuhlsdorf →

This entry was posted in SESSIUNCULA. Bookmark the permalink.

29 Comments

Dr. Eric says:

20 February 2009 at 12:32 PM

Dui bu qi. No I don’t. :-(
MenTaLguY says:

20 February 2009 at 12:37 PM

I imagine it’s an encoding issue. The encoding that pages on the blog are being served with is currently UTF-8; I don’t know what it might have been before.

In firefox, at least, you should be able to go to one of the damaged pages, go to View > Character Encoding from the menu, and try different encodings until you find the right one (though you will need to know what one or two of the characters are supposed to be to make sure you’ve got the right one).

At that point you should be able to copy the text out and paste it into an editing form in a separate tab/window that is still set to UTF-8 in order to convert the text.
MenTaLguY says:

20 February 2009 at 1:06 PM

Hm, unfortunately it looks like the text of entries may actually be corrupted in the database and not simply transmitted with the wrong encoding, though it wouldn’t hurt to get a second opinion.
Timbot says:

20 February 2009 at 1:08 PM

If the pages in question still do not display correctly with either DTF-8 or GB3212 then the flat files themselves have been corrupted and you are lost.
Father Anthony Ho says:

20 February 2009 at 1:31 PM

Let me try to type some Chinese here….

Father Z

Z ??
Father Anthony Ho says:

20 February 2009 at 1:32 PM

O! It didn’t work! I typed “Father” in Chinese but “??” appeared instead.
Fr. John Zuhlsdorf says:

20 February 2009 at 1:40 PM

Fr. Ho: Yes… I know. I will see what can be done at the level of the server.
David C says:

20 February 2009 at 1:58 PM

Here’s the problem: the original UTF-8 Chinese text was misinterpreted as though it were in Windows encoding (CP1256) and then “converted” into UTF-8. This happened *twice*. To recover the Chinese text, apply the reverse transformation. I only know how to do this under Unix using uconv (part of the ICU package from IBM):

cat garbled-file | uconv -f utf8 -t cp1252 | uconv -f utf8 -t cp1252
Fr. John Zuhlsdorf says:

20 February 2009 at 2:17 PM

David C: I am not sure I entirely understand that, but I can pass that along.
David C says:

20 February 2009 at 2:46 PM

Father, it’s as if someone found a scrap of paper that said “sex dies”, mistakenly thought it was in English, and translated it into Latin “sexus moritur”.
wsxyz says:

20 February 2009 at 3:16 PM

This must by why your Korean text has also been mangled for some time.

Hooray for Windows!
David C says:

20 February 2009 at 3:44 PM

Apologies, I wrote “CP1256” above when I should have written “CP1252.” Father, if the recipe above doesn’t work for your admin, I’d be happy to ungarble the files if they can be sent to me conveniently.
MenTaLguY says:

20 February 2009 at 8:46 PM

I can confirm that David’s recipe works (with the substitution of cp1252), even for sequences of Chinese text copy-and-pasted directly from the blog. However, it doesn’t always appear to work when there are runs of English text interspersed — it looks like the runs of Chinese text may need to be converted individually, and a handful of them appear to have been irreversably mangled.
MenTaLguY says:

20 February 2009 at 8:55 PM

Here is the recovered text (using David’s method) from Father’s post of a couple years ago:

???? !
Gongxi facai!
Happy New Year!
Today I am heading off to what I am sure will be a marvelous celebration hosted by the Chinese Ambassador to the Holy See. I will give you a report later.
?????

(We’ll see if this actually shows up correctly in the comments section — if it does, Father Ho’s problem may be that his browser is using an encoding other than UTF-8 for posting.)
MenTaLguY says:

20 February 2009 at 8:55 PM

Nope. Evidently some things are still(?) misconfigured on the server side with respect to character encodings. Oh well.
Willebrord says:

20 February 2009 at 9:05 PM

Just out of curiosity: how often exactly do you post in Chinese, Father?
Peter says:

20 February 2009 at 10:12 PM

Catholic terms in Chinese: Catholic – ???; Father – ??; Nun – ??; Sacrament of Confession – ????.
Peter says:

20 February 2009 at 10:15 PM

Chinese characters do not work. Now I will try Korean. Catholic words in Korean: Catholic – ???; Father – ??; Sister – ??; Sacrament of Confession – ????
MenTaLguY says:

21 February 2009 at 12:40 AM

Hm, let’s try accented european characters: áéíóúýñ àèìòù?ñ
MenTaLguY says:

21 February 2009 at 12:47 AM

It looks like, for posted comments, any character not covered in cp1252 is getting smashed into a ?. If “smart quotes” work, that pretty much confirms it. The really maddening thing is that all these characters show up fine in previews.
MenTaLguY says:

21 February 2009 at 12:49 AM

Yep. New comments are being shoehorned into cp1252 encoding, losing most non-European characters (and some European ones).
David C says:

21 February 2009 at 2:20 AM

MenTaLguY: but the recovered characters displayed correctly before going into the comments system, yes?
David C says:

21 February 2009 at 2:21 AM

MenTaLguY: sorry, I missed one of your comments which answers my question. Can you point me to some posts that don’t convert correctly? Thanks.
teresa says:

21 February 2009 at 5:52 AM

Dear Father,

I am Chinese and I hope I can help you.

But where are the passages? Perhaps you can send me them by email. And I will try to get them back into Chinese Characters, if they are not corrupted, they can be converted again.

If you can tell me the English meaning of the passages, I can translate them back into Chinese and type the text and send you by email.

And I will be happy to be able to help you in these kind of issues later too.

Just write me an email.
I am sending you an email to tell you my email address.

Greetings.
Roland de Chanson says:

21 February 2009 at 8:34 AM

Just as a test:

???? ???, ??? ??? ?? ???????. (Russian)

????? ????, ? ?? ???? ????????. (Greek)

O?e naš, koji si na nebesima. (Serbian, Latin characters)

Ojcze nasz, który? jest w niebe. (Polish)

Pater noster, qui es in caelis. (Latin)

All of them appear correctly in the preview. Now for the post …
MenTaLguY says:

21 February 2009 at 11:37 AM

David: yes, the recovered comments show up correctly via preview. The post I linked to from which I recovered text actually doesn’t convert completely (try the title), at least partly due to the presence of some 8-bit characters (i.e. smart quotes). Also try some of the comments on that same post. In some cases it looks like the transformation is not reversible.

(Because it appears that the transformation is lossy, I’d definitely recommend going over any recovered text with a chinese speaker to make sure it hasn’t been mangled to say something weird…)
MenTaLguY says:

21 February 2009 at 11:52 AM

Roland: to save yourself some trouble, any it appears that any character not on this chart won’t work.

In any case (this is a _guess_; I’m not privy to the technical details of the site), it looks like one of the things which will need to be done is to change the database encoding from Microsoft cp1252, which can’t represent most international characters, to UTF-8. This will involve converting all of the data there from cp1252 to UTF-8. At that point it’s just a matter of making sure that WordPress is storing things with the correct encoding for the database (which it should do automatically, I think, but it wouldn’t hurt to check). That should take care of addressing problems for new posts and comments.

As far as the older posts, it would probably be a good idea to save the unconverted database, as converting the database to support international characters might mangle the existing mangled data further.

I suspect the reason that things used to work is that the software was simply passing things through without paying attention to character encodings, so things went in and out unconverted and usually happened to work. Then at some point updates to the infrastructure meant that parts of the system started paying proper attention to character encodings, which unfortunately meant that (since different parts of the system disagreed about character encodings, and data had already been stored in mismatched encodings) things started getting mangled.
Fr. John Zuhlsdorf says:

21 February 2009 at 12:58 PM

This is useful for me, folks. Thanks.
David C says:

21 February 2009 at 3:04 PM

Father, I’ve sent you an email with a hopefully robust solution, a script that attempts to circumvent the problems Mentalguy has pointed out. I absolutely agree that you should save a backup copy!

Comments are closed.

Archives
Archives
ENTRY CALENDAR
February 2009

S M T W T F S

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

« Jan Mar »

Search for:
48320
What people say about Fr. Z

"The great Father Zed, Archiblogopoios" - Fr. John Hunwicke

"Some 2 bit novus ordo cleric" - Anonymous

"Rev. John Zuhlsdorf, a traditionalist blogger who has never shied from picking fights with priests, bishops or cardinals when liturgical abuses are concerned." - Kractivism

"Father John Zuhlsdorf is a crank"
"Father Zuhlsdorf drives me crazy"
"the hate-filled Father John Zuhlsford" [sic]
"Father John Zuhlsdorf, the right wing priest who has a penchant for referring to NCR as the 'fishwrap'"
"Zuhlsdorf is an eccentric with no real consequences" - HERE
- Michael Sean Winters

"Fr Z is a true phenomenon of the information age: a power blogger and a priest." - Anna Arco

“Given that Rorate Coeli and Shea are mad at Fr. Z, I think it proves Fr. Z knows what he is doing and he is right.” - Comment

"Let me be clear. Fr. Z is a shock jock, mostly. His readership is vast and touchy. They like to be provoked and react with speed and fury." - Sam Rocha

"Father Z’s Blog is a bright star on a cloudy night." - Comment

"A cross between Kung Fu Panda and Wolverine." - Anonymous

Fr. Z is officially a hybrid of Gandalf and Obi-Wan XD - Comment

Rev. John Zuhlsdorf, a scrappy blogger popular with the Catholic right. - America Magazine

RC integralist who prays like an evangelical fundamentalist. -Austen Ivereigh on Twitter

[T]he even more mainline Catholic Fr. Z. blog. -Deus Ex Machina

“For me the saddest thing about Father Z’s blog is how cruel it is.... It’s astonishing to me that a priest could traffic in such cruelty and hatred.” - Jesuit homosexualist James Martin to BuzzFeed

"Fr. Z's is one of the more cheerful blogs out there and he is careful about keeping the crazies out of his commboxes" - Paul in comment at 1 Peter 5

"I am a Roman Catholic, in no small part, because of your blog.
I am a TLM-going Catholic, in no small part, because of your blog.
And I am in a state of grace today, in no small part, because of your blog."
- Tom in comment

"Thank you for the delightful and edifying omnibus that is your blog."- Reader comment.

"Fr. Z disgraces his priesthood as a grifter, a liar, and a bully. - - Mark Shea
Fr. Z’s Blog is a Founding Member of…
Spam blocked
34 141 spam blocked by CleanTalk
Visits tracked by Statcounter since Sat., 25 Nov. 2006:

Fr. Z's Blog © 2024 Father John Zuhlsdorf