Unicode
I’ve been working on a new web project in my 5-9 time that will hopefully make its debut this week. Across the board, I’ve been trying to do this one in “best practices” mode. One of the problems I have run into consistently is handling non-ASCII characters.
This isn’t exactly anything new to web developers, but now that Unicode has crept into each layer of the stack (Good charset/collation support in MySQL 4.1, Unicode-native languages like Ruby, browser support, growth of XML and XHTML) it’s theoretically possible to do Unicode everywhere, from request to output.
So, the data is the same everywhere. Good. The new problem is that each layer of the stack has its own native character set and/or biases for character handling, and the application code needs to respect that to get it all to work.
My app stack for this project is: PHP 5, MySQL 4.1, Smarty 2.6.10, PEAR::DB 1.7.6. If you combine all that with a proper controller class, it’s possible to make this work painlessly. Oh, I’m also using Gordon Luk’s Freetag, which depends on ADOdb.
Here are my comments and tips on each layer of this stack:
- MySQL: All your text-bearing columns should use a Unicode encoding/collation. utf8_general_ci does the trick for me. If you’re in another country, try one of the other collations for better sorting.
- PEAR::DB: Works out of the box.
- Browser: Serve your pages with a header like Content-Type: text/html; charset=utf-8 or XML as appropriate. You could also theoretically do application/xml+xhtml—see Ian Hickson’s Sending XHTML as text/html Considered Harmful—but it’s the charset here that’s important. Stick it in a meta tag too if you want, but the header alone works for me.
- PHP and MySQL: Once all that is done, you need to encode/decode strings going to/from DB operations. The results of SELECT statements need to be sent through utf8_encode() before text-processing/entity encoding/etc., and the values for INSERT/UPDATE queries need to be sent through utf8_decode before you query. Note that this is unrelated to escaping for MySQL to prevent insertion attacks and unterminated strings—if you’re using PEAR::DB, just pass your UTF8-decoded strings in the array for the various prepare/execute methods and it will work correctly.
- Smarty: (update: this is no longer necessary in Smarty 2.6.11+; see comment below.) Smarty’s builtin modifiers are charset-ignorant. To fix this, go to smarty/libs/plugins/modifier.escape.php and add in the charset arguments to htmlspecialchars and htmlentities like this:
// lines 24-28: case 'html': return htmlspecialchars($string, ENT_QUOTES, 'UTF-8'); case 'htmlall': return htmlentities($string, ENT_QUOTES, 'UTF-8');
However, this won’t be so useful because PHP’s htmlentities and htmlspecialchars output named character entities, which may invalidate any XML documents without DTDs that you’re generating (named references are not built into XML, and come from the DTD). You need to always use numeric entities. As long as you’re ouputting Unicode, you don’t really need to encode non-special (e.g. <> & ” ‘) characters anyway… So just handle those with your own texturizer, and throw in SmartyPants while you’re at it. - Freetag/ADOdb: Haven’t looked at the code for this combo, but it works for me if I treat the Freetag functions as MySQL queries in that data to/from accessors needs to be decoded/encoded. My guess is that Freetag is charset agnostic, and so is ADOdb (like PEAR::DB).
Those are the steps needed to actually make it work. Now, what about painless? Putting ut8_encode/decodes everywhere sucks. The solution is to do it at the framework level.
- Encoding: If you’re using Smarty (or another template class) for everything, then you can override the assign() function. Here’s my code (note, I only assign scalars and associative arrays, but you could extend this to work for objects as well):
class Smarty_MySite extends Smarty { // ... function assign($tpl_var, $value = null, $encode = true) { if ($encode && $value != null) { $value = $this->utf8_encode_array($value); } parent::assign($tpl_var, $value); } function utf8_encode_array($var) { if (is_array($var)) { foreach($var as $key => $value) { $var[$key] = $this->utf8_encode_array($var[$key]); } } else { $var = utf8_encode($var); } return $var; } } - Decoding: Since all my INSERT/UPDATEs are the result of POST requests, I just put some code in my controller to do the decodes if it’s a POST. Use a recursive function like the one above for encoding.
As you see the stack running this site isn’t even handling character encoding well (&& in above code). The support is out there, but you have to do a little work at the application level to make it all work.
December 6th, 2005 at 5:17 am
This is a really helpful article you’ve written. I am having a bit of trouble implementing the Smarty class extension for objects though.
PHP 5 is supposed to let you interate through objects using foreach. In the second line of function utf8_encode_array i changed it to this:
if (is_array($var) || is_object($var)) {but the I got the fatal error:
Cannot use object of type (myObjectName) as array in blah blah blah.
Looking at the smarty code, I don’t really understand how they implement the objects either in the assign function.
Any thoughts on all this?
December 6th, 2005 at 12:27 pm
I haven’t tried this, but I’m guessing you’re almost there. It’s probably just that you can’t use array syntax to access the object’s members in the foreach. So you need to do something like this instead:
function utf8_encode_anything($var) { if (is_array($var)) { foreach($var as $key => $value) { $var[$key] = $this->utf8_encode_anything($var[$key]); } } elseif (is_object($var)) { foreach($var as $key => $value) { $var->key = $this->utf8_encode_anything($var->key); } } else { $var = utf8_encode($var); } return $var; } }Hope this helps!
January 18th, 2006 at 2:09 pm
[...] Susan’s photojournalism site, launches today. We’ve been working on it for a few months. Although it’s no great shakes design-wise, thisall the Unicode investigations, as well as a host of other best-practices magic. Try the GMap. I’m also proud to announce that Susan has released the web versions of her photos under aCreative Commons license. # [...]
January 23rd, 2006 at 2:34 pm
Update: it looks like a charset argument has been added to the escape modifiers in Smarty 2.6.11, so the original hack is no longer necessary. Instead, you might want to change the default argument for the escape function (line 22) from ‘ISO-8859-1′ to ‘UTF-8′ so you don’t have to pass it every time.