Friday, February 13, 2009

Quick guide to UTF-8 in PHP regular expressions

There's a lot of different issues that come up with any sort of Unicode support in applications. Mainly because things get a lot more complicated than the standard 8-bit ASCII implementation. UTF-8 is the most common encoding in use for Unicode support. There have been many in-depth documents written about this and I encourage you to read them if you have to do anything more than modify a couple of lines of code. To modify regular expressions in PHP for UTF-8 characters, here are a couple of steps that can get you off the ground quickly:

* Use the preg functions, the ereg functions do not support Unicode and are going away in PHP 6.
* To enable unicode for a regex, add a /u flag. i.e.

if(preg_match("/[^[:space:]a-zA-Z0-9{1,}/u", $string)) {

* Find a good table of UTF-8 characters and their hexadecimal value, like this one. Use this table to look up the Unicode value, it should be in the format U+NNN where the NNN represents a variable length hexadecimal number. For example the euro symbol, € is U+20AC. To use this in a regular expression, use the \x{NNN} format, i.e. \x{20AC}.

* For example, to add € to the allowed characters in the above expression, use the following:

if(preg_match("/[^[:space:]a-zA-Z0-9\x{20AC}{1,}/u", $string)) {

No comments: