Sorting UTF-8 strings in PHP

« previous entry | next entry »
May. 28th, 2009 | 03:23 pm

With Unicode characters, in this case the popular UTF-8, sometimes you need to convert characters to ASCII to get things done in PHP. In the case of sorting Unicode, there are the existing solutions of collator_sort() for PHP5 and strcoll() since PHP4. However, they both assume a locale. A hack that is locale-agnostic would just "normalize" Unicode characters to ASCII.

This is far from complete, but seems to do the right thing.

    <?php

    /**
     * Normalize international characters for purposes like sorting and
     * searching by using a heuristic that just uses ASCII--the english
     * alphabet ordering--for a multilingual solution--no locale setting.
     */
    header("Content-type: text/plain; charset=utf-8");

    /**
     * Iñtërnâtiônàlizætiøn
     *
     * Example from Sam Ruby
     * http://intertwingly.net/stories/2004/04/14/i18n.html
     * 
     * By way of WACT team
     * http://www.phpwact.org/php/i18n/charsets
     */
    $internationalization = array(
				  "I", // I
                                  "\xC3\xB1", // ñ
                                  "t", // t
                                  "\xC3\xAB", // ë
                                  "r", // r
                                  "n", // n
                                  "\xC3\xA2", // â
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB4", // ô
                                  "n", // n
                                  "\xC3\xA0", // à
                                  "l", // l
                                  "i", // i
                                  "z", // z
                                  "\xC3\xA6", // æ
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB8", // ø
                                  "n"); // n
    
    /** 
     * Use strtr() with this dictionary to convert to ASCII.
     * This data structure is not comprehensive.
     */
    $utf8_dict = array("\xC3\x80" => "A", // À
                       "\xC3\x81" => "A", // Á
                       "\xC3\x82" => "A", // Â
                       "\xC3\x83" => "A", // Ã
                       "\xC3\x84" => "A", // Ä
                       "\xC3\x85" => "A", // Å
                       "\xC3\x86" => "A", // Æ
                       "\xC3\x9E" => "B", // Þ
                       "\xC3\x87" => "C", // Ç
                       "\xC4\x86" => "C", // Ć
                       "\xC4\x8C" => "C", // Č
                       "\xC4\x90" => "Dj", // Đ
                       "\xC3\x88" => "E", // È
                       "\xC3\x89" => "E", // É
                       "\xC3\x8A" => "E", // Ê
                       "\xC3\x8B" => "E", // Ë
                       "\xC4\x9E" => "G", // Ğ
                       "\xC3\x8C" => "I", // Ì
                       "\xC3\x8D" => "I", // Í
                       "\xC3\x8E" => "I", // Î
                       "\xC3\x8F" => "I", // Ï
                       "\xC4\xB0" => "I", // İ
                       "\xC3\x91" => "N", // Ñ
                       "\xC3\x92" => "O", // Ò
                       "\xC3\x93" => "O", // Ó
                       "\xC3\x94" => "O", // Ô
                       "\xC3\x95" => "O", // Õ
                       "\xC3\x96" => "O", // Ö
                       "\xC3\x98" => "O", // Ø
                       "\xC3\x9F" => "Ss", // ß
                       "\xC3\x99" => "U", // Ù
                       "\xC3\x9A" => "U", // Ú
                       "\xC3\x9B" => "U", // Û
                       "\xC3\x9C" => "U", // Ü
                       "\xC3\x9D" => "Y", // Ý
                       "\xC3\xA0" => "a", // à
                       "\xC3\xA1" => "a", // á
                       "\xC3\xA2" => "a", // â
                       "\xC3\xA3" => "a", // ã
                       "\xC3\xA4" => "a", // ä
                       "\xC3\xA5" => "a", // å
                       "\xC3\xA6" => "a", // æ
                       "\xC3\xBE" => "b", // þ
                       "\xC3\xA7" => "c", // ç
                       "\xC4\x87" => "c", // ć
                       "\xC4\x8D" => "c", // č
                       "\xC4\x91" => "dj", // đ
                       "\xC3\xA8" => "e", // è
                       "\xC3\xA9" => "e", // é
                       "\xC3\xAA" => "e", // ê
                       "\xC3\xAB" => "e", // ë
                       "\xC3\xAC" => "i", // ì
                       "\xC3\xAD" => "i", // í
                       "\xC3\xAE" => "i", // î
                       "\xC3\xAF" => "i", // ï
                       "\xC3\xB0" => "o", // ð
                       "\xC3\xB1" => "n", // ñ
                       "\xC3\xB2" => "o", // ò
                       "\xC3\xB3" => "o", // ó
                       "\xC3\xB4" => "o", // ô
                       "\xC3\xB5" => "o", // õ
                       "\xC3\xB6" => "o", // ö
                       "\xC3\xB8" => "o", // ø
                       "\xC5\x94" => "R", // Ŕ
                       "\xC5\x95" => "r", // ŕ
                       "\xC5\xA0" => "S", // Š
                       "\xC5\x9E" => "S", // Ş
                       "\xC5\xA1" => "s", // š
                       "\xC3\xB9" => "u", // ù
                       "\xC3\xBA" => "u", // ú
                       "\xC3\xBB" => "u", // û
                       "\xC3\xBC" => "u", // ü
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBF" => "y", // ÿ
                       "\xC5\xBD" => "Z", // Ž
                       "\xC5\xBE" => "z"); // ž
    
    $i18n = join("", $internationalization);
    print $i18n . "\n";

    /**
     * UTF-8 regular expression from
     * http://php.net/manual/en/function.utf8-decode.php (comment 57069)
     */
    $utf8_re = "/^([\\x00-\\x7f]|"
      . "[\\xc2-\\xdf][\\x80-\\xbf]|"
      . "\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|"
      . "[\\xe1-\\xec][\\x80-\\xbf]{2}|"
      . "\\xed[\\x80-\\x9f][\\x80-\\xbf]|"
      . "\\xef[\\x80-\\xbf][\\x80-\\xbc]|"
      . "\\xee[\\x80-\\xbf]{2}|"
      . "\\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|"
      . "[\\xf1-\\xf3][\\x80-\\xbf]{3}|"
      . "\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$/";

    print "Valid UTF-8?: " . (preg_match($utf8_re, $i18n) > 0
			      ? "true" : "false") . "\n";

    print strtr($i18n, $utf8_dict) . "\n";

    // Doesn't work in PHP4?
    $sorted = preg_split("//u", $i18n, -1, PREG_SPLIT_NO_EMPTY);
    // So, just use the original array, instead.
    $sorted = $internationalization;

    function compare($s1, $s2)
    {
      global $utf8_dict;
      return strcasecmp(strtr($s1, $utf8_dict),
			strtr($s2, $utf8_dict));
    }

    usort($sorted, "compare");
    print join("", $sorted) . "\n";

    /**
     * Results:
     * 
     * Iñtërnâtiônàlizætiøn
     * Valid UTF-8?: true
     * Internationalization
     * àæâëIiiilñnnnøôrtttz
     */
    ?>

I tried the I18N_UnicodeNormalizer from the PHP PEAR project, and it didn't do what I wanted.

    <?php

    require_once('I18N/UnicodeNormalizer.php');

    print I18N_UnicodeNormalizer::toNFD($i18n) . "\n";
    print I18N_UnicodeNormalizer::toNFC($i18n) . "\n";
    ?>

There's a good chance I don't know what I'm doing there with the PEAR library, however.

Link | Leave a comment | Share

Comments {6}

Aaron S. Hawley

"Collation"

from: aaronhawley
date: May. 28th, 2009 08:22 pm (UTC)
Link

Apparently, in the context of sorting, this is called "collation".

http://userguide.icu-project.org/collation

Reply | Thread

Better ways

from: http://claimid.com/ieure
date: May. 29th, 2009 05:36 am (UTC)
Link

There are better ways to handle the decomposition in PHP. Here’s what I’d suggest instead of strtr():

$str = preg_replace('/[^a-z0-9 ]/i', '', iconv('utf-8', 'ascii//TRANSLIT', $str));

The iconv() call takes a UTF-8 string and transliterates it to ASCII. Your “Iñtërnâtiônàlizætiøn” example becomes “I~nt"ern^ati^on`alizaetion.” Then we can strip out the non-text characters, and use that for the comparison.

You could also take a hybrid approach, where you cache the decomposed character in a map so you can save the iconv() call on subsequent compare() invocations, but this seems a little micro-optimizey to me.

Reply | Thread

Aaron S. Hawley

Re: Better ways

from: aaronhawley
date: May. 29th, 2009 06:12 am (UTC)
Link

That looks like a pretty complete conversion to ASCII. Cool. I actually thought of a similar approach except using the PEAR module. Though, iconv does seem better at this. I wasn't aware.

Reply | Parent | Thread

Re: Better ways

from: http://claimid.com/ieure
date: May. 29th, 2009 04:09 pm (UTC)
Link

For iconv() to work, you need the "//TRANSLIT" magic, which is extremely poorly documented.
It also has "//IGNORE", which will just drop characters it can't represent in the target encoding.

Reply | Parent | Thread

Alex Schröder

NFC and NFD

from: kensanata
date: Jun. 5th, 2009 08:33 am (UTC)
Link

What was the code example using Unicode normalization supposed to do? It seems to be unrelated to converting Unicode to ASCII. Well... Perhaps. The difference between NFC and NFD is that one encodes Ä as Ä and the other encodes Ä as a combining ¨ + A (NFD uses U+0041 U+0308 (LATIN CAPITAL LETTER A, COMBINING DIAERESIS) instead of U+00C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) to quote http://www.cl.cam.ac.uk/~mgk25/unicode.html . Thus, I guess you could convert the string to NFD, and strip any character that is not an ASCII character. Might be good enough?

Reply | Thread

Aaron S. Hawley

Re: NFC and NFD

from: aaronhawley
date: Jun. 8th, 2009 03:01 pm (UTC)
Link

I think converting to NFD and stripping out anything non-ASCII is exactly the approach that Ian Eure suggests with PHP's iconv functions.

Reply | Parent | Thread