Aaron S. Hawley (aaronhawley) wrote,
Aaron S. Hawley
aaronhawley

Sorting UTF-8 strings in PHP

With Unicode characters, in this case the popular UTF-8, sometimes you need to convert characters to ASCII to get things done in PHP. In the case of sorting Unicode, there are the existing solutions of collator_sort() for PHP5 and strcoll() since PHP4. However, they both assume a locale. A hack that is locale-agnostic would just "normalize" Unicode characters to ASCII.

This is far from complete, but seems to do the right thing.

    <?php

    /**
     * Normalize international characters for purposes like sorting and
     * searching by using a heuristic that just uses ASCII--the english
     * alphabet ordering--for a multilingual solution--no locale setting.
     */
    header("Content-type: text/plain; charset=utf-8");

    /**
     * Iñtërnâtiônàlizætiøn
     *
     * Example from Sam Ruby
     * http://intertwingly.net/stories/2004/04/14/i18n.html
     * 
     * By way of WACT team
     * http://www.phpwact.org/php/i18n/charsets
     */
    $internationalization = array(
				  "I", // I
                                  "\xC3\xB1", // ñ
                                  "t", // t
                                  "\xC3\xAB", // ë
                                  "r", // r
                                  "n", // n
                                  "\xC3\xA2", // â
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB4", // ô
                                  "n", // n
                                  "\xC3\xA0", // à
                                  "l", // l
                                  "i", // i
                                  "z", // z
                                  "\xC3\xA6", // æ
                                  "t", // t
                                  "i", // i
                                  "\xC3\xB8", // ø
                                  "n"); // n
    
    /** 
     * Use strtr() with this dictionary to convert to ASCII.
     * This data structure is not comprehensive.
     */
    $utf8_dict = array("\xC3\x80" => "A", // À
                       "\xC3\x81" => "A", // Á
                       "\xC3\x82" => "A", // Â
                       "\xC3\x83" => "A", // Ã
                       "\xC3\x84" => "A", // Ä
                       "\xC3\x85" => "A", // Å
                       "\xC3\x86" => "A", // Æ
                       "\xC3\x9E" => "B", // Þ
                       "\xC3\x87" => "C", // Ç
                       "\xC4\x86" => "C", // Ć
                       "\xC4\x8C" => "C", // Č
                       "\xC4\x90" => "Dj", // Đ
                       "\xC3\x88" => "E", // È
                       "\xC3\x89" => "E", // É
                       "\xC3\x8A" => "E", // Ê
                       "\xC3\x8B" => "E", // Ë
                       "\xC4\x9E" => "G", // Ğ
                       "\xC3\x8C" => "I", // Ì
                       "\xC3\x8D" => "I", // Í
                       "\xC3\x8E" => "I", // Î
                       "\xC3\x8F" => "I", // Ï
                       "\xC4\xB0" => "I", // İ
                       "\xC3\x91" => "N", // Ñ
                       "\xC3\x92" => "O", // Ò
                       "\xC3\x93" => "O", // Ó
                       "\xC3\x94" => "O", // Ô
                       "\xC3\x95" => "O", // Õ
                       "\xC3\x96" => "O", // Ö
                       "\xC3\x98" => "O", // Ø
                       "\xC3\x9F" => "Ss", // ß
                       "\xC3\x99" => "U", // Ù
                       "\xC3\x9A" => "U", // Ú
                       "\xC3\x9B" => "U", // Û
                       "\xC3\x9C" => "U", // Ü
                       "\xC3\x9D" => "Y", // Ý
                       "\xC3\xA0" => "a", // à
                       "\xC3\xA1" => "a", // á
                       "\xC3\xA2" => "a", // â
                       "\xC3\xA3" => "a", // ã
                       "\xC3\xA4" => "a", // ä
                       "\xC3\xA5" => "a", // å
                       "\xC3\xA6" => "a", // æ
                       "\xC3\xBE" => "b", // þ
                       "\xC3\xA7" => "c", // ç
                       "\xC4\x87" => "c", // ć
                       "\xC4\x8D" => "c", // č
                       "\xC4\x91" => "dj", // đ
                       "\xC3\xA8" => "e", // è
                       "\xC3\xA9" => "e", // é
                       "\xC3\xAA" => "e", // ê
                       "\xC3\xAB" => "e", // ë
                       "\xC3\xAC" => "i", // ì
                       "\xC3\xAD" => "i", // í
                       "\xC3\xAE" => "i", // î
                       "\xC3\xAF" => "i", // ï
                       "\xC3\xB0" => "o", // ð
                       "\xC3\xB1" => "n", // ñ
                       "\xC3\xB2" => "o", // ò
                       "\xC3\xB3" => "o", // ó
                       "\xC3\xB4" => "o", // ô
                       "\xC3\xB5" => "o", // õ
                       "\xC3\xB6" => "o", // ö
                       "\xC3\xB8" => "o", // ø
                       "\xC5\x94" => "R", // Ŕ
                       "\xC5\x95" => "r", // ŕ
                       "\xC5\xA0" => "S", // Š
                       "\xC5\x9E" => "S", // Ş
                       "\xC5\xA1" => "s", // š
                       "\xC3\xB9" => "u", // ù
                       "\xC3\xBA" => "u", // ú
                       "\xC3\xBB" => "u", // û
                       "\xC3\xBC" => "u", // ü
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBD" => "y", // ý
                       "\xC3\xBF" => "y", // ÿ
                       "\xC5\xBD" => "Z", // Ž
                       "\xC5\xBE" => "z"); // ž
    
    $i18n = join("", $internationalization);
    print $i18n . "\n";

    /**
     * UTF-8 regular expression from
     * http://php.net/manual/en/function.utf8-decode.php (comment 57069)
     */
    $utf8_re = "/^([\\x00-\\x7f]|"
      . "[\\xc2-\\xdf][\\x80-\\xbf]|"
      . "\\xe0[\\xa0-\\xbf][\\x80-\\xbf]|"
      . "[\\xe1-\\xec][\\x80-\\xbf]{2}|"
      . "\\xed[\\x80-\\x9f][\\x80-\\xbf]|"
      . "\\xef[\\x80-\\xbf][\\x80-\\xbc]|"
      . "\\xee[\\x80-\\xbf]{2}|"
      . "\\xf0[\\x90-\\xbf][\\x80-\\xbf]{2}|"
      . "[\\xf1-\\xf3][\\x80-\\xbf]{3}|"
      . "\\xf4[\\x80-\\x8f][\\x80-\\xbf]{2})*$/";

    print "Valid UTF-8?: " . (preg_match($utf8_re, $i18n) > 0
			      ? "true" : "false") . "\n";

    print strtr($i18n, $utf8_dict) . "\n";

    // Doesn't work in PHP4?
    $sorted = preg_split("//u", $i18n, -1, PREG_SPLIT_NO_EMPTY);
    // So, just use the original array, instead.
    $sorted = $internationalization;

    function compare($s1, $s2)
    {
      global $utf8_dict;
      return strcasecmp(strtr($s1, $utf8_dict),
			strtr($s2, $utf8_dict));
    }

    usort($sorted, "compare");
    print join("", $sorted) . "\n";

    /**
     * Results:
     * 
     * Iñtërnâtiônàlizætiøn
     * Valid UTF-8?: true
     * Internationalization
     * àæâëIiiilñnnnøôrtttz
     */
    ?>

I tried the I18N_UnicodeNormalizer from the PHP PEAR project, and it didn't do what I wanted.

    <?php

    require_once('I18N/UnicodeNormalizer.php');

    print I18N_UnicodeNormalizer::toNFD($i18n) . "\n";
    print I18N_UnicodeNormalizer::toNFC($i18n) . "\n";
    ?>

There's a good chance I don't know what I'm doing there with the PEAR library, however.

Tags: howto, php, programming
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 6 comments