当前位置: 动力学知识库 > 问答 > 编程问答 >

encoding - Converting Unicode reference to UTF-8 character in PHP with mbstring

问题描述:

I have a set of data inside a database which has been input with unicode characters, but they were interpreted as a string. That is, where there should be an apostrophe I've actually got \u2019

So I now need to convert this into its character representation, which is . Firstly it is quite easy to change the string into its entity version: ’, then I need to turn it into the correct UTF-8 multibyte string.

I have attempted to do this in a number of ways; on my local server I can exctract the characters with a preg_match function and then pass each to the following function:

mb_convert_encoding($string, "UTF-8", "HTML-ENTITIES");

Sounds quite sensible, and works without issue. Turning off the UTF-8 charset in the browser shows that this has actually converted into ’ when read by the browser default encoding.

However, the exact same code when run in my production environment produces the dreaded "missing symbol" box when rendered as UTF-8. Turning off UTF-8 and it has produced whatever byte stream renders as ò°‘£. It appears to be outputting 4 bytes rather than 3, I don't know if that is relevant as I'm not well read on character encoding.

I assume that the issue is with my mbstring settings. Here are the mbstring settings from my local server:

Multibyte Support enabled

Multibyte string engine libmbfl

HTTP input encoding translation disabled

Multibyte (japanese) regex support enabled

Multibyte regex (oniguruma) version 4.7.1

mbstring.detect_order no value no value

mbstring.encoding_translation Off Off

mbstring.func_overload 0 0

mbstring.http_input auto auto

mbstring.http_output UTF-8 UTF-8

mbstring.http_output_conv_mimetypes ^(text/|application/xhtml\+xml)^(text/|application/xhtml\+xml)

mbstring.internal_encoding UTF-8 UTF-8

mbstring.language neutral neutral

mbstring.strict_detection Off Off

mbstring.substitute_character no value no value

There are a few differences on my production environment:

Multibyte Support enabled

Multibyte string engine libmbfl

Multibyte (japanese) regex support enabled

Multibyte regex (oniguruma) version 3.7.1

mbstring.detect_order no value no value

mbstring.encoding_translation Off Off

mbstring.func_overload 0 0

mbstring.http_input auto auto

mbstring.http_output UTF-8 UTF-8

mbstring.internal_encoding UTF-8 UTF-8

mbstring.language neutral neutral

mbstring.strict_detection Off Off

mbstring.substitute_character no value no value

Anyone see what I'm doing wrong?

网友答案:

See if this can help you: hex2ascii and ascii2hex

ADDED on 09-19-2012:

function ascii2hex($ascii)
{
    $hex = '';
    for ($i = 0; $i < strlen($ascii); $i++)
    {
        $byte = strtoupper(dechex(ord($ascii{$i})));
        $byte = str_repeat('0', 2 - strlen($byte)).$byte;
        $hex .= $byte." ";
    }
    return $hex;
}

function hex2ascii($hex)
{
    $ascii = '';
    $hex = str_replace(" ", "", $hex);
    for($i = 0; $i < strlen($hex); $i = $i+2)
        $ascii .= chr(hexdec(substr($hex, $i, 2)));

    return($ascii);
}
网友答案:

I guess what you're looking for, are multibyte versions of ord and chr.

I wrote the following polyfill for that :

if (!function_exists('mb_internal_encoding')) {
    function mb_internal_encoding($encoding = NULL) {
        return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
    }
}

if (!function_exists('mb_convert_encoding')) {
    function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
        return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
    }
}

if (!function_exists('mb_chr')) {
    function mb_chr($ord, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            return pack("N", $ord);
        } else {
            return mb_convert_encoding(mb_chr($ord, 'UCS-4BE'), $encoding, 'UCS-4BE');
        }
    }
}

if (!function_exists('mb_ord')) {
    function mb_ord($char, $encoding = 'UTF-8') {
        if ($encoding === 'UCS-4BE') {
            list(, $ord) = (strlen($char) === 4) ? @unpack('N', $char) : @unpack('n', $char);
            return $ord;
        } else {
            return mb_ord(mb_convert_encoding($char, 'UCS-4BE', $encoding), 'UCS-4BE');
        }
    }
}

Demo

echo "\nGet string from numeric DEC value\n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));

echo "\nGet string from numeric HEX value\n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));

echo "\nGet numeric value of character as DEC int\n";
var_dump(mb_ord('我'));
var_dump(mb_ord('好'));

echo "\nGet numeric value of character as HEX string\n";
var_dump(dechex(mb_ord('我')));
var_dump(dechex(mb_ord('好')));

Output:

Get string from numeric DEC value
string(3) "我"
string(3) "好"

Get string from numeric HEX value
string(3) "我"
string(3) "好"

Get numeric value of character as DEC string
int(25105)
int(22909)

Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"
分享给朋友:
您可能感兴趣的文章:
随机阅读: