ThinkChat2.0新版上线,更智能更精彩,支持会话、画图、视频、阅读、搜索等,送10W Token,即刻开启你的AI之旅 广告
# 字符编码 php.ini设置浏览器默认的字符集: ``` <pre class="calibre10">``` default_charset <span class="token1">=</span> <span class="token2">"UTF-8"</span> ``` ``` 指定页面编码 ``` <pre class="calibre10">``` <span class="token4">header</span><span class="token3">(</span><span class="token2">"content-type;text/html;charset=UTF-8"</span><span class="token3">)</span><span class="token3">;</span> ``` ``` 字符编码优先级:由高--->低 1.http消息报头的Content-Type字段中的charset参数。 2.通过meta元素声明,将http-equiv设置成Content-Type。 3.一些元素的charset属性设置。 # **字符集转换** # mb\_detect\_encoding:检测字符编码 # mb\_detect\_order:设置检测字符的顺序 ``` <pre class="calibre10">``` <span class="token4">mb_detect_order</span><span class="token3">(</span><span class="token2">'ASCII,ANSI,GB2312,BIG5,GBK,GB18030,Unicode,UTF-8,UTF-16,UTF-32'</span><span class="token3">)</span><span class="token3">;</span> <span class="token">//显示当前的检测顺序 </span> echo <span class="token4">implode</span><span class="token3">(</span><span class="token2">", "</span><span class="token3">,</span> <span class="token4">mb_detect_order</span><span class="token3">(</span><span class="token3">)</span><span class="token3">)</span><span class="token3">;</span> ``` ``` 该设置会影响 [mb\_detect\_encoding()](http://www.php.net/manual/zh/function.mb-detect-encoding.php) 和 [mb\_send\_mail()](http://www.php.net/manual/zh/function.mb-send-mail.php) 只能检测;*UTF-8*, *UTF-7*, *ASCII*, *EUC-JP*,*SJIS*, *eucJP-win*, *SJIS-win*, *JIS*, *ISO-2022-JP* 对于 *UTF-16*、*UTF-32*、 *UCS2* 和 *UCS4*,编码检测总是会失败。 对于 *ISO-8859-\**,*mbstring* 总是检测为 *ISO-8859-\**。 # [mb\_internal\_encoding()](http://www.php.net/manual/zh/function.mb-internal-encoding.php) - 设置/获取内部字符编码 # [mb\_http\_input()](http://www.php.net/manual/zh/function.mb-http-input.php) - 检测 HTTP 输入字符编码 # [mb\_http\_output()](http://www.php.net/manual/zh/function.mb-http-output.php) - 设置/获取 HTTP 输出字符编码 # [mb\_send\_mail()](http://www.php.net/manual/zh/function.mb-send-mail.php) - 发送编码过的邮件 总是检测为 ISO-8859-1的情况: detect\_order = ISO-8859-1, UTF-8 总是检测为 UTF-8,由于 ASCII/UTF-7 的值对 UTF-8 是有效的:detect\_order = UTF-8, ASCII, UTF-7 例子: ``` <pre class="calibre10">``` <span class="token5">function</span> <span class="token4">getEncoding</span><span class="token3">(</span>$data<span class="token3">)</span> <span class="token3">{</span> <span class="token5">return</span> <span class="token4">mb_detect_encoding</span><span class="token3">(</span>$data<span class="token3">,</span> <span class="token4">mb_detect_order</span><span class="token3">(</span><span class="token3">)</span><span class="token3">)</span><span class="token3">;</span> <span class="token3">}</span> $str<span class="token1">=</span><span class="token2">'username'</span><span class="token3">;</span> $encoding<span class="token1">=</span><span class="token4">getEncoding</span><span class="token3">(</span>$str<span class="token3">)</span><span class="token3">;</span> $str<span class="token1">=</span><span class="token4">iconv</span><span class="token3">(</span>$encoding<span class="token3">,</span> <span class="token2">'UTF-8'</span><span class="token3">,</span> $str<span class="token3">)</span><span class="token3">;</span> ``` ``` **中文字符编码标准** **GB2312**,**CP936**,**GBK**,**GB18030**,**GB13000** 在技术编码方面上,演化顺序为真子集关系,ASCII ⊂ GB2312 ⊂ GBK ⊂ GB18030。 字符集:为每一个「字符」分配一个唯一的 ID(学名为码位 / 码点 / Code Point) 编码规则:将「码位」转换为字节序列的规则(编码/解码 可以理解为 加密/解密 的过程) 广义的 Unicode 是一个标准,定义了一个字符集以及一系列的编码规则,即 Unicode 字符集和 UTF-8、UTF-16、UTF-32 等等编码…… Unicode 字符集为每一个字符分配一个码位,例如「知」的码位是 30693,记作 U+77E5(30693 的十六进制为 0x77E5)。 编码描述在window中的代码页(Code Page) ,也可叫别名标准所属**ASCII**单字节编码(1字节 = 8 bit 即最多可以表示255 个字符)包含英文大小写字母和(! @ #等)特殊符号,如大写字母`U`就表示成`01010101`---国际通用GB2312二个字节编码; 1980年,中国制定了GB2312-80,一共收录了**7445个字符**,包括6763个简体汉字(不含繁体)和包括拉丁字母、希腊字母、日文平假名及片假名字母、俄语西里尔字母在内的682个全角字符。GB2312-80,简称为GB2312window以前的CP936中国大陆Unicode 1.11993年,国际标准**Unicode 1.1**版本推出,收录中国大陆、台湾、日本及韩国通用字符集的汉字,总共有**20,902个字符**。---国际通用GB13000中国大陆订定了等同于Unicode 1.1版本的“GB 13000.1-93”,简称为GB13000 。包含的GB2312已有的文字和其他很多为包含的文字(总共有**20,902个**),如GB 2312-80推出以后才简化的汉字(如“啰”),部分人名用字(如中国前总理朱镕基的“镕”字),台湾及香港使用的繁体字,日语及朝鲜语汉字等--- (使用UTF的一套标准)中国大陆GBKGBK作为微软对GB2312的扩展,即利用GB 2312-80未使用的编码空间,收录所有的GB 13000.1-93和Unicode 1.1之中的汉字全部字符,制定了GBK编码。GBK收录了**21886个字符**,它分为汉字区和图形符号区。汉字区包括21003个字符;GBK兼容旧的GB2312,但是编码方式和GB13000不同,不兼容GB13000,但是所包含文字上,算是和GB13000相同window现在的CP936微软GB18030GBK自身并非国家标准,原始GB13000一直未被业界采用,2000年,国家出了标准GB18030-2000,简称GB18030,技术上兼容GBK而非GB13000,取代了GBK1.0,成了正式的国家标准。该标准收录了27484个汉字+其他少数民族字符CP54936中国大陆**Unicode**·为了解决各国间编码不统一的问题,国际标准化组织(ISO)和多语言软件制造商这两个组织合作搞出了 **unicode编码**,它将所有语言统一到一套编码,解决了各国间编码格式不兼容的问题,运用在内存处理中。 unicode 编码采用**两个字节**来表示一个字符,这可以涵盖世界上主流使用的字符;Unicode(学名:Universal Multiple-Octet Coded Character Set)就是世界各国合作开发的一种语言,简称为UCS;------UTF-8utf-8 编码是一种可变长编码,是 Unicode 编码根据一套规则转换而来的,会将一个字符编码为 1 到 4 个字节,utf-8 编码一般可以减少字符编码的长度(特别是英文字符较多的情况),运用在传输和存储中; 如:在ASCII编码中,`U`对应二进制序列是`01010101,`而在 Unicode中,`U`对应的二进制序列是`0000000001010101`,同样表示一个英文字符,利用 Unicode 编码较 ASCII编码将多花费一倍的存储空间。将Unicode编码转换为utf-8编码**一般**(英文字符居多的情况)可以节省存储空间ISO欧洲语系举一个例子:It's 知乎日报 unicode:每一个字符对应一个十六进制数字 ``` <pre class="calibre10">``` I <span class="token6">0049</span> t <span class="token6">0074</span> ' <span class="token6">0027</span> s <span class="token6">0073</span> <span class="token6">0020</span> 知 <span class="token6">77e5</span> 乎 <span class="token6">4e4</span>e 日 <span class="token6">65e5</span> 报 <span class="token6">62</span>a5 ``` ``` 严格按照unicode的方式(UCS-2),应该这样存储: ``` <pre class="calibre10">``` I <span class="token6">00000000</span> <span class="token6">01001001</span> t <span class="token6">00000000</span> <span class="token6">01110100</span> ' <span class="token6">00000000</span> <span class="token6">00100111</span> s <span class="token6">00000000</span> <span class="token6">01110011</span> <span class="token6">00000000</span> <span class="token6">00100000</span> 知 <span class="token6">01110111</span> <span class="token6">11100101</span> 乎 <span class="token6">01001110</span> <span class="token6">01001110</span> 日 <span class="token6">01100101</span> <span class="token6">11100101</span> 报 <span class="token6">01100010</span> <span class="token6">10100101</span> ``` ``` 英文前9位都是0浪费,我们换成utf-8 ``` <pre class="calibre10">``` I <span class="token6">01001001</span> t <span class="token6">01110100</span> ' <span class="token6">00100111</span> s <span class="token6">01110011</span> <span class="token6">00100000</span> 知 <span class="token6">11100111</span> <span class="token6">10011111</span> <span class="token6">10100101</span> 乎 <span class="token6">11100100</span> <span class="token6">10111001</span> <span class="token6">10001110</span> 日 <span class="token6">11100110</span> <span class="token6">10010111</span> <span class="token6">10100101</span> 报 <span class="token6">11100110</span> <span class="token6">10001010</span> <span class="token6">10100101</span> ``` ``` 详细的转换过程参看<https://www.zhihu.com/question/23374078> utf-8 编码一般可以减少字符编码的长度(特别是英文字符较多的情况),因此它广泛运用在存储和传输的情形下。但是 utf-8 也不是没有缺点,由于在 utf-8 编码的规则下中英文的编码长度不同,因此这使得我们在内存中操作它们时变得很复杂。所以我们在内存中操作的字符使用的一般是 Unicode 编码,比如 Python 。下面一张图将清楚的描述出 Unicode 和 utf-8 编码的关系 ![](https://img.kancloud.cn/4e/8e/4e8eabf2de63ea3db230cc212cc6e8eb_1152x648.png) [**Code Page**](https://www.crifan.com/chinese_character_encoding_standard__unicode__code_page/) 什么是codepage?codepage就是各国的文字编码和Unicode之间的映射表 而目前现存的有多个厂家,都为对应的不同的字符集,定义了对应的Code Page unicode包含gb13000 **图表3DBCS字符集所对应的Code Page** **代码页Code Page****对应的字符集Character Set**932Japanese936GBK – Simplified Chinese949Korean950BIG5 – Traditional Chinese**图表4微软的其他字符编码的Code Page** **代码页Code Page****对应的字符集Character Set**1200UTF-16LE Unicode little-endian1201UTF-16BE Unicode big-endian65000UTF-7 Unicode65001UTF-8 Unicode10000Macintosh Roman encoding (followed by several other Mac character sets)10007Macintosh Cyrillic encoding10029Macintosh Central European encoding20127US-ASCII The classic US 7 bit character set with no char larger than 12728591ISO-8859-1 (followed by ISO-8859-2 to ISO-8859-15)微软自己定义了一系列的Code Page,称为ANSI Code Page。 起初,CP1252是基于ANSI的draft版本的,而ANSI后来演化为称为ISO 8859-1,所以,算是CP1252是基于ISO 8859-1的,但是将ISO 8859-1中的C1 Control Code用作为扩展的可打印字符。 **代码页Code Page****对应的字符集Character Set**1250Central and East European Latin1251Cyrillic1252West European Latin1253Greek1254Turkish1255Hebrew1256Arabic1257Baltic1258Vietnamese874Thai上述Code Page中,我们比较常见一些的有: 简体中文是CP936,繁体中文是CP950,UTF-8是65001等。 <https://my.oschina.net/junn/blog/282160> ## **字符编码转换:** *iconv('GB2312', 'UTF-8//IGNORE', $str); //将字符串的编码从GB2312转到UTF-8 c底层实现 转码速度快* // utf8转gb2312 可能会被截断而报错;如utf8的中文字符”—”转换gb2312。可以加//IGNORE或者//TRANSLIT解决 // //IGNORE 忽略不能转换的字符 // //TRANSLIT如果在目标编码中找不到与源编码相匹配的字符,会选择相似的字符进行转换 *mb\_convert\_encoding(字符串,新编码,原编码,) 必须开启扩展才行,转码速度也比较慢 在不能转码的字符时会强制转换成\\0x00\\0x80,如从utf8转换成gbk时* 一般情况下用 iconv,只有当遇到无法确定原编码是何种编码,或者iconv转化后无法正常显示时才用mb\_convert\_encoding 函数 [iconv支持的字符集](http://www.gnu.org/software/libiconv/) ``` <pre class="calibre17">``` mbstring该 PHP 扩展支持的字符编码有以下几种: (即带有mb_xxx的函数) UCS<span class="token1">-</span><span class="token6">4</span><span class="token1">*</span> UCS<span class="token1">-</span><span class="token6">4</span>BE UCS<span class="token1">-</span><span class="token6">4</span>LE<span class="token1">*</span> UCS<span class="token1">-</span><span class="token6">2</span> UCS<span class="token1">-</span><span class="token6">2</span>BE UCS<span class="token1">-</span><span class="token6">2</span>LE UTF<span class="token1">-</span><span class="token6">32</span><span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">32</span>BE<span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">32</span>LE<span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">16</span><span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">16</span>BE<span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">16</span>LE<span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">7</span> UTF7<span class="token1">-</span>IMAP UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">*</span> ASCII<span class="token1">*</span> EUC<span class="token1">-</span>JP<span class="token1">*</span> SJIS<span class="token1">*</span> eucJP<span class="token1">-</span>win<span class="token1">*</span> SJIS<span class="token1">-</span>win<span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">2022</span><span class="token1">-</span>JP ISO<span class="token1">-</span><span class="token6">2022</span><span class="token1">-</span>JP<span class="token1">-</span>MS CP932 CP51932 SJIS<span class="token1">-</span>mac<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: MacJapanese<span class="token3">)</span> SJIS<span class="token1">-</span>Mobile#DOCOMO<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: SJIS<span class="token1">-</span>DOCOMO<span class="token3">)</span> SJIS<span class="token1">-</span>Mobile#KDDI<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: SJIS<span class="token1">-</span>KDDI<span class="token3">)</span> SJIS<span class="token1">-</span>Mobile#SOFTBANK<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: SJIS<span class="token1">-</span>SOFTBANK<span class="token3">)</span> UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>Mobile#DOCOMO<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>DOCOMO<span class="token3">)</span> UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>Mobile#KDDI<span class="token1">-</span>A<span class="token1">*</span><span class="token1">*</span> UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>Mobile#KDDI<span class="token1">-</span>B<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>KDDI<span class="token3">)</span> UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>Mobile#SOFTBANK<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: UTF<span class="token1">-</span><span class="token6">8</span><span class="token1">-</span>SOFTBANK<span class="token3">)</span> ISO<span class="token1">-</span><span class="token6">2022</span><span class="token1">-</span>JP<span class="token1">-</span>MOBILE#KDDI<span class="token1">*</span><span class="token1">*</span> <span class="token3">(</span>别名: ISO<span class="token1">-</span><span class="token6">2022</span><span class="token1">-</span>JP<span class="token1">-</span>KDDI<span class="token3">)</span> JIS JIS<span class="token1">-</span>ms CP50220 CP50220raw CP50221 CP50222 ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">1</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">2</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">3</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">4</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">5</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">6</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">7</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">8</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">9</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">10</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">13</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">14</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">15</span><span class="token1">*</span> ISO<span class="token1">-</span><span class="token6">8859</span><span class="token1">-</span><span class="token6">16</span><span class="token1">*</span> byte2be byte2le byte4be byte4le BASE64 HTML<span class="token1">-</span>ENTITIES <span class="token6">7</span>bit <span class="token6">8</span>bit EUC<span class="token1">-</span>CN<span class="token1">*</span> CP936 GB18030<span class="token1">*</span><span class="token1">*</span> HZ EUC<span class="token1">-</span>TW<span class="token1">*</span> CP950 BIG<span class="token1">-</span><span class="token6">5</span><span class="token1">*</span> EUC<span class="token1">-</span>KR<span class="token1">*</span> UHC <span class="token3">(</span>CP949<span class="token3">)</span> ISO<span class="token1">-</span><span class="token6">2022</span><span class="token1">-</span>KR Windows<span class="token1">-</span><span class="token6">1251</span> <span class="token3">(</span>CP1251<span class="token3">)</span> Windows<span class="token1">-</span><span class="token6">1252</span> <span class="token3">(</span>CP1252<span class="token3">)</span> CP866 <span class="token3">(</span>IBM866<span class="token3">)</span> KOI8<span class="token1">-</span>R<span class="token1">*</span> KOI8<span class="token1">-</span>U<span class="token1">*</span> ArmSCII<span class="token1">-</span><span class="token6">8</span> <span class="token3">(</span>ArmSCII8<span class="token3">)</span> <span class="token1">*</span> 表示该编码也可以在正则表达式中使用。 <span class="token1">*</span><span class="token1">*</span> 表示该编码自 PHP <span class="token6">5.4</span><span class="token6">.0</span> 始可用。 ``` ```