分割gbk中文出现乱码的问题解决
近日遇到一个神奇的字“弢(tao)”。
具体的过程是这样的:
<span style="color: #008080;">1</span> <span style="color: #800080;">$list</span> = <span style="color: #008080;">explode</span>('|', 'abc弢|bc'<span style="color: #000000;">);</span><span style="color: #008080;">2</span> <span style="color: #008080;">var_dump</span>(<span style="color: #800080;">$list</span>);
取得这个分割的结果。
和想象不同,结果居然是这样:
<span style="color: #0000ff;">array</span>(3<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(4) "<span style="color: #000000;">abc? [1]=> string(0) </span>""<span style="color: #000000;"> [2]=> string(2) </span>"bc"<span style="color: #000000;">}</span>
出现了乱码,而且莫名其妙的出现了一个空元素。
究其原因,原来这个字“弢”的gbk编码是8f7c,而|的ASCII是7c,这样explode就把弢的第二ASCII作为|切割了。
既然是双字节的问题,我们用mbstring解决好了。
可惜,php并没有mb_explode这种函数,找了找,找到一个mb_split。
<span style="color: #0000ff;">array</span> mb_split ( <span style="color: #0000ff;">string</span> <span style="color: #800080;">$pattern</span> , <span style="color: #0000ff;">string</span> <span style="color: #800080;">$string</span> [, int <span style="color: #800080;">$limit</span> = -1 ] )
没有声明编码的地方。仔细一看,他是通过mb_regex_encoding声明编码的。
于是写出以下的代码:
<span style="color: #008080;">1</span> mb_regex_encoding('gbk'<span style="color: #000000;">);</span><span style="color: #008080;">2</span> <span style="color: #800080;">$list</span> = mb_split('\|', 'abc弢|bc'<span style="color: #000000;">);</span><span style="color: #008080;">3</span> <span style="color: #008080;">var_dump</span>(<span style="color: #800080;">$list</span>);
结果php报错,mb_regex_encoding不认识gbk,囧。
那就使用它认识的:
<span style="color: #008080;">1</span> mb_regex_encoding('gb2312'<span style="color: #000000;">);</span><span style="color: #008080;">2</span> <span style="color: #800080;">$list</span> = mb_split('\|', 'abc弢|bc'<span style="color: #000000;">);</span><span style="color: #008080;">3</span> <span style="color: #008080;">var_dump</span>(<span style="color: #800080;">$list</span>);
结果:
<span style="color: #0000ff;">array</span>(3<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(4) "<span style="color: #000000;">abc? [1]=> string(0) </span>""<span style="color: #000000;"> [2]=> string(2) </span>"bc"<span style="color: #000000;">}</span>
发现,这种方法并没有什么用处。、
至于原因?“弢”这个字居然不在GB2312的编码集里面!!!!!但是有这个字的编码集(GBK, GB18030)这个函数都不支持!!!!!
既然这个不好用,也许万能的正则表达式是ok的。于是得到以下代码:
<span style="color: #008080;">1</span> <span style="color: #008080;">var_dump</span>(<span style="color: #008080;">preg_match_all</span>('/([^\|])*/', 'abc弢|bc', <span style="color: #800080;">$matches</span><span style="color: #000000;">));</span><span style="color: #008080;">2</span> <span style="color: #008080;">var_dump</span>(<span style="color: #800080;">$matches</span>);
结果:
int(2<span style="color: #000000;">)</span><span style="color: #0000ff;">array</span>(2<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">array</span>(2<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(4) "<span style="color: #000000;">abc? [1]=> string(2) </span>"bc"<span style="color: #000000;"> } [1]=> array(2) { [0]=> string(1) </span>"?<span style="color: #000000;"> [</span>1]=> <span style="color: #0000ff;">string</span>(1) "c"<span style="color: #000000;"> }}</span>
好吧,我想多了。
现在研究一下,如何用正则描述这个场景。
参考一下,鸟哥大神的博客:分割GBK中文遭遇乱码的解决。遗憾的是,正则能力比较low的我,还是想不出来合适的正则表达式(如果有想出这个正则表达式的大神们,希望可以告诉我)。
没办法,思来想去,只好用substr了:
<span style="color: #008080;"> 1</span> <span style="color: #0000ff;">function</span> mb_explode(<span style="color: #800080;">$delimiter</span>, <span style="color: #800080;">$string</span>, <span style="color: #800080;">$encoding</span> = <span style="color: #0000ff;">null</span><span style="color: #000000;">){</span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$list</span> = <span style="color: #0000ff;">array</span><span style="color: #000000;">();</span><span style="color: #008080;"> 3</span> <span style="color: #008080;">is_null</span>(<span style="color: #800080;">$encoding</span>) && <span style="color: #800080;">$encoding</span> =<span style="color: #000000;"> mb_internal_encoding();</span><span style="color: #008080;"> 4</span> <span style="color: #800080;">$len</span> = mb_strlen(<span style="color: #800080;">$delimiter</span>, <span style="color: #800080;">$encoding</span><span style="color: #000000;">);</span><span style="color: #008080;"> 5</span> <span style="color: #0000ff;">while</span>(<span style="color: #0000ff;">false</span> !== (<span style="color: #800080;">$idx</span> = mb_strpos(<span style="color: #800080;">$string</span>, <span style="color: #800080;">$delimiter</span>, 0, <span style="color: #800080;">$encoding</span><span style="color: #000000;">))){</span><span style="color: #008080;"> 6</span> <span style="color: #800080;">$list</span>[] = mb_substr(<span style="color: #800080;">$string</span>, 0, <span style="color: #800080;">$idx</span>, <span style="color: #800080;">$encoding</span><span style="color: #000000;">);</span><span style="color: #008080;"> 7</span> <span style="color: #800080;">$string</span> = mb_substr(<span style="color: #800080;">$string</span>, <span style="color: #800080;">$idx</span> + <span style="color: #800080;">$len</span>, <span style="color: #0000ff;">null</span>, <span style="color: #800080;">$encoding</span><span style="color: #000000;">);</span><span style="color: #008080;"> 8</span> <span style="color: #000000;"> } </span><span style="color: #008080;"> 9</span> <span style="color: #800080;">$list</span>[] = <span style="color: #800080;">$string</span><span style="color: #000000;">;</span><span style="color: #008080;">10</span> <span style="color: #0000ff;">return</span> <span style="color: #800080;">$list</span><span style="color: #000000;">; </span><span style="color: #008080;">11</span> }
测试代码:
<span style="color: #008080;">1</span> <span style="color: #800080;">$a</span> = 'abc弢|bc'<span style="color: #000000;">;</span><span style="color: #008080;">2</span> <span style="color: #008080;">3</span> <span style="color: #008080;">var_dump</span>(mb_explode('|', <span style="color: #800080;">$a</span>, 'gbk'<span style="color: #000000;">));</span><span style="color: #008080;">4</span> <span style="color: #008080;">var_dump</span>(mb_explode('bc', <span style="color: #800080;">$a</span>, 'gbk'<span style="color: #000000;">));</span><span style="color: #008080;">5</span> <span style="color: #008080;">var_dump</span>(mb_explode('弢', <span style="color: #800080;">$a</span>, 'gbk'));
结果:
<span style="color: #0000ff;">array</span>(2<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(5) "abc弢"<span style="color: #000000;"> [</span>1]=> <span style="color: #0000ff;">string</span>(2) "bc"<span style="color: #000000;">}</span><span style="color: #0000ff;">array</span>(3<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(1) "a"<span style="color: #000000;"> [</span>1]=> <span style="color: #0000ff;">string</span>(3) "弢|"<span style="color: #000000;"> [</span>2]=> <span style="color: #0000ff;">string</span>(0) ""<span style="color: #000000;">}</span><span style="color: #0000ff;">array</span>(2<span style="color: #000000;">) { [</span>0]=> <span style="color: #0000ff;">string</span>(3) "abc"<span style="color: #000000;"> [</span>1]=> <span style="color: #0000ff;">string</span>(3) "|bc"<span style="color: #000000;">}</span>
这样就可以得到正确的结果了。

使用Java的String.valueOf()函数将基本数据类型转换为字符串在Java开发中,当我们需要将基本数据类型转换为字符串时,一种常见的方法是使用String类的valueOf()函数。这个函数可以接受基本数据类型的参数,并返回对应的字符串表示。在本文中,我们将探讨如何使用String.valueOf()函数进行基本数据类型转换,并提供一些代码示例来

char数组转string的方法:可以通过赋值来实现,使用{char a[]=" abc d\0efg ";string s=a;}语法,让char数组对string直接赋值,执行代码即可完成转换。

使用Java的String.replace()函数替换字符串中的字符(串)在Java中,字符串是不可变的对象,这意味着一旦创建了一个字符串对象,就无法修改它的值。但是,你可能会遇到需要替换字符串中的某些字符或者字符串的情况。这时候,我们可以使用Java的String类中的replace()方法来实现字符串的替换。String类的replace()方法有两种重

大家好,今天给大家分享java基础知识之String。String类的重要性就不必说了,可以说是我们后端开发用的最多的类,所以,很有必要好好来聊聊它。

一、认识String1.JDK中的String首先我们看看JDK中的String类源码,它实现了很多接口,可以看到String类被final修饰了,这就说明String类不可以被继承,String不存在子类,这样所有使用JDK的人,用到的String类都是同一个,如果String允许被继承,每个人都可以对String进行扩展,每个人使用的String都不是同一个版本,两个不同的人使用相同的方法,表现出不同的结果,这就导致代码没办法进行开发了继承和方法覆写在带来灵活性的同时,也会带来很多子类行为不

使用Java的String.length()函数获取字符串的长度在Java编程中,字符串是一种非常常见的数据类型,我们经常需要获取字符串的长度,即字符串中字符的个数。在Java中,我们可以使用String类的length()函数来获取字符串的长度。下面是一个简单的示例代码:publicclassStringLengthExample{publ

String中split方法使用String的split()方法用于按传入的字符或字符串对String进行拆分,返回拆分之后的数组。1、一般用法用一般的字符,例如@或,等符号做分隔符时:Stringaddress="上海@上海市@闵行区@吴中路";String[]splitAddr=address.split("@");System.out.println(splitAddr[0]+splitAddr[1]+splitAddr[2]+splitAddr[3

在Golang编程中,byte、rune和string类型是非常基础、常见的数据类型。它们在处理字符串、文件流等数据操作时发挥着重要作用。而在进行这些数据操作时,我们通常需要对它们进行相互的转换,这就需要掌握一些转换技巧。本文将介绍Golang函数的byte、rune和string类型转换技巧,旨在帮助读者更好地理解这些数据类型,并能够熟练地在编程实践中应用


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Atom editor mac version download
The most popular open source editor

Dreamweaver Mac version
Visual web development tools

Safe Exam Browser
Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),