请问哪位大牛能详细而又通俗的解释下,
Python2下unicode、utf-8、decode、encode之间的关系。
我感觉我在这方面的认识还不够清晰,希望大牛们能帮帮忙,谢谢!!
ASCII 、unicode 是字符集,utf-8是字符集的编码方式。
utf-8 是 unicode 字符集一种编码方式。
<code class="language-python"><span class="n">In</span> <span class="p">[</span><span class="mi">1</span><span class="p">]:</span> <span class="n">a</span><span class="o">=</span><span class="s">'你好'</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">a</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="s">'</span><span class="se">\xe4\xbd\xa0\xe5\xa5\xbd</span><span class="s">'</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">3</span><span class="p">]:</span> <span class="n">b</span><span class="o">=</span><span class="n">a</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="n">b</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">4</span><span class="p">]:</span> <span class="s">u'</span><span class="se">\u4f60\u597d</span><span class="s">'</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="nb">type</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">5</span><span class="p">]:</span> <span class="nb">unicode</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="nb">type</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">6</span><span class="p">]:</span> <span class="nb">str</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">7</span><span class="p">]:</span> <span class="n">c</span><span class="o">=</span><span class="n">b</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">)</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="n">c</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">8</span><span class="p">]:</span> <span class="s">'</span><span class="se">\xe4\xbd\xa0\xe5\xa5\xbd</span><span class="s">'</span>
<span class="n">In</span> <span class="p">[</span><span class="mi">9</span><span class="p">]:</span> <span class="n">c</span><span class="o">==</span><span class="n">a</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">9</span><span class="p">]:</span> <span class="bp">True</span>
</code>
善于搜索,参考廖雪峰的博客:字符串和编码
这个你搜网上很多啊。 unicode是codepoint 就是一个抽象的\uxxx 代表一个字符。 而utf-8是unicode的一种,用x个字节表示一个抽象的codepoint \uxxx. 所以utf-8是实际的字节串,而unicode是抽象. 你可以把抽象的unicode encode(编码)成utf-8. 也可以把实际的utf-8 解码回unicode. 说了这么多,然并ruan...
请搜索“将python2中汉字会出现乱码的事一次性说清楚”