python處理html轉義字符

高洛峰原創: 2017-03-01 13:27:572098瀏覽

本文實例講述了python處理html轉義字元的方法。分享給大家供大家參考，如下：

最近在用Python處理網頁資料時，常常遇到一些html轉義字元（也叫html字元實體），例如a8093152e673feb7aba1828c43532094 等。字符實體一般是為了表示網頁中的預留字符，例如>用>表示，防止被瀏覽器認為是標籤，具體參考w3school的HTML 字符實體。雖然很有用，但是它們會極度影響對於網頁資料的解析。為了處理這些轉義字符，有以下解決方案：

1、使用HTMLParser處理

import HTMLParser
html_cont = " asdfg>123<"
html_parser = HTMLParser.HTMLParser()
new_cont = html_parser.unescape(html_cont)
print new_cont #new_cont = " asdfg>123<"

轉換回去（只是空格轉不回去了）：

import cgi
new_cont = cgi.escape(new_cont)
print new_cont #new_cont = " asdfg>123<"

#2、直接挨個替換

html_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39; &#39;, &#39; &#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;>&#39;, &#39;>&#39;)
print new_cont #new_cont = " asdfg>123<"
new_cont = new_cont.replace(&#39;<&#39;, &#39;<&#39;)
print new_cont #new_cont = " asdfg>123<"

不知道還有沒有更好的辦法。

另外stackoverflow上給了在xml中處理轉義字元的答案：python - What's the best way to handle -like entities in XML documents with lxml? - Stack Overflow。

更多python處理html轉義字元相關文章請關注PHP中文網！

陳述：

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn

上一篇：python抓取並儲存html頁面時亂碼問題的下一篇：python抓取並儲存html頁面時亂碼問題的

看更多