Question

0 0

Python 爬虫遇到形如 &#x5c0f;&#x8bf4; 的编码如何转换为中文？


 <dt>学科主题:</dt>
                <dd><a href="openlink.php?keyword=%E9%95%BF%E7%AF%87%E5%B0%8F%E8%AF%B4">长篇小说</a>-中国-当代</dd>
            </dl>
                        <dl class="booklist">
                <dt>中图法分类号:</dt>
                <dd><a href="openlink.php?coden=I247.5">I247.5</a></dd>
            </dl>
                        <dl class="booklist">
                <dt>提要文摘附注:</dt>
                <dd>小说中的主人公，正是因为当年盗墓的爷爷人赘杭州而身在杭州，开了一家小的古董铺子，守护着那群长沙土夫子从古墓不知名怪物捭中拼命抢出的战国帛书……</dd>
            </dl>

如何解决？

python 编码 python-爬虫

10 years, 8 months ago

死神的歌谣

share

死神的歌谣 10 years, 8 months ago

Answer 1

0

这个是 charref , HTML 的解析库都可以处理好, 不需要手工处理.
Python 标准库有 HTMLParser ( html.parser in Python 3)
第三方库推荐 BeautifulSoup

answered 10 years, 8 months ago

5870085

share

5870085 answered 10 years, 8 months ago

Answer 2

0


 

 # tested under python3.4

def convert(s):
    s = s.strip('&#x;') # 把'长'变成'957f'
    s = bytes(r'\u' + s, 'ascii') # 把'957f'转换成b'\\u957f'
    return s.decode('unicode_escape') # 调用bytes对象的decode，encoding用unicode_escape，把b'\\u957f'从unicode转义编码解码成unicode的'长'。具体参见codecs的文档

print(convert('长')) # => '长'

全篇替换


 

 import re

print(re.sub(r'&#x....;',
             lambda match: convert(match.group()),
             ss))

全文替换后的结果：


 <dt>学科主题:</dt>
            <dd><a href="openlink.php?keyword=%E9%95%BF%E7%AF%87%E5%B0%8F%E8%AF%B4">长篇小说</a>-中国-当代</dd>
        </dl>
                    <dl class="booklist">
            <dt>中图法分类号:</dt>
            <dd><a href="openlink.php?coden=I247.5">I247.5</a></dd>
        </dl>
                    <dl class="booklist">
            <dt>提要文摘附注:</dt>
            <dd>小说中的主人公，正是因为当年盗墓的爷爷人赘杭州而身在杭州，开了一家小的古董铺子，守护着那群长沙土夫子从古墓不知名怪物捭中拼命抢出的战国帛书……</dd>
        </dl>


 # for python2.7

def convert(s):
    return ''.join([r'\u', s.strip('&#x;')]).decode('unicode_escape')

ss = unicode(ss, 'gbk') # convert gbk-encoded byte-string ss to unicode string

import re
print re.sub(r'&#x....;', lambda match: convert(match.group()), ss)

answered 10 years, 8 months ago

Chero

share

Chero answered 10 years, 8 months ago

Python 爬虫遇到形如 &#x5c0f;&#x8bf4; 的编码如何转换为中文？

死神的歌谣

Answers

5870085

Chero

Your Answer

Python 爬虫遇到形如 &amp;#x5c0f;&amp;#x8bf4; 的编码如何转换为中文？

死神的歌谣

Answers

5870085

Chero

Your Answer

Python 爬虫遇到形如 小说 的编码如何转换为中文？