Question

0 0

python 爬虫匹配中文总是失败？

向各位大神请教个问题，初学python，我想提取"DJ00123987"和"号: DJ00123987"的部分，但是匹配中文总是失败。请问这是为什么？另外正则表达式的中文和空格应该怎么匹配？谢谢！编码格式都为UTF-8。


 import re
html = '<span>微信号：DJ00123987</span>'
print html
a = re.search(u'<span>微信号: (.*?)</span>', html, re.S).group(1)
b = re.search(u'<span>微信(.*?)</span>', html, re.S).group(1)
print a,b

python 正则表达式 python-爬虫

9 years, 7 months ago

抽风西红柿

share

抽风西红柿 9 years, 7 months ago

Answer 1

0

你正则表达式是 unicode 编码的，你的 html 是字符串类型的，在 python2 中中文字符一般设置成的是utf-8编码，你用 unicode 字符串的正则去获取 utf-8 编码的字符串当然就匹配失败了。

建议将 html 用 unicode 编码。

也就是拿到 utf-8 编码的html后， content = html.decode('utf-8')

再用正则匹配 content

answered 9 years, 7 months ago

Mr丶十六夜

share

Mr丶十六夜 answered 9 years, 7 months ago

Answer 2

0

可以使用Beautiful Soup

answered 9 years, 7 months ago

Kisai

share

Kisai answered 9 years, 7 months ago

Answer 3

0

字符集不同，如 @DDTDDT 所说，你的html缺了unicode的标记u，但是正则表达式却用了unicode
你的微信号后面的冒号，一个是半角，一个是全角

answered 9 years, 7 months ago

@五更琉璃@

share

@五更琉璃@ answered 9 years, 7 months ago

Answer 4

0

html 漏了u？
另外看下你整个文件保存的字符集，不然即使你加了u，如果文件是GBK的那么也可能遇到一些意外的问题。

answered 9 years, 7 months ago

多多良小傘

share

多多良小傘 answered 9 years, 7 months ago

Answer 5

0

完整的测试代码:


 # -*- encoding: utf8 -*-
import re
html = u'<span>微信号：DJ00123987</span>'

print html

a = re.search(u'<span>微信号：(.*?)</span>', html, re.S).group(1)
b = re.search(u'<span>微信(.*?)</span>', html, re.S).group(1)

print a, b

运行结果:
Linux :

Windows :

注意事项:

文件保存的编码为 utf-8 .
文件开头增加 # -*- encoding: utf8 -*- 编码声明.
变量 html 赋值的时候, 在字符串前面加上 u 修饰符.
你的 a 赋值时正则里的 : 是半角的, 与原始字符串中的不一样(那个是全角的), 所以会匹配失败, 而匹配失败返回的结果是 None , 在 None 的基础上取 group(1) 是会出错的.

answered 9 years, 7 months ago

大大大的肉棒

share

大大大的肉棒 answered 9 years, 7 months ago

python 爬虫匹配中文总是失败？

抽风西红柿

Answers

Mr丶十六夜

Kisai

@五更琉璃@

多多良小傘

大大大的肉棒

Your Answer