Question

0 0

如果想要匹配这样的html代码


 <div class="content">
    xxxxxxxxxxx
</div>

取出其中的xxxxx的内容
我是这样做的


 #raw_data为读取的html代码
pattern=re.compile(r'<div class="content">(.*?)</div>$')
items=re.findall(pattern,raw_data)

items为空，我想知道匹配的情况错在了那里

9 years, 9 months ago

share

cnnerv 9 years, 9 months ago

Answer 1

0

非要用正则的话，可以这样写：


 r'<div class="content">\n\s+(\S+)\s+</div>'

注：\s表示匹配空白字符，\S表示匹配非空白字符，而用+表示非贪婪匹配

answered 9 years, 9 months ago

share

瞬发大火球 answered 9 years, 9 months ago

Answer 2

0

(点符号)匹配的是除了换行符“\n”以外的所有字符
你要进行正则处理的HTML 是有换行的。
所以应该把换行情况也考虑进去 (.|\n)

answered 9 years, 9 months ago

share

传说神泪丶 answered 9 years, 9 months ago

Answer 3

0

想了想，我还是推荐题主用 xpath 去解析 HTML 或 xml 。
例子 http://outofmemory.cn/code-snippet/11036/python-xpath-minidom-parse-xm...
在爬虫过程中可能还会遇到更加复杂的结构，用 xpath 会更加得心应手。

answered 9 years, 9 months ago

share

lesbian answered 9 years, 9 months ago

Answer 4

0

$，把这个去掉

answered 9 years, 9 months ago

share

Bao725 answered 9 years, 9 months ago