Question

0 0

使用start事件读取xml不全

要读取一个xml文件中每个 item 节点的 review_id 、 summary 等的子结点，样例如下：


 <item>
<review_id>0095693</review_id>
<summary>书本内容</summary>
<polarity>P</polarity>
<text>书本的内容很好，对我很有帮助，就是字体的颜色是紫色的，看就了会觉得不清晰。</text>
<category>book</category>
</item>

完整例子可以从这里下载。编程环境为Mac 10.9.2，Python 2.7.6，代码如下：


 import sys
import os
from xml.etree.ElementTree import iterparse, tostring

def count_pos_neg(itemfile):
    pos_count = 0
    neg_count = 0
    try:
        for event, elem in iterparse(itemfile, events=["start",]):
            if elem.tag == "item":
                try:
                    if processItem(elem)['polarity'] == "P":
                        pos_count += 1
                    else:
                        neg_count += 1
                except Exception, e:
                    print >> sys.stderr, "Ignoring item: %s" % e
                elem.clear()
    except SyntaxError, se:
        print >> sys.stderr, se
    return pos_count, neg_count


def processItem(item):
    """ Process a review.
    Implement custom code here. Use 'item.find('tagname').text' to access the properties of a review. 
    """
    category = item.find("category").text
    polarity = item.find("polarity").text
    text = item.find("text").text
    summary = item.find("summary").text
    return {'polarity':polarity,
            'summary':summary,
            'text':text,
            'category':category }

if __name__ == "__main__":
    pc, nc = count_pos_neg(itemfile)

问题在于，每碰到第55个 item 节点，就会发生一次 AttruibuteError ，错误信息为


 Ignoring item: 'NoneType' object has no attribute 'text'

我在使用 evens=('end',) 进行解析时，没有发生错误。这是否说明之前的错误与使用 start 解析有关？

xml python

11 years, 1 month ago

Caiych

share

Caiych 11 years, 1 month ago

Answer 1

0

文档说：

Note

iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for “end” events instead.

start 事件发生时这个元素的子元素还没有解析，所以你应该用 end 事件。

answered 11 years, 1 month ago

siemen

share

siemen answered 11 years, 1 month ago

使用start事件读取xml不全

Caiych

Answers

siemen

Your Answer