beautifulsoup解析中文网页的编码问题
对于同一个页面,几乎同样的代码,在Python3,windows8环境下能够正常解析运行。但是把代码移植到Ubuntu,Python2.7下面之后,会出现获取的网页不能被beautifulsoup解析,find_all('table')返回空节点的情况。
出问题的代码的一部分(可以运行):
python
#coding:utf-8 import sys reload(sys) sys.setdefaultencoding('utf-8') import urllib2 from bs4 import BeautifulSoup postdata = "T1=&T2=1&T3=&T4=&T5=&APPDate=&T7=&T8=&T9=&PRDate=&T11=&SQDate=&JDDate=&T14=&T15=&T16=&T17=&SDDate=&T19=&T20=&T21=&D1=%B8%B4%C9%F3&D2=jdr&D3=%C9%FD%D0%F2&C1=fm&C2=&C3=&page=70" postdata = postdata.encode('utf-8') headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','Referer':'http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp'} req = urllib2.Request( url = "http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp", headers = headers, data = postdata) fp = urllib2.urlopen(req) mybytes = fp.read().decode('gbk').encode('utf-8') soup = BeautifulSoup(mybytes,from_coding="uft-8") print soup.original_encoding print soup.prettify()
求指点一二
beautifulsoup python python-爬虫 字符编码 Ubuntu
一粒蛋怒疯
11 years, 9 months ago