Question

0 0

如何去除html源码中的标签？

我尝试去抓取糗事百科。


 Request=urllib.request.Request(url=url,headers=headers)
response=urllib.request.urlopen(Request).read()
raw_data=response.decode('utf-8')

这里是可以出现html源码的


 soup=BeautifulSoup(raw_data)
content=soup.find_all('div', {'class':'content'})

这里可以提取出


 <div 'class'='content'>
xxxxx
</div>

这样的内容的如果我想去除


 <div 'class'='content>
</div>

我该怎么做，我尝试的网上的一种方法但是报错


 content=[s.extract() for s in content('div')]

报错


 TypeError: 'ResultSet' object is not callable

python3.x beautifulsoup python-爬虫

9 years, 9 months ago

shikii

share

shikii 9 years, 9 months ago

Answer 1

0

全部都提取后，采用string.replace来替换，这样应该也是可以做到一个蹩脚的方案

answered 9 years, 9 months ago

东方大叔爱

share

东方大叔爱 answered 9 years, 9 months ago

Answer 2

0

用正则去提取标签里的内容

answered 9 years, 9 months ago

幻耀D雅蠛蝶

share

幻耀D雅蠛蝶 answered 9 years, 9 months ago

Answer 3

0


 //取得网页的文本(去掉CSS HTML JavaScript脚本等)
        public String GetText(String strTemp, int lengthlimit)
        {
            strTemp = System.Text.RegularExpressions.Regex.Replace(strTemp, "<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\s]*?>", "");
            string str2Temp = System.Text.RegularExpressions.Regex.Replace(strTemp, "<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*?>", "");
            string str3Temp = System.Text.RegularExpressions.Regex.Replace(str2Temp, "<[^>]+>", "");
            //return System.Text.RegularExpressions.Regex.Replace(str3Temp, "", "");
            if (lengthlimit != 0)
            {
                if (str3Temp.Length >= lengthlimit)
                {
                    str3Temp = str3Temp.Substring(0, lengthlimit);
                }
            }
            return str3Temp;
        }

answered 9 years, 9 months ago

pathua

share

pathua answered 9 years, 9 months ago

Answer 4

0

最好的方式还是用正则表达式

answered 9 years, 9 months ago

鲤鱼king

share

鲤鱼king answered 9 years, 9 months ago

Answer 5

0

http://segmentfault.com/q/1010000002448667

可以看看这个，我之前问个的一个问题

answered 9 years, 9 months ago

DZC死人

share

DZC死人 answered 9 years, 9 months ago

如何去除html源码中的标签？

shikii

Answers

东方大叔爱

幻耀D雅蠛蝶

pathua

鲤鱼king

DZC死人

Your Answer