Question

0 0

Python requests 多线程抓取出现HTTPConnectionPool Max retires exceeded异常

主要代码如下：


 import threading
import requests

def get_info():
    try:
        res = requests.get('http://www.xxx.com/test/json')
        if res.status_code == 200 and res.text != '':
            print res.text
        else:
            print res.status_code
        time.sleep(10)
        get_info()
    except Exception as e:
        print e

def start():
    threads = []
    for i in range(40):
        threads.append(threading.Thread(target=get_info,args=()))
    for t in threads:
        time.sleep(0.3)
        t.start()
    for t in threads:
        t.join()    

if __name__ == '__main__':
    start()

代码临时写，可能有小错误，大概就是这么个意思：
开启40个线程，间隔0.3秒请求。刚开始很正常，但是2轮过后几乎80% 90%的请求都报异常
HTTPConnectionPool(host=' http://www.xxx.com/ ',port=80):Max retries exceeded with url: /test/json (Caused by(class 'socked.error'):[Errno 10060])

请问是哪里出了问题？

感谢你们的回答
抓的确实是小站。
我的想法是如果服务器临时封禁的话，应该是报10054的错误。
可看起来又像是服务器封禁，前几轮请求都是很正常的，为什么持续时间越长，抛出的异常就越多？
重试几次的方法我做过，似乎不太起作用：


 def get_info(retries=3):
    if 200:
        ...
    else:
        if retries > 0:
            time.sleep(5)
            get_info(retries-1)

初学Python，用来作爬虫。实际上这个问题已经困扰我很久了。我想这个应该是在爬虫项目中很常见的问题，请问该如何着手优化（少量异常可以接受）这个问题？

python requests python-爬虫网页爬虫

11 years, 8 months ago

MЯ.悲劇

share

MЯ.悲劇 11 years, 8 months ago

Answer 1

0

应该是你的服务器与目标站之间的网络连接出了问题,可以在访问目标站时多重试几次..

answered 11 years, 8 months ago

咸鱼型杏子茶

share

咸鱼型杏子茶 answered 11 years, 8 months ago

Python requests 多线程抓取出现HTTPConnectionPool Max retires exceeded异常

MЯ.悲劇

Answers

咸鱼型杏子茶

Your Answer

Python requests 多线程抓取 出现HTTPConnectionPool Max retires exceeded异常

MЯ.悲劇

Answers

咸鱼型杏子茶

Your Answer

Python requests 多线程抓取出现HTTPConnectionPool Max retires exceeded异常