Python非同步爬蟲(aiohttp版)

2022-12-06 18:00:16

非同步協程不太瞭解的話可以去看我上篇部落格:https://www.cnblogs.com/Red-Sun/p/16934843.html
PS:本部落格是個人筆記分享,不需要掃碼加群或必須關注什麼的(如果外站需要加群或關注的可以直接去我主頁檢視)
歡迎大家光臨ヾ(≧▽≦*)o我的部落格首頁https://www.cnblogs.com/Red-Sun/

1.requests請求

# -*- coding: utf-8 -*-
# @Time    : 2022/12/6 16:03
# @Author  : 紅後
# @Email   : [email protected]
# @blog    : https://www.cnblogs.com/Red-Sun
# @File    : 範例1.py
# @Software: PyCharm
import aiohttp, asyncio


async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.request("GET", url=url) as response:
        return await response.text(encoding='UTF-8')


async def main():  # 主函數用於非同步函數的啟動
    url = 'https://www.baidu.com'
    html = await aiohttp_requests(url)  # await修飾非同步函數
    print(html)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

2.session請求

GET:

# -*- coding: utf-8 -*-
# @Time    : 2022/12/6 16:33
# @Author  : 紅後
# @Email   : [email protected]
# @blog    : https://www.cnblogs.com/Red-Sun
# @File    : 範例2.py
# @Software: PyCharm
import aiohttp, asyncio


async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:  # 宣告了一個支援非同步的上下文管理器
        async with session.get(url) as response:
            return await response.text(encoding='UTF-8')


async def main():  # 主函數用於非同步函數的啟動
    url = 'https://www.baidu.com'
    html = await aiohttp_requests(url)  # await修飾非同步函數
    print(html)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())


其中aiohttp還有post,put, delete...等一系列請求(PS:一般情況下只需要建立一個session,然後使用這個session執行所有的請求。)
PSOT:傳參

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        data = {'key': 'value'}
        async with session.post(url=url, data=data) as response:
            return await response.text(encoding='UTF-8')

PS:這種傳參傳遞的資料將會被轉碼,如果不想被轉碼可以直接提交字串data=str(data)

附:關於session請求資料修改操作

1.cookies

自定義cookies應該放在ClientSession中,而不是session.get()中

async def aiohttp_requests(url):  # aiohttp的requests函數
    cookies = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
    async with aiohttp.ClientSession(cookies=cookies) as session:
        async with session.get(url) as response:
            return await response.text(encoding='UTF-8')

2.headers

放在自定義的headers跟正常的requests一樣放在session.get()中

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        headers = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}
        async with session.get(url=url, headers=headers) as response:
            return await response.text(encoding='UTF-8')

3.timeout

預設響應時間為5分鐘,通過timeout可以重新設定,其放在session.get()中

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        async with session.get(url=url, timeout=60) as response:
            return await response.text(encoding='UTF-8')

4.proxy

當然代理也是支援的在session.get()中設定

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        async with session.get(url=url, proxy="http://some.proxy.com") as response:
            return await response.text(encoding='UTF-8')

需要授權的代理

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        proxy_auth = aiohttp.BasicAuth('user', 'pass')  # 使用者,密碼
        async with session.get(url=url, proxy="http://some.proxy.com", proxy_auth=proxy_auth) as response:
            return await response.text(encoding='UTF-8')

或者

async def aiohttp_requests(url):  # aiohttp的requests函數
    async with aiohttp.ClientSession() as session:
        async with session.get(url=url, proxy='http://user:[email protected]') as response:
            return await response.text(encoding='UTF-8')

報錯處理

錯誤:RuntimeError: Event loop is closed


報錯原因是使用了asyncio.run(main())來執行程式
看到別個大佬的總結是asyncio.run()會自動關閉迴圈,並且呼叫_ProactorBasePipeTransport.__del__報錯, 而asyncio.run_until_complete()不會。
第一種解決方法換成如下程式碼執行

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

第二種重寫方法以保證run()的執行

from functools import wraps

from asyncio.proactor_events import _ProactorBasePipeTransport

def silence_event_loop_closed(func):
    @wraps(func)
    def wrapper(self, *args, **kwargs):
        try:
            return func(self, *args, **kwargs)
        except RuntimeError as e:
            if str(e) != 'Event loop is closed':
                raise
    return wrapper

_ProactorBasePipeTransport.__del__ = silence_event_loop_closed(_ProactorBasePipeTransport.__del__)