python影象爬蟲包

最近在做一些影象分類的任務時，為了擴充我們的資料集，需要在搜尋引擎下爬取額外的圖片來擴充我們的訓練集。搞人工智慧真的是太難了😭，居然還要會爬蟲。當然網上也有許多python寫的爬蟲工具，當然，用多了就知道，這些爬蟲工具不是不能進行多關鍵字的爬蟲就是用不了，或者是一會就被網站檢測到給停止了，最後發現了一款非常好用的python影象爬蟲庫icrawler，直接就能通過pip進行安裝，使用時幾行程式碼就能搞定，簡直不要太爽。
話不多說，附上安裝命令：

pip install icrawler

下面附上我爬蟲的程式碼：

from icrawler.builtin import BaiduImageCrawler 
from icrawler.builtin import BingImageCrawler 
from icrawler.builtin import GoogleImageCrawler 
#需要爬蟲的關鍵字
list_word = ['抽菸 行人','吸菸 行人','接電話 行人','打電話 行人', '玩手機 行人']
for word in list_word:
    #bing爬蟲
    #儲存路徑
    bing_storage = {'root_dir': 'bing\\'+word}
    #從上到下依次是解析器執行緒數，下載執行緒數，還有上面設定的儲存路徑
    bing_crawler = BingImageCrawler(parser_threads=2,
                                    downloader_threads=4,
                                    storage=bing_storage)
    #開始爬蟲，關鍵字+圖片數量
    bing_crawler.crawl(keyword=word,
                       max_num=2000)

    #百度爬蟲
    # baidu_storage = {'root_dir': 'baidu\\' + word}
    # baidu_crawler = BaiduImageCrawler(parser_threads=2,
    #                                   downloader_threads=4,
    #                                   storage=baidu_storage)
    # baidu_crawler.crawl(keyword=word,
    #                     max_num=2000)


    # google爬蟲
    # google_storage = {'root_dir': '‘google\\' + word}
    # google_crawler = GoogleImageCrawler(parser_threads=4,
    #                                    downloader_threads=4,
    #                                    storage=google_storage)
    # google_crawler.crawl(keyword=word,
    #                      max_num=2000)

這個爬蟲庫能夠實現多執行緒，多搜尋引擎（百度、必應、谷歌）的爬蟲，當然谷歌爬蟲需要梯子。這裡展示的是基於必應的爬蟲，百度和谷歌的程式碼也在下面，只不過被我遮蔽掉了，當然也可以三個同時全開！這樣的python爬蟲庫用起來簡直不要太爽。
喜歡的記得幫我點個贊喲😋

Python大批次搜尋引擎影象爬蟲工具

python影象爬蟲包