從0開始學python網路爬蟲----------2

爬蟲三大庫

上一篇部落格對庫有了初步概念，接下來將對爬蟲三大庫深入瞭解

庫的下載與使用
1. 庫的下載
  下載第三方庫一般可以直接在命令提示字元中輸入pip install packagename其中packagename爲要下載的庫名。安裝完成後會有提示：
  Successfully installed packagename
2. 庫的使用
  下載好庫以後，我們就可以呼叫庫格式爲：import packagename
Requests庫
requests庫的官方文件指出：讓HTTP服務人類，細心的讀者會發現requests庫的作用就是請求獲取網頁數據
請求頭的使用

import requests#匯入模組（也叫庫）
#headers爲請求頭的一部分
headers={'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.52'}
url='https://blog.csdn.net/PXXPY/article/details/107594925'#請求網址
response=requests.get(url=url,headers=headers)#發送請求
print(response)#返回狀態碼
print(response.text)#列印原始碼

request庫不僅有get()方法還有post()方法。post()方法用於表單提交來爬取需要登錄才能纔能獲取的數據，這部分會在後面講述，學習get()方法足夠我們爬取大部分網站了
3. BeautifulSoup庫
BeautifulSoup庫是一個非常流行的python模組，通過BeautifulSoup庫可以輕鬆的解析Rquests庫請求的數據，並把網頁原始碼解析爲Soup文件以便過濾提取數據


from bs4 import BeautifulSoup
import requests

 
url='http://bj.58.com/pingbandiannao/24604629984324x.shtml'
 
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')

BeautifulSoup庫除了支援python標準庫中的HTML直譯器外還支援第三方直譯器

直譯器	使用方法	優點	缺點
python標準庫	BeautifulSoup(response.text,‘html.parser’)	執行速度適中，容錯強	python3.2.2以前的版本中容錯能力差
lxml HTML解析器	BeautifulSoup(response.text,‘lxml’)	速度快，容錯強	需要安裝C語言庫
Lxml XML解析庫	BeautifulSoup(response.text,‘xml’)	速度快，唯一支援XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(response.text,’‘html5lib)	最好容錯性，以瀏覽器的方式解析文件，生成HTML5格式的文件	速度慢，不依賴外部拓展

BeautifulSoup庫官方推薦使用lxml作爲直譯器，應爲效率更高
解析的Soup文件可以使用find()和fond_all()方法及selector()方法定位所需要的元素。find()和find_all()兩個方法用法相似，BeautifulSoup文件中對這兩個方法的定義是：
** find_all(tag,attibutes,recursive,text,limit,keywords)
find(tag,attibutes,recursive,text,keywords)**
c常用前兩個參數

find_all()方法

soup.findall('div',"item")#查詢div標籤，class="item"
soup.findall('div',class="item")
soup.findall('div',attrs={"class":"item"})#attrs定義一個字典參數來搜尋包含特殊屬性的tag

find()方法
find()方法與find_all()方法相似，只不過find()方法只返回一個符合條件的數據,而find_all()方法返回所有符合條件的數據，是一個集合
selector()方法

soup.selector('body > div.m-body > div:nth-child(3) > div > div > ul > li:nth-child(6) > a')#括號中的內容由瀏覽器複製得到

在这里插入图片描述
該方法類似於中國>四川>成都>…
3. Lxml庫
Lxml庫是基於libxml2這一個XML解析庫的python封裝，使用C語言編寫，解析速度比BeautifulSoup更快

從0開始學python網路爬蟲----------2

從0開始學python網路爬蟲----------2

爬蟲三大庫

當然這些庫全是第三方庫，需要pip下載喲~~~