基於Python通過cookie對某晶片網站資訊的獲取

晶片大家都不陌生。在當今疫情下，顯示卡，車機的晶片產量銳減影響了不少人的購物需求（反正你也買不到），也讓不少人重新認識了半導體行業。閒來無事，我們可以獲取一下T網站的晶片庫存和晶片資訊。

一、列表頁請求分析

進入頁面，就能看到我們需求的資訊了。

但是，在頁面請求完成之前，有一點點不對勁，就是頁面的各個部份請求的速度是不一樣的：

所以啊，需要的資料，大概率不是簡單的get請求，所以要進一步去看，特意在開發者模式—Fetch/XHR索引標籤中有一個請求，返回值正好是我們需要的內容：

程式設計師必備介面測試偵錯工具：

這一條連結返回了所有的資料，無需翻頁，下面開始請求連結。

二、列表頁請求

根據上面的連結，直接get請求，分析json即可，上程式碼：

 def getItemList():  
     url = "https://www.xx.com.cn/selectiontool/paramdata/family/3658/results?lang=cn&output=json"  
     headers = {  
         'authority': 'www.xx.com.cn',  
         "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
     }  
     res = getRes(url,headers,'','','GET')//自己寫的請求方法  
     nodes = res.json()['ParametricResults']  
     for node in nodes:  
         data = {}  
         data["itemName"] = node["o3"] #名稱  
         data["inventory"] = node["p3318"] #庫存  
         data["price"] = node["p1130"]['multipair1']['l'] #價格  
         data["infoUrl"] = f"https://www.xx.com.cn/product/cn/{node['o1']}"#詳情URL
登入後複製

分析上面的json，可知 o3 是商品名，p3318是庫存，p1130裡面的內容有一個帶單位的價格，o1是型號，可湊出詳情連結，下面是請求結果：

三、詳情頁分析

終於拿到詳情頁連結了，該獲取剩下的內容了。

開啟開發者模式，沒有額外的請求，只有一個包含內容的get請求。

那直接請求不就得了，上程式碼：

def getItemInfo(url):  
    logger.info(f'正在請求詳情url-{url}')  
    headers = {  
        'authority': 'www.xx.com.cn',  
        'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
        'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
        'referer':'https://www.xx.com.cn/product/cn/THS4541-DIE',  
   
    }  
     res = getRes(url, headers,'', '', 'GET')//自己寫的請求方法  
     content = res.content.decode('utf-8')
登入後複製

但是發現，請求的詳情頁，跟開發者模式的預覽怎麼不太一樣？

我這裡的第一反應就覺得，完了，這個需要cookie。

繼續分析，清屏開發者模式，清除cookie，再次存取詳情連結，在All索引標籤中，可以發現：

本以為該請求一次的詳情頁連結請求了兩次，兩次中間還有一個xhr請求。

預覽第一次請求，可以發現跟剛才本地請求的內容相差無幾：

所以問題出在第二次的請求，進一步分析：

檢視第二次的get請求，與第一次的請求相差了一堆cookie

簡化cookie，發現這些cookie最關鍵的引數是ak_bmsc這一部分，而這一部分引數，就來自上一個xhr請求中的響應頭set-cookie中：

分析這個xhr請求，請求連結

這是個post請求，先從payload引數下手：

這個bm-verify引數，是不是有些眼熟？這就是第一次的get請求返回的內容嗎，下面還有一個pow引數：

"pow":j，這個j引數就在上面，宣告了i和兩個拼接的數位字串轉成int之後相加之後的結果：

通過這一系列請求，返回了最終get請求所需要的cookie，講的比較瑣碎，上程式碼：

 #詳情需要cookie  
 def getVerify(url):  
     infourl = url  
     headers = {  
         'authority': 'www.xx.com.cn',  
         "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
     }  
     proxies = getApiIp()//取代理  
     if proxies:  
         #無cookie存取詳情頁拿引數bm-verify,pow  
         res = getRes(infourl,headers,proxies,'','GET')  
         if res:  
             #拿第一次請求的ak_bmsc  
             cookie = re.findall("ak_bmsc=.*?;",res.headers['set-cookie'])[0]  
             #拿bm-verify  
             verifys = re.findall('"bm-verify": "(.*?)"', res.text)[0]  
             #合併字串轉int相加取pow  
             a = re.findall('var i = (\d+);',res.text)[0]  
             b = re.findall('Number\("(.*?)"\);',res.text)[0]  
             b = int(b.replace('" + "',''))  
             pow = int(a)+b  
             post_data = {  
                 'bm-verify': verifys,  
                 'pow':pow  
             }  
             #轉json  
             post_data = json.dumps(post_data)  
             if verifys:  
                 logger.info('第一次引數獲取完畢')  
                 return post_data,proxies,cookie  
             else:  
                 print('verify獲取異常')  
         else:  
             print('verify請求出錯')  
    
 # 第二次帶引數存取驗證連結  
 def getCookie(url):  
     post_headers = {  
         "authority": "www.xx.com.cn",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",  
         "accept": "*/*",  
         "content-type": "application/json",  
         "origin": "https://www.xx.com.cn",  
         "referer":url,  
     }  
     post_data,proxies,c_cookie = getVerify(url)  
     post_headers['Cookie'] = c_cookie  
     posturl = "https://www.xx.com.cn/_sec/verify?provider=interstitial"  
     check = getRes(posturl,post_headers,proxies,post_data,'POST')  
     if check:  
     #從請求頭拿到ak_bmsc cookie  
         cookie = check.headers['Set-Cookie']  
         cookie = re.findall("ak_bmsc=.*?;",cookie)[0]  
         if cookie:  
             logger.info('Cookie獲取完畢')  
             return cookie,proxies  
         else:  
             print('cookie獲取異常')  
     else:  
         print('cookie請求出錯')
登入後複製

簡單的概括一下詳情頁的請求流程：

第一次請求，取得所需引數bm-verify，pow，cookie，提供給下一次的post請求（getVerify方法）

第二次請求，根據已知條件進行post請求，並獲取響應頭cookie的ak_bmsc（getCookie）

切記，在整個獲取cookie的三次請求過程中，第二、三兩次請求都需要伴隨著上一次請求的ak_bmsc作為cookie傳遞，第二次請求需要第一次的ak_bmsc，最終請求需要第二次的ak_bmsc。

四、詳情頁請求

 def getItemInfo(url):  
     logger.info(f'正在請求詳情url-{url}')  
     cookie,proxies = getCookie(url)  
     headers = {  
         'authority': 'www.xx.com.cn',  
         'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
         'referer':'https://www.xx.com.cn/product/cn/THS4541-DIE',  
         'cookie':cookie  
     }  
     res = getRes(url, headers,proxies, '', 'GET')  
     content = res.content.decode('utf-8')  
     print(content)  
     exit()  
     sel = Selector(text=content)  
     Parameters = sel.xpath('//ti-tab-panel[@tab-title="引數"]/ti-view-more/div').extract_first()  
     Features = sel.xpath('//ti-tab-panel[@tab-title="特性"]/ti-view-more/div').extract_first()  
     Description = sel.xpath('//ti-tab-panel[@tab-title="描述"]/ti-view-more').extract_first()  
     if Parameters and Features and Description:  
         return Parameters,Features,Description
登入後複製

通過上一步cookie的獲取，帶著cookie再次存取詳情連結，就可以順利的獲取內容並可以使用xpath進行解析，獲取需要的內容。

五、代理設定

T網站詳情頁帶cookie請求有100多次，如果用本地代理一直去請求，會有IP封鎖的可能性出現，導致無法正常獲取。所以，需要高效請求的話，優質穩定的代理IP必不可少，我這裡使用的ipidea代理請求的T網站，資料很快就存取出來了。

地址：http://www.ipidea.net/?utm-source=csdn&utm-keyword=?wb ，首次可以白嫖流量哦。本次使用的api獲取，程式碼如下：

 # api獲取ip  
 def getApiIp():  
     # 獲取且僅獲取一個ip  
     api_url = 'http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1®ions=&port=1'  
     res = requests.get(api_url, timeout=5)  
     try:  
         if res.status_code == 200:  
             api_data = res.json()['data'][0]  
             proxies = {  
                 'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),  
                 'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),  
             }  
             print(proxies)  
             return proxies  
         else:  
             print('獲取失敗')  
     except:  
         print('獲取失敗')
登入後複製

六、程式碼彙總

 # coding=utf-8  
 import requests  
 from scrapy import Selector  
 import re  
 import json  
 from loguru import logger  
    
 # api獲取ip  
 def getApiIp():  
     # 獲取且僅獲取一個ip  
     api_url = '獲取代理地址'  
     res = requests.get(api_url, timeout=5)  
     try:  
         if res.status_code == 200:  
             api_data = res.json()['data'][0]  
             proxies = {  
                 'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),  
                 'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),  
             }  
             print(proxies)  
             return proxies  
         else:  
             print('獲取失敗')  
     except:  
         print('獲取失敗')  
    
 def getItemList():  
     url = "https://www.xx.com.cn/selectiontool/paramdata/family/3658/results?lang=cn&output=json"  
     headers = {  
         'authority': 'www.xx.com.cn',  
         "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
     }  
     proxies = getApiIp()  
     if proxies:  
         # res = requests.get(url, headers=headers, proxies=proxies)  
         res = getRes(url,headers,proxies,'','GET')  
         nodes = res.json()['ParametricResults']  
         for node in nodes:  
             data = {}  
             data["itemName"] = node["o3"] #名稱  
             data["inventory"] = node["p3318"] #庫存  
             data["price"] = node["p1130"]['multipair1']['l'] #價格  
             data["infoUrl"] = f"https://www.ti.com.cn/product/cn/{node['o1']}"#詳情URL  
             Parameters, Features, Description = getItemInfo(data["infoUrl"])  
             data['Parameters'] = Parameters  
             data['Features'] = Features  
             data['Description'] = Description  
             print(data)  
    
 #詳情需要cookie  
 def getVerify(url):  
     infourl = url  
     headers = {  
         'authority': 'www.xx.com.cn',  
         "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
     }  
     proxies = getApiIp()  
     if proxies:  
         #存取詳情頁拿引數bm-verify,pow  
         res = getRes(infourl,headers,proxies,'','GET')  
         if res:  
             #拿第一次請求的ak_bmsc  
             cookie = re.findall("ak_bmsc=.*?;",res.headers['set-cookie'])[0]  
             #拿bm-verify  
             verifys = re.findall('"bm-verify": "(.*?)"', res.text)[0]  
             #字串轉int相加取pow  
             a = re.findall('var i = (\d+);',res.text)[0]  
             b = re.findall('Number\("(.*?)"\);',res.text)[0]  
             b = int(b.replace('" + "',''))  
             pow = int(a)+b  
             post_data = {  
                 'bm-verify': verifys,  
                 'pow':pow  
             }  
             #轉json  
             post_data = json.dumps(post_data)  
             if verifys:  
                 logger.info('第一次引數獲取完畢')  
                 return post_data,proxies,cookie  
             else:  
                 print('verify獲取異常')  
         else:  
             print('verify請求出錯')  
    
 # 第二次帶引數存取驗證連結  
 def getCookie(url):  
     post_headers = {  
         "authority": "www.xx.com.cn",  
         "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",  
         "accept": "*/*",  
         "content-type": "application/json",  
         "origin": "https://www.xx.com.cn",  
         "referer":url,  
     }  
     post_data,proxies,c_cookie = getVerify(url)  
     post_headers['Cookie'] = c_cookie  
     posturl = "https://www.xx.com.cn/_sec/verify?provider=interstitial"  
     check = getRes(posturl,post_headers,proxies,post_data,'POST')  
     if check:  
     #從請求頭拿到ak_bmsc cookie  
         cookie = check.headers['Set-Cookie']  
         cookie = re.findall("ak_bmsc=.*?;",cookie)[0]  
         if cookie:  
             logger.info('Cookie獲取完畢')  
             return cookie,proxies  
         else:  
             print('cookie獲取異常')  
     else:  
         print('cookie請求出錯')  
    
 def getItemInfo(url):  
     logger.info(f'正在請求詳情url-{url}')  
     cookie,proxies = getCookie(url)  
     headers = {  
         'authority': 'www.xx.com.cn',  
         'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",  
         'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",  
         'referer':'https://www.xx.com.cn/product/cn/THS4541-DIE',  
         'cookie':cookie  
     }  
     res = getRes(url, headers,proxies, '', 'GET')  
     content = res.content.decode('utf-8')  
     sel = Selector(text=content)  
     Parameters = sel.xpath('//ti-tab-panel[@tab-title="引數"]/ti-view-more/div').extract_first()  
     Features = sel.xpath('//ti-tab-panel[@tab-title="特性"]/ti-view-more/div').extract_first()  
     Description = sel.xpath('//ti-tab-panel[@tab-title="描述"]/ti-view-more').extract_first()  
     if Parameters and Features and Description:  
         return Parameters,Features,Description  
    
 #專門傳送請求的方法,代理請求三次，三次失敗返回錯誤  
 def getRes(url,headers,proxies,post_data,method):  
     if proxies:  
         for i in range(3):  
             try:  
                 # 傳代理的post請求  
                 if method == 'POST':  
                     res = requests.post(url,headers=headers,data=post_data,proxies=proxies)  
                 # 傳代理的get請求  
                 else:  
                     res = requests.get(url, headers=headers,proxies=proxies)  
                 if res:  
                     return res  
             except:  
                 print(f'第{i}次請求出錯')  
             else:  
                 return None  
     else:  
         for i in range(3):  
             proxies = getApiIp()  
             try:  
                 # 請求代理的post請求  
                 if method == 'POST':  
                     res = requests.post(url, headers=headers, data=post_data, proxies=proxies)  
                 # 請求代理的get請求  
                 else:  
                     res = requests.get(url, headers=headers, proxies=proxies)  
                 if res:  
                     return res  
             except:  
                 print(f"第{i}次請求出錯")  
             else:  
                 return None  
    
 if __name__ == '__main__':  
    getItemList()
登入後複製

通過上述步驟，已經能獲取所需內容。

總結

整個T網站的資料獲取，難點就在詳情頁的cookie，（其實也不是很難，只不過cookie太長比較費眼）理順了整個請求流程，剩下的就是請求的過程。穩定高效的IP代理會讓你事半功倍，通過api獲取可變的代理也不易被網站封禁，從而更好地獲取資料。簡化cookie的時候使用合適的請求工具會更方便，比如postman，burp。

這次的整個流程到此結束，講的比較囉嗦，若有錯誤或者更好的方法請大佬指正！

【相關推薦：Python3視訊教學】

以上就是基於Python通過cookie對某晶片網站資訊的獲取的詳細內容，更多請關注TW511.COM其它相關文章！