資料匯入與預處理實驗二---json格式檔案轉換

一、實驗概述：
【實驗目的】

初步掌握資料採集的方法；
初步掌握利用爬蟲爬取網路資料的方法
掌握不同資料格式之間的轉換方法；

【實施環境】（使用的材料、裝置、軟體） Linux或Windows作業系統環境，MySql資料庫，Python或其他高階語言

二、實驗內容
第1題爬取網路資料
【實驗要求】

爬取酷狗音樂網站（https://www.kugou.com/）上榜單前500名的歌曲名稱，演唱者，歌名和歌曲時長
將爬取的資料以JSon格式檔案儲存。
讀取JSON格式任意資料，檢驗檔案格式是否正確。

【實驗過程】（步驟、記錄、資料、程式等）
請提供操作步驟及介面截圖證明。

from bs4 import BeautifulSoup
import requests
import time
import re
import json
import demjson
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

nameList = []
singerList = []
timeList = []
song = []
total = []
keys = ['songName','singer','time']

def get_info(url, file):
    res = requests.get(url, headers=headers)
    res.encoding = file.encoding  # 同樣讀取和寫入的編碼格式
    soup = BeautifulSoup(res.text, 'lxml')
    ranks = soup.select('span.pc_temp_num')
    titles = soup.select('a.pc_temp_songname')
    times = soup.select('span.pc_temp_time')
    #jsonData = []
    for rank, title, time in zip(ranks, titles, times):
        data = {
            #'rank': rank.get_text().strip(),
            'title': title.get_text().strip(),
            'time': time.get_text().strip()
        }
        #print(data)

        singer, songName = data['title'].split(' - ')
        nameList.append(songName)
        singerList.append(singer)
        timeList.append(data['time'])
        #print(nameList)
        #print(singerList)
        #print(data['time'])
        #print(timeList)
        #print(singer, songName)
        #print(jsonData)

def output(url, file):
    songInfo = []
    for i in range(0,len(nameList)):
        #print(nameList[i])
        #print(singerList[i])
        #print(timeList[i])
        songInfo.append(nameList[i])
        songInfo.append(singerList[i])
        songInfo.append(timeList[i])
    #print(songInfo)
    for i in range(0, len(songInfo), 3):
        temp = songInfo[i:i + 3]
        song.append(temp)
    #print(len(song))
    file.write('{\n"songInfo":[\n')
    for i in range(0,len(song)):
        d = dict(zip(keys, song[i]))
        #print(d)
        file.write(json.dumps(d,ensure_ascii=False,indent=4,separators=(',', ': ')))
        if i != len(song)-1:
            file.write(',')
    file.write('\n]\n}')
def get_website_encoding(url):  # 一般每個網站自己的網頁編碼都是一致的,所以只需要搜尋一次主頁確定
    res = requests.get(url, headers=headers)
    charset = re.search("charset=(.*?)>", res.text)
    if charset is not None:
        blocked = ['\'', ' ', '\"', '/']
        filter = [c for c in charset.group(1) if c not in blocked]
        return ''.join(filter)  # 修改res編碼格式為源網頁的格式,防止出現亂碼
    else:
        return res.encoding  # 沒有找到編碼格式,返回res的預設編碼

if __name__ == '__main__':
    encoding = get_website_encoding('http://www.kugou.com')
    #print(encoding)
    urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html?from=rank'.format(str(i)) for i in range(1, 23)]
with open(r'.\kugou_500.json', 'w+', encoding=encoding) as f:
    #f.write("歌手         歌名          長度\n")
    for url in urls:
        get_info(url, f)
        time.sleep(1) #緩衝一秒,防止請求頻率過快
    output(url,f)

得到的json檔案
在這裡插入圖片描述
開啟使用json.load開啟檔案，成功輸出後代表檔案格式正確

import json

with open("kugou_500.json",'r',encoding='UTF-8') as f:
    new_dict = json.load(f)
    print(new_dict)

在這裡插入圖片描述

第2題程式設計生成CSV檔案並轉換成JSon格式
【實驗要求】

程式設計生成CSV格式檔案。檔案內容如下：姓名，性別，籍貫，系別張迪，男，重慶，計算機系蘭博，男，江蘇，通訊工程系黃飛，男，四川，物聯網系鄧玉春，女，陝西，計算機系周麗，女，天津，藝術系李雲，女，上海，外語系
將上述CSV格式檔案轉換成JSon格式，並查詢檔案中所有女生的資訊。

【實驗過程】（步驟、記錄、資料、程式等）
請提供操作步驟及介面截圖證明。

import csv
#建立檔案物件
f = open("question02.csv","w",encoding="utf-8")
#構建csv寫入物件
csv_writer = csv.writer(f)
#構建列表頭
csv_writer.writerow(["姓名","性別","籍貫","系別"])
#寫入csv檔案內容
csv_writer.writerow(["張迪","男","重慶","計算機系"])
csv_writer.writerow(["蘭博","男","江蘇","通訊工程系"])
csv_writer.writerow(["黃飛","男","四川","物聯網系"])
csv_writer.writerow(["周麗","女","天津","藝術系"])
csv_writer.writerow(["李芸","女","上海","外語系"])

在這裡插入圖片描述
轉換為json格式

import csv
import json
csvFile = open("question02.csv","r",encoding="utf-8")
jsonFile = open("question02.json","w",encoding="utf-8")

fieldNames = {"姓名","性別","籍貫","系別"}
reader = csv.DictReader(csvFile)
i = 1
jsonFile.write('{\n"personInfo":[\n')
for row in reader:
    print(row)
    jsonFile.write(json.dumps(row,ensure_ascii=False,indent=4))
    if i != 5:
        jsonFile.write(',')
        i = i+1
jsonFile.write('\n]\n}')

在這裡插入圖片描述

import json
with open("question02.json","r",encoding="utf-8") as f:
    data = json.load(f)
    #print(data['personInfo'][1]['性別'])
    #print(type(data))
    for i in range(0,5):
        if data['personInfo'][i]['性別'] == '女':
            print(data['personInfo'][i])

在這裡插入圖片描述

第3題. XML格式檔案與JSon的轉換
【實驗內容集要求】
(1) 讀取以下XML格式的檔案，內容如下： <?xml
version=」1.0」 encoding=」gb2312」> <圖書> <書名>紅樓夢</書名> <作者>曹雪芹</作者><主要內容>描述賈寶玉和林黛玉的愛情故事</主要內容> <出版社>人民文學出版社</出版社> </圖書>
(2) 將以上XML格式檔案轉換成JSon格式。

【實驗過程】（步驟、記錄、資料、程式等）
請提供相應程式碼及程式執行介面截圖。

新建xml檔案
在這裡插入圖片描述

import xml.dom.minidom
import xmltodict
import json
#開啟xml檔案
#dom = xml.dom.minidom.parse('question_03.xml')
#得到檔案元素物件
#root = dom.documentElement
#bb = root.getElementsByTagName('書名')
#print(bb[0].firstChild.data)

#獲取xml檔案
file = open("question_03.xml","r",encoding="utf-8")
#讀取檔案內容
xmlStr = file.read()
#print(xmlStr)
jsonStr = xmltodict.parse(xmlStr)
#print(jsonStr)
with open("question03JSON.json","w",encoding="utf-8") as f:
    f.write(str(json.dumps(jsonStr,ensure_ascii=False,indent=4,separators=(',', ': '))))

在這裡插入圖片描述