数据采集与融合技术-实验作业二

发布时间 2023-10-18 18:40:21作者: tiantianmimi

作业①:

(1)天气预报实验

要求:在中国气象网(http://www.weather.com.cn)给定城市集的 7
日天气预报,并保存在数据库。
– 输出信息:
Gitee 文件夹链接
https://gitee.com/tiantianmi/crawl_project/blob/master/%E4%BD%9C%E4%B8%9A2/%E5%A4%A9%E6%B0%94

(2)实验代码

点击查看代码
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import sqlite3
url = "http://www.weather.com.cn/weather/101280601.shtml"

req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
data = data.read()
dammit = UnicodeDammit(data, ["utf-8", "gbk"])
data = dammit.unicode_markup
soup = BeautifulSoup(data, "lxml")
lis = soup.select("ul[class='t clearfix'] li")
conn = sqlite3.connect('tt.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS tt
             (data TEXT, 
              weather TEXT,
              temp TEXT)''')
for li in lis:
    try:
        date = li.select('h1')[0].text
        weather = li.select('p[class="wea"]')[0].text
        temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].text
        html=date, weather, temp
        c.execute('''INSERT INTO ranking (data,weather,temp)
                             VALUES (?, ?, ?)''', html)
        conn.commit()
    except Exception as err:
        print(err)
c.execute("SELECT * FROM ranking")
rows = c.fetchall()
for row in rows:
    print(row[0], row[1], row[2]))
conn.close()

(3)运行结果

(4)心得体会

在这个实验中使用了try-except语句来捕获可能出现的异常,使得代码正常运行

作业②

(1)爬取股票实验

要求:用 requests 和 BeautifulSoup 库方法定向爬取股票相关信息,并
存储在数据库中。
– 候选网站:东方财富网:https://www.eastmoney.com/
新浪股票:http://finance.sina.com.cn/stock/
– 技巧:在谷歌浏览器中进入 F12 调试模式进行抓包,查找股票列表加
载使用的 url,并分析 api 返回的值,并根据所要求的参数可适当更改
api 的请求参数。根据 URL 可观察请求的参数 f1、f2 可获取不同的数
值,根据情况可删减请求的参数。
参考链接:https://zhuanlan.zhihu.com/p/50099084
– 输出信息:
Gitee 文件夹链接
https://gitee.com/tiantianmi/crawl_project/blob/master/%E4%BD%9C%E4%B8%9A2/%E8%82%A1%E7%A5%A8

(2)实验代码

点击查看代码
import requests
import json
import sqlite3


# 用get方法访问服务器并提取页面数据
def getData(url):
    r = requests.get(url=url)
    data = r.text.strip("jQuery1124039403436320005647_1696661497357(")
    data = data.strip(");")
    data = json.loads(data)
    data = data['data']['diff']
    #print(data)
    return data


def main():
    print("输入要爬取的页数:")
    number=int(input())#输入要爬取的页数
    tpll="{0:^5}{1:^5}{2:^9}{3:^8}{4:^7}{5:^7}{6:^7}{7:^15}{8:^7}{9:^7}{10:^7}{11:^8}{12:^8}{13:^8}"
    tplt = "{0:^5}{1:^8}{2:^9}{3:^9}{4:^9}{5:^9}{6:^9}{7:^15}{8:^9}{9:^9}{10:^9}{11:^9}{12:^9}{13:^9}"
    print(tpll.format("序号", "代码", "名称", "最新价格", "涨跌幅", "涨跌额", "成交量", "成交额", "振幅", "最高", "最低", "今开", "昨收",chr(12288)))
    conn = sqlite3.connect('tt.db')
    cursor = conn.cursor()
    cursor.execute('''CREATE TABLE IF NOT EXISTS tt
                      (serial_no INTEGER, code TEXT, name TEXT, latest_price REAL,
                       change_percent REAL, change_amount REAL, volume INTEGER, amount REAL,
                       amplitude REAL, highest REAL, lowest REAL, today_open REAL, yesterday_close REAL)''')
    for j in range(number):
        page=j+1
        url='http://23.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124039403436320005647_1696661497357&pn='+str(page)+'&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1696661497358'
        data=getData(url)
        for i in range(len(data)):
            d=(20*j+i+1,data[i]['f12'],data[i]['f14'],data[i]['f2'],data[i]['f3'],data[i]['f4'],data[i]['f5'],data[i]['f6'],data[i]['f7'],data[i]['f15'],data[i]['f16'],data[i]['f17'],data[i]['f18'])
            cursor.execute("INSERT INTO tt VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", d)
            conn.commit()
    cursor.execute("SELECT * FROM tt")
    rows = cursor.fetchall()
    for row in rows:
        print(tplt.format(row[0],row[1],row[2],row[3],row[4],row[5],row[6],row[7],row[8],row[9],row[10],row[11],row[12],chr(12288)))
    conn.close()

main()

(3)运行结果

(4)心得体会

本次实验是我第一次使用了json提取信息,最开始觉得好难,后来慢慢分析发现还蛮有意思的。

作业③

(1)爬取大学实验

要求:爬取中国大学 2021 主榜
https://www.shanghairanking.cn/rankings/bcur/2021)所有院校信
息,并存储在数据库中,同时将浏览器 F12 调试分析的过程录制 Gif 加
入至博客中。
– 技巧:分析该网站的发包情况,分析获取数据的 api
– 输出信息:
Gitee 文件夹链接
https://gitee.com/tiantianmi/crawl_project/blob/master/%E4%BD%9C%E4%B8%9A2/%E5%A4%A7%E5%AD%A6

(2)实验代码

点击查看代码
from bs4 import BeautifulSoup
import urllib.request
import sqlite3

list = []
u = []
v = []

html = urllib.request.urlopen(url="http://www.shanghairanking.cn/rankings/bcur/2020")
soup = BeautifulSoup(html, "html.parser")
tplt = "{0:^13}{1:^16}{2:^16}{3:^10}{4:^13}"
print(tplt.format("排名", "学校", "省市", "类型", "总分", chr(12288)))

for html1 in soup.find('tbody').children:
    html2 = html1.find_all("td")
    html3 = html1.find("a")
    for i in range(len(html2)):
        # html2=html2.text.replace(" ", "").replace("\n", "")
        u.append(html2[i].text.replace(" ", "").replace("\n", ""))
    v.append(html3.text.replace(" ", "").replace("\n", ""))
conn = sqlite3.connect('tt.db')

# 创建表格
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS ranking
             (rank INTEGER, 
              school TEXT,
              province TEXT,
              category TEXT,
              score INTEGER)''')
for i in range(len(v)):
    j = 6 * i
    #print(tplt.format(u[j], v[i], u[j + 2], u[j + 3], u[j + 4], chr(12288)))
    data = u[j], v[i], u[j + 2], u[j + 3], u[j + 4]
    c.execute('''INSERT INTO ranking (rank, school, province, category, score)
                     VALUES (?, ?, ?, ?, ?)''', data)
    conn.commit()
c.execute("SELECT * FROM ranking")
rows = c.fetchall()
for row in rows:
    print(tplt.format(row[0], row[1], row[2], row[3], row[4], chr(12288)))
conn.close()

(3)运行结果

(4)心得体会

本次实验中大学名称有两个部分,因此写了两个列表分别存储和输出。在实验过程中,将数据存储在数据库中,感觉像在写sql,还蛮有意思的。