数据采集第四次实践作业

发布时间 2023-11-15 23:27:27作者: xunfeng2310

作业一

要求:

    • 熟练掌握 Selenium 查找HTML元素、爬取Ajax网页数据、等待HTML元素等内容。
      • 使用Selenium框架+ MySQL数据库存储技术路线爬取“沪深A股”、“上证A股”、“深证A股”3个板块的股票数据信息。
  • 候选网站:东方财富网:http://quote.eastmoney.com/center/gridlist.html#hs_a_board

  • 输出信息:MYSQL数据库存储和输出格式如下,表头应是英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计表头:

  • Gitee文件夹链接

部分代码如下:

def getUrl(j):
    global count
    for i in range(70):
        js = 'window.scrollTo(0,%s)' % (i * 100)  # js脚本
        driver.execute_script(js)
    lists = driver.find_element(By.XPATH,"//tbody")
    list = lists.find_elements(By.XPATH,"./tr")
    columns = {0: "代码", 1: "名称", 2: "最新价格", 3: "涨跌额", 4: "涨跌幅", 5: "成交量", 6: "成交额", 7: "振幅",
               8: "最高", 9: "最低", 10: "今开", 11: "昨收"}
    print(columns)
    for i in list:
        column = i.find_elements(By.XPATH,"td")
        no = column[0].text    #序号
        code = column[1].text  #代码
        name = column[2].text#名称
        zuixin = column[4].text#最新价格
        er=    column[5].text#涨跌额
        fu=    column[6].text#涨跌幅
        chengjiaoliang=    column[7].text#成交量
        chengjiaoer  =   column[8].text#成交额
        zhenfu =    column[9].text#振幅
        zuigao =   column[10].text#最高
        zuidi  =  column[11].text#最低
        jinkai =  column[12].text#今开
        zuoshou =  column[13].text#昨收
        print("\t"+no+"\t"+code+"\t"+name+"\t"+zuixin+"\t"+er+"\t"+
              fu,end='')        print("\t"+chengjiaoliang+"\t"+chengjiaoer+"\t"+zhenfu+"\t"+zuigao+zuidi+"\t"+jinkai+"\t"+zuoshou)
if __name__ == '__main__':
    driver = webdriver.Chrome()
    key = ["hs", "sh", "sz"]
    for i in range(len(key)):
        keyword = urllib.parse.quote(key[i])
        driver = webdriver.Chrome()  # 创建Chrome浏览器
        url = f"http://quote.eastmoney.com/center/gridlist.html#{keyword}_a_board"
        driver.get(url)
        getUrl(i)


class Mysql:
    def process_item(self, item, spider):
        self.mydb = pymysql.connect(
            host="192.168.149.1",
            port=3306,
            user='root',
            password='123456',
            database="stock",
            charset='utf8'
        )
        self.cursor = self.mydb.cursor()
        self.cursor.execute('''CREATE TABLE IF NOT EXISTS stocks(
                                                  stockname VARCHAR(256),
                                                  name VARCHAR(256),
                                                  newprice VARCHAR(256),
                                                  zhangdiefu VARCHAR(256),
                                                  zhangdieer VARCHAR(256),
                                                  chengjiaoliang VARCHAR(256),
                                                  chengjiaoer VARCHAR(256),
                                                  zhenfu VARCHAR(256),
                                                  zuigao VARCHAR(256),
                                                  zuidi VARCHAR(256),
                                                  jinkai VARCHAR(256),
                                                  zuoshou VARCHAR(256)
                                                   )''')

        self.mydb.commit()
        sql = "insert into stocks values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        self.cursor.execute(sql,(item.get("stockname"),item.get("name"),item.get("newprice"), item.get("zhangdiefu"),item.get("zhangdieer"), item.get("chengjiaoliang"), item.get("chengjiaoer"), item.get("zhenfu"), item.get("zuigao"),item.get("zuidi"), item.get("jinkai"),item.get("zuoshou")))

        self.mydb.commit()
        return item
    def close_spider(self, spider):
        self.mydb.close()

结果如下:


林梓源/数据采集与融合技术 - Gitee.com

心得体会

对Selenium有了更深的理解

作业二:

要求:

    • 熟练掌握 Selenium 查找HTML元素、实现用户模拟登录、爬取Ajax网页数据、等待HTML元素等内容。
      • 使用Selenium框架+MySQL爬取中国mooc网课程资源信息(课程号、课程名称、学校名称、主讲教师、团队成员、参加人数、课程进度、课程简介)
  • 候选网站:中国mooc网:https://www.icourse163.org

  • 输出信息:MYSQL数据库存储和输出格式

  • Gitee文件夹链接

部分代码如下:

from selenium.webdriver.edge.options import Options
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.edge.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
import sqlite3


def login():
    next_button = driver.find_element(By.XPATH, '//*[@id="j-topnav"]/div')
    webdriver.ActionChains(driver).move_to_element(next_button).click(next_button).perform()
    next_button = driver.find_element(By.XPATH,
                                      'ml/body/div[12]/div[2]/div/div/div/div/div/div[1]/div/div[2]/div[1]/a[1]')
    webdriver.ActionChains(driver).move_to_element(next_button).click(next_button).perform()
    time.sleep(2)
    driver.switch_to.frame('ptlogin_iframe')
    next_button = driver.find_element(By.CSS_SELECTOR, '#img_out_539943419')
    webdriver.ActionChains(driver).move_to_element(next_button).click(next_button).perform()
    time.sleep(15)


def spider():
    driver.execute_script("window.open('https://www.icourse163.org/course/ZJU-199001','_self');")
    time.sleep(5)
    name = driver.find_element(By.XPATH, 'ml/body/div[4]/div[2]/div[1]/div/div[3]/div/div[1]/div[1]/span[1]').text
    teacher = driver.find_element(By.XPATH,
                                  'ml/body/div[4]/div[2]/div[2]/div[2]/div[2]/div[2]/div[2]/div/div/div[2]/div/div/div/div/div/h3').text
    school = driver.find_element(By.XPATH, "ml/body/div[4]/div[2]/div[2]/div[2]/div[2]/div[2]/div[2]/div/a/img").alt
    count = driver.find_element(By.XPATH,
                                'ml/body/div[4]/div[2]/div[1]/div/div[3]/div/div[2]/div/div[1]/div[4]/span[2]').text
    time1 = driver.find_element(By.XPATH, '//*[@id="course-enroll-info"]/div/div[1]/div[2]/div/span[2]').text
    brief = driver.find_element(By.XPATH, '//*[@id="j-rectxt2"]').text
    print(name, teacher, school, count, time1, brief)
    item = [str(1), name, school, teacher, teacher, count, time1, brief]
    sql_text = "INSERT INTO mooc VALUES('" + item[0] + "'"
    for i in range(len(item) - 1):
        sql_text = sql_text + ",'" + item[i + 1] + "'"

    sql_text = sql_text + ")"
    print(sql_text)
    db.execute(sql_text)
    db.commit()


db = sqlite3.connect('w1.db')
sql_text = '''CREATE TABLE mooc
                (id TEXT,
                    cCourse TEXT,
                    cCollege TEXT,
                    cTeacher TEXT,
                    cTeam TEXT,
                    cCount TEXT,
                    cProcess TEXT,
                    cBrief TEXT);'''
db.execute(sql_text)
db.commit()

edge_options = Options()
driver = webdriver.Edge(options=edge_options)
res = []

driver.get("https://www.icourse163.org/home.htm?userId=1482874151#/home/spocCourse")
try:
    login()
except Exception as e:
    print("------------------------------------error---------------------------------------")
    print(e)

try:
    spider()
except Exception as e:
    print("------------------------------------error---------------------------------------")
    print(e)
time.sleep(999)

db.close()

结果如下:

心得体会:

对Selenium有了更深的理解
林梓源/数据采集与融合技术 - Gitee.com

作业三:

要求:

  • 掌握大数据相关服务,熟悉Xshell的使用

  • 完成文档 华为云_大数据实时分析处理实验手册-Flume日志采集实验(部分)v2.docx 中的任务,即为下面5个任务,具体操作见文档。

  • 环境搭建:

  • 任务一:开通MapReduce服务

  • 实时分析开发实战****:

  • 任务一:Python脚本生成测试数据

  • 任务二:配置Kafka

  • 任务三: 安装Flume客户端

  • 任务四:配置Flume采集数据

输出:实验关键步骤或结果截图。

部分截图如下:






心得体会:

熟悉Xshell的使用