综合设计——多源异构数据采集与融合应用综合实践

这个项目属于哪个课程	2023数据采集与融合技术
组名、项目简介	组名：普雷蒙奇、项目需求：多模态情感分析、项目目标：通过在网页中搜索关键词来得到一个综合的情感分析、项目开展技术路线：前端、python 、华为云平台
团队成员学号	102102112、102102115、102102116、102102118、102102119、102102120、102102156、102102159
这个项目目标	通过在网页中上传文本、图片、视频或音频分析其中的情感
其他参考文献	[1]梁爱华，王雪峤多模态学习数据采集与融合、[2]陈燕、赖宇斌基于CLIP和交叉注意力的多模态情感分析模型、[3]武星、殷浩宇面向视频数据的多模态情感分析

Gitee文件夹链接：

https://gitee.com/w-jking/crawl_project/blob/master/大作业/datacrawl.7z

项目整体介绍：

项目名称：
国产手机情感分析

项目背景：
近年来，国货新潮流兴起，华为Mate60系列供应链90%以上来自国内，消费者的真实反馈对于手机品牌口碑和市场表现至关重要，收集和分析消费者对于国产手机的反馈，不仅可以为用户提供一个选择手机品牌的依据，也可以为品牌提供有价值的建议和改进方向。

项目目标：
通过采集和挖掘不同模态（文本、图片、音频）的数据，运用不同的情感分析模型，构造一个可以对国产手机各个方面进行多模态分析的系统，对国产手机品牌得到一个综合的情感分析，直观的感受到大众对于国产手机的的态度，以便于更好的判断国产手机中的“国货之光”。

项目具体流程图：

技术路线：

前端开发：
- 使用HTML、CSS和JavaScript进行前端的是界面设计，实现输入关键词和视频数量和弹幕数量后得到一个综合分析。
- 提升用户体验，使用动画效果和过渡效果，可以提高页面的交互性和吸引力。
后端开发：
- 使用python语言来实现后端开发的编写
- 使用Django框架来处理前端信息的接收，以及后端得到的信息返回
数据处理与分析：
- 文本爬取：
  - 爬取B站弹幕和京东评论，但是京东评论在项目最后阶段爬取不到数据，所以只保留了弹幕的爬取。
  - 采用request库的findall（）函数获取指定cid的弹幕，并通过正则表达式提取出弹幕文本。
- 图片爬取：
  - 爬取当当网的图片。
  - 使用requests库的findall（）函数和正则表达式取所有满足条件的图片链接。
  - 并使用多线程机制将图片进行下载。
- 音/视频爬取：
  - 爬取B站相关视频。
  - 采用request库的findall（）函数和正则表达式提取JSON中BV号。
  - 使用正则表达式和json库获取视频和音频的url。
  - 使用requests库来下载视频和音频文件。
- 文本分析：首先考虑ERNIE-UIE文心模型，可是配置不成功，导致没有结果显示。接着考虑讯飞的情感分析模型，发现只能单句分析，不太符合需求，最后考虑百度云的API接口。
- 视频和音频分析：
  - 对B站相关视频进行爬取，得到视频和音频。
  - 使用Whisper方法将音频转为文本。
  - 对上传的音频文件进行特征提取和情感识别。
结果输出与展示：将分析结果通过前端界面展示。

结果输出与展示：

个人分工：

我在本次数据采集课程小组设计中主要负责国产手机的图片爬取、前端设计页面的美化。

通过selenium实现输入国产手机品牌以自动爬取相应国产品牌的图片

spider

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import time
import requests
import os

def spider():
    time.sleep(5)
    global picture
    goods = driver.find_elements(By.XPATH, '//div[@id="J_goodsList"]/ul/li')
    for good in goods:
        # 图片url
        img_url = good.find_element(By.XPATH, './div/div[1]/a/img').get_attribute('src')
        # 处理反爬
        if img_url == None:
            img_url = 'https:' + good.find_element(By.XPATH, './div/div[1]/a/img').get_attribute('data-lazy-img')
        picture.append(img_url)
        print(img_url)
    print(f"共计{len(picture)}张相关国货品牌的图片 正在保存中请耐心等待~",end='')

if __name__ == '__main__':
    picture = []
    keyword = input("请输入想查找的国货品牌： ")
    url = "https://search.jd.com/Search?keyword="+ keyword +"&enc=utf-8&wq=vivo&pvid=59ad49f2c4a1413e91dd6f2779ebe563"
    driver = webdriver.Edge()
    driver.get(url)
    # 窗口最大化
    driver.maximize_window()
    wait = WebDriverWait(driver,30)
    spider()
    driver.close()

    # 创建文件夹用于存储图片
    path = './pic'
    if not os.path.exists(path):
        os.makedirs(path)
    # 保存图片
    for i in range(len(picture)):
        img_url = picture[i]
        with open(f".\pic\{keyword+str(i+1)}.jpg", "wb") as f:  # 打开相应文件并保存图片
            content = requests.get(img_url)
            f.write(content.content)
    print("ok!")

爬取的图片示例（部分）：

将前端页面美化，实现用户交互界面更好的体验

html

<html lang="en">
  <head>
    <meta charset="UTF-8" />
    <title>index</title>

    <script>  
      function toggleContent() {  
        var result = document.getElementById("result");  
        if (result.style.display === "none") {  
          result.style.display = "block";  
        }
      }  
    </script>

    <style>
      body {
        font-family: Arial, sans-serif;
        background-repeat: no-repeat;
        background-size: cover;
        background-image: linear-gradient(
          25deg,
          #d6235d,
          #ea7a79,
          #f8bb97,
          #fff9b6
        );
        margin: 0;
        padding: 0;
        height: auto;
      }
  
      .container {
        border-radius: 3%;
        max-width: 800px;
        margin: 100px auto;
        padding: 20px;
        background-color: rgba(255, 255, 255, 0.75);
        box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
      }
  
      form {
        display: flex;
        justify-content: center;
        align-items: center;
      }
      .input-container {
        display: flex;
        justify-content: left;
        align-items: center;
        margin-bottom: 20px;
      }
  
      .input-container input {
        padding: 10px;
        font-size: 16px;
        border: 1.5px solid #000000;
        border-radius: 4px;
        width: 200px;
        margin-left: 2%;
      }
  
      button {
        width: 180px;
        height: auto;
        padding: 10px 20px;
        font-size: 16px;
        border: none;
        background-color: rgb(250, 169, 169);
        color: #fff;
        cursor: pointer;
        border-radius: 4px;
        margin-left: 3%;
      }
  
      .input-container .btn:hover {
        background-color: rgb(233, 140, 140);
      }
  
      .result-table {
        display: flex;
        align-items: center;
        justify-content: center;
      }
      
      .output-container{
        display: block;
      }

      .output-container table {
        margin-top: 1%;
        margin-bottom: 3%;
        width: 90%;
        height: auto;
        padding: 10px;
        font-size: 16px;
        border: none;
        border-radius: 5px;
        resize: vertical;
        background: linear-gradient(
          to left,
          rgba(227, 185, 181, 0.6),
          rgba(240, 197, 197, 0.4),
          rgba(235, 177, 177, 0.2)
        );
      }
  
      @keyframes blurOut {
        0% {
          opacity: 1;
          filter: blur(0px);
          transform: scale(1);
        }
        30% {
          opacity: 0;
          filter: blur(20px);
          transform: scale(2);
        }
        60%,
        100% {
          opacity: 1;
          filter: blur(0px);
          transform: scale(1);
        }
      }
  
      .mask {
        height: 100px;
        color: rgb(250, 169, 169);
        margin-right: 10px;
        display: flex;
        align-items: center;
        justify-content: center;
        font-size: 32px;
        margin-bottom: 5%;
      }
  
      .mask span {
        animation: blurOut 5s ease-in-out infinite;
        animation-delay: calc(0.2s * var(--i));
      }
      
      td {
        text-align: center;
      }
    </style>
  </head>
  <body>
    <div class="container">
      <div class="mask">
        <span style="--i: 1">J</span>
        <span style="--i: 2">K</span>
        <span style="--i: 3">I</span>
        <span style="--i: 4">N</span>
        <span style="--i: 5">G</span>
        <span style="--i: 6">国</span>
        <span style="--i: 7">货</span>
        <span style="--i: 8">之</span>
        <span style="--i: 9">光</span>
      </div>

      <div class="input-container">
        <form method="post" action="./index.html" >
          <!-- {% csrf_token %} -->
          <input type="text" name="keyword" placeholder="关键词" />
          <input type="text" name="num" placeholder="想要搜索的视频数量" />
          <input type="text" name="desired_danmu_count" placeholder="想要的弹幕数量"/>
          <button type="submit" class="btn" onclick="toggleContent()">提交</button>
        </form>
      </div>

      <div class="output-container" id="result">
        <div class="result-table">
          {% if my_list %}
          <table>
            <thead>
              <tr>
                <th>Word</th>
                <th>Count</th>
              </tr>
            </thead>
            <tbody>
              {% for item in my_list %}
              <tr>
                <td>{{ item.0 }}</td>
                <td>{{ item.1 }}</td>
              </tr>
              {% endfor %}
            </tbody>
          </table>
          {% endif %}
        </div>

        <div class="result-table">
          {% if items %}
          <table>
            <thead>
              <tr>
                <th colspan="4">弹幕情感分析</th>
              </tr>
              <tr>
                <th>Confidence</th>
                <th>Negative Probability</th>
                <th>Positive Probability</th>
                <th>Sentiment</th>
              </tr>
            </thead>
            <tbody>
              {% for item in items %}
              <tr>
                <td>{{ item.confidence }}</td>
                <td>{{ item.negative_prob }}</td>
                <td>{{ item.positive_prob }}</td>
                <td>{{ item.sentiment }}</td>
              </tr>
              {% endfor %}
            </tbody>
          </table>
          {% endif %}
        </div>
        
        <div class="result-table">
          {% if items2 %}
          <table>
            <thead>
              <tr>
                <th colspan="4">视频情感分析</th>
              </tr>
              <tr>
                <th>Confidence</th>
                <th>Negative Probability</th>
                <th>Positive Probability</th>
                <th>Sentiment</th>
              </tr>
            </thead>
            <tbody>
              {% for item in items2 %}
              <tr>
                <td>{{ item.confidence }}</td>
                <td>{{ item.negative_prob }}</td>
                <td>{{ item.positive_prob }}</td>
                <td>{{ item.sentiment }}</td>
              </tr>
              {% endfor %}
            </tbody>
          </table>
          {% endif %}
        </div>
      </div>
    </div>

  </body>
</html>

页面（初始版本）：

页面（最终版本）：

心得体会

102102116吕艺能

通过本次小组合作，让我能够对数据采集与挖掘、数据处理、数据可视化得以重温。特别是数据采集与挖掘，我通过selenium对京东商城爬取国产手机的照片，对于selenium有更进一步的了解，同时京东商城的反爬机制也令人印象深刻，例如：爬取图片的过程中前10张图片是在同一个结构的标签下，而后续图片的标签结构则发生了变化。对此应当更细腻的观察html结构并处理好反爬以实现将除了前10张以外的图片也爬取。同时京东真的太难爬了，会跳出登陆界面，于是一开始便在爬虫代码中先进行登陆处理（获取账号密码输入框并填入相应信息，接着点击登录按钮）但又出现了滑动验证，着实令人头疼，又受制于个人水平有限没有得到较好的处理，最后小组的其他成员转战通过当当爬取图片。
本次小组合作也让我对前端中HTML的结构设计有了更进一步的掌握与理解，特别是调CSS样式，通过制作动画、样式的美化让我的前端能力得到锻炼，能够使得页面更具设计感、美感。在前端调试过程是枯燥且压力的，但是当见到前端界面经过调试能够穿上适合自己的“新衣服”并能够与后端对接时内心还是充满成就感的！