使用python实现垃圾邮件分类——朴素贝叶斯-526互联

　　这个是我的python（选修课）期末作业，代码很简单，但是课程报告老师要求我们写出一朵花出来，我：？

　　相关原理介绍：

　　贝叶斯公式，用于计算在已知一些相关事情发生的情况下，另一事件发生的概率，公式如下：

　　条件独立性假设，假设所有特征在类别给定的情况下都是相互独立的。

　　朴素贝叶斯公式，建立在贝叶斯公式和条件独立假设的基础上，以下是计算公式：

其中y是分类变量，在该例子中指某一封邮件，x1到xn是特征变量，即我们筛选出来的垃圾邮件的特征。

设有类别集合C={y1,y2,......,yn}，特征集合I={x1,x2,......,xn}，其中y1,y2,.......yn是不同的类别，y1,y2,.......,yn是I的一个特征属性。根据贝叶斯公式P(A|B) = ，有P(y1|x1,x2,......,xn)=P，由假定条件互相独立，故P(x1,x2,.......,xn|y1)=，假设P(y1|M)是待分类邮件M为垃圾邮件的概率，P(y2|M)为待分类邮件M为非垃圾邮件的概率，由于P(x1,x2,......,xn)对于所有类别均为常数，故只要：

P（y1)<P(y2),既可以判定邮件M更可能为非垃圾邮件。

　　代码里的邮件是随机选取数据集中的一段连续的字符串，字符串数量有限制，当时不知道怎么生成邮件，索性直接这么干了。

　　数据集网址：

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

　　界面展示：

　　使用了Tkinter界面。

　　这里是代码：

import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('D:\Desktop\python_filter_spam\sms+spam+collection\SMSSpamCollection', sep='\t', names=['label', 'sms_message'])
df['label'] = df.label.map({'ham':0 , 'spam':1})



# # 从这里就不是示例了
# # 拆分成测试集和训练集,输出各自的行数
'''
NOTE: sklearn.cross_validation will be deprecated soon to sklearn.model_selection 
'''
# split into training and testing sets
# USE from sklearn.model_selection import train_test_split to avoid seeing deprecation warning.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)



from sklearn.feature_extraction.text import CountVectorizer

# 首先，我们需要对 CountVectorizer()拟合训练数据 (X_train) 并返回矩阵。
# 其次，我们需要转换测试数据 (X_test) 以返回矩阵。
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)
# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)



# 使用scikit-learn实现朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB

# 先训练然后预测我们的测试数据
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)
predictions = naive_bayes.predict(testing_data)
# 评估一下我们的模型
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy score: ', format(accuracy_score(y_test, predictions)))
print('Precision score: ', format(precision_score(y_test, predictions)))
print('Recall score: ', format(recall_score(y_test, predictions)))
print('F1 score: ', format(f1_score(y_test, predictions)))


import random

def generate_email_from_dataset():
    # 从数据集中随机选择一段连续的包含 15 个单词的字符串
    random_start = random.randint(0, len(df['sms_message']) - 15)
    email_content = ' '.join(df['sms_message'].str.split().sum()[random_start:random_start + 15])
    return email_content

def generate_and_display_email():
    email_content = generate_email_from_dataset()
    is_spam_email = "垃圾邮件" if naive_bayes.predict(count_vector.transform([email_content]))[0] == 1 else "正常邮件"
    
    email_display.config(text=f"邮件内容：{email_content}\n是否为垃圾邮件：{is_spam_email}")
    global spam_count
    if (is_spam_email == "垃圾邮件"):
        spam_count += 1
        update_spam_count_label()
     
# 垃圾邮件自动放进
def update_spam_count_label():
    spam_count_label.config(text=f"垃圾邮箱数量：{spam_count}")




# 设计界面
import tkinter as tk
import ttkbootstrap as ttk
from ttkbootstrap.constants import *

root = tk.Tk()
root.title("随机邮件分类器")
spam_count = 0

email_display = ttk.Label(root, text="", wraplength=600)
email_display.pack()
b1 = ttk.Button(root, text="生成邮件并判断", bootstyle=SUCCESS,command=generate_and_display_email)
b1.pack(padx=5, pady=10)

# 创建显示垃圾邮箱数量的标签
spam_count_label = tk.Label(root, text=f"垃圾邮箱数量：{spam_count}")
spam_count_label.pack(padx=20, pady=10)

# 定时生成邮件，每隔 10 秒判断一次
def repeat_generation():
    generate_and_display_email()
    root.after(10000, repeat_generation)

root.after(10000, repeat_generation)  # 启动定时器

root.mainloop()

　　想法：老师什么都没有教，然后甩出这个报告要求，真的是花了2天才弄出来，不过整体来说还是一个学术垃圾，写一下是为了补一下2024.1.7号的博客，以及创造黑历史让以后的我回味的。对于的python的看法就是版本问题兼容真的是很突出，出现的bug最多的就是版本问题，但也是真的方便好用，目前只学了机器学习这个方面。