支持向量机SVM

发布时间 2023-06-11 00:45:58作者: 找回那所有、

模型亮点

  1. 初始测试集上评分为0.56,调参后测试集上评分为0.85
  2. 数据清洗方式得当

-------------------------------------------以下为模型具体实现-------------------------------------------

Step1.数据读取

import pandas as pd
df=pd.read_csv('bankpep.csv',index_col='id')
df.head()

Step2.数据清洗

why 数据清洗?how 数据清洗?

  1. 使用分类器时,分类数据、顺序数据(这里严格来说不算),均需转为数值型数据
  2. 分类数据,用0-1表示
  3. 顺序数据,用独热矩阵表示
# 1)性别:female->1,male->0
df.loc[df['sex']=='FEMALE','sex']=1
df.loc[df['sex']=='MALE','sex']=0
col=['married','car','save_act','current_act','mortgage','pep']
# 2)婚否、车、储蓄行为、目前行为、是否按揭、是否接受提议:yes->1,no->0
for ele in col:
    df.loc[df[ele]=='YES',ele]=1
    df.loc[df[ele]=='NO',ele]=0
# 3)地区、孩子:转为独热矩阵
df_region=pd.get_dummies(df['region'],prefix='REGION_')
df_children=pd.get_dummies(df['children'],prefix='CHILDREN_')
df.drop(['region','children'],axis=1,inplace=True)
df=df.join([df_region,df_children],how='outer')

Step3.划分训练集和测试集

from sklearn.model_selection import train_test_split
x=df.drop(['pep'],axis=1).astype('float') # 注意,必须转换类型为整型或浮点型,否则无法拟合SVM模型
y=df['pep'].astype('float')
def split(x):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1) # 注意,返回值顺序
    return x_train,x_test,y_train,y_test
x_train,x_test,y_train,y_test=split(x)

Step4.启动向量机

from sklearn.svm import SVC
model=SVC(kernel='rbf',gamma=0.7,C=1)
model.fit(x_train,y_train)

Step5.模型评估

what 模型评估?

  • 分类问题:model.score->accuracy_score
  • 回归问题:model.score->r2_score
print("训练集上评分:",round(model.score(x_train,y_train),2))
def test_score(model,x_test,y_test):
    print("测试集上评分:",round(model.score(x_test,y_test),2))
test_score(model,x_test,y_test)

Step6.优化参数

how 优化参数?

  1. 标准化
  2. 网格搜索/随机搜索选一
# 1.标准化
from sklearn.preprocessing import scale
x_scale=scale(x)
x_train,x_test,y_train,y_test=split(x_scale) #自定义测试集、训练集划分函数
model.fit(x_train,y_train)
print("-----标准化-----")
test_score(model,x_test,y_test) #自定义测试集评分函数
# 2.1网格搜索
from sklearn.model_selection import GridSearchCV
params={'gamma':[0.1,0.3,0.5,0.7,0.9],'C':[0.1,0.5,1]}
model=GridSearchCV(model,params,cv=10)
model.fit(x_train,y_train)
print("-----标准化、网格搜索-----")
print("最优参数组合:",model.best_params_)
print("测试集上评分:",round(model.best_score_,2))
# 2.2随机搜索
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
gamma_arr=np.arange(0,1,0.01) # 注意,自带的range步长只能是整数
gamma_lis=list(gamma_arr)
C_lis=[pow(10,n) for n in range(-3,2)]
params={'kernel':['rbf','poly','sigmoid','linear'],'gamma':gamma_lis,'C':C_lis}
model=RandomizedSearchCV(model,params,n_iter=2000,cv=10)
model.fit(x_train,y_train)
print("-----标准化、随机搜索-----")
print("最优参数组合:",model.best_params_)
print("测试集上评分:",round(model.best_score_,2))

Step7.保存模型,工作自由

from sklearn.externals import joblib
joblib.dump(model,'d:\svm.pkl')
new_model=joblib.load('d:\svm.pkl')
print("测试集上预测结果:\n",new_model.predict(x_test))

-END