模型亮点
- 初始测试集上评分为0.56,调参后测试集上评分为0.85
- 数据清洗方式得当
-------------------------------------------以下为模型具体实现-------------------------------------------
Step1.数据读取
import pandas as pd df=pd.read_csv('bankpep.csv',index_col='id') df.head()
Step2.数据清洗
why 数据清洗?how 数据清洗?
- 使用分类器时,分类数据、顺序数据(这里严格来说不算),均需转为数值型数据
- 分类数据,用0-1表示
- 顺序数据,用独热矩阵表示
# 1)性别:female->1,male->0 df.loc[df['sex']=='FEMALE','sex']=1 df.loc[df['sex']=='MALE','sex']=0 col=['married','car','save_act','current_act','mortgage','pep']
# 2)婚否、车、储蓄行为、目前行为、是否按揭、是否接受提议:yes->1,no->0 for ele in col: df.loc[df[ele]=='YES',ele]=1 df.loc[df[ele]=='NO',ele]=0
# 3)地区、孩子:转为独热矩阵 df_region=pd.get_dummies(df['region'],prefix='REGION_') df_children=pd.get_dummies(df['children'],prefix='CHILDREN_') df.drop(['region','children'],axis=1,inplace=True) df=df.join([df_region,df_children],how='outer')
Step3.划分训练集和测试集
from sklearn.model_selection import train_test_split x=df.drop(['pep'],axis=1).astype('float') # 注意,必须转换类型为整型或浮点型,否则无法拟合SVM模型 y=df['pep'].astype('float') def split(x): x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1) # 注意,返回值顺序 return x_train,x_test,y_train,y_test x_train,x_test,y_train,y_test=split(x)
Step4.启动向量机
from sklearn.svm import SVC model=SVC(kernel='rbf',gamma=0.7,C=1) model.fit(x_train,y_train)
Step5.模型评估
what 模型评估?
- 分类问题:model.score->accuracy_score
- 回归问题:model.score->r2_score
print("训练集上评分:",round(model.score(x_train,y_train),2)) def test_score(model,x_test,y_test): print("测试集上评分:",round(model.score(x_test,y_test),2)) test_score(model,x_test,y_test)
Step6.优化参数
how 优化参数?
- 标准化
- 网格搜索/随机搜索选一
# 1.标准化 from sklearn.preprocessing import scale x_scale=scale(x) x_train,x_test,y_train,y_test=split(x_scale) #自定义测试集、训练集划分函数 model.fit(x_train,y_train) print("-----标准化-----") test_score(model,x_test,y_test) #自定义测试集评分函数
# 2.1网格搜索 from sklearn.model_selection import GridSearchCV params={'gamma':[0.1,0.3,0.5,0.7,0.9],'C':[0.1,0.5,1]} model=GridSearchCV(model,params,cv=10) model.fit(x_train,y_train) print("-----标准化、网格搜索-----") print("最优参数组合:",model.best_params_) print("测试集上评分:",round(model.best_score_,2))
# 2.2随机搜索 from sklearn.model_selection import RandomizedSearchCV import numpy as np gamma_arr=np.arange(0,1,0.01) # 注意,自带的range步长只能是整数 gamma_lis=list(gamma_arr) C_lis=[pow(10,n) for n in range(-3,2)] params={'kernel':['rbf','poly','sigmoid','linear'],'gamma':gamma_lis,'C':C_lis} model=RandomizedSearchCV(model,params,n_iter=2000,cv=10) model.fit(x_train,y_train) print("-----标准化、随机搜索-----") print("最优参数组合:",model.best_params_) print("测试集上评分:",round(model.best_score_,2))
Step7.保存模型,工作自由
from sklearn.externals import joblib joblib.dump(model,'d:\svm.pkl') new_model=joblib.load('d:\svm.pkl') print("测试集上预测结果:\n",new_model.predict(x_test))
-END