机器学习建模中--先“特征选择”还是先“划分数据集”?

发布时间 2023-03-28 23:06:40作者: 温小皮

应该先进行“特征选择”,再“划分数据集”。这样可以避免数据泄露。
测试集就应该当做“看不见的数据”,只能在最后用一次,按照这个原则处理。

代码实例:

# -*- coding: utf-8 -*-


import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


#===================错误做法:特征选择在前,划分数据集在后============================

# #---错误做法的结果很好----
# # random data:
# X = np.random.randn(500, 10000)
# y = np.random.choice(2, size=500)

# selector = SelectKBest(k=25)
# # first select features
# X_selected = selector.fit_transform(X,y)
# # then split
# X_selected_train, X_selected_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.25, random_state=42)

# # fit a simple logistic regression
# lr = LogisticRegression()
# lr.fit(X_selected_train,y_train)

# # predict on the test set and get the test accuracy:
# y_pred = lr.predict(X_selected_test)
# acc = accuracy_score(y_test, y_pred)
# print(acc)
# #几次结果为0.712,0.688,0.792,0.776,0.648

# #---检验一下这种错误做法;泛化性能很差----
# X_new = np.random.randn(500, 10000)
# y_new = np.random.choice(2, size=500)
# # select the same features in the new data
# X_new_selected = selector.transform(X_new)
# # predict and get the accuracy:
# y_new_pred = lr.predict(X_new_selected)
# acc_new = accuracy_score(y_new, y_new_pred)
# print(acc_new)
# # 几次结果为:0.498,0.504, 0.492, 0.538


#=============正确做法:先划分数据集,再特征选择===============================
# random data:
X = np.random.randn(500, 10000)
y = np.random.choice(2, size=500)

# split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# then select features using the training set only
selector = SelectKBest(k=25)
X_train_selected = selector.fit_transform(X_train,y_train)

# fit again a simple logistic regression
lr = LogisticRegression()
lr.fit(X_train_selected,y_train)
# select the same features on the test set, predict, and get the test accuracy:
X_test_selected = selector.transform(X_test)
y_pred = lr.predict(X_test_selected)
acc = accuracy_score(y_test, y_pred)
print(acc)
# 几次的结果为:0.48,0.472,0.52

参考:关于机器学习:特征选择应该在Train-Test Split之前还是之后进行?