机器学习 -> Machine Learning (III)

发布时间 2023-09-04 09:14:52作者: Arcticus

来做一些入门题吧. 以下大多是 kaggle 环境.

Q1 Titanic

https://www.kaggle.com/competitions/titanic

import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
  1. 加载数据

通常用 pandas 来读取 csv 表.

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

这是很重要的步骤, 它确保数据被正确载入进来了.

  1. 观察数据

在给定的数据集中, 很容易发现女性生存概率远远大于男性 (其他属性也有影响).

women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)
  1. 随机森林

这里采用 scikit 的随机森林根据四个输入特征来分类.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Q2 Spaceship Titanic

https://www.kaggle.com/competitions/spaceship-titanic

如果只是需要一个简单的随机森林模型, 那么使用 scikit-learn 就可以了, 但是对于复杂的任务建议使用 TensorFlow Decision Forests.

import
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
  1. 加载数据
dataset_df = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
print("Full train dataset shape is {}".format(dataset_df.shape))
dataset_df.head(5)
  1. 观察数据

大致观察一下整体:

dataset_df.describe()
dataset_df.info()

对输出标签画一个柱状图 (柱状图横轴是离散的):

plot_df = dataset_df.Transported.value_counts()
plot_df.plot(kind="bar")

对输入标签也画出直方图 (直方图横轴是连续的):

fig, ax = plt.subplots(5,1,  figsize=(10, 10))
plt.subplots_adjust(top = 2)

sns.histplot(dataset_df['Age'], color='b', bins=50, ax=ax[0]);
sns.histplot(dataset_df['FoodCourt'], color='b', bins=50, ax=ax[1]);
sns.histplot(dataset_df['ShoppingMall'], color='b', bins=50, ax=ax[2]);
sns.histplot(dataset_df['Spa'], color='b', bins=50, ax=ax[3]);
sns.histplot(dataset_df['VRDeck'], color='b', bins=50, ax=ax[4]);
  1. 处理数据集

原始数据集比较乱, 既有数字, 字母, 符号, 还有很多缺省参数.

考虑到一些因素不影响结果, 可以先把它们去掉:

dataset_df = dataset_df.drop(['PassengerId', 'Name'], axis=1)
dataset_df.head(5)

下面的代码用于检查每列的缺失值数量:

dataset_df.isnull().sum().sort_values(ascending=False)

Tensorflow 的随机森林可以应对缺省值, 但是无法处理布尔值 (True/False). 对于一些布尔缺省值要用 False 补全, 然后全部转化成 int 类型:

dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = dataset_df[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
dataset_df.isnull().sum().sort_values(ascending=False)
label = "Transported"
dataset_df[label] = dataset_df[label].astype(int)
dataset_df['VIP'] = dataset_df['VIP'].astype(int)
dataset_df['CryoSleep'] = dataset_df['CryoSleep'].astype(int)

对于 Deck/Cabin_num/Side 这种组合属性形式, 建议分开:

dataset_df[["Deck", "Cabin_num", "Side"]] = dataset_df["Cabin"].str.split("/", expand=True)
dataset_df = dataset_df.drop('Cabin', axis=1)

查看一下处理后的数据集:

dataset_df.head(5)

分开成训练集和验证集:

def split_dataset(dataset, test_ratio=0.20):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))

从 pandas 形式转换为 tensorflow 的形式:

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label)
  1. 训练模型
rf = tfdf.keras.RandomForestModel()
rf.fit(x=train_ds)
  1. 评估

随机森林可视化:

tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

OOB 测试:

import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.accuracy for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("Accuracy (out-of-bag)")
plt.show()

inspector = rf.make_inspector()
inspector.evaluation()

测试验证集:

evaluation = rf.evaluate(x=valid_ds,return_dict=True)

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

权重贡献:

inspector.variable_importances()["NUM_AS_ROOT"]
  1. 预测测试集

别忘了测试集也要和训练集一样经过预处理:

# Load the test dataset
test_df = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')
submission_id = test_df.PassengerId

# Replace NaN values with zero
test_df[['VIP', 'CryoSleep']] = test_df[['VIP', 'CryoSleep']].fillna(value=0)

# Creating New Features - Deck, Cabin_num and Side from the column Cabin and remove Cabin
test_df[["Deck", "Cabin_num", "Side"]] = test_df["Cabin"].str.split("/", expand=True)
test_df = test_df.drop('Cabin', axis=1)

# Convert boolean to 1's and 0's
test_df['VIP'] = test_df['VIP'].astype(int)
test_df['CryoSleep'] = test_df['CryoSleep'].astype(int)

# Convert pd dataframe to tf dataset
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df)

# Get the predictions for testdata
predictions = rf.predict(test_ds)
n_predictions = (predictions > 0.5).astype(bool)
output = pd.DataFrame({'PassengerId': submission_id,
                       'Transported': n_predictions.squeeze()})

output.head()

生成 csv:

sample_submission_df = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')
sample_submission_df['Transported'] = n_predictions
sample_submission_df.to_csv('/kaggle/working/submission.csv', index=False)
sample_submission_df.head()

对了, 上面是一般解法, 我提交的其实是下面这个:

神经网络解法
import pandas as pd
from keras.regularizers import l2
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from tensorflow import keras
from tensorflow.keras import layers

# Loading data
train_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
train_data = train_data.drop(['PassengerId', 'Name'], axis=1)
train_data[["Deck", "Cabin_num", "Side"]] = train_data["Cabin"].str.split("/", expand=True)
train_data = train_data.drop('Cabin', axis=1)
train_data['Transported'] = train_data['Transported'].fillna(value=0).astype(int)

test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
test_data = test_data.drop(['PassengerId', 'Name'], axis=1)
test_data[["Deck", "Cabin_num", "Side"]] = test_data["Cabin"].str.split("/", expand=True)
test_data = test_data.drop('Cabin', axis=1)

# Preprocessor
features_num = ["Age", "VIP", "CryoSleep", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Cabin_num"]
features_cat = ["HomePlanet", "Destination", "Deck", "Side"]
features_bool = ["VIP", "CryoSleep"]

def bool_to_int(X):
    return X.astype(int)

transformer_bool = FunctionTransformer(bool_to_int, validate=False)

transformer_num = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0),
    StandardScaler()
)

transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore')
)

transformer_bool = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0),
    StandardScaler()
)

preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
    (transformer_bool, features_bool)
)

# Splitting the data
X = train_data.drop('Transported', axis=1)
y = train_data['Transported']

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

# Applying the preprocessor
X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)

input_shape = [X_train.shape[1]]

model = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    
    layers.Dense(1024, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(512, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),

    layers.Dense(256, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(128, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(rate=0.4),
    
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.0003),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

early_stopping = keras.callbacks.EarlyStopping(patience=20, min_delta=0.001, restore_best_weights=True)
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=6, verbose=1)
checkpoint = keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)

# Training
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=100,
    callbacks=[early_stopping, lr_scheduler, checkpoint]
)

# Plotting
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['accuracy', 'val_accuracy']].plot(title="Accuracy")

# Validating
val_loss, val_accuracy = model.evaluate(X_valid, y_valid, verbose=0)

print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation Accuracy: {val_accuracy*100:.2f}%")

# Prediction
test_data_processed = preprocessor.transform(test_data)
predictions = model.predict(test_data_processed)
predicted_labels = (predictions > 0.5).astype(bool).flatten()

submission_df = pd.DataFrame({
    "PassengerId": pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")["PassengerId"],  
    "Transported": predicted_labels
})

submission_df.to_csv("submission.csv", index=False)
预处理器的构建很有利于重用 比如尽量这样写
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
import pandas as pd

numeric_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1', 'cat_feature2']
boolean_features = ['bool_feature1']
date_features = ['date_feature1']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

boolean_transformer = FunctionTransformer(lambda x: x.astype(int), validate=False)

date_transformer = FunctionTransformer(lambda x: pd.to_datetime(x).dt.year, validate=False) 

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features),
        ('date', date_transformer, date_features)
    ])
# 理论上,几乎所有的数据预处理操作都可以封装到 preprocessor 中