游乐园也ddl?_使用肾结石预测数据集进行二元分类_

发布时间 2023-04-14 00:36:35作者: furiyo

数据集描述

本次比赛的数据集(训练和测试)是根据基于尿液分析数据集的肾结石预测训练的深度学习模型生成的。特征分布与原始分布接近,但不完全相同。随意使用原始数据集作为本次竞赛的一部分,既可以探索差异,也可以了解在训练中合并原始数据集是否可以提高模型性能。

文件

  • 训练.csv - 训练数据集; 是否存在肾结石的可能性target
  • 测试.csv - 测试数据集;您的目标是预测概率target
  • sample_submission.csv - 正确格式的示例提交文件

OUR RECENT SUBMISSION

我的排名:725

Submitted 21 hours ago

Score: 0.79000

我的采用模型:WeightedEnsemble

是一种集成学习(Ensemble Learning)方法,它利用多个子模型的预测结果加权平均来提高整体性能和稳定性。

WeightedEnsemble_L2首先需要训练多个独立的子模型,每个子模型可以使用不同的算法、超参数或训练数据来训练,以获得不同的特征表示能力。在测试阶段,被集成的每个子模型将对输入进行预测,并将其输出作为集成模型的预测结果,该结果是所有子模型输出的加权平均值。


示例笔记本from---GANESH JAINARAIN

任务: 开发ML/DL模型来预测肾结石的发生

根据尿液分析,预测是否存在肾结石。这到底是什么意思?我们最终想通过各种输入(y)来预测一些目标(X)。在这种情况下,我们想预测肾结石的存在,"1=我们看到存在 "和 "0=我们看到不存在",因此这是一个 "二元分类 "问题,即某物存在或不存在。

Data Understanding I

这个数据集可以用来预测基于尿液分析的肾结石的存在。

对79份尿液标本进行分析,以确定尿液的某些物理特征是否可能与草酸钙晶体的形成有关。

The six physical characteristics of the urine are尿液的六个物理特征是:

  • (1) 比重,尿液相对于水的密度;<br/> (2) pH值,氢离子的负对数;<br/> (3) 渗透压(mOsm),一个在生物学和医学中使用的单位,但在物理化学中没有。渗透压与溶液中分子的浓度成正比;<br/> (4) 电导率(mMho毫摩)。一个Mho是一个倒数的欧姆。电导率与溶液中带电离子的浓度成正比;<br/> (5) 尿素浓度(毫摩尔/升);<br/> (6) 钙浓度(CALC),单位为毫摩尔。

    这些数据来自于 "有晶体和无晶体的尿液的物理特性",这是Springer统计学系列中的一章。

Data Understanding II

有两个数据集,原始数据集和生成的数据集,我们将使用这两个数据集来比较和对比特征等。

生成的数据集的文件:

  • train.csv - 训练数据集;目标是肾结石出现的可能性<br/> test.csv - 测试数据集;你的目标是预测目标的概率。<br/> sample_submission.csv - 一个正确格式的样本提交文件。

来自原始数据集的文件:

  • kidney_stone_urine_analysis.csv

Data Understanding III, Understanding our Features in Depth

  • 比重: 尿液比重是一个实验室测试,显示尿液中所有化学颗粒的浓度。尿液比重的正常范围是1.005至1.030。USG是尿液的密度(单位体积的质量)与参考物质(水)的密度(相同单位体积的质量)的比率。USG值在1.000和1.040克/毫升之间,USG小于1.008克/毫升被认为是稀释的,USG大于1.020克/毫升被认为是浓缩的。有结石形成的患者的USG比没有结石形成的患者高
  • pH值:当尿液的pH值低于5.5时,尿液中的尿酸结晶就会饱和,这种情况被称为高钙尿症。当尿液中有太多的尿酸时,就会形成结石。尿酸结石在摄入大量蛋白质的人中更常见,如红肉或家禽中的蛋白质。来源:https://www.hopkinsmedicine.org/health/conditions-and-diseases/kidney-stones
  • 渗透压(mOsm): 渗透压指的是每1L溶剂中溶质颗粒的数量

Importing Libraries & Data

#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))

import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))

import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))

import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))

import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__)) 

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
from IPython.display import HTML, display
print("IPython version: {}". format(IPython.__version__)) 

import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))

#misc libraries
import random
import time

#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)

Load Data Modelling Libraries

?我们将使用流行的scikit-learn库来开发我们的机器学习算法。在sklearn中,算法被称为Estimators,并在它们自己的类中实现。对于数据可视化,我们将使用matplotlib和seaborn库。下面是要加载的常用类。

#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process

#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.plotting import scatter_matrix

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
sns.set_style('darkgrid')
pylab.rcParams['figure.figsize'] = 12,8

Making our datasets available in our coding environment

Reading in our csv files and putting them into a dataframe object(略)

大致实现:

original_data = pd.read_csv(ORIGINAL_FILENAME)
train_data = pd.read_csv(TRAIN_FILENAME)
test_data = pd.read_csv(TEST_FILENAME)

查看变量(略)

Summary of our Datasets:

我们的数据集的摘要:<br/> 原始数据集:

79个数据条目<br/> 有7个列<br/> 没有 "id "列<br/> 在特征中,有两种数据类型:float64(4), int64(3)<br/> 重力: float64<br/> ph: float64<br/> osmo: int64<br/> 条件: float64<br/> 尿素:int64<br/> 计算: float64<br/> 目标:int64<br/> 培训数据集:

414个数据条目<br/> 一共有8列<br/> 存在 "id "列<br/> 特征中有两种数据类型:float64(4), int64(4)<br/> id: int64<br/> 重力: float64<br/> ph: float64<br/> osmo: int64<br/> cond: float64<br/> 尿素: int64<br/> 计算: float64<br/> 目标:int64<br/> 测试数据集:

276个数据条目<br/> 有7列<br/> 没有 "目标 "列,因为这是一个测试数据集<br/> 存在'id'列<br/> 在特征中,有两种数据类型:float64(4), int64(4)<br/> id:int64<br/> 重力: float64<br/> ph: float64<br/> osmo: int64<br/> cond: float64<br/> 尿素: int64<br/> 计算: float64


Observing if there are any Null-Values

我们可以看到在我们所有的数据集中都不包含空值

检查重复的值,数据集中不存在重复的值。


Data Cleaning for 'id'

In [23]:

#delete the ID column
drop_column = ['id']
train_data.drop(drop_column, axis=1, inplace = True)
print(train_data.isnull().sum())
print("-"*10)
gravity    0
ph         0
osmo       0
cond       0
urea       0
calc       0
target     0
dtype: int64
----------

Exploratory Data Analysis (EDA)

Correlation Heatmap of Dataset

In [24]:

#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = "RdYlGn",
        square=True, 
        cbar_kws={'shrink':.5 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='black',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features for Training Dataset', y=1.05, size=15)

correlation_heatmap(train_data)

编辑

Summary of Correlation Heatmap of Train Dataset

When we talk about correlation between features in a dataset what do we really mean?

Data Correlation is a way to understand the relationship between multiple variables and attributes in your dataset. Using Correlation, you can get some insights such as: One or multiple attributes depend on another attribute or a cause for another attribute.

The possible range of values for the correlation coefficient is -1.0 to 1.0. The values cannot exceed 1.0 or be less than -1.0. A correlation of -1.0 indicates a perfect negative correlation, and a correlation of 1.0 indicates a perfect positive correlation.

Correlation coefficients whose magnitude are between 0.7 and 0.9 indicate variables which can be considered highly correlated.

Correlation coefficients whose magnitude are between 0.5 and 0.7 indicate variables which can be considered moderately correlated.

A negative correlation indicates two variables that tend to move in opposite directions. A correlation coefficient of -0.8 or lower indicates a strong negative relationship, while a coefficient of -0.3 or lower indicates a very weak one.

? Exactly –1. A perfect negative (downward sloping) linear relationship

–0.70. A strong negative (downward sloping) linear relationship

–0.50. A moderate negative (downhill sloping) relationship

–0.30. A weak negative (downhill sloping) linear relationship

0 No linear relationship

+0.30. A weak positive (upward sloping) linear relationship

+0.50. A moderate positive (upward sloping) linear relationship

+0.70. A strong positive (upward sloping) linear relationship

Exactly +1. A perfect positive (upward sloping) linear relationship


Here in the Train Dataset we see some correlated features, by using the Pearson Correlation metric we can see various values

Observation 1:

If we observe the features urea and osmo we see a positive correlation of (+0.81)

It means that when the value of the urea variable increases then the value of the other variable(s) osmo also increases.

A strong positive (upward sloping) linear relationship

Observation 2:

If we observe the features cond and osmo we see a positive correlation of (+0.71)

It means that when the value of the cond variable increases then the value of the other variable(s) osmo also increases.

A strong positive (upward sloping) linear relationship

Observation 3:

If we observe the features gravity and osmo we see a positive correlation of (+0.69)

It means that when the value of the gravity variable increases then the value of the other variable(s) osmo also increases.

A strong positive (upward sloping) linear relationship

Observation 4:

If we observe the features gravity and urea we see a positive correlation of (+0.63)

It means that when the value of the gravity variable increases then the value of the other variable(s) urea also increases.

A moderately strong positive (upward sloping) linear relationship

Conclusion:

These are either highly correlated features or moderately correlated features.

It is recommended to avoid having correlated features in your dataset. A group of highly correlated features will not bring additional information, but will increase the complexity of the algorithm, thus increasing the risk of errors.

Decision:

? We will either decide on dropping a feature(s) or keeping them in the Feature Engineering Phase of the project

In [25]:

#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = "RdYlGn",
        square=True, 
        cbar_kws={'shrink':.5 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='black',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features for Test Dataset', y=1.05, size=15)

correlation_heatmap(test_data)

编辑

Pair Plots

What do Pair Plots really show us?

Pair plots, also known as scatterplot matrices, are a type of data visualization that display pairwise relationships between multiple variables in a dataset.

In machine learning, pair plots can be used to:

  • Identify patterns:
    Pair plots can reveal patterns or trends in the data that may not be apparent from individual scatterplots.
  • Detect outliers:
    Outliers can be easily spotted in pair plots, allowing for further investigation.
  • Assess correlation:
    Pair plots can help us identify the strength and direction of linear relationships between variables. This can be useful in feature selection and model building.
  • Explore data distributions:
    Pair plots can help us understand the distribution of each variable in the dataset, and identify any potential issues such as skewness or outliers.

From the output below, we can observe the variations in each plot. The plots are in matrix format where the row name represents x axis and column name represents the y axis. The main-diagonal subplots are the univariate histograms (distributions) for each attribute.

In [26]:

import matplotlib.pyplot as plt
import seaborn as sns


plt.figure(figsize=(12,10))
sns.pairplot(train_data,hue="target")
plt.title("Looking for Insights in Data")
plt.legend("target")
plt.tight_layout()
plt.plot()
plt.show()
<Figure size 1200x1000 with 0 Axes>

编辑

Distribution of Data

Interpreting our Histograms

Looking for Skewness

正在上传…重新上传取消

Data is skewed when its distribution curve is asymmetrical and skewness is the measure of the asymmetry.

The skewness for a normal distribution is 0. There are 2 different types of skews in data, left(negative), and right(positive).

Effects of skewed data: Degrades the models’ ability to describe typical cases as it has to deal with rare cases on extreme values. Right skewed data will predict better on data points with lower values as compared to those with higher values. Skewed data doesn’t work well with many statistical methods.

To ensure that the machine learning model capabilities are not affected, skewed data has to be transformed to approximate a normal distribution. The method used to transform the skewed data depends on the characteristics of the data.

  1. Log Transformation
  2. Remove Outliers
  3. Normalization (Min-Max)
  4. Square root: applied only to positive values
  5. Square: applied on left skew

We can use this line of code to observe skewness:

train_data.skew().sort_values(ascending=False)
  • If the skewness is between -0.5 and 0.5, the data are fairly symmetrical
  • If the skewness is between -1 and — 0.5 or between 0.5 and 1, the data are moderately skewed
  • If the skewness is less than -1 or greater than 1, the data are highly skewed

Using this information to reach conclusions about the skewness of the data

  • calc 1.118533: (Highly/Positively Skewed, this is reflected on the distribution graph with the values towards the left of the graph)
  • ph 0.971308: (Highly/Positively Skewed, this is reflected on the distribution graph with the values towards the left of the graph)
  • urea 0.329107: (Fairly symmetrical)
  • gravity 0.291010: (Fairly symmetrical)
  • target 0.224421: (Fairly symmetrical)
  • osmo 0.147395: (Fairly symmetrical)
  • cond -0.212009: (Fairly symmetrical)

Decision:

? We will keep all the features as is for now but should perform some transformation to the skewed data features in feature engineering.

In [27]:

train_data.skew().sort_values(ascending=False)

Out[27]:

calc       1.118533
ph         0.971308
urea       0.329107
gravity    0.291010
target     0.224421
osmo       0.147395
cond      -0.212009
dtype: float64

In [28]:
plt.figure(figsize=(14,10))
plt.title("Distribution of feature Data in the Train Dataset")
for i,col in enumerate(train_data.columns,1):
    plt.subplot(4,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(train_data[col],kde=True)
    plt.tight_layout()
    plt.plot()
plt.show()

........................................

Train/Test Split

Train Test Split Using Sklearn

The train_test_split() method is used to split our data into train and test sets.

First, we need to divide our data into features (X) and labels (y).

The dataframe gets divided into:

  • X_train
  • X_test
  • y_train
  • y_test

X_train and y_train sets are used for training and fitting the model.

The X_test and y_test sets are used for testing the model if it’s predicting the right outputs/labels. We can explicitly test the size of the train and test sets. It is suggested to keep our train sets larger than the test sets.

用训练集和测试集评估XGBoost分类器模型

? 注1:我们只是实现了一个最基本的XGBClassifier模型,以获得分割训练集和测试集的感觉,并在数据上运行一个分类器模型,以获得实施的感觉,如果有必要,以后将进行超参数调整、特征工程和数据缩放。

注2:scikit-learn中的MinMaxScaler()用于数据归一化(又称特征缩放)。对于决策树来说,数据正常化是没有必要的。因为XGBoost是基于决策树的,所以决策树不需要对其输入进行规范化。虽然决策树对异常值有天然的抵抗力,但提升的树很容易受到影响,因为新的树是根据残差建立的。归一化,甚至只是一个对数转换,会给你提供更好的保护,使你免受异常值的影响。因此,我们将同时实现非归一化和归一化,但现在我已经实现了没有归一化的XGBClassifier,因为它非常容易和快速地获得感觉。

XGBClassifier默认实现#1

正如你所看到的,默认实现给我们的准确率是69%,这很糟糕,但请记住这只是默认实现。

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# split data into X and y

X = train_data.drop(['target'], axis=1)
y = train_data['target'] 

# split data into train and test sets
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.24, random_state=7)

# fit model no training data
model = XGBClassifier()
model.fit(X_train, y_train)

# make predictions for test data

y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 69.00%

XGBClassifier Monitor Performance and Early Stopping Implementation #2

XGBoost模型可以在训练期间评估和报告模型在测试集上的表现。

它支持这种能力,在训练模型时,在调用model.fit()时指定一个测试数据集和一个评估指标,并指定verbose输出(verbose=True)。

例如,我们可以在训练XGBoost模型时,在一个独立的测试集(eval_set)上报告二元分类错误率(error)。

我们可以利用这个评价,一旦模型没有进一步的改进,就停止训练。

我们可以通过在调用model.fit()时将early_stopping_rounds参数设置为在停止训练前在验证数据集上没有看到改进的迭代次数来实现这一目的。

# split data into X and y

X = train_data.drop(['target'], axis=1)
y = train_data['target'] 

# split data into train and test sets
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=7)


model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Feature Engineering Intro I

特征工程是一种机器学习技术,它利用数据来创建训练集中没有的新变量。它可以为监督学习和无监督学习产生新的特征,目的是简化和加快数据转换,同时也提高模型的准确性。

特征创建: 创建特征包括创建对我们的模型最有帮助的新变量。这可以是增加或删除一些特征。

转化: 特征转换只是一个将特征从一种表现形式转换为另一种表现形式的函数。这里的目标是绘制和可视化数据,如果有些东西与新的特征不一致,我们可以减少使用的特征数量,加快训练速度,或者提高某个模型的准确性。

特征提取: 特征提取是指从数据集中提取特征以识别有用信息的过程。在不扭曲原始关系或重要信息的情况下,这将数据量压缩到可管理的数量,供算法处理。

让我们看看我们是否可以做一些特征工程,以便从我们的ML模型中获得更准确的结果。

Asking ChatGPT about Feature Engineering for the features

? I asked Chat GPT to give me some help with ideas for feature engineering here is what it had to say:66666

我请Chat GPT给我一些关于功能工程的想法帮助,以下是它的意见:

特征的缩放: 你可以使用最小-最大缩放法,将所有的特征缩放到0到1的范围内,这将确保没有一个特征支配其他特征。

特征编码: 你提供的特征中没有一个是分类变量,所以不需要进行特征编码。

特征选择: 你可以使用基于相关性的特征选择来确定与肾结石存在最密切相关的特征。例如,你可以计算每个特征与肾结石存在之间的相关性,并选择相关性最高的特征。

特征提取: 你可以从提供的特征中提取额外的特征,如某些测量值的比率或尿液中特定化合物的存在。例如,你可以计算草酸钙过饱和指数,这是钙和草酸盐浓度的比率,或者尿酸铵结晶指数,这是铵和尿酸盐浓度的比率。

使用领域知识的特征工程: 你可以根据你对肾结石形成的理解和尿液成分在结石形成中的作用来设计特征。例如,你可以设计与某些矿物质的浓度或尿液的酸度有关的特征。你还可以设计与尿量有关的功能,因为尿量会影响尿液中矿物质和其他化合物的浓度。

?总之: 记住,特征工程的目标是创建一组特征,帮助你的机器学习算法做出准确的预测。重要的是,要了解问题领域和你所处理的数据,并尝试不同的特征工程技术,以找到最有效的方法。


from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score

# Separate the features and target variable
X = train_data.drop('target', axis=1)
y = train_data['target']

# Split the train_data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the numerical features by scaling them and adding polynomial features
scaler = StandardScaler()
poly = PolynomialFeatures(degree=2)
X_train = scaler.fit_transform(X_train)
X_train = poly.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_test = poly.transform(X_test)

# Define the hyperparameter grid for logistic regression
param_grid = {'C': [0.1, 0.5, 1, 5, 10],
              'penalty': ['l1', 'l2'],
              'solver': ['liblinear', 'saga'],
              'max_iter': [100, 500, 1000]}

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Train a logistic regression model with the best hyperparameters
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Predict on the testing set
y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:,1]

# Evaluate the model performance
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f'Best hyperparameters: {grid_search.best_params_}')
print(f'Accuracy: {accuracy}')
print(f'Confusion matrix: \n{confusion}')
print(f'ROC AUC score: {roc_auc}')
Best hyperparameters: {'C': 0.1, 'max_iter': 100, 'penalty': 'l1', 'solver': 'saga'}
Accuracy: 0.7831325301204819
Confusion matrix: 
[[35 10]
 [ 8 30]]
ROC AUC score: 0.8578947368421053

In [34]:

# Preprocess the numerical features in the test dataset
X_test = test_data.drop('id', axis=1)
X_test = scaler.transform(X_test)
X_test = poly.transform(X_test)

# Make predictions on the test set and create a submission file for Kaggle
test_predictions = best_model.predict_proba(X_test)[:,1]
submission = pd.DataFrame({'id': test_data['id'], 'target': test_predictions})
submission.to_csv('submission.csv', index=False)

我们扩大了逻辑回归的超参数网格,以包括求解器和max_iter参数。我们还增加了一个random_state参数,以确保结果是可重复的。

我们进行网格搜索以找到逻辑回归的最佳超参数,然后用最佳超参数训练一个逻辑回归模型。我们在测试集上评估模型的性能,并报告准确性、混淆矩阵和ROC AUC得分。

Min-Max Scaling/ Pre-Processing

归一化是对原始范围的数据进行重新缩放,使所有的值都在0和1的新范围内。

归一化要求你知道或能够准确估计出最小和最大的可观察值。这将确保没有一个特征会支配其他特征。

一个值被归一化的过程如下:

y = (x - min) / (max - min)

其中最小值和最大值与被归一化的值x有关。

我们还将结合我们的训练和原始数据集

In [35]:

print(train_data.shape)

train = pd.concat([train_data, original_data])
(414, 7)

In [36]:

print(train.shape)
print(train.duplicated(keep=False))

duplicate = train[train.duplicated()]
 
print("Duplicate Rows in the New Dataframe:")
 
# Print the resultant Dataframe
print(duplicate)
(493, 7)
0     False
1     False
2     False
3     False
4     False
      ...  
74    False
75    False
76    False
77    False
78    False
Length: 493, dtype: bool
Duplicate Rows in the New Dataframe:
Empty DataFrame
Columns: [gravity, ph, osmo, cond, urea, calc, target]
Index: []

In [37]:

# Scale the values of the features to a range of 0 to 1 using min-max scaling
train_scaled = (train - train.min()) / (train.max() - train.min())

In [38]:

train_scaled.head()

Out[38]:

gravity ph osmo cond urea calc target
0 0.228571 0.449686 0.244042 0.294833 0.186885 0.090332 0.0
1 0.571429 0.201258 0.491897 0.562310 0.629508 0.282992 0.0
2 0.114286 0.430818 0.175405 0.589666 0.244262 0.625970 0.0
3 0.457143 0.047170 0.243089 0.477204 0.636066 0.455893 1.0
4 0.457143 0.242138 0.654909 0.386018 0.614754 0.143966 1.0

Removing Outliers

我们从数据集中删除异常值有几个原因:

离群值会影响统计分析的准确性: 统计分析,如平均值、方差和相关性,对离群值很敏感。离群值会使数据的分布出现偏差,导致不准确的结果。去除异常值可以帮助提高统计分析的准确性。

离群值会影响机器学习模型的性能: 机器学习模型是根据数据进行训练的,而离群值会影响这些模型的性能。离群值会导致过度拟合,即模型与训练数据的拟合过于紧密,在新数据上表现不佳。去除异常值可以帮助提高机器学习模型的性能。

离群值可能是数据收集或测量错误的结果: 异常值可能是数据收集或测量中的错误造成的。去除这些异常值可以帮助提高数据集的质量,减少这些错误对分析的影响。

离群值可能是罕见事件的结果: 在某些情况下,离群值可能是罕见事件的结果,不代表被研究系统的典型行为。去除这些离群值可以通过关注最有代表性的数据来帮助提高分析的准确性。

总的来说,去除异常值可以帮助提高数据分析和机器学习模型的准确性和可靠性。然而,在移除异常值时,必须谨慎行事,因为它们也可能包含重要的信息,不应未经仔细考虑就丢弃。

一些去除异常值的方法:

Z-score方法: 在这种方法中,我们为数据集中的每个数据点计算Z分数。如果z-score大于一个阈值(通常是3或-3),则该数据点被认为是一个离群点。一旦识别出离群点,我们就可以把它们从数据集中删除。

四分位数范围(IQR)方法: 在这种方法中,我们为数据集中的每个特征计算IQR。任何超出第一或第三四分位数IQR的1.5倍的数据点都被认为是离群点。这些离群点可以从数据集中删除。

视觉化: 盒状图和散点图等可视化技术可以帮助识别数据中的离群点。落在箱形图晶须之外的数据点或远离散点图中其他数据点群的数据点可能被视为离群点。

....
在缩放和去除异常值后,让我们看看是否能得到更好的分数

准确度: 0.755 ROC AUC: 0.741 稍微好一点,但让我们看看是否能做得更好

准确率: 0.7872340425531915 ROC AUC: 0.7726851851851851 在我们用GridSearchCV选择了一些更好的超参数之后,情况稍有好转。

超参数调整实施总结

我们在params字典中增加了几个额外的超参数来提高模型的准确性。这些参数包括bootstrap, criterion, 和class_weight。

我们还改变了一些现有超参数的取值范围,以包括更多可能提高模型性能的潜在值。

然后,我们创建一个RandomForestClassifier类的实例和一个带有更新的超参数的GridSearchCV实例。我们对训练数据进行网格搜索,并打印发现的最佳超参数。

我们使用网格搜索返回的最佳模型对测试数据进行预测,并计算该模型的准确性和ROC AUC。