Machine Learning 【note_02】

发布时间 2023-06-02 13:30:48作者: —青_木—

note_02

Keywords: Classification, Logistic Regression, Overfitting, Regularization


1 Motivation

image-20230601161422949

Classification:

  • "binary classification": \(y\) can only be one of two values
  • class / category

Try using linear regression to do:

image-20230601161906847

It seems work.

However, when there's another one sample point:

image-20230601162035119

It will cause the \(x\) value, corresponding to the threshold value 0.5, moving to right, which is worse because some point are misclassified.

Logistic regression

  • Could be used to solve this situation.
  • It is actually used for binary classification problems despite of its name.

2 Logistic Regression

2.1 Conception

sigmoid / logistic function: \(g(z)=\frac{1}{1+e^{-z}}\), \(0<g(z)<1\)

image-20230601163926622

Logistic regression:

\[f_{\vec{w},b}(\vec{x})=g(\vec{w}\cdot\vec{x}+b)=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]

image-20230601165226566

Understanding logistic regression: the possibility of the class or label \(y\) will be equal to \(1\) given a certain input \(x\).

image-20230601165753969

2.2 Decision Boundary

2.2.1 Threshold

image-20230601175043152

2.2.2 Linear

image-20230601175233889

2.2.3 Non-linear

image-20230601175531961 image-20230601175717607

2.3 Cost Function

2.3.1 Squared Error Cost

How to choose \(\vec{w}\) and \(b\) for logistic regression:

image-20230601180240713

The squared error cost function is not a good choice:

image-20230601180512304

Because it has many local minimum and is not really as smooth as the "soup bowl" from linear regression:

image-20230601183520684

2.3.2 Logistic Loss

We set the Logistic Loss Function as:

\[J(\vec{w},b)=\frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = \begin{equation} \left\{ \begin{array}{lr} -log(f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=1\\ -log(1-f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=0 \end{array} \right. \end{equation} \]

To understanding it:

image-20230601180951772 image-20230601181054174

We can see the cost curve is much better:

image-20230601183849389

2.3.3 Simplified Cost Function

We can write loss function as:

\[L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = - y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) - (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) \]

The cost function could be simplified as:

\[\begin{equation} \begin{aligned} J(\vec{w},b) &= \frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ &= -\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ] \end{aligned} \end{equation} \]

image-20230601190301470

2.4 Gradient Descent

Find \(\vec{w}\) and \(b\) to minimize the cost function.

Given new \(\vec{x}\), output

\[P(y=1|\vec{x};\vec{w},b)=f_{\vec{w},b}(\vec{x})=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]

2.4.1 Implementation

That's quite amazing that the partial derivative of cost function is same.

image-20230601193517144
Derivation

Here, I do the derivation:

\[\begin{equation} \begin{aligned} \frac{\partial}{\partial{w_j}}J(\vec{w},b) &= \frac{\partial}{\partial{w_j}} \{-\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ]\}\\ &= -\frac{1}{m}\sum_{i=1}^m\{ [ \frac{y^{(i)}}{f_{\vec{w},b}(\vec{x}^{(i)})} - \frac{1-y^{(i)}}{1-f_{\vec{w},b}(\vec{x}^{(i)})} ] \times \frac{\partial f_{\vec{w},b}}{\partial{w_j}} \}\\ &= -\frac{1}{m}\sum_{i=1}^m[ \frac{y^{(i)}-f_{\vec{w},b}}{f_{\vec{w},b}(1-f_{\vec{w},b})} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}\cdot{x_j^{(i)}}}{(1+e^{-(\vec{w}\cdot\vec{x}+b)})^2} ]\\ &= \frac{1}{m}\sum_{i=1}^m[ (f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{ f_{\vec{w},b} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \times \frac{1}{f_{\vec{w},b}} \times (1+e^{-(\vec{w}\cdot\vec{x}+b)}) } ]\\ &= \frac{1}{m}\sum_{i=1}^m(f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \end{aligned} \end{equation} \]

Other parts is similar:

image-20230601201808439

3 Overfitting

3.1 Conception

underfitting

  • doesn't fit the training set well
  • high bias

just right

  • fits training set pretty well
  • generalization

overfitting

  • fits the training set extremely well
  • high variance
image-20230601232411887 image-20230601233320171

3.2 Addressing Overfitting

Method 1: Collect more training examples

image-20230602090306951

Method 2: Feature Selection - Select features to include / exclude

image-20230602090443033

Method 3: Regularization

  • more gentler
  • Preserve all features while preventing them from having a significant impact
image-20230602090753020

是否对\(b\)进行正则化,没有啥影响。

4 Regularization

4.1 Conception

Modify the cost function. Intuitively, when \(w\) is small, the modified cost function can be smaller.

image-20230602101436430

So, set the cost function as:

\[J(\vec{w},b)=\frac{1}{2m}\sum_{i=1}^{m}{f_{\vec{w},b}((\vec{x}^{(i)})-y^{(i)})^2} + \frac{\lambda}{2m}{\sum_{j=1}^{m}{w_j^2}} \]

  • \(\lambda\): regularization parameter, positive
  • \(b\) can be included or excluded
image-20230602102756757

Here you can see, if the \(\lambda\) is too small (let's say \(0\)), it will be underfitting; and if the \(\lambda\) is too big (let's say \(10^{10}\)), it will be overfitting.

image-20230602103118714

4.2 Regularized Linear Regression

Gradient descent

​ repeat{

\[w_j := w_j-\alpha\frac{\partial}{\partial{w_j}}J(\vec{w},b) = w_j-\alpha[ \frac{1}{m} {\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ b := b-\alpha\frac{\partial}{\partial{b}}J(\vec{w},b) = b-\alpha[\frac{1}{m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})}] \]

}

image-20230602123241498

Here, rewrite the \(w_j\):

\[\begin{equation} \begin{aligned} w_j & = w_j-\alpha[\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ & = w_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}} \end{aligned} \end{equation} \]

We can know that the latter term of \(w_j\) is the usually-updating term, and the former one is to shrink \(w_j\).

4.3 Regularized Logistic Regression

image-20230602125949866 image-20230602130147126