note_02

Keywords: Classification, Logistic Regression, Overfitting, Regularization

1 Motivation

Classification:

"binary classification": \(y\) can only be one of two values
class / category

Try using linear regression to do:

It seems work.

However, when there's another one sample point:

It will cause the \(x\) value, corresponding to the threshold value 0.5, moving to right, which is worse because some point are misclassified.

Logistic regression

Could be used to solve this situation.
It is actually used for binary classification problems despite of its name.

2 Logistic Regression

2.1 Conception

sigmoid / logistic function: \(g(z)=\frac{1}{1+e^{-z}}\), \(0<g(z)<1\)

Logistic regression:

\[f_{\vec{w},b}(\vec{x})=g(\vec{w}\cdot\vec{x}+b)=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]

Understanding logistic regression: the possibility of the class or label \(y\) will be equal to \(1\) given a certain input \(x\).

2.2 Decision Boundary

2.2.1 Threshold

2.2.2 Linear

2.2.3 Non-linear

2.3 Cost Function

2.3.1 Squared Error Cost

How to choose \(\vec{w}\) and \(b\) for logistic regression:

The squared error cost function is not a good choice:

Because it has many local minimum and is not really as smooth as the "soup bowl" from linear regression:

2.3.2 Logistic Loss

We set the Logistic Loss Function as:

\[J(\vec{w},b)=\frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = \begin{equation} \left\{ \begin{array}{lr} -log(f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=1\\ -log(1-f_{\vec{w},b}(\vec{x}^{(i)})), & y^{(i)}=0 \end{array} \right. \end{equation} \]

To understanding it:

We can see the cost curve is much better:

2.3.3 Simplified Cost Function

We can write loss function as:

\[L(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) = - y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) - (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) \]

The cost function could be simplified as:

\[\begin{equation} \begin{aligned} J(\vec{w},b) &= \frac{1}{m}\sum_{i=1}^mL(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}) \\ &= -\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ] \end{aligned} \end{equation} \]

2.4 Gradient Descent

Find \(\vec{w}\) and \(b\) to minimize the cost function.

Given new \(\vec{x}\), output

\[P(y=1|\vec{x};\vec{w},b)=f_{\vec{w},b}(\vec{x})=\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \]

2.4.1 Implementation

That's quite amazing that the partial derivative of cost function is same.

Derivation

Here, I do the derivation:

\[\begin{equation} \begin{aligned} \frac{\partial}{\partial{w_j}}J(\vec{w},b) &= \frac{\partial}{\partial{w_j}} \{-\frac{1}{m}\sum_{i=1}^m[ y^{(i)}\times log(f_{\vec{w},b}(\vec{x}^{(i)}) + (1-y^{(i)})\times log(1-f_{\vec{w},b}(\vec{x}^{(i)})) ]\}\\ &= -\frac{1}{m}\sum_{i=1}^m\{ [ \frac{y^{(i)}}{f_{\vec{w},b}(\vec{x}^{(i)})} - \frac{1-y^{(i)}}{1-f_{\vec{w},b}(\vec{x}^{(i)})} ] \times \frac{\partial f_{\vec{w},b}}{\partial{w_j}} \}\\ &= -\frac{1}{m}\sum_{i=1}^m[ \frac{y^{(i)}-f_{\vec{w},b}}{f_{\vec{w},b}(1-f_{\vec{w},b})} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}\cdot{x_j^{(i)}}}{(1+e^{-(\vec{w}\cdot\vec{x}+b)})^2} ]\\ &= \frac{1}{m}\sum_{i=1}^m[ (f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{ f_{\vec{w},b} \times \frac{e^{-(\vec{w}\cdot\vec{x}+b)}}{1+e^{-(\vec{w}\cdot\vec{x}+b)}} \times \frac{1}{f_{\vec{w},b}} \times (1+e^{-(\vec{w}\cdot\vec{x}+b)}) } ]\\ &= \frac{1}{m}\sum_{i=1}^m(f_{\vec{w},b}-y^{(i)}){x_j^{(i)}} \end{aligned} \end{equation} \]

Other parts is similar:

3 Overfitting

3.1 Conception

underfitting

doesn't fit the training set well
high bias

just right

fits training set pretty well
generalization

overfitting

fits the training set extremely well
high variance

3.2 Addressing Overfitting

Method 1: Collect more training examples

Method 2: Feature Selection - Select features to include / exclude

Method 3: Regularization

more gentler
Preserve all features while preventing them from having a significant impact

是否对\(b\)进行正则化，没有啥影响。

4 Regularization

4.1 Conception

Modify the cost function. Intuitively, when \(w\) is small, the modified cost function can be smaller.

So, set the cost function as:

\[J(\vec{w},b)=\frac{1}{2m}\sum_{i=1}^{m}{f_{\vec{w},b}((\vec{x}^{(i)})-y^{(i)})^2} + \frac{\lambda}{2m}{\sum_{j=1}^{m}{w_j^2}} \]

\(\lambda\): regularization parameter, positive
\(b\) can be included or excluded

Here you can see, if the \(\lambda\) is too small (let's say \(0\)), it will be underfitting; and if the \(\lambda\) is too big (let's say \(10^{10}\)), it will be overfitting.

4.2 Regularized Linear Regression

Gradient descent

repeat{

\[w_j := w_j-\alpha\frac{\partial}{\partial{w_j}}J(\vec{w},b) = w_j-\alpha[ \frac{1}{m} {\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ b := b-\alpha\frac{\partial}{\partial{b}}J(\vec{w},b) = b-\alpha[\frac{1}{m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})}] \]

}

Here, rewrite the \(w_j\):

\[\begin{equation} \begin{aligned} w_j & = w_j-\alpha[\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}+\frac{\lambda}{m}w_j}] \\ & = w_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}{\sum_{i=1}^m(f_{w,b}(x^{(i)})-{y}^{(i)})x^{(i)}} \end{aligned} \end{equation} \]

We can know that the latter term of \(w_j\) is the usually-updating term, and the former one is to shrink \(w_j\).