Declaration (2023/06/02):

This note is the first note of a series of machine learning notes.

At present, the main learning resource is the 2022 Andrew Y. Ng machine learning Deeplearning.ai course, from which most of the knowledge and some pictures in the notes come.

For convenience, the video resource comes from 【啥都会一点的研究生】 on Bilibili and course materials is pulled from GitHub.

Here I would recommend watching the original video.

Course address: https://www.coursera.org/specializations/machine-learning-introduction
Video address: https://www.bilibili.com/video/BV1Pa411X76s/?p=1
GitHub resource: https://github.com/kaieye/2022-Machine-Learning-Specialization

Machine Learning 【note_01】

Linear Regression

1 Basic knowledge

1.1 machine learning

supervised learning
- used most in real-world applications
- most rapid advancement and innovation
unsupervised learning

reinforcement learning

practical advice for applying learning algorithms

1.2 supervised learning

Input x and y pairs, get mapping relationship, calculate y for new x.
Learns from being given "right answers".
two kinds:
1. regression:
  - predict a number
  - infinitely many possible output
2. classification:
  - predict categories
  - small number of possible output

1.3 unsupervised learning

Find something interesting in unlabeled data.
Input x and label y, find structure in data, to get label y of x without given label.
three kind:
1. Clustering: group similar data points together.
Examples: Google news, DNA micro-array, grouping customers
2. Anomaly detection: find unusual data points.

Dimensionality reduction: Compress data using fewer numbers while losing as little information as possible.

2 Linear regression

2.1 Definition

Supervised learning models data has "right answers".

Regression model predicts numbers.

2.2 An example

2.3 Terminology

Training set: Data used to train the model.
- \(x\): "input" variable/feature
- \(y\): "output/target" variable/feature
- \(m\): number of training examples
- \((x,y)\): single training example
- \((x^{(i)},y^{(i)})\): \(i^{th}\) training example
Model
- \(f\): model/function, take a new input x and output the prediction for y
- \(\hat{y}\): estimate/prediction for y
- \(w\) and \(b\): parameters of the model, coefficients / weights and intercept

graph TD A(training set)-->B(learning algorithm)-->C(function) D(x)-->C-->E(y-hat)

Linear regression with one variable.

Univariate linear regression.

3 Cost Function

3.1 Definition

Cost function: Squared error cost function

习惯上，把代价函数 \(J(w,b)\) 的系数做 \(1/2\) 计算，可以在求导的时候方便一些：

\[J(w,b)=\frac{1}{2m}\sum_{i=1}^m{({\hat{y}}^{(i)}-{y}^{(i)})^2} \]

设定代价函数的目的：minimize \(J(w,b)\)

3.2 Visualization

为了便于理解代价函数，可以做以下可视化：

3.2.1 J(w)

Set \(b\) as zero, and see how \(J(w)\) is plotted.

3.2.2 J(w,b)

See how \(J(w,b)\) is visualized.

4 Gradient Descent

4.1 Conception

goal: minimize \(J(w_1,w_2,...,w_n,b)\)

outline:

start with some \(w\) and \(b\)
keep changing \(w\), \(b\) to reduce \(J(w_1,w_2,...,w_n,b)\)
until settle at or near a minimum

There may be more than one possible minimum.

gradient descent:

local minima:

4.2 Implementation

gradient descent algorithm

repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here, the \(\alpha\) is called the learning rate. The \(\frac{\partial}{\partial{w}}J(w,b)\) is a partial derivative term.

Simultaneously update \(w\) and \(b\).

4.3 Understanding

When gradient descent algorithm run, \(w\) is decreased like below:

If \(\alpha\) is too small, gradient descent may be slow.

If \(\alpha\) is too small, gradient descent may:

overshoot, never reach minimum
fail to converge, diverge

Can reach local minimum with fixed learning rate:

When it has already reached a minimum, gradient descent leaves \(w\) unchanged:

4.4 For linear regression

Linear regression model: \(f_{w,b}(x)=wx+b\)

Cost function: \(J(w,b)=\frac{1}{2m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})^2}\)

gradient descent algorithm

repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here,

\[\frac{\partial}{\partial{w}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)x^{(i)}} \\ \frac{\partial}{\partial{b}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)} \]

Linear regression's cost function always have only one global minimum:

4.5 Running

See how gradient descent running in an example:

"Batch" gradient descent

"Batch": each step of gradient descent uses all the training examples (instead of just a subset of the training data).

5 Multiple Linear Regression

5.1 Conception

multiple features (variables)

\(x_j\): \(j^{th}\) feature
\(n\): number of features
\({\vec{x}}^{(i)}\): features of \(i^{th}\) training example
\({x_j}^{(i)}\): value of feature \(j\) in \(i^{th}\) training example

Model: \(f_{w,b}(x)=w_1x_1+w_2x_2+...+w_nx_n+b\)

\(\vec{w}=[w_1,w_2,...,w_n]\)
\(b\)
\(\vec{x}=[x_1,x_2,...,x_n]\)

So, it could be written as :

\[f_{\vec{w},b}(\vec{x})=\vec{w}\cdot\vec{x}+b \]

5.2 Vectorization

In Python, we use NumPy to calculate vectorized parameters and features:

Run faster
Shorter of code

Compare running with and without vectorization:

Parallel computing, is more efficient for large scale datasets.

Same in gradient descent:

Try in Python code:

np.random.seed(1)
a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time()  # capture start time
c = np.dot(a, b)
toc = time.time()  # capture end time

print(f"np.dot(a, b) =  {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
c = my_dot(a,b)
toc = time.time()  # capture end time

print(f"my_dot(a, b) =  {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a);del(b)  #remove these big arrays from memory

np.dot(a, b) =  2501072.5817
Vectorized version duration: 10.9415 ms 
my_dot(a, b) =  2501072.5817
loop version duration: 2575.1283 ms

5.3 Gradient Descent

Model: \(f_{\vec{w},b}=\vec{w}\cdot\vec{x}+b\)

Cost function: \(J(\vec{w},b)\)

To calculate partial derivative term:

There's an alternative to gradient descent, which is called Normal Equation:

Only for linear regression.
Solve for \(w\), \(b\) without iterations.
Disadvantages:
- Does not generalize to other learning algorithms (like Logistic Regression or other subsequent machine learning methods).
- Slow when number of features is large (>10,000).
It may be used in ML libraries that implement linear regression; however, gradient descent is the recommended method for finding parameters \(w\), \(b\).

5.4 Important things

5.4.1 Feature Scaling

5.4.1.1 Why

An example:

We can see that the scales of different variables vary greatly, which may cause problems in gradient descent:

So we need to rescale feature:

5.4.1.2 How

Simply divide by the maximum value:

Mean normalization:

Z-score normalization:

rescale it when the feature parameters are too large or too small:

5.4.2 Convergence check

Check by the curve:

The cost should decrease after every iteration.
- If the cost curve rises, it means the learning rate \(\alpha\) chosen is too big.
The cost likely converged by 400 iterations.
- It is difficult to know in advance the number of iterations required.

Or, we can set the automatic convergence test.

Set the "epsilon" \(\epsilon\) to be \(10^{-3}\).
If cost decreases by \(\leq\epsilon\) in one iteration, declare convergence.
It is hard to find an proper "epsilon", so we'd better choose the left strategy rather than right one.

5.4.3 Learning Rate Choose

We make sure that the cost should decrease after every iteration.

There's a strategy:

First choose a very small \(\alpha\), to see if the cost decrease after every iteration.
Gradually increase the \(\alpha\) to see the cost curve's change.