Machine Learning 【note_01】

发布时间 2023-06-02 10:03:47作者: —青_木—

Declaration (2023/06/02):

This note is the first note of a series of machine learning notes.

At present, the main learning resource is the 2022 Andrew Y. Ng machine learning Deeplearning.ai course, from which most of the knowledge and some pictures in the notes come.

For convenience, the video resource comes from 【啥都会一点的研究生】 on Bilibili and course materials is pulled from GitHub.

Here I would recommend watching the original video.

Machine Learning 【note_01】

Linear Regression


1 Basic knowledge

1.1 machine learning

  • supervised learning

    • used most in real-world applications

    • most rapid advancement and innovation

  • unsupervised learning

reinforcement learning

practical advice for applying learning algorithms

1.2 supervised learning

  • Input x and y pairs, get mapping relationship, calculate y for new x.
  • Learns from being given "right answers".
  • two kinds:
    1. regression:
      • predict a number
      • infinitely many possible output
    2. classification:
      • predict categories
      • small number of possible output

1.3 unsupervised learning

  • Find something interesting in unlabeled data.

  • Input x and label y, find structure in data, to get label y of x without given label.

  • three kind:

    1. Clustering: group similar data points together.
  • Examples: Google news, DNA micro-array, grouping customers
    2. Anomaly detection: find unusual data points.

  1. Dimensionality reduction: Compress data using fewer numbers while losing as little information as possible.

2 Linear regression

2.1 Definition

Supervised learning models data has "right answers".

Regression model predicts numbers.

2.2 An example

image-20230531123012578

2.3 Terminology

  • Training set: Data used to train the model.
    • \(x\): "input" variable/feature
    • \(y\): "output/target" variable/feature
    • \(m\): number of training examples
    • \((x,y)\): single training example
    • \((x^{(i)},y^{(i)})\): \(i^{th}\) training example
  • Model
    • \(f\): model/function, take a new input x and output the prediction for y
    • \(\hat{y}\): estimate/prediction for y
    • \(w\) and \(b\): parameters of the model, coefficients / weights and intercept
graph TD A(training set)-->B(learning algorithm)-->C(function) D(x)-->C-->E(y-hat)
image-20230531125937232

Linear regression with one variable.

Univariate linear regression.


3 Cost Function

3.1 Definition

Cost function: Squared error cost function

image-20230531233603163

习惯上,把代价函数 \(J(w,b)\) 的系数做 \(1/2\) 计算,可以在求导的时候方便一些:

\[J(w,b)=\frac{1}{2m}\sum_{i=1}^m{({\hat{y}}^{(i)}-{y}^{(i)})^2} \]

设定代价函数的目的:minimize \(J(w,b)\)

3.2 Visualization

为了便于理解代价函数,可以做以下可视化:

3.2.1 J(w)

Set \(b\) as zero, and see how \(J(w)\) is plotted.

image-20230601000159186

3.2.2 J(w,b)

See how \(J(w,b)\) is visualized.

image-20230601000242286 image-20230601001338536 image-20230601002325581

4 Gradient Descent

4.1 Conception

goal: minimize \(J(w_1,w_2,...,w_n,b)\)

outline:

  • start with some \(w\) and \(b\)
  • keep changing \(w\), \(b\) to reduce \(J(w_1,w_2,...,w_n,b)\)
  • until settle at or near a minimum

There may be more than one possible minimum.

gradient descent:

image-20230601003444122

local minima:

image-20230601003620237

4.2 Implementation

gradient descent algorithm

​ repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here, the \(\alpha\) is called the learning rate. The \(\frac{\partial}{\partial{w}}J(w,b)\) is a partial derivative term.

Simultaneously update \(w\) and \(b\).

image-20230601004914913

4.3 Understanding

When gradient descent algorithm run, \(w\) is decreased like below:

image-20230601010122298

If \(\alpha\) is too small, gradient descent may be slow.

If \(\alpha\) is too small, gradient descent may:

  • overshoot, never reach minimum
  • fail to converge, diverge
image-20230601010815367

Can reach local minimum with fixed learning rate:

image-20230601011323627

When it has already reached a minimum, gradient descent leaves \(w\) unchanged:

image-20230601011301848

4.4 For linear regression

Linear regression model: \(f_{w,b}(x)=wx+b\)

Cost function: \(J(w,b)=\frac{1}{2m}\sum_{i=1}^m{(f_{w,b}(x^{(i)})-{y}^{(i)})^2}\)

gradient descent algorithm

​ repeat until convergence {

\[w:=w-\alpha\frac{\partial}{\partial{w}}J(w,b) \\ b:=b-\alpha\frac{\partial}{\partial{b}}J(w,b) \]

}

Here,

\[\frac{\partial}{\partial{w}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)x^{(i)}} \\ \frac{\partial}{\partial{b}}J(w,b)=\frac{1}{m}\sum_{i=1}^m{((f_{w,b}(x^{(i)})-{y}^{(i)})^2)} \]

image-20230601124420645

Linear regression's cost function always have only one global minimum:

image-20230601124605521

4.5 Running

See how gradient descent running in an example:

image-20230601125031569

"Batch" gradient descent

  • "Batch": each step of gradient descent uses all the training examples (instead of just a subset of the training data).

5 Multiple Linear Regression

5.1 Conception

image-20230601132234905

multiple features (variables)

  • \(x_j\): \(j^{th}\) feature
  • \(n\): number of features
  • \({\vec{x}}^{(i)}\): features of \(i^{th}\) training example
  • \({x_j}^{(i)}\): value of feature \(j\) in \(i^{th}\) training example
image-20230601132509260

Model: \(f_{w,b}(x)=w_1x_1+w_2x_2+...+w_nx_n+b\)

  • \(\vec{w}=[w_1,w_2,...,w_n]\)
  • \(b\)
  • \(\vec{x}=[x_1,x_2,...,x_n]\)

So, it could be written as :

\[f_{\vec{w},b}(\vec{x})=\vec{w}\cdot\vec{x}+b \]

5.2 Vectorization

In Python, we use NumPy to calculate vectorized parameters and features:

  • Run faster
  • Shorter of code
image-20230601133700029

Compare running with and without vectorization:

  • Parallel computing, is more efficient for large scale datasets.
image-20230601134205901

Same in gradient descent:

image-20230601134546341

Try in Python code:

np.random.seed(1)
a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time()  # capture start time
c = np.dot(a, b)
toc = time.time()  # capture end time

print(f"np.dot(a, b) =  {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
c = my_dot(a,b)
toc = time.time()  # capture end time

print(f"my_dot(a, b) =  {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a);del(b)  #remove these big arrays from memory
np.dot(a, b) =  2501072.5817
Vectorized version duration: 10.9415 ms 
my_dot(a, b) =  2501072.5817
loop version duration: 2575.1283 ms 

5.3 Gradient Descent

Model: \(f_{\vec{w},b}=\vec{w}\cdot\vec{x}+b\)

Cost function: \(J(\vec{w},b)\)

image-20230601141551927

To calculate partial derivative term:

image-20230601141838070

There's an alternative to gradient descent, which is called Normal Equation:

  • Only for linear regression.

  • Solve for \(w\), \(b\) without iterations.

  • Disadvantages:

    • Does not generalize to other learning algorithms (like Logistic Regression or other subsequent machine learning methods).
    • Slow when number of features is large (>10,000).
  • It may be used in ML libraries that implement linear regression; however, gradient descent is the recommended method for finding parameters \(w\), \(b\).

5.4 Important things

5.4.1 Feature Scaling

5.4.1.1 Why

An example:

image-20230601143413294

We can see that the scales of different variables vary greatly, which may cause problems in gradient descent:

image-20230601143432689

So we need to rescale feature:

image-20230601143637077
5.4.1.2 How

Simply divide by the maximum value:

image-20230601144009784

Mean normalization:

image-20230601144118845

Z-score normalization:

image-20230601144239742

rescale it when the feature parameters are too large or too small:

image-20230601144456683

5.4.2 Convergence check

Check by the curve:

  • The cost should decrease after every iteration.
    • If the cost curve rises, it means the learning rate \(\alpha\) chosen is too big.
  • The cost likely converged by 400 iterations.
    • It is difficult to know in advance the number of iterations required.
image-20230601145155098

Or, we can set the automatic convergence test.

  • Set the "epsilon" \(\epsilon\) to be \(10^{-3}\).
  • If cost decreases by \(\leq\epsilon\) in one iteration, declare convergence.
  • It is hard to find an proper "epsilon", so we'd better choose the left strategy rather than right one.

5.4.3 Learning Rate Choose

We make sure that the cost should decrease after every iteration.

image-20230601150210340

There's a strategy:

  • First choose a very small \(\alpha\), to see if the cost decrease after every iteration.
  • Gradually increase the \(\alpha\) to see the cost curve's change.
image-20230601150653130

5.4.4 Feature Engineering

Using intuition to design new features, by transforming or combining original features.

image-20230601153559194

6 Polynomial Regression

image-20230601153954882

Here, feature scaling is quite important.

How to choose different features and different models?

image-20230601154125065