activation functions summary and comparision-526互联

written in the foreword

Any nonlinear function that has good derivative properties has the potential to become an activation function. So here, we will just compare some classic activation functions.

summary

name	formula	digraph	attribution	advantage	disadvantage	usage	addition
sigmoid			output is in [0-1] gradient in [0-0.25]	Smooth and easy to differentiate, avoid jumpy value	Gradient Vanishing high computation, low speed	binary classification
tanh			output is in [-1, 1] gradient in [0-1]	mean=0 , easy to compute smooth	Gradient Vanishing high computation, low speed	RNN
softmax			gradient in [0-1]	convert output as property distribution, the sum of all classes is 1 smooth	Gradient Vanishing high computation, low speed	multiple classification
ReLU			gradient in [0-1]	Preventing Gradient Vanishing Fast Convergence	Dead ReLU Problem	LeNet-5 AlexNet VGG
ReLU6			output limited in [0-6]	high efficiency	limited expression of output	mobile device
SwiGLU				performance improvement, better than swish, GLU, etc. including the vision field. dynamic mating mechanism	unknown	PaLM(google) llama2(meta)