From Coursera, State Estimation and Localization for Self-Driving Cars by University of Toronto

https://www.coursera.org/learn/visual-perception-self-driving-cars

## Feedforward Neural Networks

### Feed Forward Neural Networks

A Feedforward Neural Network defines a mapping from input x to output y as: $y=f(x;0)$

- We define:
- $x$ is called the input layer
- The final function $f^{(N)}$ is called the output layer
- The functions $f^{(1)}$ to $f^{(N-1)}$ are called the hidden layers

- Functions to estimate:
- Object Classification: Image → Label
- Object Detection: Image → Label+Location
- Depth Estimation: Image → Depth for every pixel
- Semantic Segmentation: Image → Label for every pixel

- Mode Of Action Of Neural Networks
- Training: Give neural network examples of $f^*(x)$.

for a wide variation of the input $x$. Then, optimize its parameters $0$ to force $f(x;0) \approx f^\ast(x)$ - Pairs of $x$ and $f^*(x)$ are called training data
- Only output is specified by training data! Network is free to do anything with its hidden layers

- Training: Give neural network examples of $f^*(x)$.
- Hidden Units: $h_n = g(W^Th_{n-1} + b)$
- Activation function $g$
- Input $h_{n-1}$
- Weight matrix $W$
- Bias $b$
- Parameters $\theta$ are the weights and biases of all the layers of the network
- Transformed parameters passed through activation function $g$

- The Rectified Linear Unit: ReLU
- The ReLU hidden unit is currently the default choice of activation function for Feedforward Neural Networks $g(z)=max(0, z)$

### Output Layers and Loss Functions

**General process of designing machine learning algorithm**

- Inference:
- a feed-forward neural network takes an input $x$, passes it through a sequence of hidden layers, then passes the output of the hidden layers through an output layer.

- Training:
- pass the predicted output through the loss function, then use an optimization procedure to produce a new set of parameters data that provide a lower value for the loss function.

- Tasks in self-driving: Classification and Regression
- Classification: Given input x map it to one of k classes or categories.
- Image classification, semantic segmentation

- Regression: Given input x map it to a real number:
- Depth prediction, bounding box estimation

- Classification: Given input x map it to one of k classes or categories.

**Loss functions in different tasks**

- Classification: Softmax Output Layers
- Softmax output layers are most often used as the output of a classifier, to represent the probability distribution over K different classes
- The Softmax output layer is comprised of:
- A linear transformation: $z=W^Th+b$

- Followed by the Softmax function: $$Softmax(z_i)= \frac{exp(z_i)}{\sum_j exp(z_j)}$$

- Classification: Cross-Entropy Loss Function
- By considering the output of the softmax output layer as aprobability distribution, the Cross Entropy Loss function is derived using maximum likelihood as: $$L(\theta) = -log(softmax(z_i)) = -z_i + log\sum_j exp(z_j) $$
- The Cross-Entropy Loss has two terms to control how close the output of the network is to the true probability:
- $Z_i$ is the output of the hidden layer corresponding to the true class before being passed through the softmax function. This is usually called the class logit which comes from the field of logistic regression. The negative of the class logit $z_i$ encourages the network to output a large value for the probability of the correct class.
- The second term on the other hand, encourages the output of the affine transformation to be small

- Regression: Linear Output Layers
- Linear Output Units are based only on an affine transformation with no non-linearity: $z=W^Th+b$
- Linear Output Units are usually used with the Mean Squared Error loss function to model the mean of a probability distribution: $$L(\theta) = \sum_i (z_i - f^*(x_i))^2$$

### Neural Network Training with Gradient Descent

Neural Network Loss Functions

- Thousands of training example pairs $[x,f^*(x)]$
- The Loss function computed over all $N$ training examples is termed the Training Loss and can be written as: $J(\theta) = \frac 1N \sum^N_{i=1} L[f(x_i,\theta), f^*(x_i)]$
- The gradient of the training loss with respect to the parameters $\theta$ can be written as: $$\nabla_\theta J(\theta) = \nabla_\theta [\frac 1N \sum^N_{i=1} L[f(x_i, \theta), f^*(x_i)]] = \frac 1N \sum^N_{i=1} \nabla_\theta L[f(x_i, \theta), f^\ast(x_i)] $$

Batch Gradient Descent:

- Batch Gradient Descent is an iterative first order optimization procedure
- Iterative means that it starts from an initial guess of parameters theta and improves on these parameters iteratively.
- First order means that the algorithm only uses the first order derivative to improve the parameters theta.
- Batch Gradient Descent Algorithm:
- Initialize parameters $\theta$
- While Stopping Condition is Not Met:
- Compute gradient of loss function over all training examples using the above training loss
- Update parameters according to: $\theta \leftarrow \theta - \epsilon \nabla_\theta J(\theta)$
- $\epsilon$ is called the learning rate and controls how much we adjust the parameters in the direction of the negative gradient at every iteration.

- Backpropagation used to compute $\nabla_{\theta}J(\theta)$ is very expensive to compute over the whole training dataset.
- Luckily, the lose function as well as its gradient are means over the training dataset
- Standard error of the mean estimated from N samples is $\frac {\sigma}{\sqrt N}$, where $\sigma$ is the standard deviation of the value of the samples
- Using all samples to estimate the gradient results in less than linear return in accuracy of this estimate
- Use a small subsample (Minibatch) of the training data to estimate the gradient

- The Stochastic (minibatch) Gradient Descent just alterate at the sampling step.
- Choose of Minibatch Size:
- GPUs work better with powers of 2 batch sizes
- Large batch sizes > 256:
- Hardware underutilized with very small batch sizes.
- More accurate estimate of the gradient, but with less than linear returns

- Small batch size < 64
- Small batches can offer a regularizing effect.The best generalization error is often achieved with batch size of 1.
- Small batch sizes allow for faster convergence, as the algorithm can compute the parameter updates rapidly

- As a result of these trade-offs, typical power of two mini batch sizes range from 32 to 256, with smaller sizes sometimes being attempted for large models or to improve generalization.
- Always make sure dataset is shuffled before sampling minibatch

Parameter Initialization and Stopping Conditions

- Parameter Initialization:
- Weights: initialized by randomly sampling from a standard normal distribution
- Biases: initialized to 0
- Other heuristics exist

- Stopping Conditions:
- Number of iterations: How many training iterations the neural network has performed
- Change $ln \theta$ value: Stop if $\theta_{new} - \theta_{old}$ < Threshold
- Change $ln J(\theta)$ value: Stop if $J(\theta_{new})-J(\theta_{old})$ < Threshold

SGD Variations

- Many variations of SGD exist
- Momentum SGD, Nestrove Momentum SGD
- Ada-Grad, RMS-Prop
- ADAM (Adaptive Moment Estimation)

- Which one to use?
- ADAM: Implemented in most deep neural network libraries, fairly robust to the choice of the learning rate and other hyperparameters

### Data Splits and Neural Network Performance Evaluation

- Data splits:
- training: used to minimize the Loss Function
- validation: used to choose best hyperparameters, such as the learning rate, number of layers, etc.
- testing: the neural network never observes this set. The developer never uses this set in the design process

- Percentage of split:
- total sample size is ~10000: training 60%, validation 20%, testing 20%
- total sample size is ~1000000: training 98%, validation 1%, testing 1%

- Behavior of Split Specific Loss Functions:
- overfitting, underfitting
- The gap between training and validation loss is called the generalization gap

- Reducing the Effect of Underfitting/Overfitting
- Underfitting: (Training loss is high)
- Train longer
- More layers or more parameters per layer
- Change architecture

- Overfitting: (Generalization gap is large)
- More training data
- Regularization
- Change architecture

- Underfitting: (Training loss is high)

### Neural Network Regularization

Remedy overfitting through various regularization strategies:

- Parameter norm penalties
- $J(\theta)_{reg} = J(\theta) + \alpha \Omega(\theta)$
- limits the capacity of the model by adding the penalty $\omega$ of $\theta$ to the objective function.
- $\alpha$ is a hyperparameter that weights the relative contribution of the norm penalty to the value of the loss function
- $\Omega(\theta)$ is a measure of how large $\theta$’s value is, usually an $L_p$ Norm.
- We usually only constrain the size of weights and not biases: $J(\theta)_{reg} = J(\theta) + \alpha \Omega(W)$
- The most common norm penalty used in neural networks is the L2-norm penalty: $\Omega(W) = \frac 12 W^TW = \frac 12 |W|^2_2$

- Dropout
- The first step of dropout is to choose a probability which we’ll call $P_{keep}$.
- At every training iteration, this probability is used to choose a subset of the network nodes to keep in the network. These nodes can be either hidden units, output units, or input units.
- We then proceed to evaluate the output y after cutting all the connections coming out of this unit.
- Since we are removing units proportional to the keep probably, $P_{keep}$, we multiply the final weights by $P_{keep}$ at the ending of training. This is essential to avoid incorrectly scaling the outputs when we switch to inference for the full network.
- Computationally inexpensive but powerful regularization method
- Does not significantly limit the type of model or training procedure that can be used. Works well with nearly any model that uses a distributed over parameterized representation, and that can be trained with stochastic gradient descent
- Dropout layers are practically implemented in all neural network libraries.

- Early Stopping
- Early stopping ends training when the validation loss keeps increasing for a preset number of iterations or epochs.
- Early stopping should not be use as a first choice for regularization. As it also limits the training time, which may interfere with the overall network performance.

### Convolutional Neural Networks

**ConvNets**

- Used for processing data defined on grid
- 1Dtime series data,2D images,3D videos
- Two major type of layers:
- Convolution Layers
- Pooling Layers

- Cross-correlation:
- The vertical and horizontal shifts are usually the same value, referred as the stride of our convolutional layer.

- Output Volume Shape
- Filters are size mxm
- Number of filters = K
- Stride = S, Padding = P
- the expression for width: $W_{out} = \frac{W_{in}-m+2*P}{S} +1$
- the expression for height: $H_{out} = \frac{H_{in}-m+2*P}{S} +1$
- the expression for depth: $D_{out} = K$

**Pooling**

- A pooling layer uses pooling functions to replace the output of the previous layer with a summary statistic of the nearby outputs.
- Pooling helps make the representations become invariant to small translations of the input.
- Max pooling:
- Max pooling summarizes output volume patches with the max function.
- Output Volume Shape
- Pool size nxn
- Stride = S
- $W_{out} = \frac{W_{in}-n}{S} +1$
- $H_{out} = \frac{H_{in}-n}{S} +1$
- $D_{out} = D_{in}$

- Advantages of ConvNets
- Convolutional neural networks are by design, a natural choice to process images
- Convolutional layers have
*less parameters*than fully connected layers, reducing the chances of overfitting - Convolutional layers use the same parameters to process every block of the image. Along with pooling layers, this leads to
*translation invariance*, which is particularly important for image understanding