From Coursera, State Estimation and Localization for Self-Driving Cars by University of Toronto

https://www.coursera.org/learn/visual-perception-self-driving-cars

## 2D Object Detection

### The Object Detection Problem

**2D object detection task problem**

- Object detection can be defined as a function estimation problem
- Given an input image x, we want to find the function $f(x;\theta)$ that produces an output vector that includes the coordinates of the top-left of the box, and the coordinates of the lower right corner of the box, and a class score: $f(x;\theta) = [x_{min}, y_{min}, x_{max}, y_{max}, S_{class_1},…S_{class_k}]$

**Evaluating performance measures**

- The first step of the evaluation process is to compare the detector localization output to the ground truth boxes via the Intersection-Over-Union metric (IOU).
- area of intersection of predicted box with a ground truth box, divided by the area of their union

- To account for class scores, we define true positives (TP), False positive (FP) and False Negative (FN).
- TP: Object class score > score threshold, and IOU > IOU threshold
- FP: Object class score > score threshold, and IOU < IOU threshold
- FN: Number of ground truth objects not detected by the algorithm
- Precision: TP/(TP+FP)
- Recall: TP/(TP+FN)
- Precision Recall Curve (PR-Curve): Use multiple object class score thresholds to compute precision and recall. Plot the values with precision on y-axis, and recall on x-axis
- Average Precision (AP): Area under PR-Curve for a single class. Usually approximated using 11 recall points

### 2D Object detection with Convolutional Neural Networks

The Feature Extractor

- Feature extractors are the most computationally expensive component of the 2D object detector
- The output of feature extractors usually has much lower width and height than those of the input image, but much greater depth
- Very active area of research, with new extractors proposed on regular basis
- Most common extractors are: VGG, ResNet, and Inception

VGG Feature Extractor

- Alternating convolutional and pooling layers
- All convolutional layers are of size 3×3xK, with stride 1 and 1 zero-padding
- All pooling layers use the max function, and are of size 2x2, with stride 2 and no padding.

Prior boxes/anchor boxes:

- To generate 2D bounding boxes, we usually do not start from scratch and estimate the corners of the bounding box without any prior.
- We assume that we do have a prior on where the boxes are in image space and how large these boxes should be. These priors are called anchor boxes and are manually defined over the whole image usually on an equally-spaced grid.
- During training, the network learns to take each of these anchors and tries to move it as close as possible to the ground truth bounding box in both the centroid location and box dimensions. This is termed
*residual learning*and it takes advantage of the notion that it is easier to nudge an existing box a small amount to improve it rather than to search the entire image for possible object locations. - Residual learning has proven to provide much better results than attempting to directly estimate bounding boxes without any prior.

Faster R-CNN:

- For every pixel in the feature map, we associate k anchor boxes.
- We then perform a 3x3xD star convolution operation on that pixels neighborhood. This results in a 1x1xD star feature vector for that pixel.
- We use this one by 1x1xD star feature vector as the feature vector of every one of the k anchors associated with that pixel.
We then proceed to feed the extracted feature vector to the output layers in the neural network.

The output layers of a 2D object detector usually comprise of a regression head and a classification head.

- The regression head usually includes multiple fully-connected hidden layers with a linear output layer. The regressed output is typically a vector of residuals that need to be added to the anchor that hand to get the ground truth bounding box.
- Another method to update the dimension of the anchors is to regress a residual from the center of the anchor to the center of the ground truth bounding box in addition to two scale factors that correct the ground truth bounding box width and height when multiplied with an anchor’s width and height.
- The classification head is also comprised of multiple fully-connected hidden layers, but with a final softmax output layer. The softmax output is a vector with a single score per class. The highest score usually defines the class of the anchor at hand.

### Training vs. Inference

**Minibatch selection**

- Negative anchors target:
- Classification:Background
- Regression:None

- Positive anchors target:
- Classification:Category of the ground truth bounding box
- Regression:Align box parameters with highest IOU ground truth bounding box

- Problem: Majority of anchors are negatives results in neural network will label all detections as background
- Solution: Sample a chosen minibatch size,with 3:1 ratio of negative to positive anchors to eliminate bias towards the negative class
- Choose negatives with highest classification loss(online hard negative mining) to be included in the minibatch
- Classification loss: $L_{cls} = \frac 1{N_{total}} \sum_i crossentropy(s^*_i, s_i)$
- $N_{total}$ is the size of the minibatch
- $s_i$ is the output of the nerual network
- $s^*_i$ is the anchor classification target:
- Background if anchor is negative
- Ground truth box class if anchor is positive

- Regression Loss: $L_{reg} = \frac 1{N_{total}} \sum_i p_iL_2(b^*_i, b_i) $
- $p_i$ is 0 if anchor is negative and 1 if anchor is positive
- $N_p$ is the number of positive anchors in the minibatch
- $b^*_i$ the ground truth bounding box
- $b_i$ is the estimated bounding box, applying the regressed residuals to the anchor box parameters

**Non-maximum suppression(NMS)**

- an extremely powerful approach to improving inference output for anchor based neuron networks.
- Non-max suppression takes as an input a list of predicted boundary boxed b, and each bounding blocks is comprised of the regressed coordinates in the class output score.
- It also needs as an input a predefined IOU threshold which we’ll call ada.
- Algorithm goes as follows
- first sort the bounding boxes in list B according to their output score. We also initialize an empty set D to hold output bounding boxes.
- then proceed to iterate overall elements in the sorted box list B bar. Inside the for loop, we first determine the box B max with the highest score in the list B, which should be the first element in B bar.
- then remove this bounding box from the bounding box set D bar and add it to the output set D.
- Next, find all boxes remaining in the set B bar that have an IOU greater than ada with the box B max. These boxes significantly overlap with the current maximum box, B max. Any box that satisfies this condition gets removed from the list B bar. We keep iterating through the list B bar until is empty, and then we return the list D.
- D now contains a single bounding box per object.

### Using 2D Object Detectors for Self-Driving Cars

**3D Object Detection**

- Estimating the:
- Category Classification: Car, pedestrian, cyclist
- Position of the centroid in 3D: $[x,y,z]$
- Extent in 3D: $[l,w,h]$
- Orientation in 3D Y: $[\phi,\psi,\theta]$

- The most common and successful way to extend 2D object detection results in 3D is to use LiDAR point clouds.
- Given a 2D bounding box in an image space and a 3D LiDAR point cloud, we can use the inverse of the camera projection matrix to project the corners of the bounding box as rays into the 3D space.

- Advantages:
- Allows exploitation of mature 2D object detectors, with high precision and recall
- Class already determined from 2D detection. There is no need to use LiDAR data or pass 3D information into the network to determine whether we are looking at a car or a post.
- Does not require prior scene knowledge,such as ground plane location

- Disadvantages:
- The performance of the 3D estimator is bounded by the performance of the 2D detector
- Occlusion and truncation are hard to handle from 2D only
- 3D estimator needs to wait for 2D detector, inducing latency in our system

**2D object tracking**

- Detection: We detect the object independently ineach frame and can record its position over time
- Tracking: We use image measurements to estimate position of object, but also incorporate position predicted by dynamics, i.e., our expectation of object’s motion pattern
- Tracking Assumptions:
- Camera is not moving instantly to new viewpoint
- Objects do not disappear and reappear in different places in the scene
- If the camera is moving,there is a gradual change in pose between camera and scene

- Object Tracking: Prediction
- Each object will have a predefined motion model in image space, e.g. $p_k = p_{k-1} +v_k\Delta t + N(0,\Sigma)$

- Object Tracking: correlation
- Get Measurement Bounding Boxes from 2D detector.
- Correlate prediction with the highest IOU measurement

- Object Tracking: Update
- The prediction and measurement are fused as part of the Kalman Filter Framework

- For each frame, we start new track if a measurement has no correlated prediction
- We also terminate inconsistent tracks, if a predicted object does not correlate with a measurement for a preset number of frames
- The same methodology can be used to track objects in 3D!

**Traffic signs and signals detection**

- Traffic signs and signals appear smaller in size compared to cars, two-wheelers, and pedestrians.
- Traffic signs are highly variable with many classes to be trained on.
- Traffic signals have different states that are required to be detected.
- In addition, traffic signals change state as the car drives
- 2D object detectors can be used to perform traffic sign and traffic signal detection without any modifications
- However, multi-stage hierarchical models have been shown to outperform the standard single stage object detectors Prior Boxes