Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license
Predict coordinates of a bounding box (x, y, w, h)
Evaluate via Intersection over Union (IoU)
$C$ classes, $4$ output dimensions ($1$ box)
Predict exactly $N$ objects: predict $(N \times 4)$ coordinates and $(N \times K)$ class scores
We don't know in advance the number of objects in the image. Object detection relies on object proposal and object classification
Object proposal: find regions of interest (RoIs) in the image
Object classification: classify the object in these regions
For each cell of the $S \times S$ predict:
YOLO features:
Instead of having a predefined set of box proposals, find them on the image:
Crop-and-resize operator (RoI-Pooling):
Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015
Replace Selective Search with RPN, train jointly
Region proposal is translation invariant, compared to YOLO
Output a class map for each pixel (here: dog vs background)
Instance segmentation: specify each object instance as well (two dogs have different instances)
This can be done through object detection + segmentation
Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." CVPR 2015
Slide the network with an input of (224, 224)
over a larger image. Output of varying spatial size
Convolutionize: change Dense (4096, 1000)
to $1 \times 1$ Convolution, with 4096, 1000
input and output channels
Gives a coarse segmentation (no extra supervision)
Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." CVPR 2015
Predict / backpropagate for every output pixel
Aggregate maps from several convolutions at different scales for more robust results
Noh, Hyeonwoo, et al. "Learning deconvolution network for semantic segmentation." ICCV 2015
Noh, Hyeonwoo, et al. "Learning deconvolution network for semantic segmentation." ICCV 2015
skip connections between corresponding convolution and deconvolution layers
sharper masks by using precise spatial information (early layers)
better object detection by using semantic information (late layers)
Newell, Alejandro, et al. "Stacked Hourglass Networks for Human Pose Estimation." ECCV 2016
U-Net like architectures repeated sequentially
Each block refines the segmentation for the following
Each block has a segmentation loss
K. He and al. Mask Region-based Convolutional Network (Mask R-CNN) NIPS 2017
Faster-RCNN architecture with a third, binary mask head
K. He and al. Mask Region-based Convolutional Network (Mask R-CNN) NIPS 2017