Convolutional Neural Networks for Image Segmentation¶

Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license

CNNs for computer vision¶

Beyond Image Classification¶

CNNs¶

Previous lecture: image classification

Limitations¶

Mostly on centered images
Only a single object per image
Not enough for many real world vision tasks

Beyond Image Classification¶

Outline¶

Simple Localisation as regression¶

Detection Algorithms¶

Fully convolutional Networks¶

Semantic & Instance Segmentation¶

Localisation¶

Single object per image
Predict coordinates of a bounding box (x, y, w, h)
Evaluate via Intersection over Union (IoU)

Localisation as regression¶

Classification + Localisation¶

Use a pre-trained CNN on ImageNet (ex. ResNet)
The "localisation head" is trained seperately with regression
At test time, use both heads

$C$ classes, $4$ output dimensions ($1$ box)

Predict exactly $N$ objects: predict $(N \times 4)$ coordinates and $(N \times K)$ class scores

Object detection¶

We don't know in advance the number of objects in the image. Object detection relies on object proposal and object classification

Object proposal: find regions of interest (RoIs) in the image

Object classification: classify the object in these regions

Two main families:¶

Single-Stage: A grid in the image where each cell is a proposal (SSD, YOLO, RetinaNet)
Two-Stage: Region proposal then classification (Faster-RCNN)

YOLO (You Only Look Once)¶

For each cell of the $S \times S$ predict:

$B$ boxes and confidence scores $C$ ($5 \times B$ values) + classes $c$
Final detections: $C_j * prob(c) > \text{threshold}$
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR (2016)

YOLO (You Only Look Once)¶

YOLO features:

Computationally very fast, can be used in real time
Globally processing the entire image once only with a single CNN
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." CVPR (2016)

RetinaNet¶

Lin, Tsung-Yi, et al. "Focal loss for dense object detection." ICCV 2017.

Single stage detector with:

Multiple scales through a Feature Pyramid Network
More than 100K boxes proposed
Focal loss to manage imbalance between background and real objects

See this post for more information

RCNN¶

SPPNet¶

Box Proposals¶

Instead of having a predefined set of box proposals, find them on the image:

Selective Search - from pixels (not learnt)
Faster - RCNN - Region Proposal Network (RPN)

Crop-and-resize operator (RoI-Pooling):

Input: convolutional map + $N$ regions of interest
Output: tensor of $N \times 7 \times 7 \times \text{depth}$ boxes
Allows to propagate gradient only on interesting regions, and efficient computation

Fast RCNN¶

Faster-RCNN¶

Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS 2015

Replace Selective Search with RPN, train jointly
Region proposal is translation invariant, compared to YOLO

Segmentation¶

Output a class map for each pixel (here: dog vs background)

Instance segmentation: specify each object instance as well (two dogs have different instances)
This can be done through object detection + segmentation

Convolutionize¶

Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." CVPR 2015

Slide the network with an input of (224, 224) over a larger image. Output of varying spatial size
Convolutionize: change Dense (4096, 1000) to $1 \times 1$ Convolution, with 4096, 1000 input and output channels
Gives a coarse segmentation (no extra supervision)

Fully Convolutional Network¶

Long, Jonathan, et al. "Fully convolutional networks for semantic segmentation." CVPR 2015

Predict / backpropagate for every output pixel
Aggregate maps from several convolutions at different scales for more robust results

Deconvolution¶

Noh, Hyeonwoo, et al. "Learning deconvolution network for semantic segmentation." ICCV 2015

"Deconvolution": transposed convolutions

Deconvolution¶

Noh, Hyeonwoo, et al. "Learning deconvolution network for semantic segmentation." ICCV 2015

skip connections between corresponding convolution and deconvolution layers
sharper masks by using precise spatial information (early layers)
better object detection by using semantic information (late layers)

Hourglass network¶

Newell, Alejandro, et al. "Stacked Hourglass Networks for Human Pose Estimation." ECCV 2016

U-Net like architectures repeated sequentially
Each block refines the segmentation for the following
Each block has a segmentation loss

Mask-RCNN¶

K. He and al. Mask Region-based Convolutional Network (Mask R-CNN) NIPS 2017

Faster-RCNN architecture with a third, binary mask head

Results¶

K. He and al. Mask Region-based Convolutional Network (Mask R-CNN) NIPS 2017

Mask results are still coarse (low mask resolution)
Excellent instance generalization

Results¶

He, Kaiming, et al. "Mask r-cnn." Internal Conference on Computer Vision (ICCV), 2017.