Convolutional Neural Networks for Image Classification¶

Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license

CNNs for computer vision¶

CNN for image classification¶

CNN = Convolutional Neural Networks (or ConvNets)¶

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). LeNet: gradient-based learning applied to document recognition.

Outline of the lecture¶

Convolutions
Convolutions in Neural Networks
- Motivations
- Layers
Architectures
- Classic CNN Architecture
- AlexNet
- VGG16
- ResNet

Convolution¶

A mathematical operation that combines two functions to form a third function.
The feature map (or input data) and the kernel are combined to form a transformed feature map.
Often interpreted as a filter: the kernel filters the feature map for certain information (edges, etc.)

Figure 1: Convolving an image with an edge detector kernel.

The mathematical definition of convolution of two functions f and x over a range t is:

$y(t) = f \otimes x = \int_{-\inf}^{\inf}f(k) \cdot x(t-k) \mathrm{d}k$

where the symbol ⊗ denotes convolution.

https://developer.nvidia.com/discover/convolution

Convolution¶

Convolutional filters can be interpreted as feature detectors:

The input (feature map) is filtered for a certain feature (the kernel).
The output is large if the feature is detected in the image.

The kernel can be interpreted as a feature detector where a detected feature results in large outputs (white) and small outputs if no feature is present (black).

In [1]:

from IPython.display import IFrame
IFrame('https://setosa.io/ev/image-kernels/', width="100%", height=800)

Out[1]:

Convolution in a neural network¶

$x$ is a $3 \times 3$ chunk (yellow area) of the image (green array)
Each output neuron is parametrized with the $3 \times 3$ weight matrix $\mathbf{w}$ (small numbers)

The activation obtained by sliding the $3 \times 3$ window and computing:

$z(x) = relu(\mathbf{w}^T x + b)$

Motivations¶

Standard Dense Layer for an image input:

x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
y = Flatten()(x)
# shape of y is: (None, 640 x 480 x 3)
z = Dense(1000)(y)

$640 \times 480 \times 3 \times 1000 + 1000 = 922M$

No spatial organization of the input

Dense layers are never used directly on large images. Most standard solution is to use convolution layers.

Motivations¶

Local connectivity¶

A neuron depends only on a few local input neurons
Translation invariance

Comparison to Fully connected¶

Parameter sharing: reduce overfitting
Make use of spatial structure: strong prior for vision!

Animal Vision Analogy¶

Hubel & Wiesel, RECEPTIVE FIELDS OF SINGLE NEURONES IN THE CAT'S STRIATE CORTEX (1959)

Channels¶

Colored image = tensor of shape (height, width, channels)

Convolutions are usually computed for each channel, and summed:

$(k \star im^{color}) = \sum\limits_{c=0}^2 k^c \star im^c $

Multiple convolutions¶

Kernel size aka receptive field (usually 1, 3, 5, 7, 11)
Output dimension: length - kernel_size + 1

Strides¶

Strides: increment step size for the convolution operator
Reduces the size of the output map

Example with kernel size $3 \times 3$ and a stride of $2$ (image in blue)

Convolution visualization by V. Dumoulin https://github.com/vdumoulin/conv_arithmetic

Padding¶

Padding: artificially fill borders of image
Useful to keep spatial dimension constant across filters
Useful with strides and large receptive fields
Usually: fill with 0s

Shapes of convolution layers¶

Kernel or Filter shape $(F, F, C^i, C^o)$

$F \times F$ kernel size,
$C^i$ input channels,
$C^o$ output channels

Number of parameters: $(F \times F \times C^i + 1) \times C^o$

Shapes of convolution layers¶

Activations or Feature maps shape:

Input: $\left(W^i, H^i, C^i\right)$
Output: $\left(W^o, H^o, C^o\right)$

$W^o = (W^i - F + 2P) / S + 1$

In [2]:

from IPython.display import IFrame # loading animation from https://cs231n.github.io
IFrame('https://cs231n.github.io/assets/conv-demo/index.html', width="100%", height=700)

Out[2]:

$W^o = (W^i - F + 2P) / S + 1$

Pooling¶

Spatial dimension reduction
Local invariance
No parameters: max or average of 2x2 units

Pooling¶

Spatial dimension reduction
Local invariance
No parameters: max or average of 2x2 units

In Keras¶

Fully Connected Network: Multilayer Perceptron¶

input_image = Input(shape=(28, 28, 1))
x = Flatten()(input_image)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
mlp = Model(inputs=input_image, outputs=x)

In Keras¶

Convolutional Network¶

input_image = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=5, padding='same', activation='relu')(input_image)
x = MaxPooling2D(2, strides=2)(x)
x = Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)

2D spatial organization of features preserved untill Flatten.

Feature visualization¶

DeepDream¶

Architectures

Classic ConvNet Architecture¶

Input¶

Conv blocks¶

Convolution + activation (relu)
Convolution + activation (relu)
...
Maxpooling 2x2

Output¶

Fully connected layers
Softmax

AlexNet¶

Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012

Input: 227x227x3 image First conv layer: kernel 11x11x3x96 stride 4

Kernel shape: (11,11,3,96)
Output shape: (55,55,96)
Number of parameters: 34,944
Equivalent MLP parameters: 43.7 x 1e9

AlexNet¶

INPUT:     [227x227x3]
CONV1:     [55x55x96]   96 11x11 filters at stride 4, pad 0
MAX POOL1: [27x27x96]      3x3   filters at stride 2
CONV2:     [27x27x256] 256 5x5   filters at stride 1, pad 2
MAX POOL2: [13x13x256]     3x3   filters at stride 2
CONV3:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV4:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV5:     [13x13x256] 256 3x3   filters at stride 1, pad 1
MAX POOL3: [6x6x256]       3x3   filters at stride 2
FC6:       [4096]      4096 neurons
FC7:       [4096]      4096 neurons
FC8:       [1000]      1000 neurons (softmax logits)

Total params: 28,054,497
Trainable params: 28,054,497
Non-trainable params: 0

VGG16¶

Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)

VGG in Keras¶

    model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))

Memory and Parameters¶

           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)

ResNet

Even deeper models: 34, 50, 101, 152 layers He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.

ResNet

A block learns the residual w.r.t. identity He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.

- Good optimization properties

ResNet

ResNet50 Compared to VGG:

Superior accuracy in all vision tasks
- 5.25% top-5 error vs 7.1%
Less parameters
- 25M vs 138M
Computational complexity
- 3.8B Flops vs 15.3B Flops
Fully Convolutional until the last layer

Benchmarks¶

Top-1 accuracy vs. number of images processed per second (with batch size 1) using the Titan Xp (S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE Access, vol. 6, 2018.)