Convolutional Neural Networks for Image Classification

Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license

CNNs for computer vision

CNN for image classification

CNN = Convolutional Neural Networks (or ConvNets)

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). LeNet: gradient-based learning applied to document recognition.

Outline of the lecture

  • Convolutions
  • Convolutions in Neural Networks
    • Motivations
    • Layers
  • Architectures
    • Classic CNN Architecture
    • AlexNet
    • VGG16
    • ResNet

Convolution

  • A mathematical operation that combines two functions to form a third function.
  • The feature map (or input data) and the kernel are combined to form a transformed feature map.
  • Often interpreted as a filter: the kernel filters the feature map for certain information (edges, etc.)
Figure 1: Convolving an image with an edge detector kernel.

The mathematical definition of convolution of two functions f and x over a range t is:

$y(t) = f \otimes x = \int_{-\inf}^{\inf}f(k) \cdot x(t-k) \mathrm{d}k$

where the symbol ⊗ denotes convolution.

https://developer.nvidia.com/discover/convolution

Convolution

Convolutional filters can be interpreted as feature detectors:

  • The input (feature map) is filtered for a certain feature (the kernel).
  • The output is large if the feature is detected in the image.
The kernel can be interpreted as a feature detector where a detected feature results in large outputs (white) and small outputs if no feature is present (black).
In [1]:
from IPython.display import IFrame
IFrame('https://setosa.io/ev/image-kernels/', width="100%", height=800)
Out[1]:

Convolution in a neural network

  • $x$ is a $3 \times 3$ chunk (yellow area) of the image (green array)

  • Each output neuron is parametrized with the $3 \times 3$ weight matrix $\mathbf{w}$ (small numbers)

The activation obtained by sliding the $3 \times 3$ window and computing:

$z(x) = relu(\mathbf{w}^T x + b)$

Motivations

Standard Dense Layer for an image input:

x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
y = Flatten()(x)
# shape of y is: (None, 640 x 480 x 3)
z = Dense(1000)(y)

$640 \times 480 \times 3 \times 1000 + 1000 = 922M$

No spatial organization of the input

Dense layers are never used directly on large images. Most standard solution is to use convolution layers.

Motivations

Local connectivity

  • A neuron depends only on a few local input neurons
  • Translation invariance

Comparison to Fully connected

  • Parameter sharing: reduce overfitting
  • Make use of spatial structure: strong prior for vision!

Animal Vision Analogy

  • Hubel & Wiesel, RECEPTIVE FIELDS OF SINGLE NEURONES IN THE CAT'S STRIATE CORTEX (1959)

Channels

Colored image = tensor of shape (height, width, channels)

Convolutions are usually computed for each channel, and summed:

$(k \star im^{color}) = \sum\limits_{c=0}^2 k^c \star im^c $

Multiple convolutions

Multiple convolutions

Multiple convolutions

Multiple convolutions

Multiple convolutions

  • Kernel size aka receptive field (usually 1, 3, 5, 7, 11)
  • Output dimension: length - kernel_size + 1

Strides

  • Strides: increment step size for the convolution operator
  • Reduces the size of the output map

Example with kernel size $3 \times 3$ and a stride of $2$ (image in blue)


Convolution visualization by V. Dumoulin https://github.com/vdumoulin/conv_arithmetic

Padding

  • Padding: artificially fill borders of image
  • Useful to keep spatial dimension constant across filters
  • Useful with strides and large receptive fields
  • Usually: fill with 0s

Shapes of convolution layers


Kernel or Filter shape $(F, F, C^i, C^o)$
  • $F \times F$ kernel size,
  • $C^i$ input channels,
  • $C^o$ output channels

Number of parameters: $(F \times F \times C^i + 1) \times C^o$

Shapes of convolution layers

Activations or Feature maps shape:

  • Input: $\left(W^i, H^i, C^i\right)$

  • Output: $\left(W^o, H^o, C^o\right)$

$W^o = (W^i - F + 2P) / S + 1$

In [2]:
from IPython.display import IFrame # loading animation from https://cs231n.github.io
IFrame('https://cs231n.github.io/assets/conv-demo/index.html', width="100%", height=700) 
Out[2]:

$W^o = (W^i - F + 2P) / S + 1$

Pooling

  • Spatial dimension reduction
  • Local invariance
  • No parameters: max or average of 2x2 units



Pooling

  • Spatial dimension reduction
  • Local invariance
  • No parameters: max or average of 2x2 units

In Keras

Fully Connected Network: Multilayer Perceptron

input_image = Input(shape=(28, 28, 1))
x = Flatten()(input_image)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
mlp = Model(inputs=input_image, outputs=x)

In Keras

Convolutional Network

input_image = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=5, padding='same', activation='relu')(input_image)
x = MaxPooling2D(2, strides=2)(x)
x = Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)

2D spatial organization of features preserved untill Flatten.

Feature visualization


DeepDream


DeepDream


Architectures

Classic ConvNet Architecture

Input

Conv blocks

  • Convolution + activation (relu)
  • Convolution + activation (relu)
  • ...
  • Maxpooling 2x2

Output

  • Fully connected layers
  • Softmax

AlexNet

Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012

Input: 227x227x3 image First conv layer: kernel 11x11x3x96 stride 4

  • Kernel shape: (11,11,3,96)
  • Output shape: (55,55,96)
  • Number of parameters: 34,944
  • Equivalent MLP parameters: 43.7 x 1e9

AlexNet

INPUT:     [227x227x3]
CONV1:     [55x55x96]   96 11x11 filters at stride 4, pad 0
MAX POOL1: [27x27x96]      3x3   filters at stride 2
CONV2:     [27x27x256] 256 5x5   filters at stride 1, pad 2
MAX POOL2: [13x13x256]     3x3   filters at stride 2
CONV3:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV4:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV5:     [13x13x256] 256 3x3   filters at stride 1, pad 1
MAX POOL3: [6x6x256]       3x3   filters at stride 2
FC6:       [4096]      4096 neurons
FC7:       [4096]      4096 neurons
FC8:       [1000]      1000 neurons (softmax logits)

Total params: 28,054,497
Trainable params: 28,054,497
Non-trainable params: 0

VGG16

Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)

VGG in Keras

    model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))

    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))

Memory and Parameters

           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000

TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)

ResNet

Even deeper models: 34, 50, 101, 152 layers He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.

ResNet

A block learns the residual w.r.t. identity He, Kaiming, et al. "Deep residual learning for image recognition." CVPR. 2016.
- Good optimization properties

ResNet

ResNet50 Compared to VGG:
  • Superior accuracy in all vision tasks

    • 5.25% top-5 error vs 7.1%
  • Less parameters

    • 25M vs 138M
  • Computational complexity

    • 3.8B Flops vs 15.3B Flops
  • Fully Convolutional until the last layer

Benchmarks

Top-1 accuracy vs. number of images processed per second (with batch size 1) using the Titan Xp (S. Bianco, R. Cadene, L. Celona, and P. Napoletano, “Benchmark analysis of representative deep neural network architectures,” IEEE Access, vol. 6, 2018.)