Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license
The mathematical definition of convolution of two functions f and x over a range t is:
where the symbol ⊗ denotes convolution.
https://developer.nvidia.com/discover/convolution
Convolutional filters can be interpreted as feature detectors:
from IPython.display import IFrame
IFrame('https://setosa.io/ev/image-kernels/', width="100%", height=800)
$x$ is a $3 \times 3$ chunk (yellow area) of the image (green array)
Each output neuron is parametrized with the $3 \times 3$ weight matrix $\mathbf{w}$ (small numbers)
The activation obtained by sliding the $3 \times 3$ window and computing:
Standard Dense Layer for an image input:
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
y = Flatten()(x)
# shape of y is: (None, 640 x 480 x 3)
z = Dense(1000)(y)
$640 \times 480 \times 3 \times 1000 + 1000 = 922M$
No spatial organization of the input
Dense layers are never used directly on large images. Most standard solution is to use convolution layers.
Colored image = tensor of shape (height, width, channels)
Convolutions are usually computed for each channel, and summed:
Number of parameters: $(F \times F \times C^i + 1) \times C^o$
Activations or Feature maps shape:
Input: $\left(W^i, H^i, C^i\right)$
Output: $\left(W^o, H^o, C^o\right)$
$W^o = (W^i - F + 2P) / S + 1$
from IPython.display import IFrame # loading animation from https://cs231n.github.io
IFrame('https://cs231n.github.io/assets/conv-demo/index.html', width="100%", height=700) 
$W^o = (W^i - F + 2P) / S + 1$
input_image = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=5, padding='same', activation='relu')(input_image)
x = MaxPooling2D(2, strides=2)(x)
x = Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
2D spatial organization of features preserved untill Flatten.
Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
Input: 227x227x3 image First conv layer: kernel 11x11x3x96 stride 4
(11,11,3,96)(55,55,96)34,94443.7 x 1e9INPUT:     [227x227x3]
CONV1:     [55x55x96]   96 11x11 filters at stride 4, pad 0
MAX POOL1: [27x27x96]      3x3   filters at stride 2
CONV2:     [27x27x256] 256 5x5   filters at stride 1, pad 2
MAX POOL2: [13x13x256]     3x3   filters at stride 2
CONV3:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV4:     [13x13x384] 384 3x3   filters at stride 1, pad 1
CONV5:     [13x13x256] 256 3x3   filters at stride 1, pad 1
MAX POOL3: [6x6x256]       3x3   filters at stride 2
FC6:       [4096]      4096 neurons
FC7:       [4096]      4096 neurons
FC8:       [1000]      1000 neurons (softmax logits)
Total params: 28,054,497
Trainable params: 28,054,497
Non-trainable params: 0Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)
    model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
    model.add(Convolution2D(64, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(Convolution2D(128, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(Convolution2D(256, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(Convolution2D(512, 3, 3, activation='relu'))
    model.add(MaxPooling2D((2,2), strides=(2,2)))
    model.add(Flatten())
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1000, activation='softmax'))           Activation maps          Parameters
INPUT:     [224x224x3]   = 150K     0
CONV3-64:  [224x224x64]  = 3.2M     (3x3x3)x64    =       1,728
CONV3-64:  [224x224x64]  = 3.2M     (3x3x64)x64   =      36,864
POOL2:     [112x112x64]  = 800K     0
CONV3-128: [112x112x128] = 1.6M     (3x3x64)x128  =      73,728
CONV3-128: [112x112x128] = 1.6M     (3x3x128)x128 =     147,456
POOL2:     [56x56x128]   = 400K     0
CONV3-256: [56x56x256]   = 800K     (3x3x128)x256 =     294,912
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
CONV3-256: [56x56x256]   = 800K     (3x3x256)x256 =     589,824
POOL2:     [28x28x256]   = 200K     0
CONV3-512: [28x28x512]   = 400K     (3x3x256)x512 =   1,179,648
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
CONV3-512: [28x28x512]   = 400K     (3x3x512)x512 =   2,359,296
POOL2:     [14x14x512]   = 100K     0
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
CONV3-512: [14x14x512]   = 100K     (3x3x512)x512 =   2,359,296
POOL2:     [7x7x512]     =  25K     0
FC:        [1x1x4096]    = 4096     7x7x512x4096  = 102,760,448
FC:        [1x1x4096]    = 4096     4096x4096     =  16,777,216
FC:        [1x1x1000]    = 1000     4096x1000     =   4,096,000
TOTAL activations: 24M x 4 bytes ~=  93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)Superior accuracy in all vision tasks
Less parameters
Computational complexity
Fully Convolutional until the last layer