Some of the material from this lecture comes from online courses of Charles Ollion and Olivier Grisel - Master Datascience Paris Saclay.
CC-By 4.0 license
The mathematical definition of convolution of two functions f and x over a range t is:
where the symbol ⊗ denotes convolution.
https://developer.nvidia.com/discover/convolution
Convolutional filters can be interpreted as feature detectors:
from IPython.display import IFrame
IFrame('https://setosa.io/ev/image-kernels/', width="100%", height=800)
$x$ is a $3 \times 3$ chunk (yellow area) of the image (green array)
Each output neuron is parametrized with the $3 \times 3$ weight matrix $\mathbf{w}$ (small numbers)
The activation obtained by sliding the $3 \times 3$ window and computing:
Standard Dense Layer for an image input:
x = Input((640, 480, 3), dtype='float32')
# shape of x is: (None, 640, 480, 3)
y = Flatten()(x)
# shape of y is: (None, 640 x 480 x 3)
z = Dense(1000)(y)
$640 \times 480 \times 3 \times 1000 + 1000 = 922M$
No spatial organization of the input
Dense layers are never used directly on large images. Most standard solution is to use convolution layers.
Colored image = tensor of shape (height, width, channels)
Convolutions are usually computed for each channel, and summed:
Number of parameters: $(F \times F \times C^i + 1) \times C^o$
Activations or Feature maps shape:
Input: $\left(W^i, H^i, C^i\right)$
Output: $\left(W^o, H^o, C^o\right)$
$W^o = (W^i - F + 2P) / S + 1$
from IPython.display import IFrame # loading animation from https://cs231n.github.io
IFrame('https://cs231n.github.io/assets/conv-demo/index.html', width="100%", height=700)
$W^o = (W^i - F + 2P) / S + 1$
input_image = Input(shape=(28, 28, 1))
x = Conv2D(filters=32, kernel_size=5, padding='same', activation='relu')(input_image)
x = MaxPooling2D(2, strides=2)(x)
x = Conv2D(filters=64, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling2D(2, strides=2)(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dense(10, activation='softmax')(x)
convnet = Model(inputs=input_image, outputs=x)
2D spatial organization of features preserved untill Flatten.
Simplified version of Krizhevsky, Alex, Sutskever, and Hinton. "Imagenet classification with deep convolutional neural networks." NIPS 2012
Input: 227x227x3 image First conv layer: kernel 11x11x3x96 stride 4
(11,11,3,96)
(55,55,96)
34,944
43.7 x 1e9
INPUT: [227x227x3]
CONV1: [55x55x96] 96 11x11 filters at stride 4, pad 0
MAX POOL1: [27x27x96] 3x3 filters at stride 2
CONV2: [27x27x256] 256 5x5 filters at stride 1, pad 2
MAX POOL2: [13x13x256] 3x3 filters at stride 2
CONV3: [13x13x384] 384 3x3 filters at stride 1, pad 1
CONV4: [13x13x384] 384 3x3 filters at stride 1, pad 1
CONV5: [13x13x256] 256 3x3 filters at stride 1, pad 1
MAX POOL3: [6x6x256] 3x3 filters at stride 2
FC6: [4096] 4096 neurons
FC7: [4096] 4096 neurons
FC8: [1000] 1000 neurons (softmax logits)
Total params: 28,054,497
Trainable params: 28,054,497
Non-trainable params: 0
Simonyan, Karen, and Zisserman. "Very deep convolutional networks for large-scale image recognition." (2014)
model.add(Convolution2D(64, 3, 3, activation='relu',input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
Activation maps Parameters
INPUT: [224x224x3] = 150K 0
CONV3-64: [224x224x64] = 3.2M (3x3x3)x64 = 1,728
CONV3-64: [224x224x64] = 3.2M (3x3x64)x64 = 36,864
POOL2: [112x112x64] = 800K 0
CONV3-128: [112x112x128] = 1.6M (3x3x64)x128 = 73,728
CONV3-128: [112x112x128] = 1.6M (3x3x128)x128 = 147,456
POOL2: [56x56x128] = 400K 0
CONV3-256: [56x56x256] = 800K (3x3x128)x256 = 294,912
CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824
CONV3-256: [56x56x256] = 800K (3x3x256)x256 = 589,824
POOL2: [28x28x256] = 200K 0
CONV3-512: [28x28x512] = 400K (3x3x256)x512 = 1,179,648
CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296
CONV3-512: [28x28x512] = 400K (3x3x512)x512 = 2,359,296
POOL2: [14x14x512] = 100K 0
CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296
CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296
CONV3-512: [14x14x512] = 100K (3x3x512)x512 = 2,359,296
POOL2: [7x7x512] = 25K 0
FC: [1x1x4096] = 4096 7x7x512x4096 = 102,760,448
FC: [1x1x4096] = 4096 4096x4096 = 16,777,216
FC: [1x1x1000] = 1000 4096x1000 = 4,096,000
TOTAL activations: 24M x 4 bytes ~= 93MB / image (x2 for backward)
TOTAL parameters: 138M x 4 bytes ~= 552MB (x2 for plain SGD, x4 for Adam)
Superior accuracy in all vision tasks
Less parameters
Computational complexity
Fully Convolutional until the last layer