Intro to Keras, Tensorflow and advanced NN, part 2

Summary of first part

Terminology

  • A dataset in supervised learning is made of a number of (features, label) pairs
  • Example, a dataset of diabetic patients is made of:
    • Features: information describing each patient (weight, height, blood pressure...)
    • Labels: whether each patient is diabetic or not (glucose levels higher or lower than...)
  • Each (features, label) pair is also called a sample or example. Basically a data point
  • Features are also sometimes called inputs when referred to something you feed to a NN
  • Labels are compared to the NN's outputs to see how well the network is doing compared to the truth

https://keras.io

  • Keras is a high-level neural networks API (front-end), written in Python
  • Capable of running on top of TensorFlow, CNTK, or Theano (backends)
  • Built to simplify access to more complex backend libraries

https://tensorflow.org

Use TensorFlow if you want a finer level of control:

  • Build your own NN layers
  • Personalized cost function
  • More complex architectures than those available on Keras

We will be mostly writing python code using Keras libraries, but "under the hood" Keras is using tensorflow libraries.

The documentation is at keras.io.

Here's how a NN layer looks like in TensorFlow:

  • 7 samples in batch
  • 784 inputs
  • 500 outputs

A neural network in Keras is called a Model

The simplest kind of model is of the Sequential kind:

In [6]:
from tensorflow.keras.models import Sequential

model = Sequential()

This is an "empty" model, with no layers, no inputs or outputs are defined either.

Adding layer is easy:

In [7]:
from tensorflow.keras.layers import Dense

model.add(Dense(units=3, activation='relu', input_dim=3))
model.add(Dense(units=2, activation='softmax'))

A "Dense" layer is a fully connected layer as the ones we have seen in Multi-layer Perceptrons. The above is equal to having this network:

If we want to see the layers in the Model this far, we can just call:

In [ ]:
model.summary()

Using "model.add()" keeps stacking layers on top of what we have:

In [ ]:
model.add(Dense(units=2, activation=None))
model.summary()

Part 2, more Keras layers (https://keras.io/api/layers/)

Common layers (we will cover most of these!)

  • Trainable

    • Dense (fully connected/MLP)
    • Conv1D (2D/3D)
    • Recurrent: LSTM/GRU/Bidirectional
    • Embedding
    • Lambda (apply your own function)
  • Non-trainable

    • Dropout
    • Flatten
    • BatchNormalization
    • MaxPooling1D (2D/3D)
    • Merge (add/subtract/concatenate)
    • Activation (Softmax/ReLU/Sigmoid/...)

Dropout is a regularization layer

  • It's applied to a previous layer's output
  • Takes those outputs and randomly sets them to 0 with probability p
  • Other outputs are scaled up so that the sum of the inputs remains unchanged
  • if p = 0.5: model.add(Dropout(0.5))
In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.layers import Dropout
from tensorflow.keras import backend as K

tf.random.set_seed(1)
drop = Dropout(0.5, input_shape=(4,))
data = tf.reshape(tf.range(1.0,13.0), (3, 4))

print("Before:", data, sep="\n")
output = drop(data, training=True)
print("After:", K.eval(output), sep="\n")
Before:
tf.Tensor(
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]], shape=(3, 4), dtype=float32)
After:
[[ 0.  4.  6.  0.]
 [ 0. 12. 14.  0.]
 [18. 20. 22. 24.]]
2023-03-20 07:58:31.056981: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-20 07:58:31.057209: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-20 07:58:31.058599: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

Dropout is a regularization layer

  • Applying the same input twice will give different results
  • Means that it is harder for the network to memorize patterns
  • Helps curb overfitting
  • Especially used with Dense() layers which are prone to overfitting
  • Active only at training time
In [2]:
import numpy as np
from tensorflow.keras.layers import Dropout
from tensorflow.keras import backend as K

#f.random.set_seed(0)
drop = Dropout(0.5, input_shape=(4,))
data = tf.reshape(tf.range(1.0,13.0), (3, 4))

print("Before:", data, sep="\n")
output = drop(data, training=True)
print("After:", K.eval(output), sep="\n")
Before:
tf.Tensor(
[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]
 [ 9. 10. 11. 12.]], shape=(3, 4), dtype=float32)
After:
[[ 2.  0.  0.  8.]
 [10.  0.  0. 16.]
 [ 0. 20.  0. 24.]]

Lambda layers

  • Work like regular lambda functions
  • Inputs and outputs are tensors, functions inside must be keras/tensorflow functions
  • Function has to be differentiable
In [3]:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K

def sum_two_tensors(inputs):

    x, y = inputs
    sum_of_tensors = x + y

    return sum_of_tensors

input_tensor_1 = tf.range(0, 9)
input_tensor_2 = tf.range(1, 10)
print(input_tensor_1)
print(input_tensor_2)
#lambda_out = Lambda(sum_two_tensors)([input_tensor_1, input_tensor_2])
lambda_layer = Lambda(sum_two_tensors)
lambda_out = lambda_layer([input_tensor_1, input_tensor_2])
K.eval(lambda_out)

#model.add(Lambda(sum_two_tensors))
tf.Tensor([0 1 2 3 4 5 6 7 8], shape=(9,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7 8 9], shape=(9,), dtype=int32)
Out[3]:
array([ 1,  3,  5,  7,  9, 11, 13, 15, 17], dtype=int32)

Keras activations (https://keras.io/api/layers/activations/)

Activation functions for regression or inner layers:

  • Sigmoid
  • Tanh
  • ReLU
  • LeakyReLU
  • Linear (None)

THE activation function for classification (output layer only):

  • Softmax (ouputs probabilities for each class)

Softmax

It's an activation function applied to a output vector z with K elements (one per class) and outputs a probability distribution over the classes:

What makes softmax your favorite activation:

  • K outputs sums to 1
  • K probabilities proportional to the exponentials of the input numbers
  • No negative outputs
  • Monotonically increasing output with increasing input

Softmax is usually only used to activate the last layer of a NN

ReLU vs. old-school logistic functions

  • Historically, sigmoid and tanh were the most used activation functions
  • Easy derivative
  • Bound outputs (ex: from 0 to 1)
  • They look like this:

ReLU vs. old-school logistic functions

  • Problems arise when we are at large $|x|$
  • The derivative in that area becomes small (saturation)
  • Remember what the chain rule said?

ReLU vs. old-school logistic functions

  • When we have $n$ layers, we go through $n$ activation functions
  • At layer $n$ the derivative is proportional to: $$\begin{eqnarray} \frac{\partial L(w,b|x)}{\partial w_{ln}} & \propto & \frac{\partial a_{ln}}{\partial z_{ln}} \end{eqnarray}$$
  • At layer 1 the derivative is proportional to: $$\begin{eqnarray} \frac{\partial L(w,b|x)}{\partial w_{l1}} & \propto & \frac{\partial a_{ln}}{\partial z_{ln}} \times \frac{\partial a_{n-1}}{\partial z_{ln-1}} \times \frac{\partial a_{ln-2}}{\partial z_{ln-2}} \ldots \times \frac{\partial a_{l1}}{\partial z_{l1}} \end{eqnarray}$$
  • It is the product of many numbers $< 1$
  • Gradient becomes smaller and smaller for the initial layers
  • Gradient vanishing problem

ReLU is the first activation to address the issue

Used in "internal" layers, usually not at last layer

Pros:

  • Easy derivative (1 for x > 0, 0 elsewhere)
  • Derivative doesn't saturate for x > 0: alleviates gradient vanishing
  • Non-linear

Cons:

  • Non-derivable at 0
  • Dead neurons if x << 0 for all data instances
  • Potential gradient explosion
  • Let's try this on Tensorflow playground: http://playground.tensorflow.org

Other ReLU-like activations

LeakyReLU/PReLU

  • y = $\alpha$x at x < 0
  • In PReLU $\alpha$ is learned

Other

ELU

  • Derivable at 0
  • Non-zero at x < 0
In [4]:
from IPython.display import IFrame 
IFrame('https://polarisation.github.io/tfjs-activation-functions/', width=860, height=470) 
Out[4]:

Setting activations in Keras

We can add activations as string parameters, or as functions:

In [8]:
model = Sequential() 
model.add(Dense(units=2, activation='sigmoid'))
model.add(Dense(units=2, activation='relu'))
model.add(Dense(units=2, activation=tf.keras.activations.relu))
model.add(Dense(units=2, activation='softmax'))

But also as separate layers

In [9]:
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense

model = Sequential() 
model.add(Dense(units=2))
model.add(Activation('sigmoid'))
model.add(Dense(units=2))
model.add(Activation('relu'))
model.add(Dense(units=2))
model.add(Activation(tf.keras.activations.relu))
model.add(Dense(units=2))
model.add(Activation('softmax'))

Passing classes as parameters

  • Some parameters can be set by passing a string (optimizer='rmsprop')
  • we need to explicitly import the object if we want better control (optimizer=RMSprop())
In [10]:
from tensorflow.keras.optimizers import RMSprop
model.compile(optimizer=RMSprop(),                    #adaptive learning rate method
              loss='sparse_categorical_crossentropy', #loss function for classification problems with integer labels
              metrics=['accuracy'])                   #the metric doesn't influence the training

model.optimizer.get_config()
Out[10]:
{'name': 'RMSprop',
 'learning_rate': 0.001,
 'decay': 0.0,
 'rho': 0.9,
 'momentum': 0.0,
 'epsilon': 1e-07,
 'centered': False}

Passing classes as parameters

  • Some parameters can be set by passing a string (optimizer='rmsprop')
  • we need to explicitly import the object if we want better control (optimizer=RMSprop())
In [11]:
from keras.optimizers import RMSprop
model.compile(optimizer=RMSprop(learning_rate=1.0),   #adaptive learning rate method
              loss='sparse_categorical_crossentropy', #loss function for classification problems with integer labels
              metrics=['accuracy'])                   #the metric doesn't influence the training

model.optimizer.get_config()
Out[11]:
{'name': 'RMSprop',
 'learning_rate': 1.0,
 'decay': 0.0,
 'rho': 0.9,
 'momentum': 0.0,
 'epsilon': 1e-07,
 'centered': False}

There are multiple ways to pass data to fit()

  • You can load all of the data in memory, assign it to:
    • numpy array or list of arrays (if you have multiple inputs/outputs)
    • TensorFlow tensors
    • A dictionary to map input names to arrays/tensors
data = np.genfromtxt('path/to/dataset.csv',delimiter=',')

X_train = data[:,0:10]
y_train = data[:,10]

model.fit(X_train, y_train,...)

There are multiple ways to pass data to fit()

  • Or you can pass it an object/function that generates data for you:
    • A generator() function
    • A keras.utils.Sequence object
    • A tensorflow.data.Dataset object

Here a quick example on how a generator that loads loads data from a list of files (images, pickle objects, csv files...) on the filesystem:

def generator(input_list):
    input_list_file = open(input_list, 'r')
    while 1:
        for next_file in input_list_file:

            data = open(next_file, 'r').readlines()
            X = data[:,0:10]
            y = data[:,10]

            yield X,y
        input_list_file.seek(0)

model.fit(generator(train_data_list),...)

Even more Keras layers

  • Dense is the classic FFNN where all nodes between layers are connected
  • Most of the other layers seen today are not trainable
  • What other layers are trainable then?

Convolutional layers

  • Used where the spatial relationship between inputs is significant
  • Classic example: imaging
  • Different types: 1D, 2D, 3D
from tensorflow.keras.layers import Conv2D

model.add(Conv2D(filters, kernel_size, strides=(1, 1), padding="valid"))

Convolutional layers

(source)

Recurrent layers

  • Used when the temporal relationship between inputs is significant
  • Examples: audio, text
  • Different types: LSTM, GRU...
from tensorflow.keras.layers import LSTM
model.add(LSTM(units, activation="tanh", recurrent_activation="sigmoid"))

Recurrent layers

Embedding layers

  • Used to transform a discrete input into a vector
  • Example: text input is made of words, how do we translate that into NN inputs?
  • "cat" -> [0.1, 0.003, 1.2 ..., 0]
from tensorflow.keras.layers import Embed
model.add(Embedding(input_dim, output_dim))

Embedding layers

  • Example: map amino acid names to 2D space
  • Which amino acids are most similar to tryptophan (W)?

The functional API in Keras

  • https://keras.io/guides/functional_api
  • Sequential() is quite simple, but limited
  • What if we want to have multiple input/output layers?
  • What if we want a model that is not just a linear sequence of layers?

Exercise 2/3 (reprise)

  • Remember the XOR classifier? Or the Boston housing dataset?
  • Can you apply some of the things we have learned today on the models from yesterday?
  • Do they help?

Exercise 4 (optional)

Classifying IMDB reviews into positive or negative.

Check the exercises notebook!