Good practices of NN/DL project design¶

What to do and - more importantly perhaps - not to do¶

Is my project right for Neural Networks?¶

The thought process should not be: “I have some data, why don’t we try neural networks”
But it should be: “Given the problem, does it make sense to use neural networks?”
- Do I really need non-linear modelling?
- What literature is out there for similar problems?
- How much data will I be able to gather or put my hands on?
- Are there datasets out there that I can re-use before I collect my data?

Do I really need non-linear modelling?¶

Sometimes linear methods perform just as well if not better
Less risk of catastrophic overfitting
Faster to code, optimize, run, debug
Use linear modelling as a baseline before you move to non-linear methods?

Real-life example¶

Drop-in question: "I tried deep learning on my data and it didn't perform better than this other simpler method"

Classifying gene expression samples
Thousands features
1000 samples
2 classes
NN looked like this:

In [4]:

from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Dense(1000, input_dim=5000))
model.add(Dense(500))
model.add(Dense(2, activation="softmax"))

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 1000)              5001000   
_________________________________________________________________
dense_5 (Dense)              (None, 500)               500500    
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 1002      
=================================================================
Total params: 5,502,502
Trainable params: 5,502,502
Non-trainable params: 0
_________________________________________________________________

Parameters (weights) vs. samples¶

If the number of parameters is many times higher than the number of samples a NN will never work
Ideally, we are looking for the inverse: way more samples than parameters
Some rules of thumb out there:
- Definitely bad if number of weights > number of samples
- 10x as many labelled samples as there are weights
- A few thousand samples per class
- Just try it and downscale/regularize until you're not overfitting anymore (or until you have a linear model)

And even if I have enough data for a NN...¶

... is Deep Learning the right choice?

The tasks were Deep Learning shine are those that require feature extraction:
- Imaging -> edge/object detection
- Audio/text -> sound/word/sentence detection
- Protein structure prediction -> mutation patterns/local structure/global structure
Deep Learning makes feature extraction automatic and seem to work best when there is a hierarchy to these features
Is your data made that way?
- Does it have an order (spatial/temporal)?
- Are smaller patterns going to form higher-order patterns?
All these different types of layers need to be there for a reason

source: datarobot

And even when both these conditions have been met¶

... you need a few more things:

Domain knowledge is not enough
Sometimes people with NN/DL knowledge and no domain knowledge end up being the right ones for the job (see Alphafold)
You also need lots of patience and time, these things rarely work out of the box

A few more things to keep in mind¶

You need extensive knowledge of your data:
- Split the data in a rigorous way to avoid introducing biases
- Check for information leakage before you get overly optimistic results
- Make sure that there are no errors in your data

And therein lies the main issue:

Some think that DL is about having a model magically fixing your data
Instead, DL is mostly about knowing your data

1) Neural Nets are very good at detecting patterns and they will use this against you¶

(a.k.a. target leakage)¶

Target leakage¶

Making a predictor when you know the answers is not as easy as it seems
Need to remove any revealing info you would not have access to in real scenario
Classic example: predict yearly salary of employee
- But one of the features is "monthly income"

Example: detecting COVID-19 from chest scans¶

(https://www.datarobot.com/blog/identifying-leakage-in-computer-vision-on-medical-images/)

COVIDx dataset
Training set: chest X-rays of 66 positive COVID results, 120 random non-COVID examples
2-class classifier based on ResNet50 Featurizer
Perfect validation results! Great!

Example: detecting COVID-19 from chest scans¶

Inspecting dataset with image embeddings tells another story: can anyone tell what's wrong?

(source)

Example: detecting COVID-19 from chest scans¶

Let's look at activations map and see more in detail

Get final layer's output after activation (ReLU) and plot figure

(source)

Example: normalizing inputs on train/validation/test data¶

If you normalize on validation data as well you are getting information you wouldn't have in a real scenario

Lab 1: looking for target leakage in a text dataset (~1 h.)¶

Jupyter notebook (download from canvas module page)

Visualize the layers of a NN for Natural Language Processing:

Can you tell if there is target leakage of some kind?
Propose solutions to curb the issue

2) Know your train/validation/test sets¶

A train set is a set of samples used to tune the NN weights
A validation set is a set used to tune the NN hyperparameters:
- Type of model (maybe not even a NN)
- Number of layers
- Number of neurons per layer
- Type of layers
- Optimizer
- Validation set results are NOT the ones that will get published
- Doesn't matter if you cross-validate
A test set is a secluded set of samples that are used only once to test the final model
- Give an idea of how well the model generalizes to unseen data (results go on paper)

Beware of similar samples across sets¶

(2F08 “Fear of Flying”)

Knowing what each set does is half the battle¶

Train, validation and test sets cannot be too similar to each other, or you will not be able to tell if the network is generalizing or just memorizing

How different they should be depends on what you're trying to achieve
Come up with a similarity measure
At the very least remove duplicate samples
You would be surprised how often scientists mess this up

Sad ending :(¶

Another example, protein structure prediction¶

For some reason most researchers try to split train/validation/test by sequence similarity
If two proteins have <25% identical amino acids, they are deemed different enough
But protein families/superfamilies contain many proteins that share no detectable sequence similarity
Sequence similarity is not the right metric!

Lab 2: splitting a protein sequence dataset (~1 h.)¶

Jupyter notebook:

session_goodPracticesDatasetDesign/lab_validation/rigorous_train_validation_splitting.ipynb

Two different strategies will be tested:

Random split
Split by alignment score

Which works best? Different groups test different networks on each strategy

split_cell