The most important talk you never heard
Think about it…
Has anyone ever talked to you about the underlying “machinery”?
Has anyone ever presented you to or with a project organisation plan?
How were the results you are being presented to produced?
I know there are supposed to be a materials and methods section in papers – Have you ever been tasked with deciphering and repeating such a section?
“More than 70% of researchers have tried and failed to reproduce another scientist’s experiments”
“More than half have failed to reproduce their own experiments”
Granted, some of the reasons for the reproducibility crisis is beyond our control
However, a step in the right direction is to think about organising and documenting your research
I have seen many times people revisiting old projects only to find that they cannot figure the project out or even reproduce it or understanding the project is so time consuming, that repeating it is more time efficient
Why does this happen?
Raw Data should always be pulled from central source, never from an excel sheet someone sent to you
You are not allowed to touch or alter the original raw data
Make sure that every step from the raw data, to the data you use for analysis can be repeated
Save the cleaned data and proceed from that
This is where your manuscript lives
Notes, presentations, pdfs and alike pertaining to the project
Basically anything “doc”-like…
This is where your analysis scripts are places
All scripts must be able to run from start-to-end with no manual intervention (We’ll get back to that)
Load-clean-func-do philosophy
First scripts takes your raw data from raw to analysis-ready
Project specific functions are put in a separate file
A single do file is defined capable of running the ENTIRE project and produce ALL results
Collect the results in a markdown file
Use GitHub for (code) collaboration/sharing, version control and backup
Once analysis has converged, a technical report should be created using markdown
Once the paper is published the project directory should be frozen as read-only (and be on Github)
The directory should contain everything needed to recreate all the exact figures and tables in the paper
This is not the absolute truth
This is my (current) take on a how-to data science
Structure takes time in order to save time
Am I always adhering 110% to this? No, but…
I strongly believe that striving for structure is better than abandoning it
A picture speaks a thousand words - Let’s try to visualise it!
Think about readability of your code. Every project you work on is fundamentally collaborative. Even if you are not working with any other person, you are always working with future you and you really do not want to be in a situation where future you has no idea what past you was thinking, because past you will not respond to any emails! [Hadley Wickham]
R for Bio Data Science