QTM 350 - Data Science Computing

Lecture 08 - Practice

Danilo Freire

Emory University

And here we are! 😊

Time to practice! 🧑🏻‍💻👩🏼‍💻

Hands-on practice

  • Now, you will practice setting up a new data science project

    • Structuring it, using Git for version control, and connecting it to a remote GitHub repository
  • You should execute these commands in your terminal and are encouraged to document their main commands in a README.md file as they go.

  • Estimated Time: 40-50 minutes

  • Feel free to use any of resources you have available, but please avoid using AI tools
  • Why? Because this is a risk-less activity to practice and learn 🤓
  • If you get stuck, please ask for help! Searching the web is also a good option 😉

Instructions

  • You can download a Jupyter Notebook with the instructions for this activity in this link
  • You can also find the same instructions in this presentation
  • You can use any terminal you want, but we recommend using the terminal in JupyterLab
    • Use the ! command to run shell commands in Jupyter, it should work the same way
    • e.g., !mkdir my_ds_project_revision
  • You can also use the terminal in RStudio, VSCode, or any other terminal you prefer

Part 1: Project & GitHub Setup

Throughout this revision, please execute the commands in your terminal. You should create a project_log.md file in your project’s root directory and add the commands you use as you go.

    1. Create a Project Directory:
    • On your local machine, create a new directory for this project. Let’s call it my_ds_project_revision.
mkdir my_ds_project_revision
  • Navigate into this new directory.
cd my_ds_project_revision
  • (Self-check: Use pwd to confirm you’re in the right place.)
pwd

Part 1: Project & GitHub Setup

    1. Create a New Repository on GitHub:
  • Go to GitHub and create a new, empty public repository.

  • Name it my-ds-project-revision (or similar).

  • Important: Do not initialise it with a README, .gitignore, or license yet.

  • Once created, copy the HTTPS or SSH URL for your new repository.

    1. Initialise Local Git Repository:
  • Back in your my_ds_project_revision directory in the terminal:

git init

Part 1: Project & GitHub Setup

    1. Initial Commit:
  • Create a simple README.md file. Add content using echo:
echo "# My Data Science Project Revision" > README.md
echo "This project is a hands-on revision of setting up a data science workflow with Git and GitHub." >> README.md
  • Stage the README.md file:
git add README.md
  • Commit the staged file:
git commit -m "Initial commit: Add README"

Part 1: Project & GitHub Setup

    1. Connect to GitHub and Push:
  • Add the GitHub repository as the remote origin (replace YOUR_GITHUB_REPO_URL with the URL you copied):
git remote add origin YOUR_GITHUB_REPO_URL
  • Rename your default branch to main (if it’s not already named that, e.g., if it’s master):
git branch -M main
  • Push your initial commit to the main branch on GitHub:
git push -u origin main
  • (Self-check: Refresh your GitHub repository page to see the README.md file.)

Part 2: Structuring Your Data Science Project

    1. Create Standard Directories:
  • Create the following directories using brace expansion:
mkdir -p data/{raw,processed} notebooks scripts results docs
  • (Self-check: Use ls or tree if available.)
ls -R

Part 2: Structuring Your Data Science Project

    1. Add Initial Project Files:
  • Create empty files:
touch scripts/01_data_preprocessing.py scripts/02_analysis.py notebooks/exploratory_data_analysis.ipynb data/raw/placeholder.txt docs/project_plan.md
  • Add placeholder text to docs/project_plan.md:
echo "## Project Plan" > docs/project_plan.md
echo "1. Setup project structure." >> docs/project_plan.md
echo "2. Implement data preprocessing." >> docs/project_plan.md
echo "3. Perform analysis." >> docs/project_plan.md

Part 2: Structuring Your Data Science Project

    1. Create a .gitignore file:
  • Create the file:
touch .gitignore
  • Add lines to .gitignore (you can use echo ... >> .gitignore for each line or open it in a text editor):
echo "# Python" >> .gitignore
echo "__pycache__/" >> .gitignore
echo "*.pyc" >> .gitignore
echo "*.pyo" >> .gitignore
echo "*.pyd" >> .gitignore
echo "" >> .gitignore
echo "# Jupyter Notebook" >> .gitignore
echo ".ipynb_checkpoints/" >> .gitignore
echo "" >> .gitignore
echo "# Data files" >> .gitignore
echo "data/raw/*" >> .gitignore
echo "data/processed/*" >> .gitignore
echo "!data/raw/placeholder.txt" >> .gitignore
echo "" >> .gitignore
echo "# Results" >> .gitignore
echo "results/*" >> .gitignore
echo "" >> .gitignore
echo "# Environment" >> .gitignore
echo ".env" >> .gitignore
echo "venv/" >> .gitignore
echo "env/" >> .gitignore
  • (Self-check: View the contents of .gitignore using cat .gitignore)

Part 2: Structuring Your Data Science Project

    1. Commit Project Structure:
  • Stage all new directories/files:
git add .
  • Commit the changes:
git commit -m "Feat: Setup project directory structure and gitignore"
  • Push to GitHub:
git push origin main

Part 3: Simulating a Workflow with Branching

    1. Create a Feature Branch:
git branch feature/add-initial-script-logic
git checkout feature/add-initial-script-logic
  • (Or in one command: git checkout -b feature/add-initial-script-logic)*

    1. “Develop” a Script:
  • Append lines to scripts/01_data_preprocessing.py:

echo "# scripts/01_data_preprocessing.py" > scripts/01_data_preprocessing.py
echo "import pandas as pd" >> scripts/01_data_preprocessing.py
echo "" >> scripts/01_data_preprocessing.py
echo "def load_data(filepath):" >> scripts/01_data_preprocessing.py
echo "    print(f\"Loading data from {filepath}...\")" >> scripts/01_data_preprocessing.py
echo "    # df = pd.read_csv(filepath)" >> scripts/01_data_preprocessing.py
echo "    # print(\"Data loaded successfully.\")" >> scripts/01_data_preprocessing.py
echo "    # return df" >> scripts/01_data_preprocessing.py
echo "" >> scripts/01_data_preprocessing.py
echo "print(\"Data preprocessing script initialized.\")" >> scripts/01_data_preprocessing.py
  • (Self-check: cat scripts/01_data_preprocessing.py)

Part 3: Simulating a Workflow with Branching

    1. Simulate Data Processing:
cp data/raw/placeholder.txt data/processed/processed_placeholder.txt
    1. Commit Feature Development:
  • Stage the changes:
git add scripts/01_data_preprocessing.py data/processed/processed_placeholder.txt
  • Commit:
git commit -m "Feat: Add initial logic to preprocessing script and simulate output"

Part 3: Simulating a Workflow with Branching

    1. Merge Feature Branch:
  • Switch to main:
git checkout main
  • Merge:
git merge feature/add-initial-script-logic
  • (Optional) Delete the feature branch locally:
git branch -d feature/add-initial-script-logic

Part 3: Simulating a Workflow with Branching

    1. Push Final Changes:
git push origin main
  • (Self-check: Verify on GitHub.)

    1. Document Your Work:
  • Create project_log.md. It’s recommended you do this as you go.

touch project_log.md
  • Add all the shell and Git commands you used to this file using Markdown.

  • Stage, commit, and push project_log.md.

git add project_log.md
git commit -m "Docs: Add project log with commands"
git push origin main

End of Revision Lecture Activity

  • Congratulations! 🥳
  • You’ve set up a data science project with version control and connected it to GitHub 🐙
  • If you have any questions, please feel free to ask!
  • Happy coding! 💻

And that’s all for today! 🎉