DATASCI 350 - Data Science Computing

Lecture 25 - Course Revision

Danilo Freire

Department of Data and Decision Sciences
Emory University

Hello, everyone! 😊

Course recap

  • We covered a lot of content this semester! 🤓
  • Three key areas for data science:
    • Reliability: consistent results every time you run your code
    • Reproducibility: others (and your future self) can obtain the same results with your data and code
    • Robustness: analyses that handle data variation and scale to larger problems
  • Together, these support more credible and collaborative science 🤓

Core tools overview

  • Tools and techniques we covered:
    • Command Line (Bash): system control, automation, remote servers, containers
    • Git & GitHub: version control to track changes, revert mistakes, and collaborate
    • Quarto: literate programming, combining text, code, and results in one document
    • Cloud Computing (AWS): on-demand servers, storage, and services for scalability
  • SQL & Relational Databases: structured data storage and querying with SQL
  • AI Tools: programming assistants that generate, explain, and debug code
  • Dask: parallel and distributed computing beyond a single machine
  • Docker: containerisation, packaging apps and dependencies to run anywhere

A quick favour 🙏

Course evaluations are open!

  • Course evaluations are now open, and I would love to hear your thoughts on the course!
  • Just log on to Canvas and go to Account → Profile → Course Evaluations
  • Works on any device: laptop, tablet, or phone
  • It only takes a few minutes!
  • Thank you so much for taking the time. I really appreciate it! 😊

Computational literacy 💻

From human computers to the silicon age

  • ‘Computing’ changed a lot over time
  • Originally, ‘computers’ were people doing calculations, often with tools like the abacus
  • Mechanical calculators followed (e.g., Leibniz’s machine for basic arithmetic)
  • The transistor, integrated circuits, and microprocessors of the silicon age gave us electronic computers
  • Most modern computers use the Von Neumann architecture: program instructions and data share the same memory, accessed by a CPU

Representing data: binary, hex, and decimal

  • Computers store all information in binary (base 2): only 0s and 1s
  • These map to the on/off states of transistors
  • One binary digit = a bit; eight bits = a byte
  • Hexadecimal (base 16): digits 0-9 and A-F, shorthand for binary. Each hex digit = four bits (e.g., FF = 11111111 = 255)
  • Used for colours (#FF0000) and memory addresses

Conversion table

Encoding information: text and images

  • Computers encode information so they can process it
  • Abstraction: mapping complex data types to numbers
  • Digital images are grids of pixels
  • Each pixel’s colour uses the RGB model: Red, Green, Blue intensities (0-255)
  • Text encoding assigns a number to each character:
    • ASCII: early standard, enough for basic English but limited (7-8 bits)
    • Unicode (UTF-8): modern universal standard, supports nearly all characters (1-4 bytes)

ASCII table

Programming languages: low-level vs high-level ⌨️

Languages bridge human instructions and machine execution:

  • Machine Code: raw binary the CPU runs directly
  • Assembly Language: low-level mnemonics for machine instructions. CPU-specific, fine control, poor portability
  • High-Level Languages (Python, R, Java): human-readable syntax, abstracts hardware. More portable, but needs translation
  • Translation Methods:
    • Compiled (e.g., C++): source translated to machine code before running; faster execution
    • Interpreted (e.g., Python): code run line-by-line at runtime by an interpreter; easier development

High and low-level languages

The command line 🖥️

Understanding the shell

  • The Operating System (OS) sits between hardware and software
  • Its core, the Kernel, manages hardware access and resources
  • Applications run in User Space, separate from the kernel for stability
  • A Shell (Bash, Zsh) is a command-line interpreter for interacting with the OS
  • You access the shell through a Terminal application

Core commands:

  • pwd: Print Working Directory (shows current location)
  • ls: List directory contents (ls -l for details, ls -a for hidden files)
  • cd <directory>: Change Directory (cd .. up, cd ~ home)
  • mkdir <name>: Make Directory
  • touch <name>: Create empty file / update timestamp
  • cp <source> <dest>: Copy file/directory (-r for directories)
  • mv <source> <dest>: Move or Rename file/directory
  • rm <file>: Remove file (permanently!)
  • rmdir <empty_dir>: Remove empty directory
  • rm -r <dir>: Remove directory recursively (DANGEROUS! No undo)

Working with text & finding files

Inspecting and searching text:

  • cat <file>: display entire file
  • head <file> / tail <file>: first/last lines (-n #)
  • wc <file>: count lines, words, characters
  • grep <pattern> <file>: search with regular expressions
    • -i (ignore case), -r (recursive), -n (line numbers), -v (invert)

Finding files:

  • find <path> [options]: search tool
    • -name <pattern>: by name (quote patterns with *)
    • -iname <pattern>: case-insensitive
    • -type d / -type f: directories / files
    • -size +1M: files larger than 1 MB
  • Wildcards: * (any characters), ? (single character)

Pipes, redirects & scripting

  • Redirection: control where output goes
    • >: write output to file (overwrites)
    • >>: append output to file
    • <: read input from file
  • Pipes |: chain commands; left output becomes right input (e.g., ls -l | grep "Jan")
  • Shell Scripting: save commands in a .sh file for automation
    • Start with #!/bin/bash (shebang)
    • Make executable: chmod +x script.sh
    • Run: ./script.sh

A simple script that moves .png files to the Desktop

Git & GitHub 💾

Version control basics

  • Problem: managing project changes manually is chaotic 😂
  • Solution: Version Control Systems (VCS) like Git track every change
  • Benefits: revert to previous states, understand history, collaborate, keep backups
  • Core Workflow:
    • Edit files in the Working Directory
    • Stage changes with git add <file>Staging Area
    • Save a snapshot with git commit -m "message"Repository
  • Key commands: git status, git add, git checkout, git commit, git diff, git log, git pull, git push

Source: Atlassian

GitHub: collaboration platform 🤝

  • GitHub: a central remote repository (origin) for sharing code

  • Key commands:

    • git push: upload local commits to the remote
    • git commit: record changes locally
    • git pull: download and merge remote changes
  • .gitignore: tells Git which files to skip

  • Other GitHub features:

    • Pull Requests (code review)
    • Wikis
    • GitHub Actions (automation)
    • GitHub Pages (web hosting)

Branching & merging 🌿

  • Version control also means managing different lines of development
  • Branches allow independent work without affecting main
  • Workflow: create a branch (git checkout -b feature-x), commit, then merge back (git checkout main, git merge feature-x)
  • Merge Conflicts: happen when branches change the same lines. Git asks you to resolve them manually
  • Pull Requests (PRs): GitHub’s workflow for proposing merges with code review and discussion

Read more: Git Branching

Quarto 📖

Literate programming & reproducibility 💡

  • Reproducible research: others can consistently reproduce your results
  • Literate programming combines code, results, and narrative in one place
  • Quarto: modern, open-source system built on Pandoc
  • Language-agnostic: supports Python, R, Julia, Observable
  • One source file (.qmd or .ipynb) → HTML, PDF, Word, slides, websites, books
  • More information: Quarto website

Quarto documents

Structure

A typical Quarto (.qmd) document includes:

  • YAML Header: between --- marks; sets metadata (title, author, date) and output format
  • Markdown Content: narrative text with headings, lists, links, maths ($...$, $$...$$)
  • Code Chunks: executable blocks (e.g., ```{python}) with #| options for execution, visibility, figures, etc.
  • Render with: quarto render mydoc.qmd

Advanced features

You can also use Quarto for:

  • Citations ([@citekey], needs .bib)
  • Cross-references (@fig-label)
  • Callouts, Tabsets
  • Interactive components
  • Websites, Books, Presentations

AI-assisted programming 🤖

LLMs & code generation

  • LLMs (Large Language Models) like GPT-4 and Claude: trained on vast text/code data to generate code, explain concepts, translate languages
  • AI-Assisted Programming: using LLMs as coding companions (GitHub Copilot, ChatGPT)
  • Benefits: faster development, less repetitive coding, help with debugging and learning
  • Risks: hallucinations (incorrect/insecure code), training data biases, IP concerns
  • Good prompt engineering matters: clear instructions, context, examples
  • Agents: LLMs that act autonomously (web browsing, API calls)

WebUI GitHub repository: https://github.com/browser-use/web-ui

Cloud computing ☁️

What is cloud computing? 🤔

  • IaaS (Infrastructure as a Service): virtual machines (EC2), networks. You manage the rest
  • PaaS (Platform as a Service): develop and run apps without managing servers or OS
  • SaaS (Software as a Service): complete applications delivered over the internet
  • FaaS (Function as a Service / Serverless): run code on events, no server management (AWS Lambda, Google Cloud Functions)
  • EC2 (Elastic Compute Cloud): scalable virtual servers (instances)
    • Pick an AMI (Amazon Machine Image) to launch
    • Instance types: different CPU, RAM, storage configs
    • Choose OS, instance type, storage (EBS). Connect via SSH
  • S3 (Simple Storage Service): scalable object storage in buckets
  • RDS (Relational Database Service): managed relational databases (PostgreSQL, MySQL, etc.)
  • SageMaker: build, train, and deploy ML models
  • Cost Management: set up billing alarms!

Interacting with EC2 instances

  • Launching: in the AWS Console, pick AMI, instance type, storage, security groups, and an SSH key pair (.pem file)
  • Connecting (SSH): use ssh with your .pem key (set chmod 400 key.pem first)
    • ssh -i key.pem ubuntu@<public-ip-or-dns>
  • Managing Software (Ubuntu): apt package manager
    • sudo apt update → refresh package lists
    • sudo apt upgrade → install updates
    • sudo apt install <package-name> → install software
  • Transferring Files (scp): copy files to/from EC2
    • scp -i key.pem local_file ubuntu@ip:/remote/path (local → remote)
    • scp -i key.pem ubuntu@ip:/remote/file ./local/path (remote → local)
  • Stopping/Terminating: Stop = pause (still pay for storage); Terminate = delete permanently. Manage costs!

Python and SQL for data science 🐍 🗄️

Relational databases & SQL

  • Relational Databases: structured tables with defined relationships (PostgreSQL, MySQL, SQLite)
  • SQL: standard language for querying and managing these databases
  • SQLite: file-based, serverless database engine
  • Keys:
    • Primary Key (PK): uniquely identifies each row
    • Foreign Key (FK): references a PK in another table, creating links

Core commands:

  • CREATE TABLE: Define table schema (columns, data types like INTEGER, TEXT, REAL)
  • INSERT INTO: Add new rows
  • SELECT: Retrieve data (SELECT cols FROM table WHERE condition ORDER BY col)
  • UPDATE: Modify existing rows (UPDATE table SET col = val WHERE condition)
  • DELETE FROM: Remove rows (DELETE FROM table WHERE condition)
  • ALTER TABLE: Modify table structure
  • DROP TABLE: Delete a table

Advanced SQL

Going beyond basic queries:

  • Filtering (WHERE): operators (=, >, LIKE, IN, BETWEEN, IS NULL) + logic (AND, OR)
  • Aggregation (GROUP BY): group rows, apply functions (COUNT, SUM, AVG); filter with HAVING
  • Joins: combine tables (INNER JOIN, LEFT JOIN) on related columns
  • Subqueries: nested SELECT statements
  • Window Functions: calculations across related rows (RANK() OVER (...), AVG(...) OVER (PARTITION BY ...)), keeping individual rows
  • String Functions: SUBSTR, LENGTH, REPLACE, || (concatenation)
  • Conditional Logic: CASE WHEN condition THEN result ... ELSE default END
  • Python’s sqlite3 module: connect, cursor, execute, commit, fetch, close
  • Pandas makes it simpler:
    • pd.read_sql(query, conn): query → DataFrame
    • df.to_sql('table', conn, ...): DataFrame → database table

Parallel computing ⚡

Serial vs parallel

  • Serial: tasks run one after another; slow for large workloads
  • Parallel: tasks run concurrently on multiple cores/machines
  • Best for “embarrassingly parallel” problems (independent computations)
  • Python Libraries:
    • joblib: simple single-machine parallelism (Parallel, delayed)
    • dask: scalable parallel/distributed computing

Dask:

  • Dask Array: parallel NumPy arrays
  • Dask DataFrame: parallel Pandas DataFrames
  • Dask Delayed: parallelise custom functions
  • Dask Distributed: cluster management (multi-process/multi-machine)

Containers & Docker 🐳

Dependency management & reproducibility

Virtual environments

  • Challenge: “it works on my machine” problem, where code fails elsewhere because of different dependencies
  • Solution: explicitly manage dependencies for reproducibility
  • Virtual Environments (venv, conda): isolate project packages, prevent conflicts
    • Define with requirements.txt (pip) or environment.yml (conda)
  • Limitation: virtual environments only isolate Python packages, not system libraries

Containers

  • Containers: package an app with all its dependencies (system libs, tools, code, runtime) in one portable unit
  • Runs the same way on any machine with Docker
  • Docker Concepts:
    • Image: immutable template built from a Dockerfile
    • Container: running instance of an image
    • Dockerfile: instructions (FROM, RUN, COPY, etc.) to build the image
    • Registry (e.g., Docker Hub): stores and shares images

Using Docker

Basic workflow:

  • Write Dockerfile: define base OS, packages, code, run command
  • Build Image: docker build -t myimage:latest .
  • Run Container: docker run -p 8888:8888 -v $(pwd):/app myimage:latest
    • -p: port mapping (host:container)
    • -v: volume mount (host:container) for persistent data
    • -it: interactive terminal
  • Share: docker push / docker pull (via Docker Hub or other registry)
FROM ubuntu:24.04

RUN apt-get update && \
    apt-get install -y python3 python3-pip && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN python3 --version
RUN pip3 --version

CMD ["python3"]

COPY . /app
WORKDIR /app
RUN pip3 install -r requirements.txt
CMD ["python3", "your_script.py"]

Example Dockerfile

Conclusion 🎓

What we learnt

  • Computational literacy, reproducibility, and solid data science practices
  • Command line, version control (Git & GitHub), reproducible reports (Quarto)
  • Data manipulation with Python/Pandas and database queries with SQL
  • AI-assisted programming, parallel computing (Dask), and containers (Docker)

Questions? 🤔

Thank you very much! 🙏