QTM 350 - Data Science Computing

Lecture 25 - Course Revision

Danilo Freire

Emory University

Hello, everyone! 😊

Course recap

  • We have seen a lot of content in this course! 🤓
  • This course centred around three key areas crucial for effective data science:
    • Reliability focuses on ensuring your results are consistent every time you run your code
    • Reproducibility is about enabling others (and your future self) to obtain the same results using your data and code
    • Robustness deals with making your analyses resilient to variations in data and scalable to larger problems
  • These principles leads to more credible, collaborative, and durable scientific work 🤓

Core tools overview

  • We have explored a variety of tools and techniques to support these principles:
    • Command Line (Bash): System control, automation, and working with remote servers or containers
    • Git & GitHub: The standard for version control, allowing you to track changes, revert mistakes, and collaborate on code effectively
    • Quarto: Literate programming, creating documents that combine text, code, and results
    • Cloud Computing (AWS): On-demand computing resources (servers, storage, services) for scalability and flexibility
  • Pandas & SQL: Pandas for data manipulation and analysis, and SQLite for interacting with databases
  • AI Tools: Modern programming assistants that help generate, explain, and debug code
  • Dask: Parallel and distributed computing, allowing you to scale computations beyond a single machine’s limits
  • Docker: Containerisation, packaging applications and their dependencies to ensure they run consistently anywhere

Computational literacy 💻

From human computers to the silicon age

  • The concept of ‘computing’ evolved significantly over time
  • Initially, ‘computers’ were people performing calculations, often aided by tools like the abacus
  • Mechanical calculators emerged, like Leibniz’s machine capable of basic arithmetic
  • The invention of the transistor, integrated circuits, and microprocessors during the silicon age led to the electronic computers we use today
  • Most modern computers follow the Von Neumann architecture, which stores both program instructions and data in the same memory space, accessed via a central processing unit (CPU)

Representing data: Binary, hex, and decimal

  • At their core, computers represent all information using the binary system (base 2), consisting of only 0s and 1s
  • These correspond to the physical on/off states of transistors
  • A single binary digit is a bit
  • Eight bits form a byte, a common unit for storing data like characters}
  • Hexadecimal (base 16) uses digits 0-9 and A-F as a shorthand for binary, where each hex digit represents four bits (e g, FF is 11111111 binary, 255 decimal)
  • Often used for representing colours (#FF0000) or memory addresses concisely

Conversion table

Encoding information: Text and images

  • Computers need to encode information in a way they can process
  • Abstraction allows us to map complex data types to numerical representations
  • Digital images are grids of pixels Each pixel’s colour is typically defined using the RGB model, specifying intensities (0-255) for Red, Green, and Blue light
  • Text is encoded by assigning a unique number to each character:
    • ASCII was an early standard, sufficient for basic English but limited (7 or 8 bits)
    • Unicode (UTF-8) is the modern, universal standard supporting virtually all characters and symbols, using variable byte lengths (1-4 bytes typically)

ASCII table

Programming languages: Low-level vs high-level ⌨️

Languages bridge the gap between human instructions and machine execution:

  • Machine Code: Raw binary instructions the CPU executes directly
  • Assembly Language: A low-level language using mnemonics for machine instructions; specific to a CPU architecture, offering fine control but poor portability
  • High-Level Languages (e g, Python, R, Java): Use more human-readable syntax, abstracting hardware details They are more portable but require translation (compilation or interpretation)
  • Translation Methods:
    • Compiled (e g, C++): Entire source code translated to machine code before running, often faster execution
    • Interpreted (e g, Python): Code executed line-by-line during runtime by an interpreter, often easier development

High and low-level languages

The command line 🖥️

Understanding the shell

  • The Operating System (OS) acts as an intermediary between hardware and software
  • Its core is the Kernel, managing hardware access and resources
  • Applications run in the User Space, separate from the kernel for stability and security
  • A Shell (like Bash or Zsh) is a command-line interpreter program that allows you to interact with the OS by typing commands
  • You access the shell via a Terminal application, which handles input and output

Core commands:

  • pwd: Print Working Directory (shows current location)
  • ls: List directory contents (ls -l for details, ls -a for hidden files)
  • cd <directory>: Change Directory (cd .. up, cd ~ home)
  • mkdir <name>: Make Directory
  • touch <name>: Create empty file / update timestamp
  • cp <source> <dest>: Copy file/directory (-r for directories)
  • mv <source> <dest>: Move or Rename file/directory
  • rm <file>: Remove file (permanently!)
  • rmdir <empty_dir>: Remove empty directory
  • rm -r <dir>: Remove directory recursively (DANGEROUS! No undo)

Working with text & finding files

Commands for inspecting and searching text files:

  • cat <file>: Display entire file content
  • head <file> / tail <file>: Show first/last lines (-n #)
  • wc <file>: Word Count (lines, words, characters)
  • grep <pattern> <file>: Search for text patterns using regular expressions
    • Options: -i (ignore case), -r (recursive), -n (line numbers), -v (invert match)

Finding files and directories:

  • find <path> [options]: Powerful search tool
    • -name <pattern>: Find by name (use quotes for patterns with *)
    • -iname <pattern>: Case-insensitive search
    • -type d / -type f: Find directories / files
    • -size +1M: Find files larger than 1 Megabyte
  • Wildcards: * (any characters), ? (single character)

Pipes, redirects & scripting

  • Redirection: Control command input/output
    • >: Redirect output to file (overwrites)
    • >>: Append output to file
    • <: Redirect input from file (less common)
  • Pipes |: Chain commands; output of the left command becomes input for the right command (e g, history | grep cd)
  • Shell Scripting: Write command sequences in a .sh file for automation
    • Start with #!/bin/bash (shebang)
    • Make executable: chmod +x script.sh
    • Run: ./script.sh

A simple script that moves .png files to the Desktop

Git & GitHub 💾

Version control fundamentals

  • Problem: Managing project evolution manually is chaotic 😂
  • Solution: Version Control Systems (VCS) like Git, where every change is tracked
  • Benefits: Allows reverting to previous states, understanding project history, collaborating effectively, and providing backups
  • Core Workflow:
    • Modify files in the Working Directory
    • Select changes for the next snapshot using git add <file> (moves changes to the Staging Area)
    • Record the snapshot into the project history using git commit -m "message" (creates a commit in the local Repository)
  • Key commands: git status, git add, git commit, git log, git diff

Source: Atlassian

GitHub: Collaboration platform 🤝

  • GitHub provides a central location (remote repository, often called origin) for sharing code and tracking project progress

  • Interactions:

    • git push: Uploads local commits to the remote repository
    • git commit: Records changes in the local repository
    • git pull: Downloads remote changes and merges them locally
  • .gitignore tells Git which files/directories should not be tracked

  • Other GitHub features:

    • Code review via Pull Requests
    • Wikis
    • GitHub Actions (automation)
    • GitHub Pages (web hosting)

Branching & merging 🌿

  • Version control is not just about tracking changes; it’s also about managing different lines of development
  • Branches are a core Git concept allowing independent lines of development
  • Work on new features or bug fixes in isolation without affecting the main (main/master) branch
  • Workflow: Create a branch (git checkout -b feature-x), make commits, then merge back into main (git checkout main, git merge feature-x)
  • Merge Conflicts: Occur when changes on different branches affect the same lines; Git requires manual resolution before the merge can complete
  • Pull Requests (PRs): The standard GitHub workflow for proposing merges, enabling code review and discussion before integrating changes

Read more: Git Branching

Quarto 📖

Literate programming & reproducibility💡

  • Reproducible research is the practice of ensuring that results can be consistently reproduced by others
  • Literate programming solves this by combining code, results, and narrative
  • Quarto is a modern, open-source system based on Pandoc that excels at this
  • It’s language-agnostic, supporting Python, R, Julia, and Observable code execution within documents
  • Can render a single source file (.qmd or .ipynb) into multiple formats like HTML, PDF (via LaTeX), MS Word, presentations, websites, and books
  • More information: Quarto website

Quarto documents

Structure

A typical Quarto (.qmd) document includes:

  • YAML Header: Delimited by ---, defines document metadata (title, author, date) and output format options (format: html, format: pdf, format: revealjs)
  • Markdown Content: Narrative text using Markdown for formatting (headings, lists, links, emphasis, math $...$, $$...$$)
  • Code Chunks: Executable code blocks (e g, ```{python}) with options (prefixed with #|) controlling execution (eval), visibility (echo), figure generation (fig-cap), etc
  • Rendering command: quarto render mydoc.qmd

Advanced features

You can also use Quarto for:

  • Citations ([@citekey], needs .bib)
  • Cross-references (@fig-label)
  • Callouts, Tabsets
  • Interactive components
  • Websites, Books, Presentations

AI-assisted programming 🤖

LLMs & code generation

  • LLMs (Large Language Models) like GPT-4 and Claude are trained on vast text/code datasets, enabling them to generate code, explain concepts, and translate languages
  • AI-Assisted Programming leverages these models as coding companions (e.g., GitHub Copilot, ChatGPT)
  • Benefits: Can accelerate development, reduce repetitive coding, aid debugging, and facilitate learning
  • Risks: Models can hallucinate (generate incorrect/insecure code), reflect biases from training data, and raise IP concerns
  • Effective use requires good prompt engineering (clear instructions, context, examples)
  • Agents are LLMs that can perform tasks autonomously, like web browsing or API calls

WebUI GitHub repository: https://github.com/browser-use/web-ui

Cloud computing ☁️

What is cloud computing? 🤔

  • IaaS (Infrastructure as a Service): Building blocks like virtual machines (EC2 in AWS), and networks. You manage everything else
  • PaaS (Platform as a Service): Platform for developing, running, and managing applications without managing the underlying infrastructure (OS, servers)
  • SaaS (Software as a Service): Delivers complete software applications over the internetg
  • FaaS (Function as a Service / Serverless): Run code in response to events without managing any servers Examples: AWS Lambda, Google Cloud Functions
  • EC2 (Elastic Compute Cloud): Scalable virtual servers (instances) in the cloud. You choose the OS and configuration
    • Choose an AMI (Amazon Machine Image) (OS image) to launch an instance
    • Instance types: Different configurations of CPU, RAM, storage
    • Security groups: Virtual firewalls to control inbound/outbound traffi
    • Choose OS (Linux/Windows), instance type (CPU/RAM), storage (EBS) Connect via SSH
  • S3 (Simple Storage Service): Durable and scalable object storage. Store files in buckets
  • RDS (Relational Database Service): Service for relational databases (PostgreSQL, MySQL, etc)
  • SageMaker: Platform for building, training, and deploying Machine Learning models
  • Cost Management: Remember to set up billing alarms!

Interacting with EC2 instances

  • Launching: Use the AWS Management Console to select an AMI (OS image), instance type, configure storage, security groups (firewall rules), and create/select an SSH key pair (.pem file)
  • Connecting (SSH): Use the ssh command from your local terminal with your private key file (.pem) Requires correct permissions (chmod 400 key.pem)
    • ssh -i key.pem ubuntu@<public-ip-or-dns>
  • Managing Software (Ubuntu): Use apt package manager
    • sudo apt update: Refresh package lists
    • sudo apt upgrade: Install updates
    • sudo apt install <package-name>: Install software
  • Transferring Files (scp): Securely copy files between local machine and EC2 instance
    • scp -i key.pem local_file ubuntu@ip:/remote/path (local to remote)
    • scp -i key.pem ubuntu@ip:/remote/file ./local/path (remote to local)
  • Stopping/Terminating: Stop instances via console to pause (still pay for storage); Terminate to delete permanently (data lost) Manage costs carefully!

Python and SQL for data science 🐍 🗄️

Data wrangling with Pandas

  • Pandas is the workhorse library in Python for data manipulation, built around the DataFrame (2D table) and Series (1D array)
  • Emphasises working with tidy data (variables as columns, observations as rows) for easier analysis
  • Reshaping:
    • melt(): Converts wide data to long format
    • pivot()/pivot_table(): Converts long data to wide format; pivot_table handles aggregation if needed
  • Combining:
    • pd.concat(): Stacks DataFrames row/column-wise
    • pd.merge(): Performs database-style joins (inner, left, right, outer)
  • Grouping: df.groupby('column') splits data for group-wise operations
  • Aggregation: Apply summary functions (mean, sum, size, count, std, min, max, custom functions) to groups, often using .agg() for flexibility
  • Applying Functions: Use .apply() (row/column-wise), .applymap() (element-wise DF), .map() (element-wise Series)
  • Missing Data (NaN): Pandas uses NaN
    • Detect: isnull(), notnull(), info()
    • Remove: dropna()
    • Impute: fillna() (with value, mean, median, method='ffill', etc)

Relational databases & SQL

  • Relational Databases store data in structured tables with defined relationships, ensuring data integrity (e.g., PostgreSQL, MySQL, SQLite)
  • SQL (Structured Query Language) is the standard language for querying and managing these databases
  • SQLite is a simple, file-based, serverless database engine, great for many applications
  • Keys:
    • Primary Key (PK): Uniquely identifies each row in a table
    • Foreign Key (FK): Column(s) referencing a PK in another table, establishing links

Core commands:

  • CREATE TABLE: Define table schema (columns, data types like INTEGER, TEXT, REAL)
  • INSERT INTO: Add new rows
  • SELECT: Retrieve data (SELECT cols FROM table WHERE condition ORDER BY col)
  • UPDATE: Modify existing rows (UPDATE table SET col = val WHERE condition)
  • DELETE FROM: Remove rows (DELETE FROM table WHERE condition)
  • ALTER TABLE: Modify table structure
  • DROP TABLE: Delete a table

Advanced SQL

Going beyond basic queries:

  • Filtering (WHERE): Use operators (=, >, LIKE, IN, BETWEEN, IS NULL) and logic (AND, OR)
  • Aggregation (GROUP BY): Group rows and apply functions (COUNT, SUM, AVG, etc); filter groups with HAVING
  • Joins: Combine tables (INNER JOIN, LEFT JOIN) based on related columns
  • Subqueries: Nest SELECT statements
  • Window Functions: Calculations across related rows (RANK() OVER (...), AVG(...) OVER (PARTITION BY ...)), preserving individual rows
  • String Functions: Manipulate text (SUBSTR, LENGTH, REPLACE, || for concatenation in SQLite)
  • Conditional Logic: CASE WHEN condition THEN result ... ELSE default END
  • Python’s built-in sqlite3 module provides direct access to SQLite databases (connect, cursor, execute, commit, fetch, close)
  • Pandas simplifies interaction:
    • pd.read_sql(query, conn): Query database and load results into a DataFrame
    • df.to_sql('table', conn, ...): Write DataFrame to a database table

Parallel computing ⚡

Serial vs parallel

  • Serial: Tasks run one after another; slow for large workloads
  • Parallel: Tasks run concurrently on multiple cores/machines; speeds up suitable problems
  • Best suited for “embarrassingly parallel” tasks (independent computations)
  • Python Libraries:
    • joblib: Simple single-machine parallelism (Parallel, delayed)
    • dask: Scalable library for parallel/distributed computing

Dask:

  • Dask Array: Parallel NumPy arrays
  • Dask DataFrame: Parallel Pandas DataFrames
  • Dask Delayed: Parallelise custom Python functions
  • Dask Distributed: Cluster management for multi-process/multi-machine execution

Containers & Docker 🐳

Dependency management & reproducibility

Virtual environments

  • Challenge: Code failing on different machines due to varying software environments (dependencies) - the “it works on my machine” problem
  • Solution: Explicitly manage dependencies for reproducibility
  • Virtual Environments (Python venv, conda): Isolate project package dependencies, preventing conflicts between projects
    • Use requirements.txt (pip) or environment.yml (conda) to define and recreate these environments easily
  • Limitations: Virtual environments only isolate Python packages, not system libraries or other dependencies

Containers

  • Containers offer superior isolation by packaging an application with all its dependencies (system libraries, tools, code, runtime) into a single, portable unit
  • Ensures consistent execution across any machine running Docker
  • Docker Concepts:
    • Image: Immutable template built from a Dockerfile
    • Container: Running instance of an image
    • Dockerfile: Text file with instructions (FROM, RUN, COPY, etc) to build the image
    • Registry (e g, Docker Hub): Stores and shares images

Using Docker

Basic workflow:

  • Write Dockerfile: Define environment setup (base OS, package installs, code copying, run command)
  • Build Image: docker build -t myimage:latest . (creates the image locally)
  • Run Container: docker run -p 8888:8888 -v $(pwd):/app myimage:latest (starts the container)
    • -p: Port mapping (host:container)
    • -v: Volume mounting (host:container) for data persistence/code access
    • -it: Interactive terminal
  • Share: docker push, docker pull (via a registry like Docker Hub)
FROM ubuntu:24.04

RUN apt-get update && \
    apt-get install -y python3 python3-pip && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

RUN python3 --version
RUN pip3 --version

CMD ["python3"]

COPY . /app
WORKDIR /app
RUN pip3 install -r requirements.txt
CMD ["python3", "your_script.py"]

Example Dockerfile

Conclusion 🎓

What we learned

A summary of the key skills and concepts from QTM350:

  • Foundations: The importance of computational literacy, reproducibility, and robust practices in data science
  • Core Workflow: Effective use of the command line, version control with Git & GitHub, and reproducible reporting with Quarto
  • Data Handling: Manipulating data with Python/Pandas and querying databases with SQL
  • Modern Techniques: Enhancing productivity with AI-assisted programming, scaling analyses using Parallel Computing (Dask), and ensuring reliable deployment via Containerisation (Docker)

Questions? 🤔

Thank you very much! 🙏