QTM 350 - Data Science Computing

Lecture 25 - Course Revision

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 😊

Course recap

We have seen a lot of content in this course! 🤓
This course centred around three key areas crucial for effective data science:
- Reliability focuses on ensuring your results are consistent every time you run your code
- Reproducibility is about enabling others (and your future self) to obtain the same results using your data and code
- Robustness deals with making your analyses resilient to variations in data and scalable to larger problems
These principles leads to more credible, collaborative, and durable scientific work 🤓

Source: The Turing Way Community (2025)

Core tools overview

We have explored a variety of tools and techniques to support these principles:
- Command Line (Bash): System control, automation, and working with remote servers or containers
- Git & GitHub: The standard for version control, allowing you to track changes, revert mistakes, and collaborate on code effectively
- Quarto: Literate programming, creating documents that combine text, code, and results
- Cloud Computing (AWS): On-demand computing resources (servers, storage, services) for scalability and flexibility

Pandas & SQL: Pandas for data manipulation and analysis, and SQLite for interacting with databases
AI Tools: Modern programming assistants that help generate, explain, and debug code
Dask: Parallel and distributed computing, allowing you to scale computations beyond a single machine’s limits
Docker: Containerisation, packaging applications and their dependencies to ensure they run consistently anywhere

Computational literacy 💻

From human computers to the silicon age

The concept of ‘computing’ evolved significantly over time
Initially, ‘computers’ were people performing calculations, often aided by tools like the abacus
Mechanical calculators emerged, like Leibniz’s machine capable of basic arithmetic
The invention of the transistor, integrated circuits, and microprocessors during the silicon age led to the electronic computers we use today
Most modern computers follow the Von Neumann architecture, which stores both program instructions and data in the same memory space, accessed via a central processing unit (CPU)

Source: Von Neumann architecture (Wikipedia)

Representing data: Binary, hex, and decimal

At their core, computers represent all information using the binary system (base 2), consisting of only 0s and 1s
These correspond to the physical on/off states of transistors
A single binary digit is a bit
Eight bits form a byte, a common unit for storing data like characters}
Hexadecimal (base 16) uses digits 0-9 and A-F as a shorthand for binary, where each hex digit represents four bits (e g, FF is 11111111 binary, 255 decimal)
Often used for representing colours (#FF0000) or memory addresses concisely

Conversion table

Encoding information: Text and images

Computers need to encode information in a way they can process
Abstraction allows us to map complex data types to numerical representations
Digital images are grids of pixels Each pixel’s colour is typically defined using the RGB model, specifying intensities (0-255) for Red, Green, and Blue light
Text is encoded by assigning a unique number to each character:
- ASCII was an early standard, sufficient for basic English but limited (7 or 8 bits)
- Unicode (UTF-8) is the modern, universal standard supporting virtually all characters and symbols, using variable byte lengths (1-4 bytes typically)

ASCII table

Programming languages: Low-level vs high-level ⌨️

Languages bridge the gap between human instructions and machine execution:

Machine Code: Raw binary instructions the CPU executes directly
Assembly Language: A low-level language using mnemonics for machine instructions; specific to a CPU architecture, offering fine control but poor portability
High-Level Languages (e g, Python, R, Java): Use more human-readable syntax, abstracting hardware details They are more portable but require translation (compilation or interpretation)
Translation Methods:
- Compiled (e g, C++): Entire source code translated to machine code before running, often faster execution
- Interpreted (e g, Python): Code executed line-by-line during runtime by an interpreter, often easier development

High and low-level languages

The command line 🖥️

Understanding the shell

The Operating System (OS) acts as an intermediary between hardware and software
Its core is the Kernel, managing hardware access and resources
Applications run in the User Space, separate from the kernel for stability and security
A Shell (like Bash or Zsh) is a command-line interpreter program that allows you to interact with the OS by typing commands
You access the shell via a Terminal application, which handles input and output

Core commands:

pwd: Print Working Directory (shows current location)
ls: List directory contents (ls -l for details, ls -a for hidden files)
cd <directory>: Change Directory (cd .. up, cd ~ home)
mkdir <name>: Make Directory
touch <name>: Create empty file / update timestamp
cp <source> <dest>: Copy file/directory (-r for directories)
mv <source> <dest>: Move or Rename file/directory
rm <file>: Remove file (permanently!)
rmdir <empty_dir>: Remove empty directory
rm -r <dir>: Remove directory recursively (DANGEROUS! No undo)

Working with text & finding files

Commands for inspecting and searching text files:

cat <file>: Display entire file content
head <file> / tail <file>: Show first/last lines (-n #)
wc <file>: Word Count (lines, words, characters)
grep <pattern> <file>: Search for text patterns using regular expressions
- Options: -i (ignore case), -r (recursive), -n (line numbers), -v (invert match)

Finding files and directories:

find <path> [options]: Powerful search tool
- -name <pattern>: Find by name (use quotes for patterns with *)
- -iname <pattern>: Case-insensitive search
- -type d / -type f: Find directories / files
- -size +1M: Find files larger than 1 Megabyte
Wildcards: * (any characters), ? (single character)

Source: Julia Evans (2024)

Pipes, redirects & scripting

Redirection: Control command input/output
- >: Redirect output to file (overwrites)
- >>: Append output to file
- <: Redirect input from file (less common)
Pipes |: Chain commands; output of the left command becomes input for the right command (e g, history | grep cd)
Shell Scripting: Write command sequences in a .sh file for automation
- Start with #!/bin/bash (shebang)
- Make executable: chmod +x script.sh
- Run: ./script.sh

A simple script that moves .png files to the Desktop

Git & GitHub 💾

Version control fundamentals

Problem: Managing project evolution manually is chaotic 😂
Solution: Version Control Systems (VCS) like Git, where every change is tracked
Benefits: Allows reverting to previous states, understanding project history, collaborating effectively, and providing backups
Core Workflow:
- Modify files in the Working Directory
- Select changes for the next snapshot using git add <file> (moves changes to the Staging Area)
- Record the snapshot into the project history using git commit -m "message" (creates a commit in the local Repository)
Key commands: git status, git add, git commit, git log, git diff

Source: Atlassian

GitHub: Collaboration platform 🤝

GitHub provides a central location (remote repository, often called origin) for sharing code and tracking project progress
Interactions:
- git push: Uploads local commits to the remote repository
- git commit: Records changes in the local repository
- git pull: Downloads remote changes and merges them locally
.gitignore tells Git which files/directories should not be tracked
Other GitHub features:
- Code review via Pull Requests
- Wikis
- GitHub Actions (automation)
- GitHub Pages (web hosting)

Branching & merging 🌿

Version control is not just about tracking changes; it’s also about managing different lines of development
Branches are a core Git concept allowing independent lines of development
Work on new features or bug fixes in isolation without affecting the main (main/master) branch
Workflow: Create a branch (git checkout -b feature-x), make commits, then merge back into main (git checkout main, git merge feature-x)
Merge Conflicts: Occur when changes on different branches affect the same lines; Git requires manual resolution before the merge can complete
Pull Requests (PRs): The standard GitHub workflow for proposing merges, enabling code review and discussion before integrating changes

Quarto 📖

Literate programming & reproducibility💡

Reproducible research is the practice of ensuring that results can be consistently reproduced by others
Literate programming solves this by combining code, results, and narrative
Quarto is a modern, open-source system based on Pandoc that excels at this
It’s language-agnostic, supporting Python, R, Julia, and Observable code execution within documents
Can render a single source file (.qmd or .ipynb) into multiple formats like HTML, PDF (via LaTeX), MS Word, presentations, websites, and books
More information: Quarto website

Quarto documents

Structure

A typical Quarto (.qmd) document includes:

YAML Header: Delimited by ---, defines document metadata (title, author, date) and output format options (format: html, format: pdf, format: revealjs)
Markdown Content: Narrative text using Markdown for formatting (headings, lists, links, emphasis, math $...$ , $$...$$)
Code Chunks: Executable code blocks (e g, ```{python}) with options (prefixed with #|) controlling execution (eval), visibility (echo), figure generation (fig-cap), etc
Rendering command: quarto render mydoc.qmd

Advanced features

You can also use Quarto for:

Citations ([@citekey], needs .bib)
Cross-references (@fig-label)
Callouts, Tabsets
Interactive components
Websites, Books, Presentations

AI-assisted programming 🤖

LLMs & code generation

LLMs (Large Language Models) like GPT-4 and Claude are trained on vast text/code datasets, enabling them to generate code, explain concepts, and translate languages
AI-Assisted Programming leverages these models as coding companions (e.g., GitHub Copilot, ChatGPT)
Benefits: Can accelerate development, reduce repetitive coding, aid debugging, and facilitate learning
Risks: Models can hallucinate (generate incorrect/insecure code), reflect biases from training data, and raise IP concerns
Effective use requires good prompt engineering (clear instructions, context, examples)
Agents are LLMs that can perform tasks autonomously, like web browsing or API calls

WebUI GitHub repository: https://github.com/browser-use/web-ui

Cloud computing ☁️

What is cloud computing? 🤔

IaaS (Infrastructure as a Service): Building blocks like virtual machines (EC2 in AWS), and networks. You manage everything else
PaaS (Platform as a Service): Platform for developing, running, and managing applications without managing the underlying infrastructure (OS, servers)
SaaS (Software as a Service): Delivers complete software applications over the internetg
FaaS (Function as a Service / Serverless): Run code in response to events without managing any servers Examples: AWS Lambda, Google Cloud Functions

EC2 (Elastic Compute Cloud): Scalable virtual servers (instances) in the cloud. You choose the OS and configuration
- Choose an AMI (Amazon Machine Image) (OS image) to launch an instance
- Instance types: Different configurations of CPU, RAM, storage
- Security groups: Virtual firewalls to control inbound/outbound traffi
- Choose OS (Linux/Windows), instance type (CPU/RAM), storage (EBS) Connect via SSH
S3 (Simple Storage Service): Durable and scalable object storage. Store files in buckets
RDS (Relational Database Service): Service for relational databases (PostgreSQL, MySQL, etc)
SageMaker: Platform for building, training, and deploying Machine Learning models
Cost Management: Remember to set up billing alarms!

Interacting with EC2 instances

Launching: Use the AWS Management Console to select an AMI (OS image), instance type, configure storage, security groups (firewall rules), and create/select an SSH key pair (.pem file)
Connecting (SSH): Use the ssh command from your local terminal with your private key file (.pem) Requires correct permissions (chmod 400 key.pem)
- ssh -i key.pem ubuntu@<public-ip-or-dns>
Managing Software (Ubuntu): Use apt package manager
- sudo apt update: Refresh package lists
- sudo apt upgrade: Install updates
- sudo apt install <package-name>: Install software
Transferring Files (scp): Securely copy files between local machine and EC2 instance
- scp -i key.pem local_file ubuntu@ip:/remote/path (local to remote)
- scp -i key.pem ubuntu@ip:/remote/file ./local/path (remote to local)
Stopping/Terminating: Stop instances via console to pause (still pay for storage); Terminate to delete permanently (data lost) Manage costs carefully!

Python and SQL for data science 🐍 🗄️

Data wrangling with Pandas

Pandas is the workhorse library in Python for data manipulation, built around the DataFrame (2D table) and Series (1D array)
Emphasises working with tidy data (variables as columns, observations as rows) for easier analysis
Reshaping:
- melt(): Converts wide data to long format
- pivot()/pivot_table(): Converts long data to wide format; pivot_table handles aggregation if needed
Combining:
- pd.concat(): Stacks DataFrames row/column-wise
- pd.merge(): Performs database-style joins (inner, left, right, outer)

Grouping: df.groupby('column') splits data for group-wise operations
Aggregation: Apply summary functions (mean, sum, size, count, std, min, max, custom functions) to groups, often using .agg() for flexibility
Applying Functions: Use .apply() (row/column-wise), .applymap() (element-wise DF), .map() (element-wise Series)
Missing Data (NaN): Pandas uses NaN
- Detect: isnull(), notnull(), info()
- Remove: dropna()
- Impute: fillna() (with value, mean, median, method='ffill', etc)

Relational databases & SQL

Relational Databases store data in structured tables with defined relationships, ensuring data integrity (e.g., PostgreSQL, MySQL, SQLite)
SQL (Structured Query Language) is the standard language for querying and managing these databases
SQLite is a simple, file-based, serverless database engine, great for many applications
Keys:
- Primary Key (PK): Uniquely identifies each row in a table
- Foreign Key (FK): Column(s) referencing a PK in another table, establishing links