DATASCI 350 - Data Science Computing

Lecture 25 - Course Revision

Danilo Freire

danilo.freire@emory.edu

Department of Data and Decision Sciences
Emory University

Hello, everyone! 😊

Course recap

We covered a lot of content this semester! 🤓
Three key areas for data science:
- Reliability: consistent results every time you run your code
- Reproducibility: others (and your future self) can obtain the same results with your data and code
- Robustness: analyses that handle data variation and scale to larger problems
Together, these support more credible and collaborative science 🤓

Source: The Turing Way Community (2025)

Core tools overview

Tools and techniques we covered:
- Command Line (Bash): system control, automation, remote servers, containers
- Git & GitHub: version control to track changes, revert mistakes, and collaborate
- Quarto: literate programming, combining text, code, and results in one document
- Cloud Computing (AWS): on-demand servers, storage, and services for scalability

SQL & Relational Databases: structured data storage and querying with SQL
AI Tools: programming assistants that generate, explain, and debug code
Dask: parallel and distributed computing beyond a single machine
Docker: containerisation, packaging apps and dependencies to run anywhere

A quick favour 🙏

Course evaluations are open!

Course evaluations are now open, and I would love to hear your thoughts on the course!
Just log on to Canvas and go to Account → Profile → Course Evaluations
Works on any device: laptop, tablet, or phone
It only takes a few minutes!
Thank you so much for taking the time. I really appreciate it! 😊

Computational literacy 💻

From human computers to the silicon age

‘Computing’ changed a lot over time
Originally, ‘computers’ were people doing calculations, often with tools like the abacus
Mechanical calculators followed (e.g., Leibniz’s machine for basic arithmetic)
The transistor, integrated circuits, and microprocessors of the silicon age gave us electronic computers
Most modern computers use the Von Neumann architecture: program instructions and data share the same memory, accessed by a CPU

Source: Von Neumann architecture (Wikipedia)

Representing data: binary, hex, and decimal

Computers store all information in binary (base 2): only 0s and 1s
These map to the on/off states of transistors
One binary digit = a bit; eight bits = a byte
Hexadecimal (base 16): digits 0-9 and A-F, shorthand for binary. Each hex digit = four bits (e.g., FF = 11111111 = 255)
Used for colours (#FF0000) and memory addresses

Conversion table

Encoding information: text and images

Computers encode information so they can process it
Abstraction: mapping complex data types to numbers
Digital images are grids of pixels
Each pixel’s colour uses the RGB model: Red, Green, Blue intensities (0-255)
Text encoding assigns a number to each character:
- ASCII: early standard, enough for basic English but limited (7-8 bits)
- Unicode (UTF-8): modern universal standard, supports nearly all characters (1-4 bytes)

ASCII table

Programming languages: low-level vs high-level ⌨️

Languages bridge human instructions and machine execution:

Machine Code: raw binary the CPU runs directly
Assembly Language: low-level mnemonics for machine instructions. CPU-specific, fine control, poor portability
High-Level Languages (Python, R, Java): human-readable syntax, abstracts hardware. More portable, but needs translation
Translation Methods:
- Compiled (e.g., C++): source translated to machine code before running; faster execution
- Interpreted (e.g., Python): code run line-by-line at runtime by an interpreter; easier development

High and low-level languages

The command line 🖥️

Understanding the shell

The Operating System (OS) sits between hardware and software
Its core, the Kernel, manages hardware access and resources
Applications run in User Space, separate from the kernel for stability
A Shell (Bash, Zsh) is a command-line interpreter for interacting with the OS
You access the shell through a Terminal application

Core commands:

pwd: Print Working Directory (shows current location)
ls: List directory contents (ls -l for details, ls -a for hidden files)
cd <directory>: Change Directory (cd .. up, cd ~ home)
mkdir <name>: Make Directory
touch <name>: Create empty file / update timestamp
cp <source> <dest>: Copy file/directory (-r for directories)
mv <source> <dest>: Move or Rename file/directory
rm <file>: Remove file (permanently!)
rmdir <empty_dir>: Remove empty directory
rm -r <dir>: Remove directory recursively (DANGEROUS! No undo)

Working with text & finding files

Inspecting and searching text:

cat <file>: display entire file
head <file> / tail <file>: first/last lines (-n #)
wc <file>: count lines, words, characters
grep <pattern> <file>: search with regular expressions
- -i (ignore case), -r (recursive), -n (line numbers), -v (invert)

Finding files:

find <path> [options]: search tool
- -name <pattern>: by name (quote patterns with *)
- -iname <pattern>: case-insensitive
- -type d / -type f: directories / files
- -size +1M: files larger than 1 MB
Wildcards: * (any characters), ? (single character)

Source: Julia Evans (2024)

Pipes, redirects & scripting

Redirection: control where output goes
- >: write output to file (overwrites)
- >>: append output to file
- <: read input from file
Pipes |: chain commands; left output becomes right input (e.g., ls -l | grep "Jan")
Shell Scripting: save commands in a .sh file for automation
- Start with #!/bin/bash (shebang)
- Make executable: chmod +x script.sh
- Run: ./script.sh

A simple script that moves .png files to the Desktop

Git & GitHub 💾

Version control basics

Problem: managing project changes manually is chaotic 😂
Solution: Version Control Systems (VCS) like Git track every change
Benefits: revert to previous states, understand history, collaborate, keep backups
Core Workflow:
- Edit files in the Working Directory
- Stage changes with git add <file> → Staging Area
- Save a snapshot with git commit -m "message" → Repository
Key commands: git status, git add, git checkout, git commit, git diff, git log, git pull, git push

Source: Atlassian

GitHub: collaboration platform 🤝

GitHub: a central remote repository (origin) for sharing code
Key commands:
- git push: upload local commits to the remote
- git commit: record changes locally
- git pull: download and merge remote changes
.gitignore: tells Git which files to skip
Other GitHub features:
- Pull Requests (code review)
- Wikis
- GitHub Actions (automation)
- GitHub Pages (web hosting)

Branching & merging 🌿

Version control also means managing different lines of development
Branches allow independent work without affecting main
Workflow: create a branch (git checkout -b feature-x), commit, then merge back (git checkout main, git merge feature-x)
Merge Conflicts: happen when branches change the same lines. Git asks you to resolve them manually
Pull Requests (PRs): GitHub’s workflow for proposing merges with code review and discussion

Quarto 📖

Literate programming & reproducibility 💡

Reproducible research: others can consistently reproduce your results
Literate programming combines code, results, and narrative in one place
Quarto: modern, open-source system built on Pandoc
Language-agnostic: supports Python, R, Julia, Observable
One source file (.qmd or .ipynb) → HTML, PDF, Word, slides, websites, books
More information: Quarto website

Quarto documents

Structure

A typical Quarto (.qmd) document includes:

YAML Header: between --- marks; sets metadata (title, author, date) and output format
Markdown Content: narrative text with headings, lists, links, maths ( $...$ , $$...$$)
Code Chunks: executable blocks (e.g., ```{python}) with #| options for execution, visibility, figures, etc.
Render with: quarto render mydoc.qmd

Advanced features

You can also use Quarto for:

Citations ([@citekey], needs .bib)
Cross-references (@fig-label)
Callouts, Tabsets
Interactive components
Websites, Books, Presentations

AI-assisted programming 🤖

LLMs & code generation

LLMs (Large Language Models) like GPT-4 and Claude: trained on vast text/code data to generate code, explain concepts, translate languages
AI-Assisted Programming: using LLMs as coding companions (GitHub Copilot, ChatGPT)
Benefits: faster development, less repetitive coding, help with debugging and learning
Risks: hallucinations (incorrect/insecure code), training data biases, IP concerns
Good prompt engineering matters: clear instructions, context, examples
Agents: LLMs that act autonomously (web browsing, API calls)

WebUI GitHub repository: https://github.com/browser-use/web-ui

Cloud computing ☁️

What is cloud computing? 🤔

IaaS (Infrastructure as a Service): virtual machines (EC2), networks. You manage the rest
PaaS (Platform as a Service): develop and run apps without managing servers or OS
SaaS (Software as a Service): complete applications delivered over the internet
FaaS (Function as a Service / Serverless): run code on events, no server management (AWS Lambda, Google Cloud Functions)

EC2 (Elastic Compute Cloud): scalable virtual servers (instances)
- Pick an AMI (Amazon Machine Image) to launch
- Instance types: different CPU, RAM, storage configs
- Choose OS, instance type, storage (EBS). Connect via SSH
S3 (Simple Storage Service): scalable object storage in buckets
RDS (Relational Database Service): managed relational databases (PostgreSQL, MySQL, etc.)
SageMaker: build, train, and deploy ML models
Cost Management: set up billing alarms!

Interacting with EC2 instances

Launching: in the AWS Console, pick AMI, instance type, storage, security groups, and an SSH key pair (.pem file)
Connecting (SSH): use ssh with your .pem key (set chmod 400 key.pem first)
- ssh -i key.pem ubuntu@<public-ip-or-dns>
Managing Software (Ubuntu): apt package manager
- sudo apt update → refresh package lists
- sudo apt upgrade → install updates
- sudo apt install <package-name> → install software
Transferring Files (scp): copy files to/from EC2
- scp -i key.pem local_file ubuntu@ip:/remote/path (local → remote)
- scp -i key.pem ubuntu@ip:/remote/file ./local/path (remote → local)
Stopping/Terminating: Stop = pause (still pay for storage); Terminate = delete permanently. Manage costs!

Python and SQL for data science 🐍 🗄️

Relational databases & SQL

Relational Databases: structured tables with defined relationships (PostgreSQL, MySQL, SQLite)
SQL: standard language for querying and managing these databases
SQLite: file-based, serverless database engine
Keys:
- Primary Key (PK): uniquely identifies each row
- Foreign Key (FK): references a PK in another table, creating links