QTM 350 - Data Science Computing

Lecture 03 - Encoding Information & Introduction to Programming

Danilo Freire

danilo.freire@emory.edu

Department of Quantitative Theory and Methods
Emory University

Brief recap 📚

Brief recap

In the previous lecture, we covered:
The evolution of computing from manual methods to silicon-based machines
The Von Neumann architecture as the blueprint for modern computers
How computers use binary (0s and 1s) to represent data
The concept of abstraction in simplifying complexity
Number systems:
- Binary (Base 2): Bits, bytes, and decimal-binary conversions
- Hexadecimal (Base 16): Compact binary representation, conversions, and use in RGB colours for HTML/CSS (e.g., #FF0000 for red)
Representing images using pixels and the RGB model

Learning objectives

By the end of this lecture, you will be able to:

Understand how text is represented digitally using ASCII and Unicode
Describe the historical origins of programming languages
Distinguish between low-level (like Assembly) and high-level programming languages
Understand the difference between compiled and interpreted languages
Recognize the role of the operating system (OS) in managing hardware and software resources
Understand the basic structure of an OS, including the kernel and user space

Representing Text 🔤

Represent text as individual characters

Characters and glyphs

Next, how do we represent text?
First, we break it down into smaller parts, like with images. In this case, we can break text down into individual characters
A character is the smallest component of text, like A, B, or /.
A glyph is the graphical representation of a character.
In programming, the display of glyphs is typically handled by GUI (Graphical User Interface) toolkits or font renderers

Represent text as individual characters

Lookup tables

For example, the text “Homer Simpson” becomes H, o, m, e, r, space, S, i, m, p, s, o, n
Unlike colours, characters do not have a logical connection to numbers
To represent characters as numbers, we use a lookup table called ASCII
ASCII stands for American Standard Code for Information Interchange
As long as every computer uses the same lookup table, computers can always translate a set of numbers into the same set of characters

ASCII is nothing but a simple lookup table

Yes, really!

For basic characters, we can use the encoding system called ASCII. This maps the numbers 0 to 255 to characters. Therefore, one character is represented by one byte

Check it out here: ASCII Table

ASCII is nothing but a simple lookup table

Translation

“Hello World” =

01001000 (H) 01100101 (e)

01101100 (l) 01101100 (l)

01101111 (o) 00100000 (space)

01010111 (W) 01101111 (o)

01110010 (r) 01101100 (l)

01100100 (d)

Your turn!

Practice Exercise 03

Translate the following binary into ASCII text:

01011001 01100001 01111001

ASCII Limitations

ASCII uses 7 bits (standard) or 8 bits (extended) to represent characters. This means it can only represent \(2^7=128\) or \(2^8=256\) unique characters.
This is sufficient for American English (unaccented letters, numbers, common punctuation).
However, it lacks characters for many other languages (e.g., with accents like ‘é’, ‘ü’, ‘ñ’, or entirely different scripts like Cyrillic, Greek, Arabic, Chinese, Japanese, Korean).
Even English needs characters like ‘é’ for words such as ‘café’ or ‘resume’.

To address this, Unicode was developed.

Unicode: A Universal Character Set

Unicode is a global standard designed to represent every character from (almost) every writing system used in the world, plus many symbols and emojis
It aims to be a superset of ASCII; the first 128 Unicode characters are identical to ASCII
Unicode assigns a unique number (called a “code point”) to each character
The Unicode standard defines over 149,000 characters!

Unicode: A Universal Character Set

UTF-8 (Unicode Transformation Format - 8-bit):
- This is the most common way Unicode characters are encoded into bytes for storage and transmission.
- It’s a variable-width encoding:
  - Uses 1 byte for standard ASCII characters (making it backward compatible).
  - Uses 2, 3, or 4 bytes for other characters.
- This makes UTF-8 efficient for documents that are mostly English/ASCII but can still represent any Unicode character.
Find all the Unicode characters here: https://symbl.cc/en/unicode-table/
- “Danilo” in Unicode code points: U+0044 U+0061 U+006E U+0069 U+006C U+006F
- “QTM 350” in Unicode code points: U+0051 U+0054 U+004D U+0020 U+0033 U+0035 U+0030

Examples of Unicode Characters

Unicode allows us to represent a vast range of characters beyond basic ASCII:

Accented Latin Characters:
- ‘é’ (as in café) - Code Point: U+00E9
- ‘ü’ (as in über) - Code Point: U+00FC
- ‘ñ’ (as in piñata) - Code Point: U+00F1
Non-Latin Characters:
- Greek: ‘Σ’ (Sigma) - Code Point: U+03A3
- Cyrillic: ‘Д’ (De) - Code Point: U+0414
- Japanese: ‘猫’ (Neko - cat) - Code Point: U+732B
- Arabic: ‘ب’ (Ba) - Code Point: U+0628
Symbols and Emojis:
- ‘€’ (Euro sign) - Code Point: U+20AC
- ‘→’ (Rightwards arrow) - Code Point: U+2192
- 😃 (Smiling Face Emoji) - Code Point: U+1F603
- ❤️ (Red Heart Emoji) - Code Point: U+2764 (or U+FE0F for emoji presentation)

The Genesis of Programming 🌟

The genesis of programming

Zuse’s computers

Konrad Zuse was a German engineer and computer pioneer
He created the first programmable computer, the Z3, in 1941
The Z3 was the first computer to use binary arithmetic and read binary instructions from punch tape
Example: Z4 had 512 bytes of memory
Zuse also created the first high-level programming language, Plankalkül (“a formal system of planning”. Not widely implemented at the time)

What is Assembly language?

Assembly language is a low-level programming language that allows writing machine code in human-readable text (mnemonics)
Each assembly instruction typically corresponds to a single machine code instruction that the computer’s processor can execute directly
It’s a bridge between human understanding and the raw binary the CPU understands
The first assemblers were human! Programmers wrote assembly code, which secretaries transcribed to binary for machine processing. Later, “assembler” programs automated this translation

Some curious facts about Assembly!

Margaret Hamilton and the Apollo 11 code

The Apollo 11 mission to the moon was programmed in assembly language for the Apollo Guidance Computer (AGC).
The code is available here: https://github.com/chrislgarry/Apollo-11 (good luck reading it! 😅)
One of the files is the BURN_BABY_BURN--MASTER_IGNITION_ROUTINE.agc 🔥 🚀
Assembly is very fast and efficient because it’s close to the hardware.
But if Assembly is so fast and efficient, why don’t we use it all the time?

Low-level vs high-level languages

Low-level languages (e.g., Machine Code, Assembly)
- Closer to the hardware, providing direct control over the processor and memory
- Harder to read, write, and debug for humans
- Code is specific to a particular computer architecture (not easily portable)
- Very fast and memory efficient
High-level languages (e.g., Python, R, Java, C++)
- Abstract away hardware details, using more human-readable syntax (closer to natural language)
- Easier to learn, write, and maintain
- Generally portable across different computer systems (with a compiler/interpreter)
- May be less performant than optimised low-level code but offer faster development

Compiled vs. Interpreted Languages

High-level languages can be broadly categorised by how their code is executed:

Compiled Languages:
- The human-readable source code is translated entirely into machine code (binary instructions) by a program called a compiler before execution.
- This machine code is then run directly by the computer’s processor.
- Examples: C, C++, Fortran, Go, Swift, Rust.
Interpreted Languages:
- The source code is read and executed line-by-line (or statement-by-statement) by another program called an interpreter during runtime.
- The interpreter translates and executes the code on the fly.
- Examples: Python, R, JavaScript (traditionally, though modern JS engines often use JIT compilation), Ruby, PHP, Shell scripts.

Some languages can have both compiled and interpreted implementations, or use a hybrid approach (e.g., Java compiles to bytecode, which is then interpreted by the Java Virtual Machine - JVM).

Pros & Cons: Compiled vs. Interpreted

Compiled Languages

Pros:

Generally faster execution speed once compiled, as the translation to machine code is done beforehand
The compiler can perform extensive optimisations
Many errors are caught at compile-time, before the program is run
The resulting executable can be distributed without the source code

Cons:

Compilation step can be time-consuming, especially for large projects
Less platform-independent: typically need to recompile for different operating systems/architectures
Debugging can sometimes be more complex as you’re debugging the compiled code’s behaviour

Pros & Cons: Compiled vs. Interpreted

Interpreted Languages

Pros:

Faster development cycle and easier debugging (edit-and-run, errors often point directly to source lines)
Greater platform independence (source code can run on any system with the interpreter)
Often more dynamic and flexible (e.g., modifying code at runtime)

Cons:

Generally slower execution speed as code is translated during runtime
Runtime errors can occur if not thoroughly tested
Usually requires the interpreter to be installed on the target machine

Low-level vs high-level languages

Code that is worth a thousand words

“Hello, World!” in machine code (Hex representation):

Low-level vs high-level languages

Code that is worth a thousand words

“Hello, World!” in Assembly (x86 Assembly for Linux)

R> section .data
+     message db 'Hello, World!', 10    ; 10 is the ASCII code for newline
+ 
+ section .text
+     global _start
+ 
+ _start:
+     ; write(stdout, message, length)
+     mov eax, 4          ; system call number for write (sys_write)
+     mov ebx, 1          ; file descriptor 1 is stdout
+     mov ecx, message    ; address of string to output
+     mov edx, 14         ; number of bytes (length of "Hello, World!\n")
+     int 0x80            ; call kernel to perform the write
+ 
+     ; exit(0)
+     mov eax, 1          ; system call number for exit (sys_exit)
+     xor ebx, ebx        ; exit status 0 (clear ebx register)
+     int 0x80            ; call kernel to exit

Low-level vs high-level languages

Code that is worth a thousand words

“Hello, World!” in Python or R:

R> print("Hello, World!")

Question:
Is Natural Language Programming the Future of High-Level Languages? 🤖

The Operating System (OS) 🖥

A computer in a nutshell

Operating system

Credit Dave Kerr

The operating system (OS) is system software that interfaces with (and manages access to) a computer’s hardware. It also provides software resources for applications
Key functions: Process management, memory management, file system management, device management, security, user interface
The OS is broadly divided into the kernel and user space

A computer in a nutshell

Operating system

Credit Dave Kerr

The kernel is the core of the OS. It’s responsible for interfacing directly with hardware (via drivers), managing system resources (CPU, memory), and providing essential services. Running software in the kernel is extremely sensitive! That’s why users and most applications are kept away from it!
Curiosity: You can see the Linux kernel source code on GitHub.
The user space is where user applications run. It provides an interface for users to interact with the system. Hardware access for programs in user space is managed and controlled by the kernel. Programs in user space are essentially in sandboxes, which limits the potential damage they can cause.

A computer in a nutshell

Kernels and shells

Credit Dave Kerr/Kkchaudhary11

The shell is a general name for any user space program that allows users to access and interact with the operating system’s services and resources.
It acts as an interface between the user and the kernel.
Shells come in many different flavours but are generally provided to aid a human operator in accessing the system. This could be:
- Interactively, by typing commands at a terminal.
- Via scripts, which are files containing a sequence of commands to be executed.
Modern computers often use graphical user interfaces (GUIs) (like Windows Explorer or macOS Finder) as a type of shell. However, the term “shell” in computing often refers to command-line shells.
Why “kernel” and “shell”? The kernel is the soft, edible part of a nut or seed (the core functionality), which is surrounded by a hard shell (the user interface) to protect it and provide access. Useful metaphor, isn’t it? 😉

Interacting with the shell

Terminals

Credit Dave Kerr

Things are still a bit more complicated
We’re not directly interacting with the “shell” but using a terminal
A terminal is just a program that reads input from the keyboard, passes that input to another programme, and displays the results on the screen
A shell program on its own does not do this - it requires a terminal as an interface
Why “terminal”? Back in the old days (before computer screen existed), terminal machines (hardware!) were used to let humans interface with large machines (“mainframes”). Often many terminals were connected to a single machine
When you want to work with a computer in a data center (or remotely in cloud computing), you’ll still do pretty much the same
But this is a topic for our next lecture! 😄

Summary 💡

Summary

Text is represented using ASCII and Unicode, with UTF-8 encoding for diverse characters
Programming languages evolved from early machines to high-level languages
Assembly language provides a human-readable way to write machine instructions
Low-level languages are close to hardware, while high-level languages are more abstract
Compiled languages are translated to machine code before runtime, while interpreted languages are translated during runtime
The operating system manages hardware and software resources, divided into the kernel (core) and user space (applications)
The shell is a user interface program that interacts with the OS
The terminal is a program that reads input from the keyboard, passes it to the shell, and displays results on the screen… which is what we will be using in the next lecture and beyond! 🤓

Questions? 🤔

Thank you very much! 😊 🙏

Solutions to Practice Exercises

Solution - Practice Exercise 03

Task: Translate the following binary into ASCII text: 01011001 01100001 01111001
Step 1: Convert each binary string to its decimal equivalent:
- 01011001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (0 \cdot 32) + (1 \cdot 16) + (1 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 16 + 8 + 1 = 89_{10}\)
- 01100001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (1 \cdot 32) + (0 \cdot 16) + (0 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 32 + 1 = 97_{10}\)
- 01111001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (1 \cdot 32) + (1 \cdot 16) + (1 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 32 + 16 + 8 + 1 = 121_{10}\)
Step 2: Map each decimal value to its corresponding ASCII character (refer to an ASCII table):
- 89 = ‘Y’
- 97 = ‘a’
- 121 = ‘y’
Step 3: Combine the ASCII characters to form the final text:
- Result: Yay

QTM 350 - Data Science Computing

Brief recap 📚

Brief recap

Learning objectives

By the end of this lecture, you will be able to:

Representing Text 🔤

Represent text as individual characters

Characters and glyphs

Represent text as individual characters

Lookup tables

ASCII is nothing but a simple lookup table

Yes, really!

ASCII is nothing but a simple lookup table

Translation

Your turn!

Practice Exercise 03

ASCII Limitations

Unicode: A Universal Character Set

Unicode: A Universal Character Set

Examples of Unicode Characters

The Genesis of Programming 🌟

The genesis of programming

Zuse’s computers

What is Assembly language?

Some curious facts about Assembly!

Low-level vs high-level languages

Compiled vs. Interpreted Languages

Pros & Cons: Compiled vs. Interpreted

Pros & Cons: Compiled vs. Interpreted

Low-level vs high-level languages

Code that is worth a thousand words

Low-level vs high-level languages

Code that is worth a thousand words

Low-level vs high-level languages

Code that is worth a thousand words

Question: Is Natural Language Programming the Future of High-Level Languages? 🤖

The Operating System (OS) 🖥

A computer in a nutshell

Operating system

A computer in a nutshell

Operating system

A computer in a nutshell

Kernels and shells

Interacting with the shell

Terminals

Summary 💡

Summary

Questions? 🤔

Thank you very much! 😊 🙏

Solutions to Practice Exercises

Solution - Practice Exercise 03

Thank you very much! 😊 🙏

Question:
Is Natural Language Programming the Future of High-Level Languages? 🤖