QTM 350 - Data Science Computing

Lecture 03 - Encoding Information & Introduction to Programming

Danilo Freire

Department of Quantitative Theory and Methods
Emory University

Brief recap 📚

Brief recap

Learning objectives

By the end of this lecture, you will be able to:

Representing Text 🔤

Represent text as individual characters

Characters and glyphs

  • Next, how do we represent text?
  • First, we break it down into smaller parts, like with images. In this case, we can break text down into individual characters
  • A character is the smallest component of text, like A, B, or /.
  • A glyph is the graphical representation of a character.
  • In programming, the display of glyphs is typically handled by GUI (Graphical User Interface) toolkits or font renderers

Represent text as individual characters

Lookup tables

  • For example, the text “Homer Simpson” becomes H, o, m, e, r, space, S, i, m, p, s, o, n
  • Unlike colours, characters do not have a logical connection to numbers
  • To represent characters as numbers, we use a lookup table called ASCII
  • ASCII stands for American Standard Code for Information Interchange
  • As long as every computer uses the same lookup table, computers can always translate a set of numbers into the same set of characters

ASCII is nothing but a simple lookup table

Yes, really!

For basic characters, we can use the encoding system called ASCII. This maps the numbers 0 to 255 to characters. Therefore, one character is represented by one byte

Check it out here: ASCII Table

ASCII is nothing but a simple lookup table

Translation

“Hello World” =

01001000 (H) 01100101 (e)

01101100 (l) 01101100 (l)

01101111 (o) 00100000 (space)

01010111 (W) 01101111 (o)

01110010 (r) 01101100 (l)

01100100 (d)

Your turn!

Practice Exercise 03

  • Translate the following binary into ASCII text:

01011001 01100001 01111001

ASCII Limitations

  • ASCII uses 7 bits (standard) or 8 bits (extended) to represent characters. This means it can only represent \(2^7=128\) or \(2^8=256\) unique characters.
  • This is sufficient for American English (unaccented letters, numbers, common punctuation).
  • However, it lacks characters for many other languages (e.g., with accents like ‘é’, ‘ü’, ‘ñ’, or entirely different scripts like Cyrillic, Greek, Arabic, Chinese, Japanese, Korean).
  • Even English needs characters like ‘é’ for words such as ‘café’ or ‘resume’.
  • To address this, Unicode was developed.

Unicode: A Universal Character Set

  • Unicode is a global standard designed to represent every character from (almost) every writing system used in the world, plus many symbols and emojis
  • It aims to be a superset of ASCII; the first 128 Unicode characters are identical to ASCII
  • Unicode assigns a unique number (called a “code point”) to each character
  • The Unicode standard defines over 149,000 characters!

Unicode: A Universal Character Set

  • UTF-8 (Unicode Transformation Format - 8-bit):
    • This is the most common way Unicode characters are encoded into bytes for storage and transmission.
    • It’s a variable-width encoding:
      • Uses 1 byte for standard ASCII characters (making it backward compatible).
      • Uses 2, 3, or 4 bytes for other characters.
    • This makes UTF-8 efficient for documents that are mostly English/ASCII but can still represent any Unicode character.
  • Find all the Unicode characters here: https://symbl.cc/en/unicode-table/
    • “Danilo” in Unicode code points: U+0044 U+0061 U+006E U+0069 U+006C U+006F
    • “QTM 350” in Unicode code points: U+0051 U+0054 U+004D U+0020 U+0033 U+0035 U+0030

Examples of Unicode Characters

Unicode allows us to represent a vast range of characters beyond basic ASCII:

  • Accented Latin Characters:
    • ‘é’ (as in café) - Code Point: U+00E9
    • ‘ü’ (as in über) - Code Point: U+00FC
    • ‘ñ’ (as in piñata) - Code Point: U+00F1
  • Non-Latin Characters:
    • Greek: ‘Σ’ (Sigma) - Code Point: U+03A3
    • Cyrillic: ‘Д’ (De) - Code Point: U+0414
    • Japanese: ‘猫’ (Neko - cat) - Code Point: U+732B
    • Arabic: ‘ب’ (Ba) - Code Point: U+0628
  • Symbols and Emojis:
    • ‘€’ (Euro sign) - Code Point: U+20AC
    • ‘→’ (Rightwards arrow) - Code Point: U+2192
    • 😃 (Smiling Face Emoji) - Code Point: U+1F603
    • ❤️ (Red Heart Emoji) - Code Point: U+2764 (or U+FE0F for emoji presentation)

The Genesis of Programming 🌟

The genesis of programming

Zuse’s computers

  • Konrad Zuse was a German engineer and computer pioneer
  • He created the first programmable computer, the Z3, in 1941
  • The Z3 was the first computer to use binary arithmetic and read binary instructions from punch tape
  • Example: Z4 had 512 bytes of memory
  • Zuse also created the first high-level programming language, Plankalkül (“a formal system of planning”. Not widely implemented at the time)

What is Assembly language?

  • Assembly language is a low-level programming language that allows writing machine code in human-readable text (mnemonics)
  • Each assembly instruction typically corresponds to a single machine code instruction that the computer’s processor can execute directly
  • It’s a bridge between human understanding and the raw binary the CPU understands
  • The first assemblers were human! Programmers wrote assembly code, which secretaries transcribed to binary for machine processing. Later, “assembler” programs automated this translation

Some curious facts about Assembly!

Margaret Hamilton and the Apollo 11 code
  • The Apollo 11 mission to the moon was programmed in assembly language for the Apollo Guidance Computer (AGC).

  • The code is available here: https://github.com/chrislgarry/Apollo-11 (good luck reading it! 😅)

  • One of the files is the BURN_BABY_BURN--MASTER_IGNITION_ROUTINE.agc 🔥 🚀

  • Assembly is very fast and efficient because it’s close to the hardware.

  • But if Assembly is so fast and efficient, why don’t we use it all the time?

Low-level vs high-level languages

  • Low-level languages (e.g., Machine Code, Assembly)
    • Closer to the hardware, providing direct control over the processor and memory
    • Harder to read, write, and debug for humans
    • Code is specific to a particular computer architecture (not easily portable)
    • Very fast and memory efficient
  • High-level languages (e.g., Python, R, Java, C++)
    • Abstract away hardware details, using more human-readable syntax (closer to natural language)
    • Easier to learn, write, and maintain
    • Generally portable across different computer systems (with a compiler/interpreter)
    • May be less performant than optimised low-level code but offer faster development

Compiled vs. Interpreted Languages

High-level languages can be broadly categorised by how their code is executed:

  • Compiled Languages:
    • The human-readable source code is translated entirely into machine code (binary instructions) by a program called a compiler before execution.
    • This machine code is then run directly by the computer’s processor.
    • Examples: C, C++, Fortran, Go, Swift, Rust.
  • Interpreted Languages:
    • The source code is read and executed line-by-line (or statement-by-statement) by another program called an interpreter during runtime.
    • The interpreter translates and executes the code on the fly.
    • Examples: Python, R, JavaScript (traditionally, though modern JS engines often use JIT compilation), Ruby, PHP, Shell scripts.

Some languages can have both compiled and interpreted implementations, or use a hybrid approach (e.g., Java compiles to bytecode, which is then interpreted by the Java Virtual Machine - JVM).

Pros & Cons: Compiled vs. Interpreted

Compiled Languages

Pros:

  • Generally faster execution speed once compiled, as the translation to machine code is done beforehand
  • The compiler can perform extensive optimisations
  • Many errors are caught at compile-time, before the program is run
  • The resulting executable can be distributed without the source code

Cons:

  • Compilation step can be time-consuming, especially for large projects
  • Less platform-independent: typically need to recompile for different operating systems/architectures
  • Debugging can sometimes be more complex as you’re debugging the compiled code’s behaviour

Pros & Cons: Compiled vs. Interpreted

Interpreted Languages

Pros:

  • Faster development cycle and easier debugging (edit-and-run, errors often point directly to source lines)
  • Greater platform independence (source code can run on any system with the interpreter)
  • Often more dynamic and flexible (e.g., modifying code at runtime)

Cons:

  • Generally slower execution speed as code is translated during runtime
  • Runtime errors can occur if not thoroughly tested
  • Usually requires the interpreter to be installed on the target machine

Low-level vs high-level languages

Code that is worth a thousand words

  • “Hello, World!” in machine code (Hex representation):

Low-level vs high-level languages

Code that is worth a thousand words

  • “Hello, World!” in Assembly (x86 Assembly for Linux)
R> section .data
+     message db 'Hello, World!', 10    ; 10 is the ASCII code for newline
+ 
+ section .text
+     global _start
+ 
+ _start:
+     ; write(stdout, message, length)
+     mov eax, 4          ; system call number for write (sys_write)
+     mov ebx, 1          ; file descriptor 1 is stdout
+     mov ecx, message    ; address of string to output
+     mov edx, 14         ; number of bytes (length of "Hello, World!\n")
+     int 0x80            ; call kernel to perform the write
+ 
+     ; exit(0)
+     mov eax, 1          ; system call number for exit (sys_exit)
+     xor ebx, ebx        ; exit status 0 (clear ebx register)
+     int 0x80            ; call kernel to exit

Low-level vs high-level languages

Code that is worth a thousand words

  • “Hello, World!” in Python or R:
R> print("Hello, World!")

Question:
Is Natural Language Programming the Future of High-Level Languages? 🤖

The Operating System (OS) 🖥

A computer in a nutshell

Operating system

Credit Dave Kerr

  • The operating system (OS) is system software that interfaces with (and manages access to) a computer’s hardware. It also provides software resources for applications
  • Key functions: Process management, memory management, file system management, device management, security, user interface
  • The OS is broadly divided into the kernel and user space

A computer in a nutshell

Operating system

Credit Dave Kerr

  • The kernel is the core of the OS. It’s responsible for interfacing directly with hardware (via drivers), managing system resources (CPU, memory), and providing essential services. Running software in the kernel is extremely sensitive! That’s why users and most applications are kept away from it!
  • Curiosity: You can see the Linux kernel source code on GitHub.
  • The user space is where user applications run. It provides an interface for users to interact with the system. Hardware access for programs in user space is managed and controlled by the kernel. Programs in user space are essentially in sandboxes, which limits the potential damage they can cause.

A computer in a nutshell

Kernels and shells

  • The shell is a general name for any user space program that allows users to access and interact with the operating system’s services and resources.
  • It acts as an interface between the user and the kernel.
  • Shells come in many different flavours but are generally provided to aid a human operator in accessing the system. This could be:
    • Interactively, by typing commands at a terminal.
    • Via scripts, which are files containing a sequence of commands to be executed.
  • Modern computers often use graphical user interfaces (GUIs) (like Windows Explorer or macOS Finder) as a type of shell. However, the term “shell” in computing often refers to command-line shells.
  • Why “kernel” and “shell”? The kernel is the soft, edible part of a nut or seed (the core functionality), which is surrounded by a hard shell (the user interface) to protect it and provide access. Useful metaphor, isn’t it? 😉

Interacting with the shell

Terminals

Credit Dave Kerr

  • Things are still a bit more complicated
  • We’re not directly interacting with the “shell” but using a terminal
  • A terminal is just a program that reads input from the keyboard, passes that input to another programme, and displays the results on the screen
  • A shell program on its own does not do this - it requires a terminal as an interface
  • Why “terminal”? Back in the old days (before computer screen existed), terminal machines (hardware!) were used to let humans interface with large machines (“mainframes”). Often many terminals were connected to a single machine
  • When you want to work with a computer in a data center (or remotely in cloud computing), you’ll still do pretty much the same
  • But this is a topic for our next lecture! 😄

Summary 💡

Summary

  • Text is represented using ASCII and Unicode, with UTF-8 encoding for diverse characters
  • Programming languages evolved from early machines to high-level languages
  • Assembly language provides a human-readable way to write machine instructions
  • Low-level languages are close to hardware, while high-level languages are more abstract
  • Compiled languages are translated to machine code before runtime, while interpreted languages are translated during runtime
  • The operating system manages hardware and software resources, divided into the kernel (core) and user space (applications)
  • The shell is a user interface program that interacts with the OS
  • The terminal is a program that reads input from the keyboard, passes it to the shell, and displays results on the screen… which is what we will be using in the next lecture and beyond! 🤓

Questions? 🤔

Thank you very much! 😊 🙏

Solutions to Practice Exercises

Solution - Practice Exercise 03

  • Task: Translate the following binary into ASCII text: 01011001 01100001 01111001

  • Step 1: Convert each binary string to its decimal equivalent:

    • 01011001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (0 \cdot 32) + (1 \cdot 16) + (1 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 16 + 8 + 1 = 89_{10}\)
    • 01100001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (1 \cdot 32) + (0 \cdot 16) + (0 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 32 + 1 = 97_{10}\)
    • 01111001\(_2\) = \((0 \cdot 128) + (1 \cdot 64) + (1 \cdot 32) + (1 \cdot 16) + (1 \cdot 8) + (0 \cdot 4) + (0 \cdot 2) + (1 \cdot 1) = 64 + 32 + 16 + 8 + 1 = 121_{10}\)
  • Step 2: Map each decimal value to its corresponding ASCII character (refer to an ASCII table):

    • 89 = ‘Y’
    • 97 = ‘a’
    • 121 = ‘y’
  • Step 3: Combine the ASCII characters to form the final text:

    • Result: Yay

Thank you very much! 😊 🙏