Welcome to Data Science Computing!

Lecture 01: Introduction

Danilo Freire

Department of Quantitative Theory and Methods
Emory University

15 January, 2025

Welcome to QTM350
Data Science Computing! 🎉

Lecture overview

Today’s agenda

  • Introduction
  • Motivation
  • Class logistics
  • Computer set up

Course materials

Course repository: https://github.com/danilofreire/qtm350

Course website: https://danilofreire.github.io/qtm350

All course materials, including lectures, code, discussions, assignments, and project guidelines, are available on our GitHub repository

We will use Canvas for course administration, including submitting assignments, accessing grades, and receiving announcements

Please take some time to get to know both platforms, and reach out if you have any questions.

Note

Please remember to check the course repository regularly for updates and announcements!

Nice to meet you! 😊

Instructor

A bit about me

Visiting Assistant Professor in the QTM

MA from the Graduate Institute Geneva, PhD from King’s College London, Postdoc at Brown University, Senior Lecturer at the University of Lincoln, UK

Research interests: computational social science, experimental methods, policy evaluation, political violence, organised crime

What about you? (time permitting!)

Now it’s your turn! 😉

Please introduce yourself! 👋

Tell us your name, your major, one thing you really like, and something we don’t know about your city or country! 🌍

My teaching philosophy

  • I love teaching and aim to make learning fun
  • Classes where students participate are the best
  • Hands-on activities help you learn better
  • I am always available to help and answer questions. And I mean it
  • Your feedback helps me improve my teaching. Please let me know what is working and what is not

Teaching assistants

  • The teaching assistants for this course are Alix Morales and Harris Wang

  • Alix will be answering questions during our lectures and holding office hours. You can reach out to him via email at

  • Harris will also be grading your assignments and quizzes (with my oversight). His email address is

  • We are all here to help you! So feel free to ask questions during class, office hours, or via email 😃

Office hours

What for and what not for

  • What office hours are meant for:
    • Applying tools in practice
    • Discussion of issues related to the assignments
    • Boosting your knowledge of data science
  • What these sessions are not meant for:
    • Solving the assignments for you
    • Taking care of developing your coding skills

Class etiquette

  • Coding can be tough and push you out of your comfort zone. If the course pace is too fast, let us know. I expect your commitment, but I do not want anyone to fail
  • You are all keen on data science, but your backgrounds vary. That is great! Some sessions might be more engaging than others. If you are bored, help others or explore new data science areas
  • Always be respectful to each other
  • Ask questions whenever you need to!

Motivation:
What is data science? 👨🏻‍💻👩🏼‍💻

An old classic

An old classic

An old classic

An old classic

An old classic

Our focus!

Rise of the digital information age

Social media data

New data formats

Survey data

Cheap computing power

And what can we do with all these data? 🤔

Scraping the web for social research

Tackling social problems

Reducing hate speech

Monitoring the effects of climate change on health

Calling bullsh*t when you see it

Learn not to be fooled by

  • big data
  • garbage data
  • garbage models
  • weird samples
  • claims of generality
  • implausibly large effect sizes
  • overfitted models

And much more…

And much more…

  • Abundance of data available for research and for governments to make better decisions
    • Opportunities for novel research questions
    • New methods to answer longstanding research questions
  • New technologies also have social implications and can raise important policy issues
    • Ethical concerns
    • Use of technology by malicious actors
    • Government use of technology to censor or monitor citizens

Course overview and logistics 📖 📚 💻

Course objectives


  • Use data science tools for project collaboration and version control
  • Apply advanced techniques for data storage, manipulation, and querying
  • Create clear data visualisations and write well-documented code
  • Use AI tools to help with programming tasks
  • Understand the basics of containerisation and parallel computing

Key focus areas

Why reliability, reproducibility, and robustness matter

  • This course centres around three key areas of the modern data science workflow: reliability, reproducibility, and robustness
  • Reliability:
    • Ensures consistency in results across multiple runs
    • Minimises errors in data processing and analysis
    • Supports accurate interpretation of findings
  • Reproducibility:
    • Allows others to verify and build upon your work
    • Enhances the credibility of research outcomes
    • Facilitates long-term preservation of scientific knowledge
  • Robustness:
    • Enables analyses to handle unexpected data variations
    • Improves the stability of results under different conditions
    • Supports the scalability of methods to larger datasets

Key tools

Key tools

Key tools

Key tools

Key tools

  • Docker for consistent computational environments
  • Dask for scalable and parallel computing

Logistics

Course information

  • Syllabus: Available on our course repository and website. The course is designed to be self-contained. The syllabus includes links to slides and Jupyter Notebooks we will use in class, along with recommended readings, and problem sets. I will upload slides throughout the term as we progress.

  • Schedule: Lectures are on Mondays and Wednesdays from 2:30 to 3:45 pm

  • Office Hours: I’m available to meet you at any time. Please reach out some time in advance and we can schedule a meeting

  • Materials:

Assignments

How you will be graded

  • Problem sets: Ten of them, due on Wednesdays at 11:59 pm (50%)
  • In-class quizzes: Five of them (30%)
  • Final project: Due on the last day of class (20%)
  • Late policy: 10% off per day late
  • Collaboration: You can discuss assignments with your classmates, but you must write your own code and submit your own work. AI is allowed.
  • Academic integrity: Please refer to the syllabus for the university’s policy on academic integrity

Set up 💻 🛠

Software

  • Git: Version control system. Download it here. Instructions for installation here. Feel free to configure it if you wish (instructions here), but we are going to talk about it in class.

  • GitHub: Online platform for hosting code repositories. You will use it a lot, and not only for this class. Create an account on GitHub and register for a student/educator discount.

  • There is a series of tutorials available on our course website on how to set up Git and GitHub: https://danilofreire.github.io/qtm350/tutorials/tutorials.html

OS extras

Other tools

  • We will have time to install other tools during the course. But if you want to get ahead, you can install the following:
  • VS Code: Code editor. Download it here
  • Anaconda: Python distribution. Download it here
  • Docker: Containerisation tool. Download it here
  • SQLite: You can use the sqlite3 Python package or install the SQLite database. The package has everything you need to get started

Next class

  • We will cover computational literacy, including binary and hexadecimal numbers, and character encoding systems like ASCII and Unicode
  • We will also discuss the early days of computing, focusing on Konrad Zuse’s work with digital computers and binary arithmetic
  • We will talk about the evolution of programming languages, from assembly to modern high-level languages like Python, and the differences between compiled and interpreted languages
  • There will be time for questions about installing the terminal. You do not need it for next week, but consider installing it soon, as it will be necessary in two weeks. Please create a GitHub educational account if you do not have one 😉

and that’s all for today! 🎉

Questions?

Thank you very much for your attention! 🙏🏻

Have a great day! 😊