QTM 350 - Data Science Computing

Lecture 14 - Introduction to Pandas

Danilo Freire

danilo.freire@emory.edu

Emory University

Hello, everyone! 👋

I’ve made something!
Let me know if it works 🐍

Jupyter Notebooks online 🌐

New users find it challenging to install Python and Jupyter Notebook on their computers
So I’ve made a Jupyter Notebook online that you can use to run Python code without installing anything! 🤓
I’ve used Pyodide and JupyterLite to run Python code in the browser
You can access it on the Jupyter Lite tab on the course website
The website is https://danilofreire.github.io/qtm350/jupyter

It already comes with all packages we need for this class, such as NumPy, Pandas, Matplotlib, and Seaborn
You can install many other packages too! 📦
Not all Python packages work, but many do. Install them with

%pip install package-name

You can also use it to run R and JavaScript code, as well as write LaTeX and Markdown documents
Please download your files with the right-click menu before closing the browser!
Let me know if you find any bugs! 🐞

Brief recap of the last lecture

Introduction to Python 🐍

In the last lecture, we had a brief introduction to Python
We covered the main concepts of the language, such as variables, operators, control structures, and functions
We also saw how to install Python and Jupyter Notebook
We briefly discussed the various data types in Python, such as integers, floats, strings, lists, tuples, and dictionaries
We finished with for loops, if statements, and functions
Today we will see lean more about Numpy and, more importantly, Pandas

Numpy 🧮

Numpy arrays

What is Numpy?

NumPy stands for “Numerical Python” and it is the standard Python library used for working with arrays (i.e., vectors & matrices), linear algebra, and other numerical computations
NumPy is written in C, making NumPy arrays faster and more memory efficient than Python lists
If you have Anaconda installed, you already have NumPy installed too. But if you don’t, you can install it using conda install numpy or pip install numpy
In Python, we export packages with the import command.
It is also common to use aliases to make the code shorter and more readable. Numpy’s is np

import numpy as np

What is an array?

Arrays are “n-dimensional” data structures that can contain all the basic Python data types, e.g., floats, integers, strings etc
However, they work best with numeric data
NumPy arrays (ndarrays) are homogenous, which means that items in the array should be of the same type.
ndarrays are also compatible with numpy’s vast collection of in-built functions!

Creating arrays

A numpy array is sort of like a list, but with more functionality

my_list = [1, 2, 3, 4, 5]
my_list

[1, 2, 3, 4, 5]

my_array = np.array([1, 2, 3, 4, 5])
my_array

array([1, 2, 3, 4, 5])

But it has the type numpy.ndarray
Unlike a list, arrays can only hold a single type (usually numbers). Check this out

my_list = [1, "hi"]
my_list

[1, 'hi']

my_array = np.array([1, "hi"])
my_array

array(['1', 'hi'], dtype='<U21')

Above: NumPy converted the integer 1 into the string '1'!

ndarrays are typically created using two main methods:
- From existing data using np.array()
- Using built-in functions like np.zeros(), np.ones(), np.arange(), np.linspace(), np.random.normal(), etc.
Let’s see some examples

my_list = [1, 2, 3]
np.array(my_list)

array([1, 2, 3])

np.arange(1, 10, 2)  # from 1 inclusive to 10 exclusive, step 2

array([1, 3, 5, 7, 9])

np.linspace(0, 10, 5) # from 0 to 10, 5 numbers

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

You can have multi-dimensional arrays (indicated by double square brackets [[ ]]):

np.array([[1, 2, 3], [4, 5, 6]])

array([[1, 2, 3],
       [4, 5, 6]])

Array operations

Arrays can be used in arithmetic operations, such as addition, subtraction, multiplication, and division
These operations are performed element-wise, meaning that the operation is applied to each element in the array

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b

array([5, 7, 9])

a * b

array([ 4, 10, 18])

You can also apply functions to arrays, such as np.sqrt(), np.exp(), np.log(), np.sin(), np.cos(), np.tan(), etc. Please check the documentation for more information

np.sqrt(a)

array([1.        , 1.41421356, 1.73205081])

np.exp(a)

array([ 2.71828183,  7.3890561 , 20.08553692])

You can also apply logical operations to arrays, such as ==, !=, >, <, >=, <=, etc.
These operations return boolean arrays

a == b

array([False, False, False])

a < b

array([ True,  True,  True])

You can call (most of) these functions on the array itself using the dot notation a.sum(), a.mean(), a.max(), a.min(), etc.
- A dot “.” basically means “in here”

a.sum()

a.mean()

2.0

a.max()

Broadcasting

Broadcasting allows NumPy to work with arrays of different shapes when performing arithmetic operations
The smaller array is “broadcast” across the larger array so that they have compatible shapes
So it substitues for loops in many cases! 😅
One example is adding a scalar to an array

cost = np.array([20, 15, 25])
print("Pie cost:")
print(cost)

Pie cost:
[20 15 25]

sales = np.array([[2, 3, 1], [6, 3, 3], [5, 3, 5]])
print("\nPie sales (#):")
print(sales)


Pie sales (#):
[[2 3 1]
 [6 3 3]
 [5 3 5]]

How do we make them the same size?

We can broadcast the cost array to the sales array
We will use the np.repeat() function to do this

cost = np.repeat(cost, 3).reshape((3, 3))
cost

array([[20, 20, 20],
       [15, 15, 15],
       [25, 25, 25]])

Now we can calculate the total sales

total_sales = cost * sales
total_sales

array([[ 40,  60,  20],
       [ 90,  45,  45],
       [125,  75, 125]])

Wohoo! 🥳 Much easier than creating a loop

Indexing and slicing

Numeric indexing

Indexing arrays is similar to indexing lists but there are just more dimensions
We use square brackets [] to index arrays
Colons : are used to slice arrays

x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x[3]

x[2:5]

array([2, 3, 4])

x[2:]

array([2, 3, 4, 5, 6, 7, 8, 9])

x[-1]

x[5:0:-1] # reverse

array([5, 4, 3, 2, 1])

Multi-dimensional arrays

For multi-dimensional arrays, we use commas to separate the indices
The first index refers to the row, the second to the column, and so on

x = np.random.randint(10, size=(4, 6))
x

array([[9, 4, 7, 9, 4, 4],
       [7, 6, 3, 2, 1, 1],
       [6, 7, 6, 0, 5, 3],
       [1, 3, 1, 7, 8, 1]])

x[3, 4]

x[2, :]

array([6, 7, 6, 0, 5, 3])

x[:, 3]

array([9, 2, 0, 7])

x[2:, :3]

array([[6, 7, 6],
       [1, 3, 1]])

Pandas! 🐼

Pandas is pretty cool! 🐼

Pandas Series

Pandas is the most popular Python library for tabular data structures
Think of Pandas as an extremely powerful version of Excel or dplyr + tibble in R
Pandas is built on top of NumPy, so it is fast and memory efficient
Pandas has two main data structures: Series and DataFrame
A Series is a one-dimensional array with an index, pretty much like a np.array, but with a label for each element
- They are strictly one-dimensional
You can create a Series from a list, a NumPy array, a dictionary, or a scalar value using pd.Series() (note the capital “S”)
You import pandas with import pandas as pd

Creating Series

import pandas as pd

pd.Series(data = [-5, 1.3, 21, 6, 3])

0    -5.0
1     1.3
2    21.0
3     6.0
4     3.0
dtype: float64

The left column is the index, and the right column is the data
If you don’t specify an index, Pandas will create one for you
But you can add a custom index:

pd.Series(data = [-5, 1.3, 21, 6, 3],
          index = ['a', 'b', 'c', 'd', 'e'])

a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64

You can create a Series from a dictionary:

pd.Series(data = {'a': 10, 'b': 20, 'c': 30})

a    10
b    20
c    30
dtype: int64

Indexing and slicing Series

You can index and slice a Series like a NumPy array
In fact, series can be passed to most NumPy functions!
They can be indexed using square brackets [ ] and sliced using colon : notation:

s = pd.Series(data = range(5),
              index = ['A', 'B', 'C', 'D', 'E'])
s

A    0
B    1
C    2
D    3
E    4
dtype: int64

s[0]

s['A']

s[["B", "D", "C"]]

B    1
D    3
C    2
dtype: int64

Note above how array-based indexing and slicing also returns the series index
Finally, we can also do boolean indexing with series

s[s >= 1]

B    1
C    2
D    3
E    4
dtype: int64

s[s > s.mean()]

D    3
E    4
dtype: int64

(s != 1)

A     True
B    False
C     True
D     True
E     True
dtype: bool

Series operations

Series can be used in arithmetic operations, such as addition, subtraction, multiplication, and division
Unlike ndarrays, operations between Series align values based on their LABELS (not their position in the structure)
The resulting index will be the sorted union of the two indexes

s1 = pd.Series(data = range(4),
               index = ["A", "B", "C", "D"])
s1

A    0
B    1
C    2
D    3
dtype: int64

s2 = pd.Series(data = range(10, 14),
               index = ["B", "C", "D", "E"])
s2

B    10
C    11
D    12
E    13
dtype: int64

s1 + s2

A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

Indices that don’t match will appear in the product but with NaN values
NumPy also accepts series as an argument to most functions because series are built off numpy arrays

np.exp(s1)

A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

Pandas DataFrames 🐼

Pandas DataFrames

Pandas DataFrames are your new best friend 😂
DataFrames are really just Series stuck together!
Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data

Dataframes can be created using pd.DataFrame() (note the capital “D” and “F”)

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

	0	1	2
0	1	2	3
1	4	5	6
2	7	8	9

We can use the index and columns arguments to give them labels:

pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])

	C1	C2	C3
R1	1	2	3
R2	4	5	6
R3	7	8	9

Creating DataFrames

DataFrames can be created from dictionaries, lists, NumPy arrays, and Series
It is common to create DataFrames from dictionaries, where the keys are the column names and the values are the data

pd.DataFrame({"C1": [1, 2, 3],
              "C2": ['A', 'B', 'C']},
             index=["R1", "R2", "R3"])

	C1	C2
R1	1	A
R2	2	B
R3	3	C

Usually, you will create a DataFrame from a CSV file using pd.read_csv()
You can also create a DataFrame from an Excel file using pd.read_excel()
Pandas can read from many other sources, such as SQL databases (as we will see in this course), JSON files, and even HTML tables

df = pd.read_csv("data/iris.csv")
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

Creating DataFrames - Cheat Sheet

Create DataFrame from	Code
Lists of lists	`pd.DataFrame([['Danilo', 7], ['Maria', 15], ['Lucas', 3]])`
ndarray	`pd.DataFrame(np.array([['Danilo', 7], ['Maria', 15], ['Lucas', 3]]))`
Dictionary	`pd.DataFrame({"Name": ['Danilo', 'Maria', 'Lucas'], "Number": [7, 15, 3]})`
List of tuples	`pd.DataFrame(zip(['Danilo', 'Maria', 'Lucas'], [7, 15, 3]))`
Series	`pd.DataFrame({"Name": pd.Series(['Danilo', 'Maria', 'Lucas']), "Number": pd.Series([7, 15, 3])})`

See the Pandas documentation for more

Indexing and slicing DataFrames

There are several main ways to select data from a DataFrame:
- Using square brackets [ ], .loc[], .iloc[], Boolean indices, and .query()

df = pd.DataFrame({"Name": ["Danilo", "Maria", "Lucas"],
                   "Language": ["Python", "Python", "R"],
                   "Courses": [5, 4, 7]})
df

	Name	Language	Courses
0	Danilo	Python	5
1	Maria	Python	4
2	Lucas	R	7

You can select a column using square brackets [ ] or dot notation .

df["Name"]

0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object

df.Name

0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object

You can select multiple columns by passing a list of column names

df[["Name", "Courses"]]

	Name	Courses
0	Danilo	5
1	Maria	4
2	Lucas	7

You can select rows using .iloc[], which accepts integers as references to rows/columns

df.iloc[0]  # returns a series

Name        Danilo
Language    Python
Courses          5
Name: 0, dtype: object

df.iloc[0:2]  # returns a dataframe

	Name	Language	Courses
0	Danilo	Python	5
1	Maria	Python	4

Indexing and slicing DataFrames

Now let’s look at .loc which accepts labels as references to rows/columns:

df.loc[:, 'Name']

0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object

df.loc[:, 'Name':'Language']

	Name	Language
0	Danilo	Python
1	Maria	Python
2	Lucas	R

df.loc[[0, 2], ['Language']]

	Language
0	Python
2	R

Boolean indexing is also possible

df[df["Courses"] > 5]

	Name	Language	Courses
2	Lucas	R	7

df[df['Name'] == "Danilo"]

	Name	Language	Courses
0	Danilo	Python	5

Indexing and slicing with `.query()`

The .query() method allows you to select data using a string expression
It is my favourite method because it is more readable and less error-prone
.query() accepts a string expression to evaluate and it “knows” the names of the columns in your dataframe

df.query('Courses > 5')

	Name	Language	Courses
2	Lucas	R	7

df.query('Name == "Danilo"')

	Name	Language	Courses
0	Danilo	Python	5

df.query("Courses > 4 & Language == 'Python'")

	Name	Language	Courses
0	Danilo	Python	5

Note the use of single quotes AND double quotes above, lucky we have both in Python!
You can also use the @ symbol to reference variables in the environment

min_courses = 5
df.query("Courses > @min_courses")

	Name	Language	Courses
2	Lucas	R	7

Indexing cheatsheet

Method	Syntax	Output
Select column	`df[col_label]`	Series
Select row slice	`df[row_1_int:row_2_int]`	DataFrame
Select row/column by label	`df.loc[row_label(s), col_label(s)]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select row/column by integer	`df.iloc[row_int(s), col_int(s)]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by row integer & column label	`df.loc[df.index[row_int], col_label]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by row label & column integer	`df.loc[row_label, df.columns[col_int]]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean	`df[bool_vec]`	Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean expression	`df.query("expression")`	Object for single selection, Series for one row/column, otherwise DataFrame

Reading/Writing Data From External Sources 📂

Reading data from external sources

.csv files

Pandas can read data from many sources, such as CSV, Excel, SQL databases, JSON, HTML, and more
As mentioned above, .csv files are the most common data format (for good reason!)
You can use the pd.read_csv() function for this
There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out by using shift + tab in Jupyter, or typing help(pd.read_csv)

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

You can save a DataFrame to a .csv file using df.to_csv()

df.to_csv("data/iris_copy.csv", index=False)

URLs

You can also read data from a URL using pd.read_csv()
This is useful when you want to read data from a website without downloading it first

url = "https://github.com/danilofreire/qtm350/raw/refs/heads/main/lectures/lecture-14/data/iris.csv"
df = pd.read_csv(url)
df.head()

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

See the Pandas documentation here for information about reading and writing data

Common DataFrame Operations 🐼

Common DataFrame operations

DataFrames have built-in functions for performing most common operations, e.g., .min(), idxmin(), sort_values(), etc.
They’re all documented in the Pandas documentation here but I’ll demonstrate a few below

df = pd.read_csv('data/cycling_data.csv')
df.head(7)

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise
4	12 Sep 2019, 00:28:05	Afternoon Ride	Ride	1891	12.48	Tired by the end of the week
5	16 Sep 2019, 13:57:48	Morning Ride	Ride	2272	12.45	Rested after the weekend!
6	17 Sep 2019, 00:15:47	Afternoon Ride	Ride	1973	12.45	Legs feeling strong!

df.min()

Date                         1 Oct 2019, 00:15:07
Name                               Afternoon Ride
Type                                         Ride
Time                                         1712
Distance                                    11.79
Comments    A little tired today but good weather
dtype: object

df['Time'].min()

df['Time'].idxmin() # index of the minimum value

df.iloc[20]

Date               27 Sep 2019, 01:00:18
Name                      Afternoon Ride
Type                                Ride
Time                                1712
Distance                           12.47
Comments    Tired by the end of the week
Name: 20, dtype: object

Sorting DataFrames

.sort_values() is used to sort a DataFrame by one or more columns
You can sort by one or more columns, and you can specify the order (ascending or descending)
The default argument is ascending=True

df.sort_values(by='Distance').head()

	Date	Name	Type	Time	Distance	Comments
32	11 Oct 2019, 00:16:57	Afternoon Ride	Ride	1843	11.79	Bike feeling tight, needs an oil and pump
16	25 Sep 2019, 00:07:21	Afternoon Ride	Ride	1775	12.10	Feeling really tired
5	16 Sep 2019, 13:57:48	Morning Ride	Ride	2272	12.45	Rested after the weekend!
6	17 Sep 2019, 00:15:47	Afternoon Ride	Ride	1973	12.45	Legs feeling strong!
20	27 Sep 2019, 01:00:18	Afternoon Ride	Ride	1712	12.47	Tired by the end of the week

df.sort_values(by='Distance', ascending=False).head(4)

	Date	Name	Type	Time	Distance	Comments
8	18 Sep 2019, 13:49:53	Morning Ride	Ride	2903	14.57	Raining today
25	2 Oct 2019, 13:46:06	Morning Ride	Ride	2134	13.06	Bit tired today but good weather
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
19	26 Sep 2019, 13:42:43	Morning Ride	Ride	2350	12.91	Detour around trucks at Jericho

You can also combine .query() with .sort_values()

df.query('Distance > 13').sort_values(by='Time')

	Date	Name	Type	Time	Distance	Comments
25	2 Oct 2019, 13:46:06	Morning Ride	Ride	2134	13.06	Bit tired today but good weather
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
8	18 Sep 2019, 13:49:53	Morning Ride	Ride	2903	14.57	Raining today

DataFrame Characteristics 🔎

DataFrame characteristics

DataFrames have many attributes that can be used to understand the data
For example, you can use .shape to get the dimensions of the DataFrame

df.shape

(33, 6)

.info() prints a summary of the DataFrame

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      33 non-null     object 
 1   Name      33 non-null     object 
 2   Type      33 non-null     object 
 3   Time      33 non-null     int64  
 4   Distance  31 non-null     float64
 5   Comments  33 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.7+ KB

.columns returns the column names

df.columns

Index(['Date', 'Name', 'Type', 'Time', 'Distance', 'Comments'], dtype='object')

.describe() prints the descriptive stats of the numerical columns

df.describe()

	Time	Distance
count	33.000000	31.000000
mean	3512.787879	12.667419
std	8003.309233	0.428618
min	1712.000000	11.790000
25%	1863.000000	12.480000
50%	2118.000000	12.620000
75%	2285.000000	12.750000
max	48062.000000	14.570000

.dtypes returns the data types of the columns

df.dtypes

Date         object
Name         object
Type         object
Time          int64
Distance    float64
Comments     object
dtype: object

Rename columns

We can rename columns two ways:
Using .rename() (to selectively change column names, usually recommended)
By setting the .columns attribute (to change all column names at once)

df.head(2)

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain

df.rename(columns={"Date": "Datetime",
                   "Comments": "Notes"})
df.head(2)

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain

Wait? What happened? Nothing changed? 🤔

We did actually rename columns of our dataframe but we didn’t modify the dataframe inplace, we made a copy of it
There are generally two options for making permanent dataframe changes:
- Use the argument inplace=True, e.g., df.rename(..., inplace=True), available in most functions/methods
- Re-assign, e.g., df = df.rename(...) (recommended by Pandas)

df = df.rename(columns={"Date": "Datetime",
                        "Comments": "Notes"})
df.head(2)

	Datetime	Name	Type	Time	Distance	Notes
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain

Now it works fine! 🥳

Rename columns

If you wish to change all of the columns of a dataframe, you can do so by setting the .columns attribute:

df.columns = [f"Column {_}" for _ in range(1, 7)]
df.head(5)

	Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise
4	12 Sep 2019, 00:28:05	Afternoon Ride	Ride	1891	12.48	Tired by the end of the week

This is a bit more dangerous than using .rename() because you can easily mess up the order of the columns 😅
So be careful when using this method

Adding and removing columns

There are two main ways to add/remove columns of a dataframe:
- Use [] to add columns
- Use .drop() to drop columns
Let’s re-read in a fresh copy of the cycling dataset

df = pd.read_csv('data/cycling_data.csv')
df.head(2)

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain

Let’s add new columns to the dataframe

df['Rider'] = 'Danilo Freire'
df['Avg Speed'] = df['Distance'] * 1000 / df['Time']  # avg. speed in m/s
df.head(2)

	Date	Name	Type	Time	Distance	Comments	Rider	Avg Speed
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain	Danilo Freire	6.055662
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain	Danilo Freire	5.148163

Now let’s remove the columns we just added

df = df.drop(columns=['Rider', 'Avg Speed'])
df.head(4)

	Date	Name	Type	Time	Distance	Comments
0	10 Sep 2019, 00:13:04	Afternoon Ride	Ride	2084	12.62	Rain
1	10 Sep 2019, 13:52:18	Morning Ride	Ride	2531	13.03	rain
2	11 Sep 2019, 00:23:50	Afternoon Ride	Ride	1863	12.52	Wet road but nice weather
3	11 Sep 2019, 14:06:19	Morning Ride	Ride	2192	12.84	Stopped for photo of sunrise

Adding and removing rows

You won’t often be adding rows to a dataframe manually (you’ll usually add rows through joining)
You can add/remove rows of a dataframe in two ways: .concat() to add rows and .drop() to drop rows
Let’s add a new row to the dataframe

another_row = pd.DataFrame([["12 Oct 2019, 00:10:57", "Morning Ride", "Ride",
                             2331, 12.67, "Washed and oiled bike last night"]],
                           columns = df.columns,
                           index = [33])
df = pd.concat([df, another_row])
df.tail(2)

	Date	Name	Type	Time	Distance	Comments
32	11 Oct 2019, 00:16:57	Afternoon Ride	Ride	1843	11.79	Bike feeling tight, needs an oil and pump
33	12 Oct 2019, 00:10:57	Morning Ride	Ride	2331	12.67	Washed and oiled bike last night

Finally, let’s remove the row we just added

df = df.drop(index=33)
df.tail(2)

	Date	Name	Type	Time	Distance	Comments
31	10 Oct 2019, 13:47:14	Morning Ride	Ride	2463	12.79	Stopped for photo of sunrise
32	11 Oct 2019, 00:16:57	Afternoon Ride	Ride	1843	11.79	Bike feeling tight, needs an oil and pump

Why ndarrays and Series and DataFrames?

At this point, you might be asking why we need all these different data structures
Well, they all serve different purposes and are suited to different tasks. For example:
- NumPy is typically faster/uses less memory than Pandas
- not all Python packages are compatible with NumPy & Pandas
- the ability to add labels to data can be useful (e.g., for time series)
- NumPy and Pandas have different built-in functions available
My advice: use the simplest data structure that fulfills your needs!
Finally, we’ve seen how to go from: ndarray (np.array()) -> series (pd.series()) -> dataframe (pd.DataFrame())
Remember that we can also go the other way: dataframe/series -> ndarray using df.to_numpy()
But you will probably use DataFrames most of the time 😉

Conclusion

Today we learned about NumPy and Pandas, two of the most important Python libraries for data manipulation
We saw how to create arrays and series, and how to perform operations on them
We also saw how to create and manipulate DataFrames, and how to read/write data from/to external sources
We learned how to index and slice arrays and DataFrames
We also learned how to rename columns, add/remove columns/rows, and how to sort DataFrames
Finally, we discussed why we need different data structures
Next time, we will learn about DataFrame reshaping, joining data, applying custom functions, visualising Dataframes with .plot(), and more! 😊

And that’s a wrap! 🎬

Thank you very much!
See you next time! 😊🙏🏽

QTM 350 - Data Science Computing

Hello, everyone! 👋

I’ve made something! Let me know if it works 🐍

Jupyter Notebooks online 🌐

Brief recap of the last lecture

Introduction to Python 🐍

Numpy 🧮

Numpy arrays

What is Numpy?

What is an array?

Creating arrays

Array operations

Broadcasting

Indexing and slicing

Numeric indexing

Multi-dimensional arrays

Pandas! 🐼

Pandas is pretty cool! 🐼

Pandas Series

Creating Series

Indexing and slicing Series

Series operations

Pandas DataFrames 🐼

Pandas DataFrames

Creating DataFrames

Creating DataFrames - Cheat Sheet

Indexing and slicing DataFrames

Indexing and slicing DataFrames

Indexing and slicing with .query()

Indexing cheatsheet

Reading/Writing Data From External Sources 📂

Reading data from external sources

.csv files

URLs

Common DataFrame Operations 🐼

Common DataFrame operations

Sorting DataFrames

DataFrame Characteristics 🔎

DataFrame characteristics

Rename columns

Rename columns

Adding and removing columns

Adding and removing rows

Why ndarrays and Series and DataFrames?

Conclusion

And that’s a wrap! 🎬

Thank you very much! See you next time! 😊🙏🏽

I’ve made something!
Let me know if it works 🐍

Indexing and slicing with `.query()`

Thank you very much!
See you next time! 😊🙏🏽