QTM 350 - Data Science Computing

Lecture 14 - Introduction to Pandas

Danilo Freire

Emory University

21 October, 2024

Hello, everyone! 👋

I’ve made something!
Let me know if it works 🐍

Jupyter Notebooks online 🌐

  • It already comes with all packages we need for this class, such as NumPy, Pandas, Matplotlib, and Seaborn
  • You can install many other packages too! 📦
  • Not all Python packages work, but many do. Install them with
%pip install package-name
  • You can also use it to run R and JavaScript code, as well as write LaTeX and Markdown documents
  • Please download your files with the right-click menu before closing the browser!
  • Let me know if you find any bugs! 🐞

Brief recap of the last lecture

Introduction to Python 🐍

  • In the last lecture, we had a brief introduction to Python
  • We covered the main concepts of the language, such as variables, operators, control structures, and functions
  • We also saw how to install Python and Jupyter Notebook
  • We briefly discussed the various data types in Python, such as integers, floats, strings, lists, tuples, and dictionaries
  • We finished with for loops, if statements, and functions
  • Today we will see lean more about Numpy and, more importantly, Pandas

Numpy 🧮

Numpy arrays

What is Numpy?

  • NumPy stands for “Numerical Python” and it is the standard Python library used for working with arrays (i.e., vectors & matrices), linear algebra, and other numerical computations
  • NumPy is written in C, making NumPy arrays faster and more memory efficient than Python lists
  • If you have Anaconda installed, you already have NumPy installed too. But if you don’t, you can install it using conda install numpy or pip install numpy
  • In Python, we export packages with the import command.
  • It is also common to use aliases to make the code shorter and more readable. Numpy’s is np
import numpy as np 

What is an array?

  • Arrays are “n-dimensional” data structures that can contain all the basic Python data types, e.g., floats, integers, strings etc
  • However, they work best with numeric data
  • NumPy arrays (ndarrays) are homogenous, which means that items in the array should be of the same type.
  • ndarrays are also compatible with numpy’s vast collection of in-built functions!

Creating arrays

  • A numpy array is sort of like a list, but with more functionality
my_list = [1, 2, 3, 4, 5]
my_list
[1, 2, 3, 4, 5]
my_array = np.array([1, 2, 3, 4, 5])
my_array
array([1, 2, 3, 4, 5])
  • But it has the type numpy.ndarray

  • Unlike a list, arrays can only hold a single type (usually numbers). Check this out

my_list = [1, "hi"]
my_list
[1, 'hi']
my_array = np.array([1, "hi"])
my_array
array(['1', 'hi'], dtype='<U21')
  • Above: NumPy converted the integer 1 into the string '1'!
  • ndarrays are typically created using two main methods:
    • From existing data using np.array()
    • Using built-in functions like np.zeros(), np.ones(), np.arange(), np.linspace(), np.random.normal(), etc.
  • Let’s see some examples
my_list = [1, 2, 3]
np.array(my_list)
array([1, 2, 3])
np.arange(1, 10, 2)  # from 1 inclusive to 10 exclusive, step 2
array([1, 3, 5, 7, 9])
np.linspace(0, 10, 5) # from 0 to 10, 5 numbers
array([ 0. ,  2.5,  5. ,  7.5, 10. ])
  • You can have multi-dimensional arrays (indicated by double square brackets [[ ]]):
np.array([[1, 2, 3], [4, 5, 6]])
array([[1, 2, 3],
       [4, 5, 6]])

Array operations

  • Arrays can be used in arithmetic operations, such as addition, subtraction, multiplication, and division
  • These operations are performed element-wise, meaning that the operation is applied to each element in the array
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b
array([5, 7, 9])
a * b
array([ 4, 10, 18])
  • You can also apply functions to arrays, such as np.sqrt(), np.exp(), np.log(), np.sin(), np.cos(), np.tan(), etc. Please check the documentation for more information
np.sqrt(a)
array([1.        , 1.41421356, 1.73205081])
np.exp(a)
array([ 2.71828183,  7.3890561 , 20.08553692])
  • You can also apply logical operations to arrays, such as ==, !=, >, <, >=, <=, etc.
  • These operations return boolean arrays
a == b
array([False, False, False])
a < b
array([ True,  True,  True])
  • You can call (most of) these functions on the array itself using the dot notation a.sum(), a.mean(), a.max(), a.min(), etc.
    • A dot “.” basically means “in here”
a.sum()
6
a.mean()
2.0
a.max()
3

Broadcasting

  • Broadcasting allows NumPy to work with arrays of different shapes when performing arithmetic operations
  • The smaller array is “broadcast” across the larger array so that they have compatible shapes
  • So it substitues for loops in many cases! 😅
  • One example is adding a scalar to an array
cost = np.array([20, 15, 25])
print("Pie cost:")
print(cost)
Pie cost:
[20 15 25]
sales = np.array([[2, 3, 1], [6, 3, 3], [5, 3, 5]])
print("\nPie sales (#):")
print(sales)

Pie sales (#):
[[2 3 1]
 [6 3 3]
 [5 3 5]]
  • How do we make them the same size?

  • We can broadcast the cost array to the sales array
  • We will use the np.repeat() function to do this
cost = np.repeat(cost, 3).reshape((3, 3))
cost
array([[20, 20, 20],
       [15, 15, 15],
       [25, 25, 25]])
  • Now we can calculate the total sales
total_sales = cost * sales
total_sales
array([[ 40,  60,  20],
       [ 90,  45,  45],
       [125,  75, 125]])
  • Wohoo! 🥳 Much easier than creating a loop

Indexing and slicing

Numeric indexing

  • Indexing arrays is similar to indexing lists but there are just more dimensions
  • We use square brackets [] to index arrays
  • Colons : are used to slice arrays
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
x[3]
3
x[2:5]
array([2, 3, 4])
x[2:]
array([2, 3, 4, 5, 6, 7, 8, 9])
x[-1]
9
x[5:0:-1] # reverse
array([5, 4, 3, 2, 1])

Multi-dimensional arrays

  • For multi-dimensional arrays, we use commas to separate the indices
  • The first index refers to the row, the second to the column, and so on
x = np.random.randint(10, size=(4, 6))
x
array([[8, 3, 2, 4, 0, 4],
       [0, 2, 0, 4, 7, 9],
       [9, 4, 7, 2, 6, 3],
       [3, 8, 2, 5, 3, 6]])
x[3, 4]
3
x[2, :]
array([9, 4, 7, 2, 6, 3])
x[:, 3]
array([4, 4, 2, 5])
x[2:, :3]
array([[9, 4, 7],
       [3, 8, 2]])

Pandas! 🐼

Pandas is pretty cool! 🐼

Pandas Series

  • Pandas is the most popular Python library for tabular data structures
  • Think of Pandas as an extremely powerful version of Excel or dplyr + tibble in R
  • Pandas is built on top of NumPy, so it is fast and memory efficient
  • Pandas has two main data structures: Series and DataFrame
  • A Series is a one-dimensional array with an index, pretty much like a np.array, but with a label for each element
    • They are strictly one-dimensional
  • You can create a Series from a list, a NumPy array, a dictionary, or a scalar value using pd.Series() (note the capital “S”)
  • You import pandas with import pandas as pd

Creating Series

import pandas as pd

pd.Series(data = [-5, 1.3, 21, 6, 3])
0    -5.0
1     1.3
2    21.0
3     6.0
4     3.0
dtype: float64
  • The left column is the index, and the right column is the data
  • If you don’t specify an index, Pandas will create one for you
  • But you can add a custom index:
pd.Series(data = [-5, 1.3, 21, 6, 3],
          index = ['a', 'b', 'c', 'd', 'e'])
a    -5.0
b     1.3
c    21.0
d     6.0
e     3.0
dtype: float64
  • You can create a Series from a dictionary:
pd.Series(data = {'a': 10, 'b': 20, 'c': 30})
a    10
b    20
c    30
dtype: int64

Indexing and slicing Series

  • You can index and slice a Series like a NumPy array
  • In fact, series can be passed to most NumPy functions!
  • They can be indexed using square brackets [ ] and sliced using colon : notation:
s = pd.Series(data = range(5),
              index = ['A', 'B', 'C', 'D', 'E'])
s
A    0
B    1
C    2
D    3
E    4
dtype: int64
s[0]
0
s['A']
0
s[["B", "D", "C"]]
B    1
D    3
C    2
dtype: int64
  • Note above how array-based indexing and slicing also returns the series index
  • Finally, we can also do boolean indexing with series
s[s >= 1]
B    1
C    2
D    3
E    4
dtype: int64
s[s > s.mean()]
D    3
E    4
dtype: int64
(s != 1)
A     True
B    False
C     True
D     True
E     True
dtype: bool

Series operations

  • Series can be used in arithmetic operations, such as addition, subtraction, multiplication, and division
  • Unlike ndarrays, operations between Series align values based on their LABELS (not their position in the structure)
  • The resulting index will be the sorted union of the two indexes
s1 = pd.Series(data = range(4),
               index = ["A", "B", "C", "D"])
s1
A    0
B    1
C    2
D    3
dtype: int64
s2 = pd.Series(data = range(10, 14),
               index = ["B", "C", "D", "E"])
s2
B    10
C    11
D    12
E    13
dtype: int64
s1 + s2
A     NaN
B    11.0
C    13.0
D    15.0
E     NaN
dtype: float64

  • Indices that don’t match will appear in the product but with NaN values
  • NumPy also accepts series as an argument to most functions because series are built off numpy arrays
np.exp(s1)
A     1.000000
B     2.718282
C     7.389056
D    20.085537
dtype: float64

Pandas DataFrames 🐼

Pandas DataFrames

  • Pandas DataFrames are your new best friend 😂
  • DataFrames are really just Series stuck together!
  • Think of a DataFrame as a dictionary of series, with the “keys” being the column labels and the “values” being the series data

  • Dataframes can be created using pd.DataFrame() (note the capital “D” and “F”)
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
  • We can use the index and columns arguments to give them labels:
pd.DataFrame([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]],
             index = ["R1", "R2", "R3"],
             columns = ["C1", "C2", "C3"])
C1 C2 C3
R1 1 2 3
R2 4 5 6
R3 7 8 9

Creating DataFrames

  • DataFrames can be created from dictionaries, lists, NumPy arrays, and Series
  • It is common to create DataFrames from dictionaries, where the keys are the column names and the values are the data
pd.DataFrame({"C1": [1, 2, 3],
              "C2": ['A', 'B', 'C']},
             index=["R1", "R2", "R3"])
C1 C2
R1 1 A
R2 2 B
R3 3 C
  • Usually, you will create a DataFrame from a CSV file using pd.read_csv()
  • You can also create a DataFrame from an Excel file using pd.read_excel()
  • Pandas can read from many other sources, such as SQL databases (as we will see in this course), JSON files, and even HTML tables
df = pd.read_csv("data/iris.csv")
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Creating DataFrames - Cheat Sheet

Create DataFrame from Code
Lists of lists pd.DataFrame([['Danilo', 7], ['Maria', 15], ['Lucas', 3]])
ndarray pd.DataFrame(np.array([['Danilo', 7], ['Maria', 15], ['Lucas', 3]]))
Dictionary pd.DataFrame({"Name": ['Danilo', 'Maria', 'Lucas'], "Number": [7, 15, 3]})
List of tuples pd.DataFrame(zip(['Danilo', 'Maria', 'Lucas'], [7, 15, 3]))
Series pd.DataFrame({"Name": pd.Series(['Danilo', 'Maria', 'Lucas']), "Number": pd.Series([7, 15, 3])})

See the Pandas documentation for more

Indexing and slicing DataFrames

  • There are several main ways to select data from a DataFrame:
    • Using square brackets [ ], .loc[], .iloc[], Boolean indices, and .query()
df = pd.DataFrame({"Name": ["Danilo", "Maria", "Lucas"],
                   "Language": ["Python", "Python", "R"],
                   "Courses": [5, 4, 7]})
df
Name Language Courses
0 Danilo Python 5
1 Maria Python 4
2 Lucas R 7
  • You can select a column using square brackets [ ] or dot notation .
df["Name"]
0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object
df.Name
0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object
  • You can select multiple columns by passing a list of column names
df[["Name", "Courses"]]
Name Courses
0 Danilo 5
1 Maria 4
2 Lucas 7
  • You can select rows using .iloc[], which accepts integers as references to rows/columns
df.iloc[0]  # returns a series
Name        Danilo
Language    Python
Courses          5
Name: 0, dtype: object
df.iloc[0:2]  # returns a dataframe
Name Language Courses
0 Danilo Python 5
1 Maria Python 4

Indexing and slicing DataFrames

  • Now let’s look at .loc which accepts labels as references to rows/columns:
df.loc[:, 'Name']
0    Danilo
1     Maria
2     Lucas
Name: Name, dtype: object
df.loc[:, 'Name':'Language']
Name Language
0 Danilo Python
1 Maria Python
2 Lucas R
df.loc[[0, 2], ['Language']]
Language
0 Python
2 R
  • Boolean indexing is also possible
df[df["Courses"] > 5]
Name Language Courses
2 Lucas R 7
df[df['Name'] == "Danilo"]
Name Language Courses
0 Danilo Python 5

Indexing and slicing with .query()

  • The .query() method allows you to select data using a string expression
  • It is my favourite method because it is more readable and less error-prone
  • .query() accepts a string expression to evaluate and it “knows” the names of the columns in your dataframe
df.query('Courses > 5')
Name Language Courses
2 Lucas R 7
df.query('Name == "Danilo"')
Name Language Courses
0 Danilo Python 5
df.query("Courses > 4 & Language == 'Python'")
Name Language Courses
0 Danilo Python 5
  • Note the use of single quotes AND double quotes above, lucky we have both in Python!

  • You can also use the @ symbol to reference variables in the environment

min_courses = 5
df.query("Courses > @min_courses")
Name Language Courses
2 Lucas R 7

Indexing cheatsheet

Method Syntax Output
Select column df[col_label] Series
Select row slice df[row_1_int:row_2_int] DataFrame
Select row/column by label df.loc[row_label(s), col_label(s)] Object for single selection, Series for one row/column, otherwise DataFrame
Select row/column by integer df.iloc[row_int(s), col_int(s)] Object for single selection, Series for one row/column, otherwise DataFrame
Select by row integer & column label df.loc[df.index[row_int], col_label] Object for single selection, Series for one row/column, otherwise DataFrame
Select by row label & column integer df.loc[row_label, df.columns[col_int]] Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean df[bool_vec] Object for single selection, Series for one row/column, otherwise DataFrame
Select by boolean expression df.query("expression") Object for single selection, Series for one row/column, otherwise DataFrame

Reading/Writing Data From External Sources 📂

Reading data from external sources

.csv files

  • Pandas can read data from many sources, such as CSV, Excel, SQL databases, JSON, HTML, and more
  • As mentioned above, .csv files are the most common data format (for good reason!)
  • You can use the pd.read_csv() function for this
  • There are so many arguments that can be used to help read in your .csv file in an efficient and appropriate manner, feel free to check them out by using shift + tab in Jupyter, or typing help(pd.read_csv)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
  • You can save a DataFrame to a .csv file using df.to_csv()
df.to_csv("data/iris_copy.csv", index=False)

URLs

  • You can also read data from a URL using pd.read_csv()
  • This is useful when you want to read data from a website without downloading it first
url = "https://github.com/danilofreire/qtm350/raw/refs/heads/main/lectures/lecture-14/data/iris.csv"
df = pd.read_csv(url)
df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Common DataFrame Operations 🐼

Common DataFrame operations

  • DataFrames have built-in functions for performing most common operations, e.g., .min(), idxmin(), sort_values(), etc.
  • They’re all documented in the Pandas documentation here but I’ll demonstrate a few below
df = pd.read_csv('data/cycling_data.csv')
df.head(7)
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise
4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 Tired by the end of the week
5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 Rested after the weekend!
6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 Legs feeling strong!
df.min()
Date                         1 Oct 2019, 00:15:07
Name                               Afternoon Ride
Type                                         Ride
Time                                         1712
Distance                                    11.79
Comments    A little tired today but good weather
dtype: object
df['Time'].min()
1712
df['Time'].idxmin() # index of the minimum value
20
df.iloc[20]
Date               27 Sep 2019, 01:00:18
Name                      Afternoon Ride
Type                                Ride
Time                                1712
Distance                           12.47
Comments    Tired by the end of the week
Name: 20, dtype: object

Sorting DataFrames

  • .sort_values() is used to sort a DataFrame by one or more columns
  • You can sort by one or more columns, and you can specify the order (ascending or descending)
  • The default argument is ascending=True
df.sort_values(by='Distance').head()
Date Name Type Time Distance Comments
32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump
16 25 Sep 2019, 00:07:21 Afternoon Ride Ride 1775 12.10 Feeling really tired
5 16 Sep 2019, 13:57:48 Morning Ride Ride 2272 12.45 Rested after the weekend!
6 17 Sep 2019, 00:15:47 Afternoon Ride Ride 1973 12.45 Legs feeling strong!
20 27 Sep 2019, 01:00:18 Afternoon Ride Ride 1712 12.47 Tired by the end of the week
df.sort_values(by='Distance', ascending=False).head(4)
Date Name Type Time Distance Comments
8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 Raining today
25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 Bit tired today but good weather
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
19 26 Sep 2019, 13:42:43 Morning Ride Ride 2350 12.91 Detour around trucks at Jericho
  • You can also combine .query() with .sort_values()
df.query('Distance > 13').sort_values(by='Time')
Date Name Type Time Distance Comments
25 2 Oct 2019, 13:46:06 Morning Ride Ride 2134 13.06 Bit tired today but good weather
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
8 18 Sep 2019, 13:49:53 Morning Ride Ride 2903 14.57 Raining today

DataFrame Characteristics 🔎

DataFrame characteristics

  • DataFrames have many attributes that can be used to understand the data
  • For example, you can use .shape to get the dimensions of the DataFrame
df.shape
(33, 6)
  • .info() prints a summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Date      33 non-null     object 
 1   Name      33 non-null     object 
 2   Type      33 non-null     object 
 3   Time      33 non-null     int64  
 4   Distance  31 non-null     float64
 5   Comments  33 non-null     object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.7+ KB
  • .columns returns the column names
df.columns
Index(['Date', 'Name', 'Type', 'Time', 'Distance', 'Comments'], dtype='object')
  • .describe() prints the descriptive stats of the numerical columns
df.describe()
Time Distance
count 33.000000 31.000000
mean 3512.787879 12.667419
std 8003.309233 0.428618
min 1712.000000 11.790000
25% 1863.000000 12.480000
50% 2118.000000 12.620000
75% 2285.000000 12.750000
max 48062.000000 14.570000
  • .dtypes returns the data types of the columns
df.dtypes
Date         object
Name         object
Type         object
Time          int64
Distance    float64
Comments     object
dtype: object

Rename columns

  • We can rename columns two ways:
  • Using .rename() (to selectively change column names, usually recommended)
  • By setting the .columns attribute (to change all column names at once)
df.head(2)
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
df.rename(columns={"Date": "Datetime",
                   "Comments": "Notes"})
df.head(2)
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
  • Wait? What happened? Nothing changed? 🤔
  • We did actually rename columns of our dataframe but we didn’t modify the dataframe inplace, we made a copy of it
  • There are generally two options for making permanent dataframe changes:
    • Use the argument inplace=True, e.g., df.rename(..., inplace=True), available in most functions/methods
    • Re-assign, e.g., df = df.rename(...) (recommended by Pandas)
df = df.rename(columns={"Date": "Datetime",
                        "Comments": "Notes"})
df.head(2)
Datetime Name Type Time Distance Notes
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
  • Now it works fine! 🥳

Rename columns

  • If you wish to change all of the columns of a dataframe, you can do so by setting the .columns attribute:
df.columns = [f"Column {_}" for _ in range(1, 7)]
df.head(5)
Column 1 Column 2 Column 3 Column 4 Column 5 Column 6
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise
4 12 Sep 2019, 00:28:05 Afternoon Ride Ride 1891 12.48 Tired by the end of the week
  • This is a bit more dangerous than using .rename() because you can easily mess up the order of the columns 😅
  • So be careful when using this method

Adding and removing columns

  • There are two main ways to add/remove columns of a dataframe:
    • Use [] to add columns
    • Use .drop() to drop columns
  • Let’s re-read in a fresh copy of the cycling dataset
df = pd.read_csv('data/cycling_data.csv')
df.head(2)
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
  • Let’s add new columns to the dataframe
df['Rider'] = 'Danilo Freire'
df['Avg Speed'] = df['Distance'] * 1000 / df['Time']  # avg. speed in m/s
df.head(2)
Date Name Type Time Distance Comments Rider Avg Speed
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain Danilo Freire 6.055662
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain Danilo Freire 5.148163
  • Now let’s remove the columns we just added
df = df.drop(columns=['Rider', 'Avg Speed'])
df.head(4)
Date Name Type Time Distance Comments
0 10 Sep 2019, 00:13:04 Afternoon Ride Ride 2084 12.62 Rain
1 10 Sep 2019, 13:52:18 Morning Ride Ride 2531 13.03 rain
2 11 Sep 2019, 00:23:50 Afternoon Ride Ride 1863 12.52 Wet road but nice weather
3 11 Sep 2019, 14:06:19 Morning Ride Ride 2192 12.84 Stopped for photo of sunrise

Adding and removing rows

  • You won’t often be adding rows to a dataframe manually (you’ll usually add rows through joining)

  • You can add/remove rows of a dataframe in two ways: .concat() to add rows and .drop() to drop rows

  • Let’s add a new row to the dataframe

another_row = pd.DataFrame([["12 Oct 2019, 00:10:57", "Morning Ride", "Ride",
                             2331, 12.67, "Washed and oiled bike last night"]],
                           columns = df.columns,
                           index = [33])
df = pd.concat([df, another_row])
df.tail(2)
Date Name Type Time Distance Comments
32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump
33 12 Oct 2019, 00:10:57 Morning Ride Ride 2331 12.67 Washed and oiled bike last night
  • Finally, let’s remove the row we just added
df = df.drop(index=33)
df.tail(2)
Date Name Type Time Distance Comments
31 10 Oct 2019, 13:47:14 Morning Ride Ride 2463 12.79 Stopped for photo of sunrise
32 11 Oct 2019, 00:16:57 Afternoon Ride Ride 1843 11.79 Bike feeling tight, needs an oil and pump

Why ndarrays and Series and DataFrames?

  • At this point, you might be asking why we need all these different data structures

  • Well, they all serve different purposes and are suited to different tasks. For example:

    • NumPy is typically faster/uses less memory than Pandas
    • not all Python packages are compatible with NumPy & Pandas
    • the ability to add labels to data can be useful (e.g., for time series)
    • NumPy and Pandas have different built-in functions available
  • My advice: use the simplest data structure that fulfills your needs!

  • Finally, we’ve seen how to go from: ndarray (np.array()) -> series (pd.series()) -> dataframe (pd.DataFrame())

  • Remember that we can also go the other way: dataframe/series -> ndarray using df.to_numpy()

  • But you will probably use DataFrames most of the time 😉

Conclusion

  • Today we learned about NumPy and Pandas, two of the most important Python libraries for data manipulation
  • We saw how to create arrays and series, and how to perform operations on them
  • We also saw how to create and manipulate DataFrames, and how to read/write data from/to external sources
  • We learned how to index and slice arrays and DataFrames
  • We also learned how to rename columns, add/remove columns/rows, and how to sort DataFrames
  • Finally, we discussed why we need different data structures
  • Next time, we will learn about DataFrame reshaping, joining data, applying custom functions, visualising Dataframes with .plot(), and more! 😊

And that’s a wrap! 🎬

Thank you very much!
See you next time! 😊🙏🏽