Lecture 14 - Introduction to Pandas
21 October, 2024
R
and JavaScript
code, as well as write LaTeX
and Markdown
documents
conda install numpy
or pip install numpy
import
command.np
ndarrays
) are homogenous, which means that items in the array should be of the same type.ndarrays
are also compatible with numpy’s vast collection of in-built functions!But it has the type numpy.ndarray
Unlike a list, arrays can only hold a single type (usually numbers). Check this out
'1'
!ndarrays
are typically created using two main methods:
np.array()
np.zeros()
, np.ones()
, np.arange()
, np.linspace()
, np.random.normal()
, etc.[[ ]]
):np.sqrt()
, np.exp()
, np.log()
, np.sin()
, np.cos()
, np.tan()
, etc. Please check the documentation for more information==
, !=
, >
, <
, >=
, <=
, etc.a.sum()
, a.mean()
, a.max()
, a.min()
, etc.
for
loops in many cases! 😅
Pie sales (#):
[[2 3 1]
[6 3 3]
[5 3 5]]
np.repeat()
function to do this[]
to index arrays:
are used to slice arraysarray([[8, 3, 2, 4, 0, 4],
[0, 2, 0, 4, 7, 9],
[9, 4, 7, 2, 6, 3],
[3, 8, 2, 5, 3, 6]])
dplyr
+ tibble
in R
Series
and DataFrame
Series
is a one-dimensional array with an index, pretty much like a np.array
, but with a label for each element
Series
from a list, a NumPy array, a dictionary, or a scalar value using pd.Series()
(note the capital “S”)import pandas as pd
a -5.0
b 1.3
c 21.0
d 6.0
e 3.0
dtype: float64
Series
like a NumPy array[ ]
and sliced using colon :
notation:A 0
B 1
C 2
D 3
E 4
dtype: int64
ndarrays
, operations between Series align values based on their LABELS (not their position in the structure)B 10
C 11
D 12
E 13
dtype: int64
pd.DataFrame()
(note the capital “D” and “F”)index
and columns
arguments to give them labels:DataFrame
from a CSV file using pd.read_csv()
DataFrame
from an Excel file using pd.read_excel()
Create DataFrame from | Code |
---|---|
Lists of lists | pd.DataFrame([['Danilo', 7], ['Maria', 15], ['Lucas', 3]]) |
ndarray | pd.DataFrame(np.array([['Danilo', 7], ['Maria', 15], ['Lucas', 3]])) |
Dictionary | pd.DataFrame({"Name": ['Danilo', 'Maria', 'Lucas'], "Number": [7, 15, 3]}) |
List of tuples | pd.DataFrame(zip(['Danilo', 'Maria', 'Lucas'], [7, 15, 3])) |
Series | pd.DataFrame({"Name": pd.Series(['Danilo', 'Maria', 'Lucas']), "Number": pd.Series([7, 15, 3])}) |
See the Pandas documentation for more
[ ]
, .loc[]
, .iloc[]
, Boolean indices, and .query()
df = pd.DataFrame({"Name": ["Danilo", "Maria", "Lucas"],
"Language": ["Python", "Python", "R"],
"Courses": [5, 4, 7]})
df
Name | Language | Courses | |
---|---|---|---|
0 | Danilo | Python | 5 |
1 | Maria | Python | 4 |
2 | Lucas | R | 7 |
[ ]
or dot notation .
.iloc[]
, which accepts integers as references to rows/columns.loc
which accepts labels as references to rows/columns:.query()
.query()
method allows you to select data using a string expression.query()
accepts a string expression to evaluate and it “knows” the names of the columns in your dataframeNote the use of single quotes AND double quotes above, lucky we have both in Python!
You can also use the @
symbol to reference variables in the environment
Method | Syntax | Output |
---|---|---|
Select column | df[col_label] |
Series |
Select row slice | df[row_1_int:row_2_int] |
DataFrame |
Select row/column by label | df.loc[row_label(s), col_label(s)] |
Object for single selection, Series for one row/column, otherwise DataFrame |
Select row/column by integer | df.iloc[row_int(s), col_int(s)] |
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by row integer & column label | df.loc[df.index[row_int], col_label] |
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by row label & column integer | df.loc[row_label, df.columns[col_int]] |
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by boolean | df[bool_vec] |
Object for single selection, Series for one row/column, otherwise DataFrame |
Select by boolean expression | df.query("expression") |
Object for single selection, Series for one row/column, otherwise DataFrame |
.csv
files are the most common data format (for good reason!)pd.read_csv()
function for this.csv
file in an efficient and appropriate manner, feel free to check them out by using shift + tab
in Jupyter, or typing help(pd.read_csv)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
.csv
file using df.to_csv()
pd.read_csv()
url = "https://github.com/danilofreire/qtm350/raw/refs/heads/main/lectures/lecture-14/data/iris.csv"
df = pd.read_csv(url)
df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
.min()
, idxmin()
, sort_values()
, etc.Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
2 | 11 Sep 2019, 00:23:50 | Afternoon Ride | Ride | 1863 | 12.52 | Wet road but nice weather |
3 | 11 Sep 2019, 14:06:19 | Morning Ride | Ride | 2192 | 12.84 | Stopped for photo of sunrise |
4 | 12 Sep 2019, 00:28:05 | Afternoon Ride | Ride | 1891 | 12.48 | Tired by the end of the week |
5 | 16 Sep 2019, 13:57:48 | Morning Ride | Ride | 2272 | 12.45 | Rested after the weekend! |
6 | 17 Sep 2019, 00:15:47 | Afternoon Ride | Ride | 1973 | 12.45 | Legs feeling strong! |
Date 1 Oct 2019, 00:15:07
Name Afternoon Ride
Type Ride
Time 1712
Distance 11.79
Comments A little tired today but good weather
dtype: object
.sort_values()
is used to sort a DataFrame by one or more columnsascending=True
Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
32 | 11 Oct 2019, 00:16:57 | Afternoon Ride | Ride | 1843 | 11.79 | Bike feeling tight, needs an oil and pump |
16 | 25 Sep 2019, 00:07:21 | Afternoon Ride | Ride | 1775 | 12.10 | Feeling really tired |
5 | 16 Sep 2019, 13:57:48 | Morning Ride | Ride | 2272 | 12.45 | Rested after the weekend! |
6 | 17 Sep 2019, 00:15:47 | Afternoon Ride | Ride | 1973 | 12.45 | Legs feeling strong! |
20 | 27 Sep 2019, 01:00:18 | Afternoon Ride | Ride | 1712 | 12.47 | Tired by the end of the week |
Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
8 | 18 Sep 2019, 13:49:53 | Morning Ride | Ride | 2903 | 14.57 | Raining today |
25 | 2 Oct 2019, 13:46:06 | Morning Ride | Ride | 2134 | 13.06 | Bit tired today but good weather |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
19 | 26 Sep 2019, 13:42:43 | Morning Ride | Ride | 2350 | 12.91 | Detour around trucks at Jericho |
.query()
with .sort_values()
.shape
to get the dimensions of the DataFrame.info()
prints a summary of the DataFrame<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 33 non-null object
1 Name 33 non-null object
2 Type 33 non-null object
3 Time 33 non-null int64
4 Distance 31 non-null float64
5 Comments 33 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 1.7+ KB
.columns
returns the column names.describe()
prints the descriptive stats of the numerical columnsTime | Distance | |
---|---|---|
count | 33.000000 | 31.000000 |
mean | 3512.787879 | 12.667419 |
std | 8003.309233 | 0.428618 |
min | 1712.000000 | 11.790000 |
25% | 1863.000000 | 12.480000 |
50% | 2118.000000 | 12.620000 |
75% | 2285.000000 | 12.750000 |
max | 48062.000000 | 14.570000 |
.dtypes
returns the data types of the columns.rename()
(to selectively change column names, usually recommended).columns
attribute (to change all column names at once)Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
inplace=True
, e.g., df.rename(..., inplace=True)
, available in most functions/methodsdf = df.rename(...)
(recommended by Pandas)Datetime | Name | Type | Time | Distance | Notes | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
.columns
attribute:Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6 | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
2 | 11 Sep 2019, 00:23:50 | Afternoon Ride | Ride | 1863 | 12.52 | Wet road but nice weather |
3 | 11 Sep 2019, 14:06:19 | Morning Ride | Ride | 2192 | 12.84 | Stopped for photo of sunrise |
4 | 12 Sep 2019, 00:28:05 | Afternoon Ride | Ride | 1891 | 12.48 | Tired by the end of the week |
.rename()
because you can easily mess up the order of the columns 😅[]
to add columns.drop()
to drop columnsDate | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
df['Rider'] = 'Danilo Freire'
df['Avg Speed'] = df['Distance'] * 1000 / df['Time'] # avg. speed in m/s
df.head(2)
Date | Name | Type | Time | Distance | Comments | Rider | Avg Speed | |
---|---|---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain | Danilo Freire | 6.055662 |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain | Danilo Freire | 5.148163 |
Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
0 | 10 Sep 2019, 00:13:04 | Afternoon Ride | Ride | 2084 | 12.62 | Rain |
1 | 10 Sep 2019, 13:52:18 | Morning Ride | Ride | 2531 | 13.03 | rain |
2 | 11 Sep 2019, 00:23:50 | Afternoon Ride | Ride | 1863 | 12.52 | Wet road but nice weather |
3 | 11 Sep 2019, 14:06:19 | Morning Ride | Ride | 2192 | 12.84 | Stopped for photo of sunrise |
You won’t often be adding rows to a dataframe manually (you’ll usually add rows through joining)
You can add/remove rows of a dataframe in two ways: .concat()
to add rows and .drop()
to drop rows
Let’s add a new row to the dataframe
another_row = pd.DataFrame([["12 Oct 2019, 00:10:57", "Morning Ride", "Ride",
2331, 12.67, "Washed and oiled bike last night"]],
columns = df.columns,
index = [33])
df = pd.concat([df, another_row])
df.tail(2)
Date | Name | Type | Time | Distance | Comments | |
---|---|---|---|---|---|---|
32 | 11 Oct 2019, 00:16:57 | Afternoon Ride | Ride | 1843 | 11.79 | Bike feeling tight, needs an oil and pump |
33 | 12 Oct 2019, 00:10:57 | Morning Ride | Ride | 2331 | 12.67 | Washed and oiled bike last night |
At this point, you might be asking why we need all these different data structures
Well, they all serve different purposes and are suited to different tasks. For example:
My advice: use the simplest data structure that fulfills your needs!
Finally, we’ve seen how to go from: ndarray (np.array()) -> series (pd.series()) -> dataframe (pd.DataFrame())
Remember that we can also go the other way: dataframe/series -> ndarray using df.to_numpy()
But you will probably use DataFrames most of the time 😉
.plot()
, and more! 😊