It allows you to see the data in a table format and gives you some basic statistics
You can also filter the data, sort it, fill missing values, and more
This new extension is super convienent when you are working with data you are not familiar with
It also has the advantage of using a sandboxed environment, so you can interact with the data without changing the original file (unless you explicitly save them)
Install the extension by going to the Extensions view (Ctrl+Shift+X) and searching for Data Wrangler, or click here
The extension is automatically integrated with the Jupyter extension, so you can open a .csv file and start wrangling the data
Or you can open a Jupyter notebook, load your dataset with pandas, and start wrangling the data from there
Subsetting data with Data Wrangler
Let’s see a quick example!
The quickest way to open the Data Wrangler is by right-clicking on a .csv file and selecting Open in Data Wrangler
Let’s visualise the features.csv dataset. It is located in the data_raw folder
The dataset is about cars and has the following variables:
mpg: Miles per gallon - Fuel efficiency of the car
cylinders: Number of cylinders
displacement: Volume of all the cylinders in the engine
horsepower: Power of the engine in horsepower
weight: Weight in pounds
acceleration: Acceleration from 0 to 60 mph
vehicle_id: Unique identifier for each car
Subsetting data with Data Wrangler
Subsetting data with Data Wrangler
Subsetting data with Data Wrangler
Any questions so far? 🤔
Subsetting data with Data Wrangler
When you click on Filter, you will see a menu like this:
Select the column you want to filter, and click on Add Filter
Then you can select the condition you want to filter by and click on Apply
You can also sort the data by clicking on Sort and selecting the column you want to sort by
Data Wrangler will also show the Python code that corresponds to the operations you are doing!
This is a great way to learn how to use pandas! 🐼
You can then Export to notebook and continue working on your data in a Jupyter notebook, Export as file, or Copy all code and paste it in your Python script
Let’s practice!
Please open the features.csv dataset in the Data Wrangler
We will filter the data to show only cars with 6 or more cylinders
And sort the data by mpg in descending order
Finally, we will export the code to a Jupyter notebook
matplotlib can also be used to plot subsets of data
The syntax is similar to what we have seen before
For example, you can just add other plt.scatter() or plt.hist() commands to the same cell
Or you can create a for loop to plot multiple subsets
Let’s see an example using cylynders
First, we need to import matplotlib and use pd.unique() to extract a list with the unique elements in that column
import matplotlib.pyplot as pltlist_unique_cylinders = pd.unique(carfeatures["cylinders"])print(list_unique_cylinders)
[8 4 6 3 5]
Plotting Subsets
# If we call plt.scatter() twice, it will display both plots on the same graph# We also include include plt.show() at the very end.df_8 = carfeatures.query("cylinders == 8")df_4 = carfeatures.query("cylinders == 4")plt.scatter(x = df_8["weight"],y = df_8["acceleration"])plt.scatter(x = df_4["weight"],y = df_4["acceleration"])plt.legend(labels = ["8","4"], title ="Cylinders")plt.show()# Note: If we put plt.show() in between the plots, then the results will# be shown on separate graphs instead.
Plotting Subsets
Using a for loop to plot multiple subsets
# Compute number of unique categorieslist_unique_cylinders = pd.unique(carfeatures["cylinders"])# Use a for loop to plot a scatter plot between "weight" and "acceleration"# for each category. Each plot will have a different colorfor category in list_unique_cylinders: df = carfeatures.query("cylinders == @category") plt.scatter(x = df["weight"],y = df["acceleration"])# Add labels and a legends plt.xlabel("Weight")plt.ylabel("Acceleration")plt.legend(labels = list_unique_cylinders, title ="Cylinders")plt.show()
Try it yourself! 🚗
Compute a histogram of “mpg” by cylinder count
Make the histograms transparent by adjusting alpha in plt.hist(x = ..., alpha = 0.5)