Data visualization is a powerful mean to understand complex data and communicate insights effectively, revealing the hidden patterns and trends that wouldn’t be obvious in tabular data, turning raw data into information.

The tools we are going to introduce today serve exactly this purpose, enhancing the visualization possibilities offered by other static visualization packages such as ggplot2.

What tools allow to do so?

Plotly and Dash! And we will introduce both!

Indeed, plotly and dash are interactive, in the sense that the user can interact with the data through the graph directly. This allows possibilities for higher understanding of the data.

Why interactive graphs?

  • Engagement: Makes data come alive (interactive) and is more engaging for users. This increase understanding and retention.

  • Depth: Allows users to drill down and explore nuances, as interaction allows to create narratives more easily. Users can ask ‘why’ and get answers instantly by exploring the chart or graph.

  • Decision-making: Facilitates more informed decisions.

Both plotly and dash are interactive, but we can say that the level of interaction is different: plotly appears very similar to a static plot, with the difference that we can hover your cursor over it, accessing further data. dash on the other hand, creates proper dashboards, namely HTML apps, allowing even more interaction.


Plotly

Main distinctive features

Plotly’s main feature is interactivity! It is achieved by a couple of operations within the plot_ly function:

  • data: a data frame (what a surprise!)
  • x and y: the variables for the x and y axes, respectively
  • type: the type of plot you want to create (e.g. “scatter”, “bar”, “box”, etc)
  • mode: determines whether the scatter plot should show “markers”, “lines”, “text”, or a combination thereof
  • color: specifies the color of data points or lines. This can be used to distinguish between different groups or represent a continuous variable

Nothing really interactive until now…


… What enables interaction:

  • Hovertext & Tooltip:
    • Hovertext: This is the text that appears when a user hovers over a specific data point or element in a Plotly plot. It’s essentially a string or array of strings associated with each data point.
    • Tooltip: This is the box or interface that displays the hovertext. While “hovertext” refers specifically to the text content, “tooltip” encompasses the overall appearance and context of the hover information.

These can be highly customized: more info here for plot_ly and here for ggplot2, using ggplotly.


Examples and exercises

Many people are surprised that one is required to pay for using the public toilets in Berlin. Let’s find out how far you would need to go to use a free toilet after a couple of beers with your friends.


First: a nice barplot!

Let’s use this dataset for the public toilets in Berlin: Berlin Public Toilet Data database.


Q.1. Find out the district-wise distribution of public toilets across Berlin.

# load the necessary packages
pacman::p_load(tidyverse, readr, ggplot2, plotly, dash, GGally, hrbrthemes)
#1. Prepare data
toilets <- read.csv("./data/berliner-toiletten-standorte.csv")

str(toilets) #tibble
## 'data.frame':    440 obs. of  18 variables:
##  $ LavatoryID             : chr  "Wall_101003" "Wall_112911" "Wall_115753" "Wall_116738" ...
##  $ Description            : chr  "Toilette Ottmachauer Steig (oberhalb Badestelle Ostufer)" "Toilette Uferweg " "Toilette Am Kiesteich 50" "Wall CT, Hubertusdamm" ...
##  $ City                   : chr  "Berlin" "Berlin" "Berlin" "Berlin" ...
##  $ Street                 : chr  "Krumme Lanke, Quermatenweg   (0-24 Uhr)" "Schlachtensee , Am Schlachtensee ggü. 145  (0-24 Uhr)" "Spektepark   (0-24 Uhr)" "Hubertusdamm ggü. 7 (0-24 Uhr)" ...
##  $ PostalCode             : chr  "14109" "14129" "13589" "13125" ...
##  $ Country                : chr  "Deutschland" "Deutschland" "Deutschland" "Deutschland" ...
##  $ Longitude              : num  13.2 13.2 13.2 13.5 13.2 ...
##  $ Latitude               : num  52.5 52.4 52.5 52.6 52.5 ...
##  $ isOwnedByWall          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ isHandicappedAccessible: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Price                  : num  0 0 0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ canBePayedWithCoins    : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ canBePayedInApp        : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ canBePayedWithNFC      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ hasChangingTable       : int  0 0 0 1 1 1 0 0 1 1 ...
##  $ LabelID                : int  5 5 5 1 1 1 1 1 1 1 ...
##  $ hasUrinal              : int  0 0 0 1 0 0 0 1 1 1 ...
##  $ FID                    : int  269 NA NA 868 738 790 716 674 671 684 ...
toilets <- toilets |>
        mutate(Free = if_else(Price == 0, "free", "paid"))

#Filter the districts based on postal codes:

#Sources
#1. Wikipedia - https://simple.wikipedia.org/wiki/Postal_codes_in_Germany
#2. Cybo - https://postal-codes.cybo.com/germany/berlin/

toilets <- toilets |>
        mutate(District = case_when(
          (PostalCode >= 10115 & PostalCode <= 10179) | 
            (PostalCode >= 13347 &PostalCode <= 13359) ~ "Mitte",
          
          (PostalCode >= 10179 & PostalCode <= 10249) |
            (PostalCode >= 10961 & PostalCode <= 10999) ~ "Friedrichshain-Kreuzberg",
          
          (PostalCode >= 10367 & PostalCode <= 10407) | 
            (PostalCode >= 10318 & PostalCode <= 10319) ~ "Lichtenberg",
          
          (PostalCode >= 10405 & PostalCode <= 10439) | 
            (PostalCode >= 13086 & PostalCode <= 13129) ~ "Pankow",
          
          (PostalCode >= 10585 & PostalCode <= 10779) | 
            (PostalCode >= 12159 & PostalCode <= 12163) | 
            (PostalCode >= 13629 & PostalCode <= 14053) ~ "Charlottenburg-Wilmersdorf",
          
          (PostalCode >= 12099 & PostalCode <= 12109) | 
            (PostalCode >= 10777 & PostalCode <= 10783) | 
            (PostalCode >= 10823 & PostalCode <= 10829) | 
            (PostalCode >= 12305 & PostalCode <= 12309) | 
            (PostalCode >= 12277 & PostalCode <= 12279) ~ "Tempelhof-Schöneberg",
          
          (PostalCode >= 10999 & PostalCode <= 12059) | 
            (PostalCode >= 12555 & PostalCode <= 12559) | 
            (PostalCode >= 12487 & PostalCode <= 12489) ~ "Treptow-Köpenick",
          
          (PostalCode >= 12163 & PostalCode <= 12169) | 
            (PostalCode >= 14163 & PostalCode <= 14169) | 
            (PostalCode >= 12203 & PostalCode <= 12209) | 
            (PostalCode >= 12247 & PostalCode <= 12249) | 
                                  (PostalCode == 14109) ~ "Steglitz-Zehlendorf",
          
          (PostalCode >= 13407 & PostalCode <= 13439) | 
            (PostalCode >= 13505 & PostalCode <= 13509) ~ "Reinickendorf",
          
          (PostalCode >= 12043 & PostalCode <= 12059) ~ "Neukölln",
          
          (PostalCode >= 12679 & PostalCode <= 12689) ~ "Marzahn-Hellersdorf",
          
          (PostalCode >= 13599 & PostalCode <= 13629 ~ "Spandau")
      ))

Put you answer here:

#Plot the distribution using plotly:
toilets |> filter(District != "NA") |>
  plot_ly(x = ~District, color = ~Free, type = "histogram") |>
  layout(
    title = "Distribution of Public Toilets in Berlin",
    xaxis = list(title = "Districts of Berlin"),
    yaxis = list(title = "No. of Public Toilets")
  )

Q.2. Can you comment on the number of free toilets vs. paid public toilets across Berlin? Based on this dataset, which district in Berlin has the highest and lowest number of free public toilets?

Second: ggplotly!

Q.3. Using the ‘beers’ dataset, use ggplot to generate a parallel coordinates plot for the first 20 beers. Then transform the ggplot to a interactive plot using ggplotly. (Hint: Use package “GGally” to generate a parallel coordinates plot. Beware of NA’s!)

# Prepare the data
beers_original <- read.csv("./data/beers.csv")

beers_test <- beers_original |>
  filter(ibu != "NA") |>
  slice(1:20)

Put your answer here:

p <- ggparcoord(beers_test,
                columns = 3:4, groupColumn = 6,
                scale = "uniminmax",
                showPoints = TRUE, 
                title = "Beers by ABV and IBU",
                alphaLines = 0.3
) +
  theme_ipsum()+
  theme(
    legend.position="Default",
    plot.title = element_text(size=13)
  ) +
  xlab("")

ggplotly(p)

Third: A bubble chart!

Try to code a bubble chart according to these instructions: - The x-axis represents the abv (alcohol by volume). - The y-axis represents the average_ibu (average bitterness level) for each combination of style_group and abv. - The size of each bubble is determined by the number of beers (count) that fall into each combination of style_group and abv. - The color of the bubbles is determined by the style_group. - Use the plot_ly function to create the bubble chart with data from agg_data_2. - Try to determine the size of the bubbles by the count column of agg_data_2, with sizes ranging from 10 to 800. - Determine the color of the bubbles by the style_group column, with colors taken from the “Set1” color palette. - Hover text for each bubble using the paste function, showing details about style_group, abv, and average_ibu. - Use the layout function to set the chart title, axis titles, and other visualization properties. - Add the number of beers for each combination as a hovertext for each bubble.

Use the following data:

# Prepare the data
beers <- beers_original |>
  mutate(style_group = case_when(
    (str_detect(style, "IPA") ~ "IPA"),
    (str_detect(style, "Ale") ~ "Ale"),
    (str_detect(style, "Lager") | style == "Altbier" ~ "Lager"),
    (str_detect(style, "Pilsner") | str_detect(style, "Pilsener") ~ "Pilsner"),
    (str_detect(style, "Stout") ~ "Stout"),
    (str_detect(style, "Barleywine") ~ "Barleywine"),
    (str_detect(style, "Porter") ~ "Porter"),
    (str_detect(style, "Radler") | str_detect(style, "Fruit") ~ "Radler/Fruit"),
    .default = "Other"
    )
  )

agg_data_2 <- beers %>%
  group_by(style_group, abv) %>%
  summarise(
    average_ibu = mean(ibu, na.rm = TRUE),
    count = n()
    )

Put you answer here:

# Create the bubble chart 
plot <- plot_ly(data = agg_data_2, x = ~abv, y = ~average_ibu, 
                text = ~paste("Style:", style_group, 
                              "<br>ABV:", abv, 
                              "<br>Avg IBU:", round(average_ibu, 1),
                              "<br>No. of Beers:", count), # Enhanced hover text
                size = ~count, sizes = c(10, 800),
                mode = "markers", marker = list(opacity = 0.5),
                color = ~style_group, colors = "Set1") %>%
  layout(title = "Bubble Chart of Average IBU by ABV and Style Group",
         xaxis = list(title = "ABV"),
         yaxis = list(title = "Average IBU"),
         hovermode = "closest",
         showlegend = TRUE)

plot


Dash

What is Dash?

A web application framework for R (and Python and other programming languages) that allows creating interactive, web-based data dashboards (indeed), turning your interactive plots into full-blown web applications.

Key features

  • HTML layout: Design your dashboard layout using simple HTML components, without having to write any actual HTML. Indeed, it structure dashboard components using a set of pre-defined HTML tags within R.

  • Interactivity through callbacks: callbacks define interactive behavior by specifying how input changes should modify output components. Essentially, when changing a dropdown or move a slider, a callback updates a chart, text, or any other component in real-time.

  • Integration with Plotly visualizations: Plotly charts can be easily included in Dash, making them part of an
    interactive dashboard where each component can talk to others.

A Nice Example

We will use the same toilets dataset.

app <- dash_app()

We mutate the table to add the columns for any specific filters that we require.

toilets <- toilets |>
  mutate(Accessibility = if_else(isHandicappedAccessible == 0, "Handicapped Accessible", "Not Handicapped Accessible")) |>
  mutate(BabyFacilities = if_else(hasChangingTable == 0, "Changing table", "No changing table"))

We would like to see the distribution of the toilets across Berlin, based on whether they are paid or not.

#We assign different colours to different districts.
color_options <- c(unique(toilets$District))

#We filter the toilets based on whether they are free or paid.
price_options <- c(unique(toilets$Free))
price_filter <- dccChecklist(
  id = 'price-filter',
  options = price_options,
  value = price_options #default, choose all
)

#We can also add additional filters like accessibility for the handicapped and child facilities:

handicap_options <- c(unique(toilets$Accessibility))
handicap_filter <- dccChecklist(
  id = 'handicap-filter',
  options = handicap_options,
  value = handicap_options #default, choose all
)

baby_options <- c(unique(toilets$BabyFacilities))
baby_filter <- dccChecklist(
  id = 'baby-filter',
  options = baby_options,
  value = baby_options #default, choose all
)

We assign the graph to a new variable and give instructions for callback based on the input(s) we provided.

#Output component(s)
toilets_map <- dccGraph(id = 'toilets-map')

#Callback(s)
app |> add_callback(
  output('toilets-map', 'figure'),
  list(
    #input('district-dropdown', 'value'),
    input('price-filter', 'value'),
    input('handicap-filter', 'value'),
    input('baby-filter', 'value')
  ),
  function(price_input, handicap_input, baby_input) { #district_input
    #function(color_input) {
    toilets <- toilets |>
      filter(Free %in% price_input) |>
      filter(Accessibility %in% handicap_input) |>
      filter(BabyFacilities %in% baby_input)
    fig = toilets |>
      filter(District != "NA") |>
      plot_ly(lat = ~Latitude, lon = ~Longitude, type = "scattermapbox",
              split = ~District, size = 5, #toilets[[color_input]]
              hoverinfo = "text+name", hovertext = ~Street) |>
      layout(
        title = "Berlin public toilets",
        mapbox = list(
          style = 'open-street-map',
          zoom = 10,
          center = list(lat = ~median(Latitude), lon = ~median(Longitude)))
      )
    return(fig) #not needed but just to be explicit
  }
)

We also need to define the aesthetics of the dashboard:

#3 Front-end layout
layout_elements <- list(
  h1("IDS Workshop - Plotly-Dash test"),
  div("Using the Berlin Public Toilets dataset to test Scatter Map functionality..."),
  br(),
  div("Type:"),
  price_filter,
  br(),
  div("Accessibility:"),
  handicap_filter,
  br(),
  div("Baby facilities:"),
  baby_filter,
  br(),
  toilets_map
)

Finally, we run the app:

#4 Main app
app |> set_layout(layout_elements)
app |> run_app()

We can also add additional filters to the data, as and when required.

Q.5. Find the distribution of paid public toilets in Berlin, that are also handicapped accessible and have a changing table.

Q.6. You have just submitted the materials for the IDS workshop on Monday and it is officially the end of the mid-term exam week. You go out with your friends to grab a few beers. And the weather is really cold. Are there any free toilets in Mitte that you could use after a grabbing some drinks with your friends?

Thank you!