Advanced ggplotting

I am using the summer movies data set, which is a compilation of movies with “summer” in the title from IMDB. I started by loading the data and extracting only the first genre for each movie since some have multiple.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(waffle)
library(ggbeeswarm)
library(ggridges)
library(ggmosaic)

tuesdata <- tidytuesdayR::tt_load('2024-07-30')
## ---- Compiling #TidyTuesday
## Information for 2024-07-30
## ----
## --- There are 2 files
## available ---
## 
## 
## ── Downloading files ─────────
## 
##   1 of 2:
##   "summer_movie_genres.csv"
##   2 of 2: "summer_movies.csv"
summer_movie_genres <- tuesdata$summer_movie_genres
summer_movies <- tuesdata$summer_movies

# extracting only the first listed genre for each movie
primary_genre <- sub(",.*", "", summer_movies$genres)

To get a rough idea of the frequencies of different genres in the dataset, I created a waffle plot. To keep it readable, I only included genres with more than five observations. This shows that comedies and dramas are the most frequent genres by far.

# creating a waffle plot to see the relative frequencies of different genres of movies with "summer" in the title (excluding those with five or fewer observations)
tabled_data <- as.data.frame(table(class=primary_genre)) %>%
  filter(Freq > 5)

ggplot(data = tabled_data) +
  aes(fill = class, values = Freq) +
  geom_waffle(n_rows = 15, size = 0.33, colour = "white") +
  coord_equal() +
  theme_void() +
  theme(
    legend.position = "right",
    legend.key.size = unit(0.2, "cm"),
    legend.text = element_text(size = 8)
  )

Next, I compared average ratings of comedies, romances, and dramas using a beeswarm plot. This is a good plot type to use as it also shows that there are fewer romances compared to comedies and dramas. Comedies seem to have a wider spread of ratings whereas dramas are more concentrated.

# creating a beeswarm plot to compare average ratings of comedies, romances, and dramas with "summer" in the title
summer_movies %>%
  mutate(primary_genre = sub(",.*", "", genres)) %>% # extract only the first listed genre for each movie
  filter(primary_genre %in% c("Comedy", "Romance", "Drama")) %>% # filter genre for comedy, drama, and romance
  ggplot(aes(x = primary_genre, y = average_rating, color = primary_genre)) +
  geom_beeswarm() +
  theme_minimal() +
  labs(title = "IMDb Ratings for Comedies, Romances, and Dramas",
    x = "Genre",
    y = "Average Rating") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1)) +
  scale_color_brewer(palette = "Set2")

I also compared runtimes for several genres using a ridgeline plot. This plot shows that action movies have a much wider spread of runtimes whereas comedies are very concentrated.

# creating a ridgeline plot to compare runtimes for documentary, action, comedy movies
summer_movies %>%
  mutate(primary_genre = sub(",.*", "", genres)) %>%
  filter(primary_genre %in% c("Documentary", "Action", "Comedy")) %>%
  ggplot(aes(x = runtime_minutes, y = primary_genre, fill = primary_genre)) +
  geom_density_ridges(alpha = 0.8, scale = 1.2) +
  theme_minimal() +
  labs(title = "Distribution of Runtime by Genre",
    x = "Runtime",
    y = "Genre") +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")
## Picking joint bandwidth of 7.37
## Warning: Removed 20 rows containing
## non-finite outside the scale
## range
## (`stat_density_ridges()`).

Finally, I created a mosaic plot to compare the relative frequencies of different genres across decades from 1950 to 2020.

# using a mosaic plot to compare relative frequencies of different genres across decades from 1950-2020
summer_movies <- summer_movies %>%
  mutate(
    primary_genre = sub(",.*", "", genres),
    decade = floor(year / 10) * 10
  )

genre_counts <- summer_movies %>%
  count(primary_genre) %>%
  filter(n > 5)

summer_movies_shortened <- summer_movies %>%
  filter(primary_genre %in% genre_counts$primary_genre) %>%
  filter(decade > 1940)


ggplot(summer_movies_shortened) +
  geom_mosaic(aes(x = product(decade), fill = primary_genre), na.rm = TRUE) +
  theme_minimal() +
  labs(
    title = "Genre Frequencies by Decade",
    x = "Decade",
    y = "Proportion",
    fill = "Genre"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.key.size = unit(0.5, "cm"),
    legend.text = element_text(size = 8)
  )
## Warning: The `scale_name` argument of
## `continuous_scale()` is
## deprecated as of ggplot2
## 3.5.0.
## This warning is displayed
## once every 8 hours.
## Call
## `lifecycle::last_lifecycle_warnings()`
## to see where this warning was
## generated.
## Warning: The `trans` argument of
## `continuous_scale()` is
## deprecated as of ggplot2
## 3.5.0.
## ℹ Please use the `transform`
##   argument instead.
## This warning is displayed
## once every 8 hours.
## Call
## `lifecycle::last_lifecycle_warnings()`
## to see where this warning was
## generated.
## Warning: `unite_()` was deprecated in
## tidyr 1.2.0.
## ℹ Please use `unite()`
##   instead.
## ℹ The deprecated feature was
##   likely used in the ggmosaic
##   package.
##   Please report the issue at
##   <https://github.com/haleyjeppson/ggmosaic>.
## This warning is displayed
## once every 8 hours.
## Call
## `lifecycle::last_lifecycle_warnings()`
## to see where this warning was
## generated.