ggplot2 is a data visualization package written by Hadley Wickham that uses the “grammar of graphics.” The grammar of graphics provides a consistent way to describe the components of graph, allowing us to move beyond specific types of plots (e.g., boxplot, scatterplot, etc.) to different elements that compose the plot. As the name would imply, the grammar of graphics is a language we can use to describe and build visualizations.
Today we’ll be using data on US breweries (yay beer!) to explore some of ggplot2’s capabilities. First, we will install the packages we will need.
# install.packages("devtools")
# devtools::install_github("hadley/ggplot2")
# devtools::install_github("hadley/dplyr")
# devtools::install_github("hadley/purrr")
# devtools::install_github("hadley/tidyr")
# devtools::install_github("hadley/forcats")
# devtools::install_github("hadley/readr")
# devtools::install_github("dgrtwo/gganimate")
# install.packages("maps")
library(ggplot2)
library(dplyr)
library(purrr)
library(tidyr)
library(forcats)
library(readr)
library(gganimate)
library(maps)
The dataset contains the information on breweries across the United States scraped from beer advocate. Information on the breweries includes the brewery name, brewery rating, the number of reviews, the average rating of their beers, the number of beers they serve, and location information.
all_breweries <- read_csv("all_breweries.csv", col_types = "cnnnnccccnncnn")
all_breweries
#> # A tibble: 5,686 × 14
#> brewery_name brewery_rating num_reviews beer_avg
#> <chr> <dbl> <dbl> <dbl>
#> 1 603 Brewery NA NA 3.75
#> 2 7th Settlement Brewery 4.23 24 3.61
#> 3 Agner & Wolf Brewery Corp. NA NA 3.61
#> 4 Ashuelot Brewing Company NA NA NA
#> 5 Bad Lab Beer Co. NA NA NA
#> 6 Beara Irish Brewing Co. 4.12 5 3.60
#> 7 Belgian Mare Brewery NA NA 3.39
#> 8 Big Water Brewery NA NA 3.17
#> 9 Blackstone Brewing Company NA NA NA
#> 10 Border Brew Supply 3.85 10 3.39
#> # ... with 5,676 more rows, and 10 more variables: num_beers <dbl>,
#> # address <chr>, type <chr>, city <chr>, state <chr>, lon <dbl>,
#> # lat <dbl>, full_city <chr>, city_lon <dbl>, city_lat <dbl>
Let’s look at the relationship between the brewery’s overall rating and the number of beers they serve.
brew_plot <- all_breweries %>%
filter(!is.na(brewery_rating), !is.na(num_beers),
state %in% c("Kansas", "Oklahoma", "Missouri", "Missouri", "Iowa",
"Nebraska", "Colorado"),
num_beers < 400)
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating))
ggplot()
initializes a blank plot, and then layers (geoms) are added to complete the plot. For example, geom_point()
adds points to create a scatterplot. In the geom call, the user specifies which variable map to the x- and y-axes. We can create a general form all ggplot2 graphics:
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating, color = type))
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating, shape = type))
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating, size = type))
#> Warning: Using size for a discrete variable is not advised.
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating), color = "blue")
Inside of the aes()
command, ggplot2 maps the aesthetic to a variable in your dataset and creates a legend. Outside of the aes()
command, aesthetics can be fixed to a specific value. For a list of aesthetics that each geom can use, see the help page (e.g., ?geom_point
).
ggplot(data = brew_plot) +
geom_bar(mapping = aes(x = state))
ggplot(data = brew_plot) +
geom_bar(mapping = aes(x = state, fill = type))
ggplot(data = brew_plot) +
geom_bar(mapping = aes(x = state, fill = type), position = position_dodge())
ggplot(data = brew_plot) +
geom_density(mapping = aes(x = num_beers))
ggplot(data = brew_plot) +
geom_histogram(mapping = aes(x = num_beers), binwidth = 10)
ggplot(data = brew_plot) +
geom_boxplot(mapping = aes(x = state, y = num_beers))
ggplot(data = brew_plot) +
geom_violin(mapping = aes(x = state, y = brewery_rating))
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating))
ggplot(data = brew_plot) +
geom_point(mapping = aes(x = num_beers, y = brewery_rating)) +
geom_smooth(mapping = aes(x = num_beers, y = brewery_rating))
#> `geom_smooth()` using method = 'loess'
ggplot(data = brew_plot) +
geom_histogram(mapping = aes(x = num_beers), binwidth = 10) +
geom_density(mapping = aes(x = num_beers), alpha = 0.4, fill = "red")
Not all geoms are on the same scale! When using geom_bar
, ggplot2 automatically calculates the count (i.e., the nubmer of occurances for each bin), and plots that on the y-axis. We can override this and make a different calculation for the y-axis.
ggplot(data = brew_plot) +
geom_histogram(mapping = aes(x = num_beers, y = ..density..), binwidth = 10) +
geom_density(mapping = aes(x = num_beers), alpha = 0.4, fill = "red")
Mapping in every geom can start to get redundant, so instead we can set global mappings.
ggplot(data = brew_plot, mapping = aes(x = num_beers, y = brewery_rating)) +
geom_point() +
geom_smooth()
#> `geom_smooth()` using method = 'loess'
You can also set local mappings that only apply to a specific layer.
ggplot(data = brew_plot, mapping = aes(x = num_beers)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 10) +
geom_density(alpha = 0.4, fill = "red")
ggplot(data = brew_plot, mapping = aes(x = num_beers, y = brewery_rating)) +
geom_point(mapping = aes(color = type)) +
geom_smooth()
#> `geom_smooth()` using method = 'loess'
We can similarly define which data should be used for a single geom.
ggplot(data = brew_plot, mapping = aes(x = num_beers, y = brewery_rating)) +
geom_point(mapping = aes(color = type)) +
geom_smooth(data = filter(brew_plot, type == "Brewery, Eatery"))
#> `geom_smooth()` using method = 'loess'
We can also show each subset individually using facet_wrap()
.
ggplot(data = brew_plot, mapping = aes(x = num_beers, y = brewery_rating)) +
geom_point(mapping = aes(color = type)) +
geom_smooth(se = FALSE) +
facet_wrap(~ type)
#> `geom_smooth()` using method = 'loess'
Anything you see in a ggplot2 visualization can be altered. Most of this will occur through the theme()
function, but labels occur in scale_
or labs
. There is extensive documentation on what all can be altered. And chances are if there is something you want to change, someone has had the same question, and asked it online.
ggplot(brew_plot) +
geom_violin(mapping = aes(x = state, y = brewery_rating, fill = state)) +
scale_x_discrete(labels = c("CO", "IA", "KS", "MO", "NE", "OK")) +
scale_fill_brewer(type = "qual", palette = "Set3") +
labs(x = "State", y = "Number of Beers",
title = "How are breweries rated in each state?",
subtitle = "Most breweries are around 4 stars",
caption = "Data from beeradvocate.com") +
theme_bw() +
theme(plot.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.caption = element_text(size = 6),
axis.text = element_text(size = 8),
axis.title = element_text(size = 10),
axis.title.x = element_text(margin = margin(5, 0, 5, 0)),
axis.title.y = element_text(margin = margin(0, 5, 0, 0)),
legend.position = "none")
We can also save plots using the ggsave
function. By default this saves the last plot you created.
ggsave("Saved Images/Violin_Plot.png")
However you can also save your plots like you variable values and save them later.
p <- ggplot(data = brew_plot, mapping = aes(x = num_beers)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 10,
alpha = 0.6, color = "black", fill = "black") +
geom_density(fill = "red", color = "black", alpha = 0.3) +
labs(x = "Number of Beers", y = "Density",
title = "Distribution of Number of Beers Sold",
subtitle = paste0("All Breweries"),
caption = "Data from beeradvocate.com") +
theme_bw() +
theme(plot.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.caption = element_text(size = 6),
axis.text = element_text(size = 8),
axis.title = element_text(size = 10),
axis.title.x = element_text(margin = margin(5, 0, 5, 0)),
axis.title.y = element_text(margin = margin(0, 5, 0, 0)),
legend.position = "none")
ggsave("Saved Images/All_States.png", plot = p)
This can work in conjunction with other packages from the tidyverse, such as purrr
. Here we make the same plot as above, except for each state individually. We can save each of these plots in a list, and then use pwalk
to save them all at once.
plot_list <- unique(brew_plot$state) %>% list_along()
names(plot_list) <- unique(brew_plot$state)
for (i in seq_along(plot_list)) {
plot <- ggplot(data = filter(brew_plot, state == names(plot_list)[i]),
mapping = aes(x = num_beers)) +
geom_histogram(mapping = aes(y = ..density..), binwidth = 10,
alpha = 0.6, color = "black", fill = "black") +
geom_density(fill = "red", color = "black", alpha = 0.3) +
labs(x = "Number of Beers", y = "Density",
title = "Distribution of Number of Beers Sold",
subtitle = paste0("Breweries in ", names(plot_list)[i]),
caption = "Data from beeradvocate.com") +
theme_bw() +
theme(plot.title = element_text(size = 12, face = "bold"),
plot.subtitle = element_text(size = 10, face = "italic"),
plot.caption = element_text(size = 6),
axis.text = element_text(size = 8),
axis.title = element_text(size = 10),
axis.title.x = element_text(margin = margin(5, 0, 5, 0)),
axis.title.y = element_text(margin = margin(0, 5, 0, 0)),
legend.position = "none")
plot_list[[i]] <- plot
}
filenames <- paste0(names(plot_list), ".png")
pwalk(list(filenames, plot_list), ggsave,
path = paste0(getwd(), "/Saved Images/"))
Because geoms are layered, you aren’t limited to certain types of pre-defined visualizations (e.g., scatterplots, barplots, etc.). You can continue to add layers to communicate information (or to increase aesthetic appeal).
You also aren’t limited to the normal coordinate system. For example, you can map geospatial locations.
ggplot2 also has a strong online community of contributers that have written packages to extend the capabilities of ggplot2. For example, the gganimate
package allows you to make animated visualizations.
brew_loc$brew_descrip <- case_when(
brew_loc$num_brewery %in% 1:5 ~ "1-5 Breweries",
brew_loc$num_brewery %in% 6:10 ~ "6-10 Breweries",
brew_loc$num_brewery %in% 10:20 ~ "10-20 Breweries",
brew_loc$num_brewery %in% 20:40 ~ "20-40 Breweries",
brew_loc$num_brewery > 40 ~ "More than 40 Breweries"
)
brew_loc$brew_descrip <- fct_inorder(brew_loc$brew_descrip) %>% fct_rev()
p <- ggplot(data = brew_loc, mapping = aes(x = city_lon, y = city_lat,
size = num_brewery, frame = brew_descrip)) +
geom_polygon(data = states, mapping = aes(x = long, y = lat, group = group),
color = "white", inherit.aes = FALSE) +
geom_point(aes(x = city_lon, y = city_lat, size = num_brewery),
color = "grey", alpha = 0.2, inherit.aes = FALSE) +
geom_point(color = "red", alpha = 0.5) +
scale_size_area(name = "Number of\nBreweries", breaks = seq(10, 60, 10)) +
coord_map() +
labs(title = "US Breweries: Cities with ") +
theme_void() +
theme(plot.title = element_text(size = 12, face = "bold",
margin = margin(3, 0, 0, 0)),
legend.position = "bottom",
legend.title = element_text(size = 8),
plot.margin = unit(c(0,0,0,0), "in")) +
guides(size = guide_legend(nrow = 1))
gg_animate(p, interval = 2)
devtools::session_info()
#> Session info --------------------------------------------------------------
#> setting value
#> version R version 3.3.1 (2016-06-21)
#> system x86_64, darwin13.4.0
#> ui RStudio (1.0.34)
#> language (EN)
#> collate en_US.UTF-8
#> tz America/Chicago
#> date 2016-10-10
#> Packages ------------------------------------------------------------------
#> package * version date source
#> animation * 2.4 2015-08-16 cran (@2.4)
#> assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
#> colorspace 1.2-6 2015-03-11 CRAN (R 3.3.0)
#> DBI 0.5-1 2016-09-10 cran (@0.5-1)
#> devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
#> digest 0.6.10 2016-08-02 cran (@0.6.10)
#> dplyr * 0.5.0.9000 2016-09-29 Github (hadley/dplyr@546a089)
#> evaluate 0.9 2016-04-29 cran (@0.9)
#> forcats * 0.1.1.9000 2016-09-29 Github (hadley/forcats@5d469c1)
#> formatR 1.4 2016-05-09 cran (@1.4)
#> gganimate * 0.1 2016-09-22 Github (dgrtwo/gganimate@26ec501)
#> ggplot2 * 2.1.0.9001 2016-10-05 Github (hadley/ggplot2@3b29891)
#> gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
#> htmltools 0.3.5 2016-03-21 cran (@0.3.5)
#> knitr * 1.14 2016-08-13 CRAN (R 3.3.0)
#> lazyeval 0.2.0.9000 2016-09-19 Github (hadley/lazyeval@c155c3d)
#> magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
#> maps * 3.1.1 2016-07-27 CRAN (R 3.3.0)
#> memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
#> munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
#> plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
#> purrr * 0.2.2.9000 2016-09-26 Github (hadley/purrr@8c72c35)
#> R6 2.1.3 2016-08-19 cran (@2.1.3)
#> Rcpp 0.12.7 2016-09-05 cran (@0.12.7)
#> readr * 1.0.0.9000 2016-09-17 Github (hadley/readr@37d6eda)
#> rmarkdown 1.0.9016 2016-10-05 Github (rstudio/rmarkdown@fe693c3)
#> scales 0.4.0.9002 2016-10-05 Github (hadley/scales@38f81a7)
#> stringi 1.1.2 2016-10-01 CRAN (R 3.3.1)
#> stringr 1.1.0 2016-08-19 cran (@1.1.0)
#> tibble 1.2-12 2016-10-05 Github (hadley/tibble@090e075)
#> tidyr * 0.6.0.9000 2016-09-17 Github (hadley/tidyr@3c9335b)
#> withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
#> yaml 2.1.13 2014-06-12 cran (@2.1.13)