Recipe #1, geom_medians() and geom_means()

Creating a new geom_*() or stat_*() function is often motivated when plotting would require pre-computation otherwise. By using Stat extension, you can define computation to be performed within the plotting pipeline, as in the code that follows:

ggplot(data = penguins) + 
  aes(x = bill_depth_mm,
      y = bill_length_mm) + 
  geom_point() + 
  geom_means(size = 8, color = "red") # new function

In this exercise, we’ll think about a way to add a point at the medians x and y, defining the new extension function geom_medians() and then you’ll be prompted to define geom_means() based on what you’ve learned.

Step 00: Loading packages and prepping data

Handling missingness is not a discussion of this tutorial, so we’ll only use complete cases.

library(tidyverse)
library(palmerpenguins)
penguins_clean <- remove_missing(penguins) 
glimpse(penguins_clean)
Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Step 0: use base ggplot2 to get the job done

It’s good to look at how you’d get things done without Stat extension first, just using ‘base’ ggplot2. The computational moves you do here can serve a reference for building our new extension functionality.

# Compute.
penguins_medians <- penguins_clean |> 
  summarize(bill_length_mm_median = median(bill_length_mm),
            bill_depth_mm_median = median(bill_depth_mm))

# Plot.
penguins_clean |> 
  ggplot() + 
  aes(x = bill_depth_mm, y = bill_length_mm) + 
  geom_point() + 
  geom_point(data = penguins_medians,
             aes(x = bill_depth_mm_median,
                 y = bill_length_mm_median),
             size = 8, color = "red") + 
  labs(title = "Created with base ggplot2")

Use ggplot2::layer_data() to inspect the render-ready data internal in the plot. Your Stat will help prep data to look something like this.

layer_data(plot = last_plot(), 
           i = 2) # layer 2, the computed means, is of interest
     x    y PANEL group shape colour size fill alpha stroke
1 17.3 44.5     1    -1    19    red    8   NA    NA    0.5

Step 1: Define compute. Test.

Now you are ready to begin building your extension function. The first step is to define the compute that should be done under-the-hood when your function is used. We’ll define this in a function called compute_group_medians(). The input is the plot data. You will also need to use the scales argument, which ggplot2 uses internally.

Define compute.

# Define compute.
compute_group_medians <- function(data, ...){ 
  data |> 
    summarize(x = median(x),
              y = median(y))
}
  1. that our function uses a ... argument. Other things get passed alongside the data, like scales, params, through .... You don’t need to worry about these unless writing more complicated Stats.

  2. that the compute function assumes that variables x and y are present. These aesthetic variables names, relevant for building the plot, are generally not found in the raw data inputs for ggplots.

Test compute.

# Test compute. 
penguins_clean |>
  select(x = bill_depth_mm,  
         y = bill_length_mm) |>  
  compute_group_medians()
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1  17.3  44.5

that we prepare the data to have columns with names x and y before testing compute_group_medians. Computation will fail if the names x and y are not present given our function definition. Internally in a plot, columns are renamed when mapping aesthetics, e.g. aes(x = bill_depth, y = bill_length).

Step 2: Define new Stat. Test.

Next, we use the ggplot2::ggproto function which allows you to define a new Stat object - which will let us do computation under the hood while building our plot.

Define Stat.

StatMedians <- 
  ggplot2::ggproto(`_class` = "StatMedians",
                   `_inherit` = ggplot2::Stat,
                   compute_group = compute_group_medians,
                   required_aes = c("x", "y"))
  1. that the naming convention for the proto object is CamelCase. The new class should also be named the same, i.e. "StatMedians".
  2. that we inherit from the ‘Stat’ class. In fact, your ggproto object is a subclass and you aren’t fully defining it. You simplify the definition by inheriting class properties from ggplot2::Stat. We have a quick look at defaults of generic Stat below. The required_aes and compute_group elements are generic and in StatMedians, we update the definition.
names(ggplot2::Stat) # which elements exist in Stat
 [1] "compute_layer"   "parameters"      "aesthetics"      "setup_data"     
 [5] "retransform"     "optional_aes"    "non_missing_aes" "default_aes"    
 [9] "finish_layer"    "compute_panel"   "extra_params"    "compute_group"  
[13] "required_aes"    "setup_params"    "dropped_aes"    
ggplot2::Stat$required_aes # generic some kind of NULL behaivor
character(0)
  1. that the compute_group_medians function is used to define our Stat’s compute_group element. This means that data will be transformed by our compute definition – group-wise if groups are specified.
  2. that setting required_aes to ‘x’ and ‘y’ makes sense given compute assumptions. The compute assumes data to be a dataframe with columns x and y. If you data doesn’t have x and y, your compute will fail. Specifying required_aes in your Stat can improve your user interface because standard ggplot2 error messages will issue if required aes are not specified when plotting, e.g. ‘stat_medians() requires the following missing aesthetics: x.’

Test Stat.

You can test out your Stat with many base ggplot2 geom_()* functions.

penguins_clean |> 
  ggplot() + 
  aes(x = bill_depth_mm,
      y = bill_length_mm) + 
  geom_point() + 
  geom_point(stat = StatMedians, size = 7) + 
  labs(title = "Testing StatMedians")

that we don’t use “medians” as the stat argument, which would be more consistent with base ggplot2 documentation. However, if you prefer, you can refer to your newly created Stat this way when testing, i.e. geom_point(stat = "medians", size = 7)

Test Stat group-wise behavior

Test group-wise behavior by using a discrete variable with an group-triggering aesthetic like color, fill, or group, or by faceting.

last_plot() + 
  aes(color = species)

You might be thinking, what we’ve done has a lot of merit itself. Can I just use my Stat as-is within geom_*() functions?

The answer probably depends a lot on audience. If you just want to use the Stat yourself, there might not be much reason to go on to Step 3, user-facing functions. But if you have a wider audience in mind, i.e. internal to organization or open sourcing in a package, probably a more succinct expression of what functionality you deliver will be useful - i.e. write the user-facing functions.

Instead of using a geom_*() function, you might prefer to use the more flexible layer() function in your testing step. In fact, it’s sometimes necessary to go this route; for example, geom_vline() contain no stat argument, but you can use the GeomVline.

A test of StatMedians using this method follows. You can see it is a little more verbose, as there is no default for the position argument, and setting the size must be handled with a little more care.

penguins_clean |> 
  ggplot() + 
  aes(x = bill_depth_mm,
      y = bill_length_mm) + 
  geom_point() + 
  layer(geom = GeomPoint, 
        stat = StatMedians, 
        position = "identity", 
        params = list(size = 7)) + 
  labs(title = "Testing StatMedians with layer() function")

Step 3: Define user-facing functions. Test.

In this next section, we define user-facing functions. It is a bit of a mouthful, but see the tip that follows.

Define stat_*() function

# user-facing function
stat_medians <- function(mapping = NULL, data = NULL, 
                         geom = "point", position = "identity", 
                         ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatMedians, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE, 
            ...))
}
  1. that the stat_*() function name derives from the Stat objects’s name, but is snake case. So if I wanted a StatBigCircle based stat_*() function, I’d create stat_big_circle().

  2. that StatMedians is used to define the new layer function, so the computation that defines it, which is to summarize to medians, will be in play before the layer is rendered.

  3. that "point" is specified as the default for the geom argument in the function. This means that the ggplot2::GeomPoint will be used in the layer unless otherwise specified by the user.

You may be thinking, defining a new stat_*() function is a mouthful that’s probably hard to reproduce from memory. So you might use stat_identity()’s definition as scaffolding to write your own layer. i.e:

  • Type stat_identity in your console to print function contents; copy-paste the function definition.
  • Switch out StatIdentity with your Stat, e.g. StatMedian.
  • Switch out "point" other geom (‘rect’, ‘text’, ‘line’ etc) if needed
  • Final touch, list2 will error without export from rlang, so update to rlang::list2.
stat_identity
function (mapping = NULL, data = NULL, geom = "point", position = "identity", 
    ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatIdentity, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = list2(na.rm = FALSE, 
            ...))
}
<bytecode: 0x55ebad4e6c98>
<environment: namespace:ggplot2>

Define geom_*() function

You can also define geom with identical properties via aliasing.

geom_medians <- stat_medians

Test user-facing functions

## Test user-facing.
penguins_clean |>
  ggplot() +
  aes(x = bill_depth_mm, y = bill_length_mm) +
  geom_point() +
  geom_medians(size = 8)  + 
  labs(title = "Testing geom_medians()")

Test group-wise behavior

last_plot() + 
  aes(color = species) 

Test geom flexibility of stat_*() function.

last_plot() + 
  stat_medians(geom = "label", aes(label = species))  + 
  labs(subtitle = "and stat_medians()")

This approach is not fully vetted. Your comments and feedback are welcome. See discussions 26 and 31

An alternate ‘express’ route below may be helpful in some settings (i.e. in-script definitions and exploratory work).

geom_medians <- function(...){geom_point(stat = StatMedians, ...)}
geom_medians_label <- function(...){geom_label(stat = StatMedians, ...)}

penguins_clean |>
  ggplot() +
  aes(x = bill_depth_mm, 
      y = bill_length_mm) +
  geom_point() +
  geom_medians(size = 8)

last_plot() + 
  aes(color = species) 

last_plot() + 
  aes(label = species) +
  geom_medians_label()

A downside is that the geom is hard-coded, so it is not flexible in this regard compared with the stat_*() counterpart defined using the layer() function.

Also, not as many arguments will be spelled out for the user when using the function.

Done! Time for a review.

Here is a quick review of the definitional code we’ve covered, dropping tests and discussion.

Review
library(tidyverse)

# Step 1. Define compute
compute_group_medians <- function(data, scales){
  
  data |>
    summarise(x = median(x), y = median(y))
  
}

# Step 2. Define Stat
StatMedians = ggproto(`_class` = "StatMedians",
                      `_inherit` = Stat,
                      required_aes = c("x", "y"),
                      compute_group = compute_group_medians)

# Step 3. Define user-facing functions

## define stat_*()
stat_medians <- function(mapping = NULL, data = NULL, 
                         geom = "point", position = "identity", 
                         ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatMedians, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE, 
            ...))
}

## define geom_*()
geom_medians <- stat_medians

Your Turn: write geom_means()

Using the medians Recipe #1 as a reference, try to create a stat_means() function that draws a point at the means of x and y. You may also write convenience geom_*() functions.

Step 00: load libraries, data

Step 0: Use base ggplot2 to get the job done

Step 1: Write compute function. Test.

Step 2: Write Stat.

Step 3: Write user-facing functions.

Next up: Recipe 2 geom_id()

How would you write the function which annotates coordinates (x,y) for data points on a scatterplot? Go to Recipe 2.