Recipe #2, geom_index() and geom_coordinates()

The Goal

In the last recipe, you should have seen and practiced how to write a compute function that can be used in a Stat. Specifically, you defined group-wise computation by defining compute_group for your Stat ggproto object. You specified that other behavior should be inherited from the more generic Stat class. You improved user interface for your Stat by specifying which aes must be provided by the user in the require_aes slot. Finally you wrote user-facing functions stat_\*() and geom_\*() based on the new Stat object and an existing Geom object.

In this next recipe, we’ll first see the creation of StatIndex, stat_index, and geom_index, which can all be used to label observations with their row numbers

Along the way, we’ll contrast a new extension move - defining panel-wise instead of group-wise compute using compute_panel. Panel-wise computation will be used in the two recipes that follow.

In the exercise for this recipe, you’ll create StatCoordinates, stat_coordinates() and geom_coordinates(), usable to mark points with their x and y coordinates.

Let’s get started! By the end of the exercise, we should be able to specify the plot below with the new geom_coordinates() function:

ggplot(data = cars) + 
  aes(x = speed,
      y = dist) + 
  geom_point() + 
  geom_coordinates(hjust = 1, # new function!
                   vjust = 1, 
                   check_overlap = T) 

But, we’ll start by demonstrating how to annotate the observation index (row number) at x and y, defining the new extension function geom_index(). Then you’ll be prompted to define geom_coordinates() based on what you’ve learned.

Step 00: Loading packages and cleaning data

We’ll use the tidyverse’s tibble to convert the cars dataframe into a tibble. This is just so we’ll have redacted data when printing.

library(tidyverse)
cars <- tibble(cars)
cars
# A tibble: 50 × 2
   speed  dist
   <dbl> <dbl>
 1     4     2
 2     4    10
 3     7     4
 4     7    22
 5     8    16
 6     9    10
 7    10    18
 8    10    26
 9    10    34
10    11    17
# ℹ 40 more rows

Step 0: use base ggplot2 to get the job done

It’s good idea to go through at how you’d get things done without Stat extension first, just using ‘base’ ggplot2. The computational moves you make here can serve a reference for building our extension function.

# Compute.
cars |> 
  mutate(index = row_number()) |>
  # Plot
  ggplot() + 
  aes(x = speed, y = dist, 
      label = index) + 
  geom_point() + 
  geom_label(vjust = 1, hjust = 1) + 
  labs(title = "Created with base ggplot2")

Use ggplot2::layer_data() to inspect the render-ready data internal in the plot. Your Stat will help prep data to look something like this.

layer_data(plot = last_plot(), 
           i = 2) |> # layer 2, with labels designated is of interest
  head()
  x  y label PANEL group colour  fill size angle hjust vjust alpha family
1 4  2     1     1    -1  black white 3.88     0     1     1    NA       
2 4 10     2     1    -1  black white 3.88     0     1     1    NA       
3 7  4     3     1    -1  black white 3.88     0     1     1    NA       
4 7 22     4     1    -1  black white 3.88     0     1     1    NA       
5 8 16     5     1    -1  black white 3.88     0     1     1    NA       
6 9 10     6     1    -1  black white 3.88     0     1     1    NA       
  fontface lineheight
1        1        1.2
2        1        1.2
3        1        1.2
4        1        1.2
5        1        1.2
6        1        1.2

Step 1: Define compute. Test.

Now you are ready to begin building your extension function. The first step is to define the compute that should be done under-the-hood when your function is used. We’ll define this in a function called compute_group_index(). The input is the plot data. You will also need to use the scales argument, which ggplot2 uses internally.

Define compute.

# Define compute.
compute_group_index <- function(data, scales){ 
  data |> 
    mutate(label = row_number())
}
NoteYou may have noticed …
  1. … the scales argument in the compute definition, which is used internally in ggplot2. While it won’t be used in your test (up next), you do need so that the computation will work in the ggplot2 setting.

  2. … that the compute function can only be used with data with variables x and y. These aesthetic variables names, relevant for building the plot, are generally not found in the raw data inputs for plot.

  3. … that the compute function adds a column of data called ‘label’ internally. This means that the Stat can be used with Geoms like GeomText and GeomLabel without the user providing a label!

Test compute.

# Test compute. 
cars |>
  select(x = speed,  
         y = dist) |>  
  compute_group_index()
# A tibble: 50 × 3
       x     y label
   <dbl> <dbl> <int>
 1     4     2     1
 2     4    10     2
 3     7     4     3
 4     7    22     4
 5     8    16     5
 6     9    10     6
 7    10    18     7
 8    10    26     8
 9    10    34     9
10    11    17    10
# ℹ 40 more rows
NoteYou may have noticed …

… that we prepare the data to have columns with names x and y before testing. Computation will fail if variables x and y are not present given the function’s definition. In a plotting setting, columns are renamed by mapping aesthetics, e.g. aes(x = speed, y = dist).

Step 2: Define new Stat. Test.

Next, we use the ggplot2::ggproto function which allows you to define a new Stat object - which will let us do computation under the hood while building our plot.

Define Stat.

StatIndex <- 
  ggplot2::ggproto(`_class` = "StatIndex",
                   `_inherit` = ggplot2::Stat,
                   compute_group = compute_group_index,
                   required_aes = c("x", "y"))
NoteYou may have noticed…
  1. … that the naming convention for the ggproto object is written in CamelCase. The new class should also be named the same, i.e. "StatIndex".

  2. … that we inherit from the ‘Stat’ class. In fact, your ggproto object is a subclass – you are inheriting class properties from ggplot2::Stat.

  3. … that the compute_group_index function is used to define our Stat’s compute_group element. This means that data will be transformed group-wise by our compute definition – i.e. by categories if a categorical variable is mapped.

  4. … that setting required_aes to x and y reflects the compute functions requirements Specifying required_aes in your Stat can improve your user interface. Standard ggplot2 error messages will issue if required aes are not specified, e.g. “stat_index() requires the following missing aesthetics: x.”

Test Stat.

You can test out your Stat with many base ggplot2 geom_()* functions.

cars |> 
  ggplot() + 
  aes(x = speed,
      y = dist) + 
  geom_point() + 
  geom_text(stat = StatIndex, hjust = 1, vjust = 1) + 
  labs(title = "Testing StatIndex")

NoteYou may have noticed …

… that we don’t use "index" as the stat argument. But you could! If you prefer, you could write geom_point(stat = "medians", size = 7) which will direct to your new StatMedians under the hood.

Test Stat group-wise behavior

Test group-wise behavior by using a discrete variable with an group-triggering aesthetic like color, fill, or group, or by faceting.

last_plot() + 
  aes(color = speed > 15)

NoteYou may have noticed …

… that some indices change with color mapping. This is because compute is by group (compute_group is defined), so the index is within group. Contrast this to the case where compute_panel is defined instead.

StatIndexPanel <- 
  ggplot2::ggproto(`_class` = "StatIndexPanel",
                   `_inherit` = ggplot2::Stat,
                   compute_panel = compute_group_index)

cars |> 
  ggplot() + 
  aes(x = speed, y = dist) + 
  geom_point() + 
  geom_text(stat = StatIndexPanel, hjust = 1, vjust = 1) + 
  labs(title = "Testing StatIndex") + 
  aes(color = speed > 15)

The indexing (row_numbers) are not computed within the T/F computed color variable speed > 15, but across the panel (facet).

You might be thinking, what we’ve done would already be pretty useful to me. Can I just use my Stat as-is within geom_*() functions?

The short answer is ‘yes’! If you just want to use the Stat yourself locally in a script, there might not be much reason to go on to Step 3, user-facing functions. But if you have a wider audience in mind, i.e. internal to organization or open sourcing in a package, probably a more succinct expression of what functionality you deliver will be useful - i.e. write the user-facing functions.

Instead of using a geom_*() function, you might prefer to use the layer() function in your testing step. Occasionally, it’s necessary to go this route; for example, geom_vline() contain no stat argument, but you can use the GeomVline in layer(). If you are teaching this content, using layer() may help you better connect this step with the next, defining the user-facing functions.

A test of StatIndex using this method follows. You can see it is a little more verbose, as there is no default for the position argument, and setting the size must be handled with a little more care.

cars |> 
  ggplot() + 
  aes(x = speed,
      y = dist) + 
  geom_point() + 
  layer(geom = GeomLabel, 
        stat = StatIndex, 
        position = "identity", 
        params = list(size = 7)) + 
  labs(title = "Testing StatIndex with layer() function")

Step 3: Define user-facing functions. Test.

In this next section, we define user-facing functions. It is a bit of a mouthful, but see the ‘Pro tip: Use stat_identity definition as a template in this step …’ that follows.

Define stat_*() function

# user-facing function
stat_index <- function(mapping = NULL, data = NULL, 
                         geom = "label", position = "identity", 
                         ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatIndex, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE, 
            ...))
}
NoteYou may have noticed…
  1. … that the stat_*() function name derives from the Stat objects’s name, but is snake case. So if I wanted a StatBigCircle based stat_*() function, I’d create stat_big_circle().

  2. … that StatIndex is used to define the new layer function, so the computation that defines it, which is to summarize to add the index label variable, will be in play before the layer is rendered.

  3. … that "label" is specified as the default for the geom argument in the function. This means that the ggplot2::GeomLabel will be used in the layer unless otherwise specified by the user.

You may be thinking, defining a new stat_*() function is a mouthful that’s probably hard to reproduce from memory. So you might use stat_identity()’s definition as scaffolding to write your own layer. i.e:

  • Type stat_identity in your console to print function contents; copy-paste the function definition.
  • Switch out StatIdentity with your Stat, e.g. StatIndex.
  • Switch out "point" other geom (‘rect’, ‘text’, ‘line’ etc) if needed
  • Final touch, list2 will error without export from rlang, so update to rlang::list2.
stat_identity
function (mapping = NULL, data = NULL, geom = "point", position = "identity", 
    ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatIdentity, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = list2(na.rm = FALSE, 
            ...))
}
<bytecode: 0x559c74c75950>
<environment: namespace:ggplot2>
stat_index <- make_constructor(StatIndex, geom = "text")

Define geom_*() function

Because users are more accustom to using layers that have the ‘geom’ prefix, you might also define geom with identical properties via aliasing.

geom_index <- stat_index

It is more conventional write out scaffolding code, nearly identical to the stat_*() definition, but has the geom fixed and the stat flexible.

But soon we can use make_constructor() in the next ggplot2 release, just about as easy as aliasing and which will deliver the fixed geom and flexible stat convention in what follows:

geom_index <- make_constructor(GeomText, stat = "index")

Test geom_index()

## Test user-facing.
cars |>
  ggplot() +
  aes(x = speed, y = dist) +
  geom_point() +
  geom_index(hjust = 1, vjust = 1)  + 
  labs(title = "Testing geom_index()")

Test/Enjoy your user-facing functions

Test group-wise behavior

last_plot() + 
  aes(color = speed > 15) 

Use stat_*() function with another Geom

cars |>
  ggplot() +
  aes(x = speed, y = dist) + 
  geom_point() + 
  stat_index(geom = "text", hjust = 1, vjust = 1)  + 
  labs(subtitle = "and stat_index()")

Done! Time for a review.

Here is a quick review of the functions and ggproto objects we’ve covered, dropping tests and discussion.

NoteReview
library(tidyverse)

# Step 1. Define compute
compute_group_index <- function(data, scales){
  
  data |>
    mutate(label = row_number())
  
}

# Step 2. Define Stat
StatIndex = ggproto(`_class` = "StatIndex",
                    `_inherit` = Stat,
                    required_aes = c("x", "y"),
                    compute_group = compute_group_index)

# Step 3. Define user-facing functions

## define stat_*()
stat_index <- function(mapping = NULL, data = NULL, 
                         geom = "label", position = "identity", 
                         ..., show.legend = NA, inherit.aes = TRUE) 
{
    layer(data = data, mapping = mapping, stat = StatIndex, 
        geom = geom, position = position, show.legend = show.legend, 
        inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE, 
            ...))
}

## define geom_*()
geom_index <- stat_index

Your Turn: write geom_coordinates()

Using the geom_index() Recipe #2 as a reference, try to create a stat_coordinates() function that draws a point at the means of x and y. You may also write convenience geom_*() function.

Step 00: load libraries, data

Step 0: Use base ggplot2 to get the job done

Step 1: Write compute function. Test.

Step 2: Write Stat.

Step 3: Write user-facing functions.

Next up: Recipe 3 geom_bal_point() and geom_support()

In the next recipe, we’ll look at computation when we know we’ll be working with categorical variables, creating geom_bal_point() and geom_support()