ggplot(data = cars) +
aes(x = speed,
y = dist) +
geom_point() +
geom_coordinates(hjust = 1, # new function!
vjust = 1,
check_overlap = T)
Recipe #2, geom_index() and geom_coordinates()
The Goal
Welcome to ‘Easy geom
recipes’ Recipe #2: creating geom_index and geom_coordinates.
Creating a new geom_*() or stat_*() function is often motivated when plotting would require pre-computation otherwise. By using Stat extension, you can define computation to be performed within the plotting pipeline, as in the code that follows:
In this exercise, we’ll demonstrate how to annotate the observation index (row number) at x and y, defining the new extension function geom_index()
. Then you’ll be prompted to define geom_coordinates()
based on what you’ve learned.
Step 00: Loading packages and cleaning data
We’ll use the tidyverse’s tibble to convert the cars dataframe into a tibble. This is just so we’ll have redacted data when printing.
library(tidyverse)
<- tibble(cars)
cars cars
# A tibble: 50 × 2
speed dist
<dbl> <dbl>
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# ℹ 40 more rows
Step 0: use base ggplot2 to get the job done
It’s good idea to go through at how you’d get things done without Stat extension first, just using ‘base’ ggplot2. The computational moves you make here can serve a reference for building our extension function.
# Compute.
|>
cars mutate(index = row_number()) |>
# Plot
ggplot() +
aes(x = speed, y = dist,
label = index) +
geom_point() +
geom_label(vjust = 1, hjust = 1) +
labs(title = "Created with base ggplot2")
layer_data()
to inspect ggplot’s internal data …
Use ggplot2::layer_data() to inspect the render-ready data internal in the plot. Your Stat will help prep data to look something like this.
layer_data(plot = last_plot(),
i = 2) |> # layer 2, with labels designated is of interest
head()
x y label PANEL group colour fill size angle hjust vjust alpha family
1 4 2 1 1 -1 black white 3.88 0 1 1 NA
2 4 10 2 1 -1 black white 3.88 0 1 1 NA
3 7 4 3 1 -1 black white 3.88 0 1 1 NA
4 7 22 4 1 -1 black white 3.88 0 1 1 NA
5 8 16 5 1 -1 black white 3.88 0 1 1 NA
6 9 10 6 1 -1 black white 3.88 0 1 1 NA
fontface lineheight
1 1 1.2
2 1 1.2
3 1 1.2
4 1 1.2
5 1 1.2
6 1 1.2
Step 1: Define compute. Test.
Now you are ready to begin building your extension function. The first step is to define the compute that should be done under-the-hood when your function is used. We’ll define this in a function called compute_group_index()
. The input is the plot data. You will also need to use the scales argument, which ggplot2 uses internally.
Define compute.
# Define compute.
<- function(data, scales){
compute_group_index |>
data mutate(label = row_number())
}
… that our function uses the
scales
argument. Thescales
argument is used internally in ggplot2. So while it won’t be used in your test, you do need it for defining and using the ggproto Stat object in the next step.… that the compute function assumes that variables x and y are present in the data. These aesthetic variables names, relevant for building the plot, are generally not found in the raw data inputs for ggplots.
… that the compute function adds a column of data called ‘label’ We assume that x and y will be present in our data as the user will declare them. However, the
label
aes, which is required for Geoms like Text and Label will be computed internally in our layer, meaning the user doesn’t need to provide this.
Test compute.
# Test compute.
|>
cars select(x = speed,
y = dist) |>
compute_group_index()
# A tibble: 50 × 3
x y label
<dbl> <dbl> <int>
1 4 2 1
2 4 10 2
3 7 4 3
4 7 22 4
5 8 16 5
6 9 10 6
7 10 18 7
8 10 26 8
9 10 34 9
10 11 17 10
# ℹ 40 more rows
… that we prepare the data to have columns with names x and y before testing compute_group_index
. We just do this to be more consistent with how data will look internally in a ggplot — columns are renamed when mapping aesthetics, e.g. aes(x = speed, y = dist)
, and variables that aren’t mapped won’t be available to the compute.
Step 2: Define new Stat. Test.
Next, we use the ggplot2::ggproto function which allows you to define a new Stat object - which will let us do computation under the hood while building our plot.
Define Stat.
<-
StatIndex ::ggproto(`_class` = "StatIndex",
ggplot2`_inherit` = ggplot2::Stat,
compute_group = compute_group_index)
- … that the naming convention for the ggproto object is CamelCase. The new class should also be named the same, i.e.
"StatIndex"
. - … that we inherit from the ‘Stat’ class. In fact, your ggproto object is a subclass and you aren’t fully defining it. You simplify the definition by inheriting class properties from ggplot2::Stat. We have a quick look at defaults of generic
Stat
below. The required_aes and compute_group elements are generic and in StatMedians, we update the definition.
names(ggplot2::Stat) # which elements exist in Stat
[1] "compute_layer" "parameters" "aesthetics" "setup_data"
[5] "retransform" "optional_aes" "non_missing_aes" "default_aes"
[9] "finish_layer" "compute_panel" "extra_params" "compute_group"
[13] "required_aes" "setup_params" "dropped_aes"
::Stat$required_aes # generic some kind of NULL behavior ggplot2
character(0)
- … that the compute_group_index function is used to define our Stat’s compute_group element. This means that data will be transformed by our compute definition – group-wise if groups are specified.
- … that setting
required_aes
isn’t necessary for this Stat The compute assumes data to be a dataframe but doesn’t rely on columns named x and y, for example, for computation. In general, specifyingrequired_aes
in your Stat can improve your user interface because standard ggplot2 error messages will issue if required aes are not specified when plotting, e.g. ‘stat_index()
requires the following missing aesthetics: x.’
Test Stat.
You can test out your Stat with many base ggplot2 geom_()* functions.
|>
cars ggplot() +
aes(x = speed,
y = dist) +
geom_point() +
geom_text(stat = StatIndex, hjust = 1, vjust = 1) +
labs(title = "Testing StatIndex")
… that we don’t use “index” as the stat argument, which would be more consistent with base ggplot2 documentation. However, if you prefer, you can refer to your newly created Stat this way when testing, i.e. geom_point(stat = "index", size = 7)
Test Stat group-wise behavior
Test group-wise behavior by using a discrete variable with an group-triggering aesthetic like color, fill, or group, or by faceting.
last_plot() +
aes(color = speed > 15)
… that some indices change with color mapping. This is because compute is by group (compute_group
is defined), so the index is within group.
You might be thinking, what we’ve done would already be pretty useful to me. Can I just use my Stat as-is within geom_*() functions?
The short answer is ‘yes’! If you just want to use the Stat yourself locally in a script, there might not be much reason to go on to Step 3, user-facing functions. But if you have a wider audience in mind, i.e. internal to organization or open sourcing in a package, probably a more succinct expression of what functionality you deliver will be useful - i.e. write the user-facing functions.
layer()
function to test instead of geom_*(stat = StatNew)
Instead of using a geom_*()
function, you might prefer to use the layer()
function in your testing step. Occasionally, it’s necessary to go this route; for example, geom_vline()
contain no stat
argument, but you can use the GeomVline in layer()
. If you are teaching this content, using layer()
may help you better connect this step with the next, defining the user-facing functions.
A test of StatIndex using this method follows. You can see it is a little more verbose, as there is no default for the position argument, and setting the size must be handled with a little more care.
|>
cars ggplot() +
aes(x = speed,
y = dist) +
geom_point() +
layer(geom = GeomLabel,
stat = StatIndex,
position = "identity",
params = list(size = 7)) +
labs(title = "Testing StatIndex with layer() function")
Step 3: Define user-facing functions. Test.
In this next section, we define user-facing functions. It is a bit of a mouthful, but see the ‘Pro tip: Use stat_identity
definition as a template in this step …’ that follows.
Define stat_*() function
# user-facing function
<- function(mapping = NULL, data = NULL,
stat_index geom = "label", position = "identity",
show.legend = NA, inherit.aes = TRUE)
...,
{layer(data = data, mapping = mapping, stat = StatIndex,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE,
...)) }
… that the
stat_*()
function name derives from the Stat objects’s name, but is snake case. So if I wanted a StatBigCircle based stat_*() function, I’d create stat_big_circle().… that
StatIndex
is used to define the new layer function, so the computation that defines it, which is to summarize to medians, will be in play before the layer is rendered.… that
"label"
is specified as the default for the geom argument in the function. This means that theggplot2::GeomLabel
will be used in the layer unless otherwise specified by the user.
stat_identity
definition as a template in this step …
…
You may be thinking, defining a new stat_*() function is a mouthful that’s probably hard to reproduce from memory. So you might use stat_identity()
’s definition as scaffolding to write your own layer. i.e:
- Type
stat_identity
in your console to print function contents; copy-paste the function definition. - Switch out
StatIdentity
with your Stat, e.g.StatIndex
. - Switch out
"point"
other geom (‘rect’, ‘text’, ‘line’ etc) if needed - Final touch,
list2
will error without export from rlang, so update torlang::list2
.
stat_identity
function (mapping = NULL, data = NULL, geom = "point", position = "identity",
..., show.legend = NA, inherit.aes = TRUE)
{
layer(data = data, mapping = mapping, stat = StatIdentity,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = list2(na.rm = FALSE,
...))
}
<bytecode: 0x563c206a29c0>
<environment: namespace:ggplot2>
Define geom_*() function
Because users are more accustom to using layers that have the ‘geom’ prefix, you might also define geom with identical properties via aliasing.
<- stat_index geom_index
… verbatim aliasing as shown above is a bit of a shortcut and assumes that users will use the ‘geom_*()’ function with the stat-geom combination as-is. (For a discussion, see Constructors in ‘Extending ggplot2: A case Study’ in ggplot2: Elegant Graphics for Data Analysis. This section notes, ‘Most ggplot2 users are accustomed to adding geoms, not stats, when building up a plot.’)
An approach that is more consistent with existing guidance would be to hardcode the Geom and allow the user to change the Stat as follows.
# user-facing function
<- function(mapping = NULL, data = NULL,
geom_index stat = "index", position = "identity",
show.legend = NA, inherit.aes = TRUE)
...,
{layer(data = data, mapping = mapping, stat = stat,
geom = GeomPoint, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE,
...)) }
However, because it is unexpected to use geom_index()
with a Stat other than StatIndex (doing so would remove the index-ness) we think that the verbatim aliasing is a reasonable, time and code saving getting-started approach.
Test geom_index()
## Test user-facing.
|>
cars ggplot() +
aes(x = speed, y = dist) +
geom_point() +
geom_index(hjust = 1, vjust = 1) +
labs(title = "Testing geom_index()")
Test/Enjoy your user-facing functions
Test group-wise behavior
last_plot() +
aes(color = speed > 15)
Use stat_*() function with another Geom
|>
cars ggplot() +
aes(x = speed, y = dist) +
geom_point() +
stat_index(geom = "text", hjust = 1, vjust = 1) +
labs(subtitle = "and stat_index()")
This approach is not fully vetted. Your comments and feedback are welcome. See discussions 26 and 31
An alternate ‘express’ route below may be helpful in some settings (i.e. in-script definitions and exploratory work).
<- function(...){geom_label(stat = StatIndex, ...)}
geom_index <- function(...){geom_text(stat = StatIndex, ...)}
geom_index_text
|>
cars ggplot() +
aes(x = speed,
y = dist) +
geom_point() +
geom_index(vjust = 1)
last_plot() +
aes(color = speed > 15)
A downside is that the geom is hard-coded, so it is not flexible in this regard compared with the stat_*() counterpart defined using the layer() function.
Also, not as many arguments will be spelled out for the user when using the function.
Done! Time for a review.
Here is a quick review of the functions and ggproto objects we’ve covered, dropping tests and discussion.
library(tidyverse)
# Step 1. Define compute
<- function(data, scales){
compute_group_index
|>
data mutate(label = row_number())
}
# Step 2. Define Stat
= ggproto(`_class` = "StatIndex",
StatIndex `_inherit` = Stat,
required_aes = c("x", "y"),
compute_group = compute_group_index)
# Step 3. Define user-facing functions
## define stat_*()
<- function(mapping = NULL, data = NULL,
stat_index geom = "label", position = "identity",
show.legend = NA, inherit.aes = TRUE)
...,
{layer(data = data, mapping = mapping, stat = StatIndex,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE,
...))
}
## define geom_*()
<- stat_index geom_index
Your Turn: write geom_coordinates()
Using the geom_index()
Recipe #2 as a reference, try to create a stat_coordinates()
function that draws a point at the means of x and y. You may also write convenience geom_*() function.
Step 00: load libraries, data
Step 0: Use base ggplot2 to get the job done
Step 1: Write compute function. Test.
Step 2: Write Stat.
Step 3: Write user-facing functions.
Next up: Recipe 3 geom_lm_residuals()
How would you write the function draws residuals based on a linear model fit lm(y ~ x)? Go to Recipe 3.