ggplot(data = penguins) +
aes(x = bill_depth_mm,
y = bill_length_mm) +
geom_point() +
geom_means(size = 8, color = "red") # new function!
Recipe #1, geom_medians() and geom_means()
The Goal
Why write new geom_*
functions? When visualizations requires computation before plotting, custom geom_()
or stat_()
functions can streamline your workflow. By defining new Stats
objects and using them to define new geom_*()
functions, you can integrate calculations directly into the plotting pipeline. In the following code, we’ll demonstrate the process to define geom_medians()
to add a point at the means of x
and y
which can be used as follows:
In this exercise, we’ll demonstrate how to define the new extension function geom_medians()
to add a point at the medians x
and y
. Then you’ll be prompted to define geom_means()
based on what you’ve learned.
Step 00: Loading packages and prepping data
Handling missingness is not a discussion of this tutorial, so we’ll only use complete cases.
library(tidyverse)
library(palmerpenguins)
<- remove_missing(penguins)
penguins_clean glimpse(penguins_clean)
Rows: 333
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex <fct> male, female, female, female, male, female, male, fe…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Step 0: use base ggplot2 to get the job done
It’s a good idea to get things done without Stat extension first, just using ‘base’ ggplot2. The computational moves you make here can serve a reference for building our extension function.
# Compute.
<- penguins_clean |>
penguins_medians summarize(bill_length_mm_median = median(bill_length_mm),
bill_depth_mm_median = median(bill_depth_mm))
# Plot.
|>
penguins_clean ggplot() +
aes(x = bill_depth_mm, y = bill_length_mm) +
geom_point() +
geom_point(data = penguins_medians,
aes(x = bill_depth_mm_median,
y = bill_length_mm_median),
size = 8, color = "red") +
labs(title = "Created with base ggplot2")
layer_data()
to inspect ggplot's internal data …
Use ggplot2::layer_data() to inspect the render-ready data internal in the plot. Your Stat will help prep data to look something like this.
layer_data(plot = last_plot(),
i = 2) # layer 2, the computed means, is of interest
x y PANEL group shape colour size fill alpha stroke
1 17.3 44.5 1 -1 19 red 8 NA NA 0.5
Step 1: Define compute. Test.
Now you are ready to begin building your extension function. The first step is to define the compute that should be done under-the-hood when your function is used. We’ll define this in a function called compute_group_medians()
. The data
input will look similar to the plot data. You will also need to include a scales
argument, which ggplot2 uses internally.
Define compute.
# Define compute.
<- function(data, scales){
compute_group_medians |>
data summarize(x = median(x),
y = median(y))
}
… the
scales
argument in the compute definition, which is used internally in ggplot2. While it won’t be used in your test (up next), you do need so that the computation will work in the ggplot2 setting.… that the compute function can only be used with data with variables
x
andy
. These aesthetic variables names, relevant for building the plot, are generally not found in the raw data inputs for plot.
Test compute.
# Test compute.
|>
penguins_clean select(x = bill_depth_mm,
y = bill_length_mm) |>
compute_group_medians()
# A tibble: 1 × 2
x y
<dbl> <dbl>
1 17.3 44.5
… that we prepare the data to have columns with names x
and y
before testing. Computation will fail if variables x
and y
are not present given the function’s definition. In a plotting setting, columns are renamed by mapping aesthetics, e.g. aes(x = bill_depth, y = bill_length)
.
Step 2: Define new Stat. Test.
Next, we use the ggplot2::ggproto function which allows you to define a new Stat object - which will let us do computation under the hood while building our plot.
Define Stat.
<-
StatMedians ::ggproto(`_class` = "StatMedians",
ggplot2`_inherit` = ggplot2::Stat,
compute_group = compute_group_medians,
required_aes = c("x", "y"))
… that the naming convention for the
ggproto
object is written in CamelCase. The new class should also be named the same, i.e."StatMedians"
.… that we inherit from the ‘Stat’ class. In fact, your ggproto object is a subclass – you are inheriting class properties from ggplot2::Stat.
… that the
compute_group_medians
function is used to define our Stat’scompute_group
element. This means that data will be transformed group-wise by our compute definition – i.e. by categories if a categorical variable is mapped.… that setting
required_aes
tox
andy
reflects the compute functions requirements Specifyingrequired_aes
in your Stat can improve your user interface. Standard ggplot2 error messages will issue if required aes are not specified, e.g. “stat_medians()
requires the following missing aesthetics:x
.”
Test Stat.
You can test out your Stat using them in ggplot2 geom_*()
functions.
|>
penguins_clean ggplot() +
aes(x = bill_depth_mm,
y = bill_length_mm) +
geom_point() +
geom_point(stat = StatMedians, size = 7) +
labs(title = "Testing StatMedians")
… that we don’t use "medians"
as the stat
argument. But you could! If you prefer, you could write geom_point(stat = "medians", size = 7)
which will direct to your new StatMedians
under the hood.
Test Stat group-wise behavior
Test group-wise behavior by using a discrete variable with an group-triggering aesthetic like color, fill, or group, or by faceting.
last_plot() +
aes(color = species)
You might be thinking, what we’ve done would already be pretty useful to me. Can I just use my Stat as-is within geom_*()
functions?
The short answer is ‘yes’! If you just want to use the Stat yourself locally in a script, there might not be much reason to go on to Step 3, user-facing functions. But if you have a wider audience in mind, i.e. internal to organization or open sourcing in a package, probably a more succinct expression of what functionality you deliver will be useful - i.e. write the user-facing functions.
layer()
function to test instead of geom_*(stat = StatNew)
Instead of using a geom_*()
function, you might prefer to use the layer()
function in your testing step. Occasionally, you must to go this route; for example, geom_vline()
contain no stat
argument, but you can use the GeomVline in layer()
. If you are teaching this content, using layer()
may help you better connect this step with the next, defining the user-facing functions.
A test of StatMedians using this method follows. You can see it is a little more verbose, as there is no default for the position argument, and setting the size must be handled with a little more care.
|>
penguins_clean ggplot() +
aes(x = bill_depth_mm,
y = bill_length_mm) +
geom_point() +
layer(geom = GeomPoint,
stat = StatMedians,
position = "identity",
params = list(size = 7)) +
labs(title = "Testing StatMedians with layer() function")
Step 3: Define user-facing functions. Test.
In this next section, we define user-facing functions. Doing so is a bit of a mouthful, but see the ‘Pro tip: Use geom_point
definition as a template in this step …’ that follows.
Define stat_*() and geom_*() functions.
<- function (mapping = NULL, data = NULL,
stat_medians geom = "point", position = "identity",
na.rm = FALSE,
..., show.legend = NA, inherit.aes = TRUE)
{layer(mapping = mapping, data = data,
geom = geom, stat = StatMedians,
position = position, show.legend = show.legend,
inherit.aes = inherit.aes,
params = rlang::list2(na.rm = na.rm, ...))
}
… the
stat_*()
function name derives from the Stat object’s name, but is snake case. Given naming conventions, a StatBigCircle-based stat_*() function, should be named stat_big_circle().…
StatMedians
defines the new layer function and cannot be replaced by the userStatMedians
and the computation that defines it will be in effect before the layer is rendered.…
"point"
refers to the object GeomPoint and defines the layer’sgeom
unless otherwise specified.
stat_identity
’s definition as a template in this step …
…
You may be thinking, defining a new stat_*() function is a mouthful that’s probably hard to reproduce from memory. So you might use stat_identity()
’s definition as scaffolding to write your own layer. i.e:
- Type
stat_identity
in your console to print function contents; copy-paste the function definition. - Switch out
StatIdentity
with your Stat, e.g.StatIndex
. - Switch out
"point"
other geom (‘rect’, ‘text’, ‘line’ etc) if needed - Final touch,
list2
will error without export from rlang, so update torlang::list2
.
stat_identity
function (mapping = NULL, data = NULL, geom = "point", position = "identity",
..., show.legend = NA, inherit.aes = TRUE)
{
layer(data = data, mapping = mapping, stat = StatIdentity,
geom = geom, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = list2(na.rm = FALSE,
...))
}
<bytecode: 0x564d30d771c0>
<environment: namespace:ggplot2>
Define geom_*() function
Because users are more accustom to using layers that have the ‘geom’ prefix, you might also define geom with identical properties via aliasing.
<- stat_medians geom_medians
Verbatim aliasing as shown above is a bit of a shortcut and assumes that users will use the ‘geom_*()’ function with the stat-geom combination as-is. (For a discussion, see Constructors in ‘Extending ggplot2: A case Study’ in ggplot2: Elegant Graphics for Data Analysis. This section notes, ‘Most ggplot2 users are accustomed to adding geoms, not stats, when building up a plot.’)
An approach that is more consistent with existing guidance would be to hardcode the Geom and allow the user to change the Stat as follows.
# user-facing function
<- function(mapping = NULL, data = NULL,
geom_index stat = "index", position = "identity",
show.legend = NA, inherit.aes = TRUE)
...,
{layer(data = data, mapping = mapping, stat = stat,
geom = GeomPoint, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE,
...)) }
However, because it is unexpected to use geom_index()
with a Stat other than StatIndex (doing so would remove the index-ness) we think that the verbatim aliasing is a reasonable, time and code saving getting-started approach.
Test/Enjoy your user-facing functions
Test geom_medians()
## Test user-facing.
|>
penguins_clean ggplot() +
aes(x = bill_depth_mm, y = bill_length_mm) +
geom_point() +
geom_medians(size = 8) +
labs(title = "Testing geom_medians()")
Test group-wise behavior
last_plot() +
aes(color = species)
Test stat_*() function with another Geom.
last_plot() +
stat_medians(geom = "label", aes(label = species)) +
labs(subtitle = "and stat_medians()")
make_constructor
, available in ggplot v. 3.5.2.9000. This will write the scaffolding code for you!
<- ggplot2::make_constructor(StatMedians, GeomPoint)
stat_medians <- ggplot2::make_constructor(GeomPoint, StatMedians)
geom_medians <- ggplot2::make_constructor(GeomLabel, StatMedians)
geom_medians_label
# check out the function definitions
geom_medians
|>
penguins_clean ggplot() +
aes(x = bill_depth_mm,
y = bill_length_mm) +
geom_point() +
geom_medians(size = 8)
last_plot() +
aes(color = species)
last_plot() +
aes(label = species) +
geom_medians_label()
Done! Time for a review.
Here is a quick review of the functions and ggproto objects we’ve covered, dropping tests and discussion.
library(tidyverse)
# Step 1. Define compute
<- function(data, scales){
compute_group_medians
|>
data summarise(x = median(x), y = median(y))
}
# Step 2. Define Stat
= ggproto(`_class` = "StatMedians",
StatMedians `_inherit` = Stat,
required_aes = c("x", "y"),
compute_group = compute_group_medians)
# Step 3. Define user-facing functions (user friendly, geom_*() function only shown here)
## use geom_point's definition as a model to follow geom_* conventions: geom is fixed, stat is flexible
<- function(mapping = NULL, data = NULL,
geom_medians stat = "medians", position = "identity",
show.legend = NA, inherit.aes = TRUE)
...,
{layer(data = data, mapping = mapping, stat = stat,
geom = GeomPoint, position = position, show.legend = show.legend,
inherit.aes = inherit.aes, params = rlang::list2(na.rm = FALSE,
...)) }
Your Turn: write geom_means()
Using the geom_medians
Recipe #1 as a reference, try to create a stat_means()
function that draws a point at the means of x and y. You may also write convenience geom_*() functions.
Step 00: load libraries, data
Step 0: Use base ggplot2 to get the job done
Step 1: Write compute function. Test.
Step 2: Write Stat.
Step 3: Write user-facing functions
Next up: Recipe 2 geom_id()
How would you write the function which annotates coordinates (x,y) for data points on a scatterplot? Go to Recipe 2.