library(ggplot2)
ggplot(diamonds) +
aes(x = cut) +
geom_bar()
ggplot2 sublayer modularity
a brief and whimsical first-look behind the curtains of ggplot2’s layers
If you are a fan of ggplot2, you are probably also a fan of ‘layer’ functions geom_*()
and stat_*()
.
Important clarification before we begin: Sometimes all the ggplot2 functions are referred to as ggplot2 ‘layers’, i.e. scales_*(), coord_*(), etc as in ‘build up your plot layer-by-layer’. *
But we are using the word in the narrower, sense used in the ggplot2 documentation.
Maybe you get giddy thinking about geom_bump()
, geom_ridgeline()
, or classic geom_histogram()
?
Well-composed geom_*()
s and stat_*()
s (i.e. layer) make for more fluid analytic discovery.
But what elements constitute a layer function?
Let’s look at geom_bar()
to get a feel for what layer composition means.
In this plot specification, geom_bar() counts the number of observations that are of each value of x (cut).
There are actually three main characters in every geom_()
or stat_()
.
They are geom
, stat
, and position
.
In geom_bar()
the geom
is fixed, but stat
and position
are adjustable. You can see that their defaults are "count"
and "stack"
in the function’s definition.
And instead of using convenient geom_bar()
, we can use the more generic layer()
function - which is actually used under the hood to define all geom_*()
and stat_*()
functions.
We can reproduce `geom_bar()`‘s behavior with layer()
, but we must provide all three ’control operators’:layer(geom = "bar", stat = "count", position = "stack")
.
Or, equivalently, we can simply name the underlying ggproto
objects, GeomBar
and StatCount
in our case, and the position function, position_stack()
.
Reiteration: There are actually ‘control operators’ that define the geom_*()
and stat_*()
user-facing function. Geoms, Stats, and position_*()
.
You can refer to them indirectly by quoting their stem, layer(geom = "bar", stat = "count", position = "stack")
.
Or use the ggproto objects and position function directly, layer(geom = GeomBar, stat = StatCount, position = position_stack())
Focus: Lets look at one ‘control operator’, the Stat, more closely.
Stats themselves have a number of control elements.
It transforms plot input data before it is passed off to be rendered.
Stat’s computation is defined in the compute slots.
And in StatCount
, compute is done group-wise, so compute_group()
defines StatCount’s data transformation.
We can get a sense of StatCount$compute_group
’s behavior by using on our data.
First, we use select()
to make the data look as it would inside of the ggplot2 plotting environment — this mirrors the aes(x = cut)
mapping declaration.
Then we see that the data is collapsed by x
, and count
and prop
variables are produced.
We can think about StatCount’s job as doing some computation that the user might otherwise be responsible for.
We use StatCount$compute_group() to manually do this computation for us, in conjunction with StatIdentity (leaves data alone) in layer
to show this work explicitly.
Key point: We might think Stat’s job as lightening the analyst’s load - doing computation that the user would otherwise need to do for before plotting.
One final question you might have is ’how exactly is the height of the bar, y
, communicated to the ggplot2 system? Why does that just work?
This is managed by the default_aes
specification for StatCount.
Because there is no variable mapped to y in our plot specification, y position defaults to after_stat(count)
, in other words the computed variable count
that is available after the StatCount computation is done!
And it is good to do a little mix-and-match thinking to get a further feel for StatCount. Which of the following plots will have identical outputs?
Is this what you anticipated?
Above, we’ve had an outside-in look at some aspects of ‘sublayer modularity’.
To get an inside-out look — building up your own Stat from scratch — you might have a look at ‘easy geom recipes’ or by joining ggplot2 extenders club
function (mapping = NULL, data = NULL, stat = "count", position = "stack",
..., just = 0.5, na.rm = FALSE, orientation = NA, show.legend = NA,
inherit.aes = TRUE)
{
layer(data = data, mapping = mapping, stat = stat, geom = GeomBar,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list2(just = just, na.rm = na.rm, orientation = orientation,
...))
}
<bytecode: 0x7fe7c772e5b8>
<environment: namespace:ggplot2>
<ggproto method>
<Wrapper function>
function (...)
compute_group(..., self = self)
<Inner function (f)>
function (self, data, scales, width = NULL, flipped_aes = FALSE)
{
data <- flip_data(data, flipped_aes)
x <- data$x
weight <- data$weight %||% rep(1, length(x))
count <- as.vector(rowsum(weight, x, na.rm = TRUE))
bars <- data_frame0(count = count, prop = count/sum(abs(count)),
x = sort(unique0(x)), width = width, flipped_aes = flipped_aes,
.size = length(count))
flip_data(bars, flipped_aes)
}
p1 <- ggplot(data = diamonds) +
aes(x = cut) +
layer(geom = GeomBar,
stat = StatCount,
position = position_stack())
p2 <- p1 + geom_point(stat = StatCount)
p3 <- p1 + stat_count(geom = GeomPoint)
p4 <- p1 + geom_label(stat = StatCount,
aes(label = after_stat(count)))
p5 <- p1 + stat_count(geom = GeomText,
aes(label = after_stat(count)),
vjust = 1)