TL;DR

The ggxmean package introduces new geom_* for fluid visual description of some basic statistical concepts, the title character, geom_x_mean, which draws a vertical line at the mean of x. We also introduce “Easy Geom Recipes”, an introductory tutorial for creating a class of geoms that perform useful computation and inherit characteristics from more primitive geoms.

On the path to {ggxmean}:

Some time ago, I was sitting on the floor in the back of a packed-out ballroom, watching Thomas Lin Pederson give a talk: ‘Extend your Ability to Extend ggplot2’.

‘I want to do that’ I thought.

And I had a use case: statistical summaries, especially those for explaining rather basic statistical concepts like variance, covariance, and correlation.

I’d visually walking through these concepts for courses I’d taught. At a chalkboard, you can do this pretty easily.

I also worked on the walk-throughs with ggplot2. Here’s what that looks like:

So, math notation and visual representation builds of basic statistics! They co-evolve speaking to different learning styles. Plus DRY principles for coders and a walk through of calc w num vals, for numerophiles! #ggplot2 #xaringan #flipbookr #rstats https://t.co/JgWLxo94Ms pic.twitter.com/ol08lMGdtD

— Gina Reynolds (@EvaMaeRey) June 25, 2020

With ggplot2 you can get this done, but actually there was so much prep that has to happen before you can do this start with the visualization. I had to calculate the means, standard deviations etc, all before beginning to plot, and then to feed those calculations into existing geom functions like geom_vline and geom_segment.

This didn’t feel like the powerful declarative experience that you have a lot of the time using ggplot2. For example, the boxplot experience is:

I want to know about the pattern in this dataset; okay, so ggplot(data = my_data)
I’m picturing my different categories on the x axis; alright, mapping = aes(x = my_category)
And I’d like y to represent my continous variable; great, then mapping = aes(y = my_continuous_outcome)
And I’ll use boxplots to summarize these groupwise distributions; so it’s '+ geom_boxplot()'.
and bam! I’ve built my plot and I can see group differences!

In this boxplot example, lots of computation happens in the background for us: min, max, 25%, 75%, median. And that is great. I understand the boxplot well; I don’t need to do those computations myself. I’m happy for ggplot2 to that for me.

For this stats walk through, I wanted the same declarative experience. I understand the mean well, and one standard deviation away from the mean, etc. I should be able to ask ggplot2 to do that computation for me – to compute the global mean (or a group-wise mean if I’m in the mood for that) – and put a vertical line there.

But as things stood, my solution to choreographing the stats visualizations felt inelegant and fragile. It wasn’t portable (not easy to move to other cases – maybe data that I or my students might be more passionate about) or dynamic (I couldn’t easily do group-wise work instead of acting globally). Put simply, it wasn’t very fun.

Thomas’ talk and the extension system seemed like the answer to bringing ggplot2 fun to this the statistical storytelling that I’d been wanting to do.

Fast forward a few years. I’ve dug into tutorial material like the ‘Extending ggplot2’ vignette and the ‘Extension’ chapter in the newest edition of the ggplot2 book, rewatched Thomas Lin Pederson’s talk a number of times, and examined ggplot2 code on github and code from other extension packages in the ggplot2 extension gallery. And now I’m happy to introduce the {ggxmean} package!

The ggxmean package allows you to easily add statistical summaries to your data visualization with new geom functions! It makes doing statistical walk-throughs like the covariance, variance, sd, and correlation walk-through elegant and fluid!

The syntax mirrors how you might go about untangling the covariance equation and drawing the mathematical representations on a scatter plot on a classroom chalkboard!

And, plus, moreover, additionally (yeah, this part feels really huge), you can easily ask ggplot2 to do all these computations group-wise if you so choose! For example, in the plot that follows, ggplot recomputes everything for us when we add the faceting by species declaration. That is awesome. ggplot2 is hard at work in the background. [footnote: some of these functions aren’t exported because I’m not confident of the names and some other considerations.]

library(tidyverse)
library(ggxmean)
palmerpenguins::penguins %>% 
  ggplot() +
  aes(x = bill_length_mm) +
  aes(y = flipper_length_mm) +
  geom_point() +
  ggxmean::geom_x_mean() +
  ggxmean::geom_y_mean() +
  ggxmean:::geom_xdiff() +
  ggxmean:::geom_ydiff() +
  ggxmean:::geom_x1sd(linetype = "dashed") +
  ggxmean:::geom_y1sd(linetype = "dashed") +
  ggxmean:::geom_diffsmultiplied() +
  ggxmean:::geom_xydiffsmean(alpha = 1) +
  ggxmean:::geom_rsq1() +
  ggxmean:::geom_corrlabel() +
  facet_wrap(facets = vars(species))

Way leads onto way …

Next, I was interested in visualizing conceptual components of an ordinary least squares (OLS) regression lesson. What does an instructor draw at a chalk board as the named concepts in teaching linear regression? Again, could we isolate those concepts and provide geoms to build up those ideas in code, just as an instructor would do on a chalkboard?

Asking these questions lead to a number of new geoms including those that compute and draw residuals, the intercept, fitted values and more, as seen here:

library(tidyverse)
library(ggxmean)
#library(transformr) #might help w/ animate

## basic example code
cars %>% 
  ggplot() +
  aes(x = speed,
      y = dist) +
  geom_point() + 
  ggxmean::geom_lm() +
  ggxmean::geom_lm_residuals(linetype = "dashed") +
  ggxmean::geom_lm_fitted(color = "goldenrod3", size = 3) +
  ggxmean::geom_lm_conf_int() +
  ggxmean::geom_lm_pred_int() +
  ggxmean::geom_lm_formula() +
  ggxmean::geom_lm_intercept(color = "red", size = 5) +
  ggxmean::geom_lm_intercept_label(size = 4, hjust = 0)

extending the scope of ggxmean: student contributions

The work on OLS was a jumping off point for the most recent additions to the ggxmean package, written Morgan Brown and Madison McGovern, students at West Point for independent studies.

Brown and McGovern took up the question of data outliers. We applied their work to famous toy datasets: Anscombe’s quartet and the datasauRus dozen. With the functions I’d previously worked on, we can visualize the summary statistics (mean, sds, correlation) that are typically the subject of discussions of Anscombe’s quartet and the datasauRus Dozen. This is shown here:

# first some data munging
datasets::anscombe %>%
  pivot_longer(cols = 1:8) %>%
  mutate(group = paste("Anscombe", 
                       str_extract(name, "\\d"))) %>%
  mutate(var = str_extract(name, "\\w")) %>%
  select(-name) %>%
  pivot_wider(names_from = var,
              values_from = value) %>%
  unnest() ->
tidy_anscombe

tidy_anscombe %>%
  ggplot() +
  aes(x = x, y = y) +
  geom_point() +
  aes(color = group) +
  facet_wrap(facets = vars(group)) +
  ggxmean::geom_x_mean() +
  ggxmean::geom_y_mean() +
  ggxmean:::geom_x1sd(linetype = "dashed") +
  ggxmean:::geom_y1sd(linetype = "dashed") +
  ggxmean::geom_lm() +
  ggxmean::geom_lm_formula() +
  ggxmean:::geom_corrlabel() + 
  guides(color = "none")

But Anscombe’s quartet’s distributions are pretty wonky. Using Morgan and Madison functions on leverage and influence we reveal outlying observations. In the following plot, Morgan Brown’s function geom_text_leverage() calculates leverage for each observation:

tidy_anscombe %>%
  ggplot() +
  aes(x = x, y = y) +
  aes(color = group) +
  geom_point() +
  facet_wrap(facets = vars(group)) +
  ggxmean::geom_text_leverage(vjust = 1,   ## Morgan's function!
                              check_overlap = T) + 
  guides(color = "none")

And in the datasauRus::datasaurus_dozen, Madison McGovern’s geom_point_high_cooks() highlights the 10% most influential observations.

datasauRus::datasaurus_dozen %>%
  ggplot() +
  aes(x = x, y = y) +
  geom_point() +
  ggxmean::geom_point_high_cooks( ## Madison's function
    color = "goldenrod",
    size = 5) + 
  facet_wrap(facets = "dataset")

Not on CRAN?

Nope. Not yet, but hopefully soon. We’re open to feedback on code, computation, and conventions (can we be more consistent with function names, etc) in this package.

Extending the circle of extenders with new point of entry: ‘easy geom recipes’?

Learning the particular mechanism needed to build ggxmean functions took me a chunk of time - about six months. Once worked out, though, I found that mechanism to be really enabling — writing about 20 geom_* functions in ggxmean with this mechanism.

Working with Morgan and Madison was chance to think more about this particular extension mechanism, and about how to make it accessible to new learners. Morgan and Madison are top students in the West Point math department, but relative newcomers to R and ggplot2; and yet they managed to write usable geom extensions in less time than I’d done!

And not only did they write geoms about outlying observations – arguably more statistically interesting than any of the geoms I’d worked on – but the also wrote what are, at first glance, trivial geoms: geom_label_id and geom_coordinate.

ggplot(cars) + 
  aes(x = speed, y = dist) + 
  geom_point() + 
  ggxmean::geom_text_coordinate(hjust = -.05,
                                check_overlap = T)

On a whim, I asked them to together these additional geoms; they did so very quickly and with very little guidance.

Their competence at translating their skills along with the computational accessibility of these more ‘trivial’ geoms was exciting. Could we make this space much more accessible to a lot more ggplot2 practitioners?

In a second independent study term, Morgan and I put together tutorial ‘easy geom recipes’; and have tested the tutorial with extension newcomers. We distilled down our process for success in building geoms in {ggxmean}.

For almost all of the geoms in ggxmean we followed the formula:

Step 0: use base ggplot2 to build the desired output
Step 1-3: build your ggplot2 function by
1. writing a compute function based on computations done using the base ggplot2 build (we only use compute_group for these easy recipes)
2. passing the compute_group function to create a ggproto object
3. passing the ggproto object (‘Stat’) to a geom_* function
Step 4: try out and enjoy the geom_*() functionality…

The mechanism is creating geoms_*() that wrap a stat and inherits from an existing geom, and that use continuous required aesthetics. We think that using inheritance mechanism may be a useful point of entry – with easy wins – for folks interested in entering this space!

The tutorial contains 6 recipes - three that are fully worked out, and three that we invite users to complete. Try out the recipes here.

A go-to starting point for ggplot extension is theme. But in my experience, though, using a home-grown theme pales to the thrill of seeing a home-grown geom_*() – doing a bunch of computational work in the background – in action. I think more mathematically and statistically minded folks in the ggplot2 community may have the same experience!

What types of geoms can I expect these recipes to inform.

geoms that have only continuous required aesthetics are good candidates and inherit from simpler primitive geoms (point, segment, text, label).

Deserving of a geom_*() function?

As a guide, I think a geom_*() function may be in order when we have well articulated statistical concepts and would like the visual vocabulary to match. These cases seem particularly deserving. The I-need-it feeling is also probably a good indicator of taking a stab at writing a new geom.

What’s next?

The ggplot2 extension space is vast and includes the orthogonal components of ggplot2 builds (geoms, scales, coords, facets etc). Even within the space of new geoms, the strategies are varied.

‘Easy geom recipes’ bites down into a little, exciting area of the ggplot2 extension space. It expands the number of examples for that space and the material to chew and get the flavor of this particular mechanism. We think this tutorial gives folks a small win and makes more people curious about entering the ggplot extension space! There is certainly more to terrain to cover, and perhaps we’ll get to work on a ‘more, just-as-easy geom recipes’ soon that introduce a bit more terrain.

{ggxmean} and new statistical geoms

Gina Reynolds, Morgan Brown, Madison McGovern

5/11/2022