R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
spotify_songs <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

spotify_songs %>% head()
##                 track_id                                            track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31                       Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l                       All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7                     Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x               Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef     Beautiful People (feat. Khalid) - Jack Wins Remix
##       track_artist track_popularity         track_album_id
## 1       Ed Sheeran               66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2         Maroon 5               67 63rPSO264uRjW1X5E6cWv6
## 3     Zara Larsson               70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers               60 1nqYsOef1yKKuGOVchbsk6
## 5    Lewis Capaldi               69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6       Ed Sheeran               67 2yiy9cd2QktrNvWC2EUi0k
##                                        track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2                       Memories (Dillon Francis Remix)
## 3                       All the Time (Don Diablo Remix)
## 4                           Call You Mine - The Remixes
## 5               Someone You Loved (Future Humans Remix)
## 6     Beautiful People (feat. Khalid) [Jack Wins Remix]
##   track_album_release_date playlist_name            playlist_id playlist_genre
## 1               2019-06-14     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 2               2019-12-13     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 3               2019-07-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 4               2019-07-19     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 5               2019-03-05     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
## 6               2019-07-11     Pop Remix 37i9dQZF1DXcZDD7cfEKhW            pop
##   playlist_subgenre danceability energy key loudness mode speechiness
## 1         dance pop        0.748  0.916   6   -2.634    1      0.0583
## 2         dance pop        0.726  0.815  11   -4.969    1      0.0373
## 3         dance pop        0.675  0.931   1   -3.432    0      0.0742
## 4         dance pop        0.718  0.930   7   -3.778    1      0.1020
## 5         dance pop        0.650  0.833   1   -4.672    1      0.0359
## 6         dance pop        0.675  0.919   8   -5.385    1      0.1270
##   acousticness instrumentalness liveness valence   tempo duration_ms
## 1       0.1020         0.00e+00   0.0653   0.518 122.036      194754
## 2       0.0724         4.21e-03   0.3570   0.693  99.972      162600
## 3       0.0794         2.33e-05   0.1100   0.613 124.008      176616
## 4       0.0287         9.43e-06   0.2040   0.277 121.956      169093
## 5       0.0803         0.00e+00   0.0833   0.725 123.976      189052
## 6       0.0799         0.00e+00   0.1430   0.585 124.982      163049
spotify_songs %>% skimr::skim()
## Skim summary statistics
##  n obs: 32833 
##  n variables: 23 
##  group variables:  
## 
## ── Variable type:factor ────────────────────────────────────────────────────────
##                  variable missing complete     n n_unique
##            playlist_genre       0    32833 32833        6
##               playlist_id       0    32833 32833      471
##             playlist_name       0    32833 32833      449
##         playlist_subgenre       0    32833 32833       24
##            track_album_id       0    32833 32833    22545
##          track_album_name       5    32828 32833    19743
##  track_album_release_date       0    32833 32833     4530
##              track_artist       5    32828 32833    10692
##                  track_id       0    32833 32833    28356
##                track_name       5    32828 32833    23449
##                                  top_counts ordered
##  edm: 6043, rap: 5746, pop: 5507, r&b: 5431   FALSE
##      4Jk: 247, 37i: 198, 6Kn: 195, 3xM: 189   FALSE
##      Ind: 308, 202: 247, Per: 244, Har: 219   FALSE
##  pro: 1809, sou: 1675, ind: 1672, lat: 1656   FALSE
##          5L1: 42, 5fs: 29, 7Cj: 28, 4VF: 26   FALSE
##         Gre: 139, Ult: 42, Gol: 35, Mal: 30   FALSE
##      202: 270, 201: 244, 201: 235, 201: 220   FALSE
##      Mar: 161, Que: 136, The: 123, Dav: 110   FALSE
##             7BK: 10, 14s: 9, 3ee: 9, 0nb: 8   FALSE
##          Poi: 22, Bre: 21, Ali: 20, For: 20   FALSE
## 
## ── Variable type:integer ───────────────────────────────────────────────────────
##          variable missing complete     n      mean       sd   p0    p25    p50
##       duration_ms       0    32833 32833 225799.81 59834.01 4000 187819 216000
##               key       0    32833 32833      5.37     3.61    0      2      6
##              mode       0    32833 32833      0.57     0.5     0      0      1
##  track_popularity       0    32833 32833     42.48    24.98    0     24     45
##     p75   p100     hist
##  253585 517810 ▁▁▅▇▃▁▁▁
##       9     11 ▇▃▃▃▃▆▃▆
##       1      1 ▆▁▁▁▁▁▁▇
##      62    100 ▇▃▆▇▇▇▃▁
## 
## ── Variable type:numeric ───────────────────────────────────────────────────────
##          variable missing complete     n    mean    sd        p0     p25
##      acousticness       0    32833 32833   0.18   0.22   0         0.015
##      danceability       0    32833 32833   0.65   0.15   0         0.56 
##            energy       0    32833 32833   0.7    0.18   0.00017   0.58 
##  instrumentalness       0    32833 32833   0.085  0.22   0         0    
##          liveness       0    32833 32833   0.19   0.15   0         0.093
##          loudness       0    32833 32833  -6.72   2.99 -46.45     -8.17 
##       speechiness       0    32833 32833   0.11   0.1    0         0.041
##             tempo       0    32833 32833 120.88  26.9    0        99.96 
##           valence       0    32833 32833   0.51   0.23   0         0.33 
##        p50      p75   p100     hist
##    0.08      0.26     0.99 ▇▂▁▁▁▁▁▁
##    0.67      0.76     0.98 ▁▁▁▂▆▇▆▂
##    0.72      0.84     1    ▁▁▁▃▅▇▇▆
##    1.6e-05   0.0048   0.99 ▇▁▁▁▁▁▁▁
##    0.13      0.25     1    ▇▅▂▁▁▁▁▁
##   -6.17     -4.64     1.27 ▁▁▁▁▁▁▇▃
##    0.062     0.13     0.92 ▇▂▁▁▁▁▁▁
##  121.98    133.92   239.44 ▁▁▂▇▇▂▁▁
##    0.51      0.69     0.99 ▂▅▆▇▇▇▅▃

The problem

library(tidyverse)
spotify_songs %>% 
  ggplot() + 
  aes(x = danceability, y = valence) + 
  geom_point()

randomly sampled exemplars

library(tidyverse)
spotify_songs %>% 
  ggplot() + 
  aes(x = danceability, y = valence) + 
  geom_point(alpha = .04) + 
  geom_point(data = . %>% sample_frac(.03)) # exemplars

ggpointdensity solution. Color as a third dimension.

library(tidyverse)
spotify_songs %>% 
  ggplot() + 
  aes(x = danceability, y = valence) + 
  ggpointdensity::geom_pointdensity() +  
  scale_color_viridis_c()


Row and column sketching

https://arxiv.org/abs/2009.03979 Journal of Computational and Graphical Statistics, 2022

Visualizing very large matrices involves many formidable problems. Various popular solutions to these problems involve sampling, clustering, projection, or feature selection to reduce the size and complexity of the original task. An important aspect of these methods is how to preserve relative distances between points in the higher-dimensional space after reducing rows and columns to fit in a lower dimensional space. This aspect is important because conclusions based on faulty visual reasoning can be harmful. Judging dissimilar points as similar or similar points as dissimilar on the basis of a visualization can lead to false conclusions. To ameliorate this bias and to make visualizations of very large datasets feasible, we introduce two new algorithms that respectively select a subset of rows and columns of a rectangular matrix. This selection is designed to preserve relative distances as closely as possible. We compare our matrix sketch to more traditional alternatives on a variety of artificial and real datasets.

Leland Wilkinson, Hengrui Luo