This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
spotify_songs <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
spotify_songs %>% head()
## track_id track_name
## 1 6f807x0ima9a1j3VPbc7VN I Don't Care (with Justin Bieber) - Loud Luxury Remix
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories - Dillon Francis Remix
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the Time - Don Diablo Remix
## 4 75FpbthrwQmzHlBJLuGdC7 Call You Mine - Keanu Silva Remix
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone You Loved - Future Humans Remix
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful People (feat. Khalid) - Jack Wins Remix
## track_artist track_popularity track_album_id
## 1 Ed Sheeran 66 2oCs0DGTsRO98Gh5ZSl2Cx
## 2 Maroon 5 67 63rPSO264uRjW1X5E6cWv6
## 3 Zara Larsson 70 1HoSmj2eLcsrR0vE9gThr4
## 4 The Chainsmokers 60 1nqYsOef1yKKuGOVchbsk6
## 5 Lewis Capaldi 69 7m7vv9wlQ4i0LFuJiE2zsQ
## 6 Ed Sheeran 67 2yiy9cd2QktrNvWC2EUi0k
## track_album_name
## 1 I Don't Care (with Justin Bieber) [Loud Luxury Remix]
## 2 Memories (Dillon Francis Remix)
## 3 All the Time (Don Diablo Remix)
## 4 Call You Mine - The Remixes
## 5 Someone You Loved (Future Humans Remix)
## 6 Beautiful People (feat. Khalid) [Jack Wins Remix]
## track_album_release_date playlist_name playlist_id playlist_genre
## 1 2019-06-14 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 2 2019-12-13 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 3 2019-07-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 4 2019-07-19 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 5 2019-03-05 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## 6 2019-07-11 Pop Remix 37i9dQZF1DXcZDD7cfEKhW pop
## playlist_subgenre danceability energy key loudness mode speechiness
## 1 dance pop 0.748 0.916 6 -2.634 1 0.0583
## 2 dance pop 0.726 0.815 11 -4.969 1 0.0373
## 3 dance pop 0.675 0.931 1 -3.432 0 0.0742
## 4 dance pop 0.718 0.930 7 -3.778 1 0.1020
## 5 dance pop 0.650 0.833 1 -4.672 1 0.0359
## 6 dance pop 0.675 0.919 8 -5.385 1 0.1270
## acousticness instrumentalness liveness valence tempo duration_ms
## 1 0.1020 0.00e+00 0.0653 0.518 122.036 194754
## 2 0.0724 4.21e-03 0.3570 0.693 99.972 162600
## 3 0.0794 2.33e-05 0.1100 0.613 124.008 176616
## 4 0.0287 9.43e-06 0.2040 0.277 121.956 169093
## 5 0.0803 0.00e+00 0.0833 0.725 123.976 189052
## 6 0.0799 0.00e+00 0.1430 0.585 124.982 163049
spotify_songs %>% skimr::skim()
## Skim summary statistics
## n obs: 32833
## n variables: 23
## group variables:
##
## ── Variable type:factor ────────────────────────────────────────────────────────
## variable missing complete n n_unique
## playlist_genre 0 32833 32833 6
## playlist_id 0 32833 32833 471
## playlist_name 0 32833 32833 449
## playlist_subgenre 0 32833 32833 24
## track_album_id 0 32833 32833 22545
## track_album_name 5 32828 32833 19743
## track_album_release_date 0 32833 32833 4530
## track_artist 5 32828 32833 10692
## track_id 0 32833 32833 28356
## track_name 5 32828 32833 23449
## top_counts ordered
## edm: 6043, rap: 5746, pop: 5507, r&b: 5431 FALSE
## 4Jk: 247, 37i: 198, 6Kn: 195, 3xM: 189 FALSE
## Ind: 308, 202: 247, Per: 244, Har: 219 FALSE
## pro: 1809, sou: 1675, ind: 1672, lat: 1656 FALSE
## 5L1: 42, 5fs: 29, 7Cj: 28, 4VF: 26 FALSE
## Gre: 139, Ult: 42, Gol: 35, Mal: 30 FALSE
## 202: 270, 201: 244, 201: 235, 201: 220 FALSE
## Mar: 161, Que: 136, The: 123, Dav: 110 FALSE
## 7BK: 10, 14s: 9, 3ee: 9, 0nb: 8 FALSE
## Poi: 22, Bre: 21, Ali: 20, For: 20 FALSE
##
## ── Variable type:integer ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50
## duration_ms 0 32833 32833 225799.81 59834.01 4000 187819 216000
## key 0 32833 32833 5.37 3.61 0 2 6
## mode 0 32833 32833 0.57 0.5 0 0 1
## track_popularity 0 32833 32833 42.48 24.98 0 24 45
## p75 p100 hist
## 253585 517810 ▁▁▅▇▃▁▁▁
## 9 11 ▇▃▃▃▃▆▃▆
## 1 1 ▆▁▁▁▁▁▁▇
## 62 100 ▇▃▆▇▇▇▃▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## acousticness 0 32833 32833 0.18 0.22 0 0.015
## danceability 0 32833 32833 0.65 0.15 0 0.56
## energy 0 32833 32833 0.7 0.18 0.00017 0.58
## instrumentalness 0 32833 32833 0.085 0.22 0 0
## liveness 0 32833 32833 0.19 0.15 0 0.093
## loudness 0 32833 32833 -6.72 2.99 -46.45 -8.17
## speechiness 0 32833 32833 0.11 0.1 0 0.041
## tempo 0 32833 32833 120.88 26.9 0 99.96
## valence 0 32833 32833 0.51 0.23 0 0.33
## p50 p75 p100 hist
## 0.08 0.26 0.99 ▇▂▁▁▁▁▁▁
## 0.67 0.76 0.98 ▁▁▁▂▆▇▆▂
## 0.72 0.84 1 ▁▁▁▃▅▇▇▆
## 1.6e-05 0.0048 0.99 ▇▁▁▁▁▁▁▁
## 0.13 0.25 1 ▇▅▂▁▁▁▁▁
## -6.17 -4.64 1.27 ▁▁▁▁▁▁▇▃
## 0.062 0.13 0.92 ▇▂▁▁▁▁▁▁
## 121.98 133.92 239.44 ▁▁▂▇▇▂▁▁
## 0.51 0.69 0.99 ▂▅▆▇▇▇▅▃
library(tidyverse)
spotify_songs %>%
ggplot() +
aes(x = danceability, y = valence) +
geom_point()
library(tidyverse)
spotify_songs %>%
ggplot() +
aes(x = danceability, y = valence) +
geom_point(alpha = .04) +
geom_point(data = . %>% sample_frac(.03)) # exemplars
library(tidyverse)
spotify_songs %>%
ggplot() +
aes(x = danceability, y = valence) +
ggpointdensity::geom_pointdensity() +
scale_color_viridis_c()
https://arxiv.org/abs/2009.03979 Journal of Computational and Graphical Statistics, 2022
Visualizing very large matrices involves many formidable problems. Various popular solutions to these problems involve sampling, clustering, projection, or feature selection to reduce the size and complexity of the original task. An important aspect of these methods is how to preserve relative distances between points in the higher-dimensional space after reducing rows and columns to fit in a lower dimensional space. This aspect is important because conclusions based on faulty visual reasoning can be harmful. Judging dissimilar points as similar or similar points as dissimilar on the basis of a visualization can lead to false conclusions. To ameliorate this bias and to make visualizations of very large datasets feasible, we introduce two new algorithms that respectively select a subset of rows and columns of a rectangular matrix. This selection is designed to preserve relative distances as closely as possible. We compare our matrix sketch to more traditional alternatives on a variety of artificial and real datasets.
Leland Wilkinson, Hengrui Luo