scatterplot / dotplot / losttheplot

I’m not sure how to game search engine optimisation algorithms, but hopefully you’ll end up here if you’ve googled “things that are better than histograms” or “like scatter plots but with groups and paired and with lines” or “Weissgerber but in R not Excel” or something similar.

Anyway. Weissgerber et al. (2015) have a fantastic paper on data visualisation which is well worth a read.

(tl;dr version: histograms are dishonest and you should plot individual data points instead)

Helpfully, Weissgerber et al. include instructions for plotting these graphs in MS Excel at the end should you wish to give it a go. But, if MS Excel isn’t your bag, it’s easy enough to try in R…

…apart from the fact that nobody really agrees on what to call these plots, which makes it really hard to search for code examples online. Weissgerber et al. refer to them as scatterplots, but in most people’s minds, scatterplots are for plotting two continuous variables against each other. Other writers refer to them as dotplots or stripplots or stripcharts, but if you don’t know the name, you don’t know that this is what you’re looking for, and all you can find is advice on creating different graphs from the ones you want.

As an example, here’s some of my own data from a behavioural task in which participants had to remember things in two different conditions. The histogram with 95% confidence intervals makes it fairly clear that participants are more accurate in condition one than condition two:

The scatterplots / dotplots / whateverplots also show the distribution of the data quite nicely, and because it’s paired data (each participant does both conditions), you can draw a line between each participant’s data point and make it obvious that most of the participants are better in condition one than in condition two. I’ve also jittered the dots so that multiple data points with the same value (e.g.the two 100% points in condition_one) don’t overlap:

It’s easy to generate these plots using ggplot2. All you need is a long form or melted dataframe (called dotdata here) with three columns: participant, condition, and accuracy.

```dotdata\$condition<- factor(dotdata\$condition, as.character(dotdata\$condition))
# re-order the levels in the order of appearance in the dataframe
# otherwise it plots it in alphabetical order

ggplot(dotdata, aes(x=condition, y=accuracy, group=participant)) +
geom_point(aes(colour=condition), size=4.5, position=position_dodge(width=0.1)) +
geom_line(size=1, alpha=0.5, position=position_dodge(width=0.1)) +
xlab('Condition') +
ylab('Accuracy (%)') +
scale_colour_manual(values=c("#009E73", "#D55E00"), guide=FALSE) +
theme_bw()```
Standard