Time is money, money is power, power is pizza, and pizza is knowledge.

I’m not proud to admit it. I was recently in a pub and pulled out my phone to calculate the area of a set of pizzas we were ordering. There were two sizes of pizza on the menu – 12″ and 18″, for £12 and £18 each – and my group were going to order five 12″ pizzas.

I couldn’t let that slide.

“Nah, let’s get three 18″ pizzas instead”, I said. “More pizza, less money.”

Intuitively, it feels wrong. Yes, an 18″ pizza is bigger, but surely five 12″ pizzas are more than only three 18″ pizzas? This is when I worked out the total area of the pizzas and their price per square inch, and for the record, buying three 18″ pizzas at £18 gets you 763 sq inch of pizza for £54, whereas buying five 12″ pizzas at £12 gets you 566 sq inch of pizza for £60. We bought the big ones.

This got me thinking about how to visualise this in Tableau, and I created the following handy dashboard for settling your pub pizza problems (click here for desktop version, or here for mobile version):

mobile pizza dash.png

This was a fun one to build, not least because there’s basically no underlying data. I knew I wanted the inputs for the calculations to depend entirely on parameters, so all I needed was the names of the two pizzas. This is the spreadsheet I created:



The next thing was to create three parameters per pizza; diameter, price, and quantity. The values entered here can be passed to the data with a case statement, e.g.:


…or directly within a formula:


Things were starting to get complicated, so I built out a table to make sure the numbers were working properly:


From here on out, it was relatively straightforward. It’s really hard to accurately judge area when looking at circles and squares (which is one of the reasons I hate packed bubbles and tree maps), so I started out by visualising the pizzas as actual circles vs. bar graphs.

This makes the two pizzas look relatively similar in size:


…whereas this shows that one of them is more than double the other:


One useful little trick I found was to create a calculated field to place the two pizza icons in specific places:


This spaced them out nicely rather than piling them on top of each other or putting them side by side and too close together:


The most fiddly bit was creating the text summary. I dragged a lot of calculated string fields onto the label shelf, and configured it like this:


And an example of some of the fields going into making a sentence which automatically updates is the best pizza deal field here:


I really enjoyed playing around with this viz. People who use Tableau often talk about liberating your data, but this was a nice exercise in what’s possible with basically no data at all!

April pizza


Paint your target: how to create a bullseye graph in Tableau

Ever thought of tracking whether something’s hitting the target by showing an actual target? I was looking through some old radial blogs recently, and realised you could use a scatterplot on x-y coördinates to show accuracy on a bullseye target.

(to cut straight to the viz on Tableau Public, go here; to find out how to create it, keep reading!)

For example, you could set up an image of a target a bit like this as a background image:


…and plot simple x-y points in Tableau over the top to show how close people/departments/countries are to meeting their target:


I used MS Publisher to design these concentric circles in the background image because when you select them all and save as an image, it doesn’t save the blank space as white, so you can use it with any colour worksheet or dashboard. Feel free to download this target image here (and if you use it, give me a shout! I’d love to see what kind of cool things you’re using it for).

Sadly, Superstore doesn’t have target data on it, so I’ve mocked up a quick dataset of sales and target sales per country as follows:


The next thing to do is calculate the “accuracy” on the bullseye; the nearer the sales figure is to the target, the closer to the middle of the bullseye it should be. However, once you’ve met or exceeded your target, you don’t want the points to keep moving. Countries with sales at 101% of the target and at 300% of the target should both still be in the middle.

The exact middle of the target is going to be at coördinates (0,0), which means that we actually want to create a field which takes some kind of inverse of the accuracy, where the greater the accuracy, the smaller the number in that field.


So, let’s create the following calculated fields. I’ve adapted them a bit from this blog on radial bar charts so that you can put the X field on columns and the Y field on rows, which is more intuitive.


IF [Sales] > [Target] THEN 1.05
ELSE [Sales] / [Target]

Radial Field:

1.1 - [Accuracy]

Why have I set the Accuracy equaliser bit to 1.05 instead of 1? And why have I set the Radial Field calculation to 1.1 instead of 1?

Well, it’s a bit fiddly, but it’s about plotting. I want everything where Sales is 100% of Target or more to be in the gold bullseye, so I want that to fill a certain amount of space. If I set the Accuracy equaliser bit to 1 and the Radial Field calculation part to 1, then it plots everything that’s at 100% or more at (0,0) and includes everything that’s 90% or more in the gold bullseye. I want to space out the 100% or more points so that they’re not on top of each other, and I want only the 100% or more points to be included in the gold bullseye. Setting the Radial Field calculation part to 1.1 makes it so that the edge of the gold bullseye denotes 100%. That means that the 100% or more points will be plotted on the edge of the bullseye. So, the 1.05 part in the Accuracy calculation moves those points further inside the bullseye, but not into the exact middle where they’ll be on top of each other.

While I’m at it, there’s a limitation to the Accuracy calculation. Have you spotted how it’s [Sales] > [Target] rather than SUM([Sales]) > SUM([Target])? This is because the angle calculation needs to aggregate the Radial Field further, and will break if the Accuracy or Radial Field calculations are already aggregated. This means that you’ll probably need to do some data processing to make sure that it’s just one row per thing in the dataset.

The next fields to calculate are:

Radial Angle:

(INDEX() -1 ) *
(1/WINDOW_COUNT(COUNT([Radial Field])))
* 2 * PI()


MAX([Radial Field]) * SIN([Radial Angle])


MAX([Radial Field]) * COS([Radial Angle])

Now you can drag X onto columns, Y onto rows, and put Country on detail:


This looks pretty rubbish, but that’s because the INDEX() function in the Radial Angle calculation is a table calc, and Tableau needs to know how to compute it. Edit the X and Y fields to compute using Country instead:


…and now you’ll get some points spaced out properly:


The calculations mean you can plot the points in a radial way; it’ll go through whatever field you’ve put on detail, and plot the points the right distance away from the centre, starting at 12 o’clock and looping round clockwise. With Country, it’s done alphabetically, so Belgium is on the X axis zero line. If you want to order the points differently, you can add a numeric field to the detail shelf, and change the table calculation to compute using that numeric field (but make sure to keep Country there!).

The next thing to do is to put the target in as a background image. Confusingly, this is under the Map options. Once you’ve found it, you have to specify exactly where the background image is going to sit on the x-y grid of coördinates. With this target, there are four concentric circles, denoting 70%, 80%, 90%, and 100%+. Because I set the Radial Field to be 1.1 – [Accuracy], there’s a distance of 0.1 in a circle around point (0,0) and the edge of the bullseye denoting 100%. That means that the bullseye section is going to run from -0.1 to 0.1 on both axes. Counting backwards, there are three other circles which are the same thickness; the light grey one denoting 90% will run from -0.2 to 0.2 on both axes, the mid grey one denoting 80% will run from -0.3 to 0.3 on both axes, and the dark grey one denoting 70% will run from -0.4 to 0.4 on both axes. As the edge of the dark grey circle is the edge of the background image, this is where you need to tell Tableau to position the background image:


Click OK, and there we are! A nice target, with the different points on it:


You can change the formatting to remove the zero lines if you like:


…but I actually kinda like them on this graph. I’ve also put Country on the colour shelf, and made the borders around the circles black.


I haven’t seen this approach before, so I’m not really sure what to call it. I’ve plumped for a bullseye graph, but maybe it already exists under another name. Let me know if somebody else has covered this, and definitely let me know if you find this useful! You can download the Tableau workbook I used to make this example here (it’s 10.2, by the way, but the same approach should work for older versions).


The Colour Revolt: using rank calculations to get around Tableau’s colour defaults

Tableau’s colour palettes are generally pretty good, pretty flexible, and pretty… well, pretty. But sometimes you want to colour-code a dimension with more values than the number of colours in the palette, and this is when Tableau has a bit of a problem.

This blog is about how Tableau assigns colours, how that messes some things up, and how to get around that.

I’m working with the Global Indicators dataset in Tableau 10.2 here. Let’s say I want to plot each country’s CO2 emissions each year. Without using colours at all, we get this:


Not exactly ideal, so I’ll put the Country dimension on the colour shelf. But before I do, I’m just going to take the CO2 Emissions measure off the view so that we can see how Tableau assigns colours to values in a dimension:


Tableau takes all values in the dimension in order, and loops through the colour palette (in this case the 20-shaded automatic palette), assigning the colours in order and restarting every 20 values.

As there are about 200 countries but only 20 colours, ten or so countries will be represented by the same colour. This might not be a problem, depending on what we’re plotting, but then again, it might. Let’s put the CO2 Emissions measure back into the view:


Yeah, it’s a problem. It just so happens that China and the USA are a multiple of 20 apart in the dimension list, so they’ve been assigned the same brown colour. This is a problem.

Even more irritatingly, Tableau assigns the colour palette to all possible values in the dimension before you filter them. So, if I change the filter so that it only has China and the USA…


…they’re still the same colour.

How can we assign colours to only the countries selected by the filter in the view, rather than all countries?

One solution (thanks Anna for the tip!) is to create a calculated Rank() field, which will only look at the countries in the view. I’ll stick with a default rank of the countries themselves for now, which will do it alphabetically, but you can also do it by how they do on a particular measure.


Make this new calculated field discrete, and drag it onto the colour shelf. You’ll also need to separate out the countries by putting Country on detail (which was previously done automatically when Country was on colour).


This has changed the colours… but they’re still the same. This is because Rank() is a table calculation, and we need to change its default settings. Instead of compute using Table (across), compute using Country:


This now ranks the countries alphabetically (in reverse order, but it doesn’t really matter for this purpose) and assigns a different colour to each of them:


Now we have nicely distinguishable colours for dimensions with loads of values when we actually only want to plot a handful!

The colour marks card is a bit annoying, though; it only shows the rank number next to the colour, not the country name. You can get around this by creating a new sheet, applying the country filter to that sheet too, and just putting Country in rows:


You can now use this second sheet as a colour legend in a dashboard.

Of course, you might suggest that a simpler solution to the colour problem is to forget about colours altogether and just use labels on the lines. And you’d be totally correct! But maybe you’ve got a more complicated plot where you’ve got several different values for the same country or something. Like a scatterplot of two measures with a circle for each year:


Labels would look pretty hideous on this, especially if the circles overlapped a lot. But the Rank() colour workaround still applies.


Quantifying three years of a long distance relationship

I read two really useful guides to processing text data recently; an analysis of Trump’s tweets to work out whether it’s him or an intern sending them, and a sentiment analysis of Pride and Prejudice. Three years of a long distance relationship means that I have a nice big corpus of Whatsapp messages between my girlfriend and me, so I did the romantic thing and quantified some of our interactions in R. Also, this required quite a bit of text munging in Excel first, which turned out to be far quicker and easier than using regex in this case.

First of all, let’s look at when we text each other throughout the day. We’re in different time zones, but only by an hour, and since texts are inherently dependent – one text is overwhelmingly likely to lead to another pretty soon after – I haven’t adjusted the times.

text no by hour of day.png

Our texting activity represents our general activity pretty well; nothing much going on until about 7am, then a slow start to the day, a bit of a post-lunch dip, and then an evening peak when we’re more likely to be out and about doing things.

We can also have a look at how many messages we send each other, and how that’s changed over time:

text no by date.png

We’ve sent each other a fairly similar number of texts per day throughout the long distance period, but it looks pretty bad on me that I have consistently sent fewer texts than her…

…or does it? When I plot the length of each text sent, I consistently write longer messages:

text length by date.png

So, there’s two distinct texting styles here; I write longer messages less frequently, she writes shorter messages more frequently. The other thing I like about the text length graph is that you can see the times when we’ve been together and not texted each other that much; three weeks in November 2014 when I was running experiments in London, three weeks around Christmas 2015, and a load of long weekends throughout. It’s not that we don’t text each other at all then, it’s more that those texts tend to be stuff like “have we got milk?”, or simply “pub?”.

Plotting log likelihood ratios of how much each of us uses each word in comparison to the other also captures our texting styles:

top 20 words each (no names).png

For example, we both use the word /ha/ to express laughter, but I spell it “ha” and she spells it “hah”. Likewise, “til” and “till” as abbreviations for “until”, and I seem to use “somebody” while she uses “someone”.

If we filter out equivalent words and proper names (like the pubs, supermarkets, and stations we go to most often), another difference in dialogue style appears:

top 10 words each (no proper names).png

I am apparently a lot more conversational; I write out interjections (hmm, oooh, hey, ohhh) and reactions (fuck’s comes from for fuck’s sake, hoera comes from the Dutch phrase hiep hiep hoera, and boourns comes from, erm, The Simpsons). Apart from hhmmm, she doesn’t write interjections or contextual replies at all. Apart from the interjections and replies, my main thing is adjectives; she tends towards nouns and verbs.

The next step is sentiment analysis. If I plot log likelihood bars for each sentiment, I seem to be an atrociously negative person:

sentiment error bars.png

…but this, I think, is more a problem with the way sentiment analysis works in the syuzhet and tidytext packages using NRC sentiment data. Each word in the NRC corpus has a given value, 0 or 1, for a range of sentiments, and this sentiment analysis style simply adds it up for each word in a given set.

Because of that, it doesn’t really capture the actual sentiment behind the way we’re using these words. Let’s look at the main words driving the differences in each sentiment:

sentiment log likelihood words.pngFor me, a lot of my disgust and anger is coming from the word damn. If I was texting damn! every time I stubbed my toe or something, perhaps that would be accurate; but in this case, a lot of the time I write damn is in sympathy, as in exchanges like:

“My computer crashed this afternoon and I lost all the work I’d done today”
“Damn, that’s horrible”

Meanwhile, the word coop is actually me talking about the coöp / co-op, where I get my groceries. I’m not talking about being trapped, either physically or mentally.

The same goes for my girlfriend being more positive. With words like engagement and ceremony, she’s not joyous or anticipatory about her own upcoming nuptials or anything; rather, several of her colleagues have got engaged and married recently, and most of her uses of the words engagement and ceremony are her complaining about how that’s the only topic of conversation at the office. As for assessment, council, and teacher, she works in education. These are generally neutral descriptions of what’s happened that day.

So, I was hoping to be able to plot some sentiment analyses to show our relationship over time, but either it doesn’t work for text messages, or we’re really fucking obtuse. I think it might be the former.

Instead, I’ll settle for showing how much we both swear over time:

expletives per month.png

Each dot represents the number of occurrences per month of a particular expletive. I’m clearly the more profane here, although I do waver a bit while she’s fairly consistent.

More importantly is how we talk about beer a similar amount:

beer per month.png

Since couples who drink together stay together (or in the words of this study, “concordant drinking couples reported decreased negative marital quality over time”), I think this bodes pretty well for us.

R, Uncategorized

Visualising football league tables

I was looking at the Premiership league table today, and it looks like this:

current league table

It’s pretty informative; we can see that Leicester are top, Aston Villa are bottom, and that the rest of the teams are somewhere in between. If we look at the points column on the far right, we can also see how close things are; Villa are stranded at the bottom and definitely going down, Leicester are five points clear, and there’s a close battle for the final Champions League spot between Manchester City, West Ham, and Manchester United, who are only separated by a single point.

Thing is, that requires reading the points column closely. If you take the league table as a simple visual guide, it doesn’t show the distribution of teams throughout the league very well. If you say that Stoke are 8th, that sounds like a solid mid-table season… but what it doesn’t tell you is that Stoke are as close to 4th place and the Champions League as they are to 10th place, which is also solid mid-table. A more visually honest league table would look something a little like this*:

current league table dragged about a bit

*definitely not to scale.

Screen-shotting a webpage and dragging things about in MS Paint isn’t the best way to go about this, so I’ve scraped the data and had a look at plotting it in R instead.

Firstly, let’s plot each team as a coloured dot, equally spaced apart in the way that the league table shows them:

League position right now

(colour-coding here is automatic; I tried giving each point the team home shirt colours, but just ended up with loads of red, blue, and white dots, which was actually a lot worse)

Now, let’s compare that with the distribution of points to show how the league positions are distributed. Here, I’ve jittered them slightly so that teams with equal points (West Ham and Manchester United in 5th and 6th, Everton and Bournemouth in 12th and 13th) don’t overlap:

League points right now

This is far more informative. It shows just how doomed Aston Villa are, and shows that there’s barely any difference between 10th and 15th. It also shows that the fight for survival is between Norwich, Sunderland, and Newcastle, who are all placed closely together.

Since the information is out there, it’d also be interesting to see how this applies to league position over time. Sadly, Premiership matches aren’t all played at 3pm on Saturday anymore, they’re staggered over several days. This means that the league table will change every couple of days, which is far too much to plot over most of a season. So, I wrote a webscraper to get the league tables every Monday between the start of the season and now, which roughly corresponds to a full round of matches.

Let’s start with looking at league position:

League position over time

This looks more like a nightmare tube map than an informative league table, but there are a few things we can pick out. Obviously, there’s how useless Aston Villa have been, rooted to the bottom since the end of October. We can also see the steady rise of Tottenham, in a dashing shade of lavender, working their way up from 8th in the middle of October to 2nd now. Chelsea’s recovery from flirting with relegation in December to being secure in mid-table now is fairly clear, while we can also see how Crystal Palace have done the reverse, plummeting from 5th at the end of the year to 16th now.

An alternative way of visualising how well teams do over time is by plotting their total number of points over time:

League points over time

This is visually more satisfying than looking at league position over time, as we can see how the clusters of teams in similar positions have formed. Aston Villa have been bottom since October, but they were at least relatively close to Sunderland even at the end of December. Since then, though, the gap between bottom and 19th as opened up to nine points. We can also see how Leicester and Arsenal were neck and neck in first and second for most of the season, but the moment when Leicester really roared ahead was in mid-February. Finally, the relegation fight again looks like it’s a competition between Norwich, Sunderland, and Newcastle for 17th; despite Crystal Palace’s slump, the difference between 16th and 17th is one of the biggest differences between consecutive positions in the league. This is because Norwich, Sunderland, and Newcastle haven’t won many points recently, whereas Swansea and Bournemouth, who were 16th and 15th and also close to the relegation zone back in February, have both had winning streaks in the last month.

One of the drawbacks with plotting points over time is that, for most of the early part of the season, teams are so close together that you can’t really see the clusters and trends.

So, we can also calculate a ratio of how many points a team has compared to the top and bottom team at any given week. To do this, I calculated the points difference between top and bottom teams each week, and then calculated every team’s points as a proportion of where they are.

For example, right now, Leicester have 66 points and Aston Villa have 16. That’s a nice round difference of 50 points across the whole league. Let’s express that points difference on a scale of 0 to 1, where Aston Villa are at one extreme end at 0 and Leicester are at the other extreme end at 1.

Tottenham, in 2nd, have 61 points, or five points fewer than Leicester and 45 points more than Aston Villa. This means that, proportionally, they’re 90% along the points difference spectrum. This means they get a relative position of 0.9, as shown below:

Relative league position over time

This is a lot more complicated, and perhaps needlessly so. It reminds me more of stock market data than a football league table. I plotted it this way to be able to show how close or far teams were from each other in the early parts of the season, but even then, the lines are messy and all over the place until about the start of October, when the main trends start to show. One thing that means is that however badly your team are doing in terms of points and position, there’s little use in sacking a manager before about November; there’s not enough data, and teams are too close together, to show whether it’s a minor blip or a terminal decline. Of course, if your team are doing badly in terms of points and position and playing like they’ve never seen a football before, then there’s a definite problem.

To make it really fancy/silly (delete as appropriate), I’ve plotted form guides of relative league position over time. Instead of joining each individual dot each week as above, it smooths over data points to create an average trajectory. At this point, labelling the relative position is meaningless as it isn’t designed to be read off precisely, but instead provides an overall guide to how well teams are doing:

Relative league position over time smooth narrative (span 0.5)

Here, the narratives of each team’s season are more obvious. Aston Villa started out okay, but sank like a stone after a couple of months. Sunderland were fairly awful for a fairly long time, but the upswing started with Sam Allardyce’s appointment in October and they’ve done well to haul themselves up and into contention for 17th. Arsenal had a poor start to the season, then shot up, rapidly to first near the end of the year, but then they did an Arsenal and got progressively worse from about January onwards. Still, their nosedive isn’t as bad as Manchester City’s; after being top for the first couple of months, they’ve drifted further and further down. It’s more pronounced since Pep Guardiola was announced as their next manager in February, but they were quietly in decline for a while before that anyway. Finally, looking at Chelsea’s narrative line is interesting. While they’ve improved since Guus Hiddink took over, their league position improvement is far more to do with other teams declining over the last couple of months. Four teams (Crystal Palace, Everton, Watford, and West Brom) have crossed Chelsea’s narrative line since February.

I don’t expect these graphs to catch on instead of league tables, but I definitely find them useful for visualising how well teams are doing in comparison to each other, rather than just looking at their position.

Cricket, R

Bigger isn’t always better – the case of the first innings in cricket

I’ve got an unsubstantiated hunch (the best kind of hunch!) about cricket. Well, not just one, I have loads, but this particular hunch is about the first innings of a cricket match, and that bigger isn’t always better.

I greatly enjoyed following England’s first innings against South Africa in the second Test in Cape Town. But, even with the high run rate while Stokes was smashing it everywhere, I was convinced that the higher that first innings got, the less likely we’d be to win it. This goes against the received wisdom in cricket, which is that the bigger the first innings score, the better it is.

So, I’ve had a look at all first innings scores in Tests from 1990 until now (there’s just over a thousand of them). Here’s simple density plot of the distributions of runs scored in the first innings per match result:

density plot of runs

What this seems to show is that there’s a limited sweet spot from just over 380 runs to about 500 runs where a win is the most likely result. Once a team scores over about 500 runs in the first innings, the most likely match result is a draw.

Part of that is probably because of how much time posting a huge first innings takes out of the game. What happens when we look at runs scored vs. balls taken in the first innings?

scatter plot of runs and balls simple

There’s a green cluster in the middle between about 350 and 550 runs and between about 700 and 800 balls. That, I reckon, is the sweet spot for the perfect first innings: scoring a high but not massive number of runs, without taking too much time. England took 755 balls (125.5 overs) in their first innings in Cape Town, so a win was still just about the most likely result there… but, this may just be an exception. We’ll see.

Here’s the same plot with some lines showing a run rate of 2, 3, and 4 runs per over (the steeper the line, the greater the run rate):

scatter plot of runs and balls

Visually, I’m convinced the sweet spot of 380-500 runs at a decent run rate is obviously there. So, let’s try looking at some simple percentages by comparing scores between 380-500 runs with scores over 500 runs, where runs are scored at over 3.5 runs an over:

Run rate over 3.5, runs between 380 and 500
won draw lost        = 62.32% win rate
43     16     10          = 2.69 win:draw ratio

Run rate over 3.5, runs over 500
won draw lost        = 54.29% win rate
57     47      1           = 1.21 win:draw ratio

The win rate goes down slightly for the higher scores, and the win:draw ratio goes down too. i.e. even if you’re scoring well, going beyond 500 just makes the draw more likely and doesn’t actually help your chances of winning.

But, that’s not quite a fair comparison. I said earlier that if you’re going to score more runs, you have to do it at a higher run rate, so comparing all scores above 3.5 an over isn’t exactly fair. Let’s now compare a good score at a good run rate with a high score at a high run rate. Again, I’m taking a good score to be 380-500 and a high score to be over 500. In terms of run rate, I’m quantifying a good run rate as between the mean run rate of all innings and the mean plus one standard deviation (i.e. between 3.13 and 3.72 runs per over), and a high run rate as above the mean plus one standard deviation (i.e. above 3.72 runs per over).

So, is a score of 380-500 at 3.13-3.72 runs per over better than a score of 500+ at 3.72+ ?

380-500 runs at 3.13-3.72 RPO (mean runs: 438 , mean RPO: 3.40)
won draw lost        = 56.10% win rate
46    20     16          = 2.3 win:draw ratio

500+ runs at 3.72+ RPO (mean runs: 587, mean RPO: 4.90)
won draw lost        = 57.14% win rate
44    32     1             = 1.375 win:draw ratio

…the lower, slower score isn’t better, but it isn’t worse either. The likelihood of winning stays the same; the only difference is that batting on makes losing much less likely and drawing much more likely.

This is really counterintuitive, and I find it hard to wrap my head around the fact that scoring 438 at 3.4 an over is about as likely to result in a win as scoring 587 at 4.9 an over. One possibility is that the matches which feature high first innings scores are played on absolute roads, like in the 1997 Colombo snoozeathon between India and Sri Lanka, meaning that a high second innings score is also pretty likely. Therefore, you’d expect the first and second innings scores to correlate in matches where the first innings was 500+ runs at 3.72+ RPO… but they don’t (r=0.07, p=0.52). Nor do the first and second innings scores correlate in matches where the first innings was between 380-500 runs at 3.13-3.72 RPO (r=-0.15, p=0.18). The only indication that a massive first innings score may mean that the pitch is easier to bat on is that the mean second innings score in response to a massive first innings score is 346.90, while the mean second innings score in response to a good first innings score is 307.09. A t-test between the two set of second innings scores is “relatively significant” (as an ever-hopeful colleague of mine used to say) with a p-value of 0.07, but that doesn’t cut it. This is another mystery for another blog post.

Right, back to looking at just the first innings scores and win rate. One last way of exploring this is by creating a matrix of win rates in bins of runs scored and run rate.

I’ve put all innings into bins of 50 runs and bins of 0.5 RPO. This means that every square in the following graphs is represented by a set of matches where that many runs have been scored at that rate. It’s only done for bins with at least five matches in (because you can’t really extrapolate from things where only one or two matches have happened, as that leads to a lot of 0% and 100% win rates).

This graph visualises the win rate per bin; the darker the green, the greater the likelihood of winning based on that kind of first innings:

rough matrix of runs, RPO, win rate - five matches or more, cropped

But what if, instead of plotting the simple win likelihood for all bins, we plot the most likely result based on that bin, along with the likelihood of that result? In this graph, the colour represents the top result – win, draw, or loss – and the intensity of that colour represents the likelihood – the more intense the colour, the more likely that result:

rough matrix of runs, RPO, top result, rate, cropped

In both matrices, the sweet spot with the most green and the most intense green falls within 400 and 500 runs… although it turns out that in terms of overall win likelihood, the best first innings is to score between 500 and 550 runs, scored at over 4 runs per over.

Ultimately, what this shows is that batting on past 500 or so makes losing the match hugely unlikely (but definitely not impossible), so if safety first is your watchword, have at it. However, if you want to win a Test match, there’s not much point in batting on past 500 or so in the first innings, 550 at most, no matter how fast you score (and if you do decide to go for the big imposing total, you’d better hurry up about it). Ben Stokes might have set a load of records, but with a bit of statistical sleuthing, he’d have realised it was pointless because his batting blitz was actually just making it harder for England to win.

Why bother creating these incredible cricketing memories when the statistics say hold back?

…because it’s much more entertaining. If you focus on the statistics all the time, you end up with a team like England under Peter Moores, where nobody knows anything before they’ve looked at the data. Fair enough, then.

R, Science in general

scatterplot / dotplot / losttheplot

I’m not sure how to game search engine optimisation algorithms, but hopefully you’ll end up here if you’ve googled “things that are better than histograms” or “like scatter plots but with groups and paired and with lines” or “Weissgerber but in R not Excel” or something similar.

Anyway. Weissgerber et al. (2015) have a fantastic paper on data visualisation which is well worth a read.

(tl;dr version: histograms are dishonest and you should plot individual data points instead)

Helpfully, Weissgerber et al. include instructions for plotting these graphs in MS Excel at the end should you wish to give it a go. But, if MS Excel isn’t your bag, it’s easy enough to try in R…

…apart from the fact that nobody really agrees on what to call these plots, which makes it really hard to search for code examples online. Weissgerber et al. refer to them as scatterplots, but in most people’s minds, scatterplots are for plotting two continuous variables against each other. Other writers refer to them as dotplots or stripplots or stripcharts, but if you don’t know the name, you don’t know that this is what you’re looking for, and all you can find is advice on creating different graphs from the ones you want.

JEDI KNIGHT - these aren't the scatterplots you're looking for

As an example, here’s some of my own data from a behavioural task in which participants had to remember things in two different conditions. The histogram with 95% confidence intervals makes it fairly clear that participants are more accurate in condition one than condition two:

accuracy for each condition in percent

The scatterplots / dotplots / whateverplots also show the distribution of the data quite nicely, and because it’s paired data (each participant does both conditions), you can draw a line between each participant’s data point and make it obvious that most of the participants are better in condition one than in condition two. I’ve also jittered the dots so that multiple data points with the same value (e.g.the two 100% points in condition_one) don’t overlap:

accuracy for each condition in percent - jitterdots

It’s easy to generate these plots using ggplot2. All you need is a long form or melted dataframe (called dotdata here) with three columns: participant, condition, and accuracy.

dotdata$condition<- factor(dotdata$condition, as.character(dotdata$condition))
# re-order the levels in the order of appearance in the dataframe
# otherwise it plots it in alphabetical order
ggplot(dotdata, aes(x=condition, y=accuracy, group=participant)) +
  geom_point(aes(colour=condition), size=4.5, position=position_dodge(width=0.1)) +
  geom_line(size=1, alpha=0.5, position=position_dodge(width=0.1)) +
  xlab('Condition') +
  ylab('Accuracy (%)') +
  scale_colour_manual(values=c("#009E73", "#D55E00"), guide=FALSE) +