data visualisation, Tableau

Donut charts and dynamic reference banding

Donut charts aren’t everybody’s cup of tea, but I quite like them for showing a percentage against a total which has to be 100%. Things like the percentage of tickets answered within an hour, or an industrial test pass rate as a percentage, or an on time percentage.

The problem is that percentages often come with targets. If you’re measuring a rate, you’re probably measuring it to check that you’re on target. For example, if you’ve got 19.8% of tickets being answered within an hour, you’ve probably also got a target of 15% or 20% or something, and you’d probably want to show that on your donut chart for context, like this:

In Tableau, you can’t do that, not without creating some pretty filthy trigonometric calculations. But I’ve recently found a workaround which I rather like, which I’ll explain in this blog. You can download the supporting workbook from Tableau Public here.

I’ve used Superstore, which isn’t too ideal for percentages and targets, but hey, it’s something everybody uses. Let’s say you’re the head of sales for California. You know you’re a big market, and you want to keep it that way – you want 15% of all of Superstore’s sales to be in California.

You can create donut charts showing this percentage easily by creating two fields. One called [California Sales], which is:
IF [State] = “California” THEN [Sales] END

The other would be [Rest of US Sales], which is:
SUM([Sales]) – SUM([California Sales])

And you’d put it on a donut chart with those two fields as the two measure values, then put measure names on colour, and split it out by category to get something like this:

Sadly, we can’t put a reference line at the 15% mark to show the target. Not easily, at least. But what we can do is to play around with the colours. If the percentage is above the target, we could show the percentage up to the target in yellow, and then the overperformance in green, like so:

And if we adjust the target higher, we could show the percentage up to the actual percentage in yellow, and then the underperformance in red, like so:

This is a little complicated. It requires a few extra calculations; [California Sales Percentage], [Target Distance], [California Sales Base], [Rest of Sales], [California Sales Over], and [California Sales Under]. Let’s go through the logic one by one.

[California Sales Percentage]
In this calculation, you take the existing [California Sales] field that you’ve made, and found out what that is as a percentage of all sales. It’s simply:
SUM([California Sales]) / SUM([Sales])

[Target Distance]
This is how far from the target the California Sales Percentage is. I’ve used [Target] as a parameter to make it adjustable, but you could also hardcode it. It’s simply the California Sales Percentage minus the target; so, if you’ve got an actual % of 21%, and your target is 15%, then the Target Distance will be 6%. It’s simply:
[California Sales Percentage] – [Target]

[California Sales Base]
This calculation will be what’s in yellow in the donut. If your California Sales Percentage is above the target, then you’ll want it to be yellow up to the target, and then green above that, so this base field will simply be the target. If your California Sales Percentage is below the target, then you’ll want it to be yellow up to the actual sales percentage, and then red for the space between the percentage and the target. So, you can calculate it like this:
IF [Target Distance] > 0 THEN ([Target] * SUM([Sales]))
ELSE SUM([California Sales]) END

[Rest of Sales]
This is the bit in grey. If your California Sales Percentage is above the target, then you’ll want it to be grey from the actual sales up to 100%. If your California Sales Percentage is below the target, then you’ll want it to be grey from the target value up to 100%. That can be calculated like this:
IF [Target Distance] < 0 THEN
SUM([Sales]) – ([Target] * SUM([Sales]))
ELSEIF [Target Distance] > 0 THEN
SUM([Sales]) – SUM([California Sales])
END

[California Sales Over]
This is the bit in green. If your California Sales Percentage is above the target, then you’ll want it to be green between the target and the actual sales percentage. If it’s below target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] > 0 THEN
SUM([California Sales]) – ([Target] * SUM([Sales]))
ELSE 0 END

[California Sales Under]
Finally, this is the bit in red. If your California Sales Percentage is below the target, then you’ll want it to be red between the actual sales percentage and the target. If it’s above target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] < 0 THEN
([Target] * SUM([Sales]))-SUM([California Sales])
ELSE 0 END

Okay! Now we’re ready to build our donuts. This is the easy bit.

Build out your donuts like normal, like this:

Now, instead of the current two measure values, we’ll want all four of the colour ones:

For this one, I’ve set the target to 20% so that there are examples of categories that are above and below target, all in one view.

And that’s it! It’s not quite a reference line, but it’s a nice way of showing a percentage performance against a target within a donut chart. As a reminder, you can download my workbook here:
https://public.tableau.com/profile/gwilym#!/vizhome/Dynamicreferencebandingondonutcharts/Donutchart-dynamicreferencebanding

Standard
data visualisation, football

Simulating football leagues: the “magical” 40-point mark.

Every time a new Premier League season starts, somebody, probably a manager for a newly-promoted side, says they’ll only relax once they’ve hit the magical 40-point mark. Claudio Ranieri famously kept banging on about aiming for 40 points and Premier League safety throughout the season when Leicester won it.

The problem with the magical 40-point truism is that it’s not really true. There are a fair few examples out there of how you’re probably safe in the Premier League with 36 or 37 points, as well as the reminder that you can still get relegated with 42 points (West Ham in 2003, which will never not be funny to this Charlton fan).

But the problem with the debunking articles is that they’re also not that accurate. They show maybe 20 seasons of data, showing the number of points the teams in 17th (safe) and 18th (relegated) got. And the frustrating and beautiful thing about football is that it’s full of variance.

Here’s an example league table I’ve generated:
(click any graph to follow through to the interactive version)

equal simulation league table 1

The thing is, all of these teams are the exact same strength. In this incredibly basic simulation of a twenty-team football league, for each of the 380 games there was an equal chance of it being a win, loss, or draw. So, Inter Random finished bottom with 33 points, and Random Albion won it with 63 points, but those two teams were perfectly equal throughout the season. It just so happened that it wasn’t Inter Random’s season.

Here’s another table:

equal simulation league table 2

This is also from the same set of simulations. Inter Random did pretty well this time, finishing 6th with 55 points, while last year’s champions Random Albion finished 19th with 37 points and got relegated. Why are they so bad this season? What happened to them? Nothing happened. Just a different roll of the die.

Let’s do this 10,000 times and look at the breakdown of points won by teams finishing in each position.

equal simulation 10k position x points.PNG

That’s a lot of variance! All of these teams are equal, and every single game had a 33.3% chance of the home team winning it, 33.3% chance of a draw, and 33.3% of the away team winning it. You’d think that this would balance out over the course of a season, but it doesn’t. A team can win the league with as many as 85 points (Random Athletic in simulation number 4349), a team can win the league with only 55 points (also Random Athletic, in simulation number 9384), a team can finish bottom with as many as 45 points (Real Random x2, Sporting Random, Random Argyle), and a team can finish bottom with only 18 points (Dynamo Random, Random United).

And if this is the amount of variance you can get between seasons when everything is equal, what happens when it’s not? Did West Ham get a particularly unlucky roll of the die when they finished 18th with 42 points, and that 36 points is going to see you to safety most of the time? Or is it that the last twenty or so seasons have been at the low end of the variance, and that in any given season, 36 points is still probably going to get you relegated? And is there even a points total where you’re definitely absolutely guaranteed not to get relegated?

On that last point, it’s technically possible to get relegated with 63 points. If two teams are completely useless and lose every single game, and the other 18 teams win every home game and lose every away game apart from the two games away to the bottom two, that means that 18 teams finish on 63 points (57 points from winning all home games, 6 points from winning two away games). One team could finish 18th on goal difference. So, really, 64 points is the real magic safety number.

But this would never realistically happen. So, I’ve also run 10,000 simulations of leagues based on real data. I took every single game from the last four years (2014/15 to 2018/19) of the big five leagues (England, Spain, France, Italy, Germany). Assuming that a team’s actual points total is a relatively good measure of a team’s actual strength – which it isn’t, as shown above, but it’s about as close as I can get – I drew random samples of 20 values for each simulation. Since Italy and Germany only have 18 teams in their top flights, I used each team’s average points per game (PPG) as their underlying team strength. This generated 10,000 realistic leagues of 20 teams of different strengths. I then grouped them into strength tiles of 0.3 points per game – the teams in the weakest tile were between 0.3 and 0.6 PPG, the teams in the strongest tile were between 2.4 and 2.7 PPG. I then compared the frequency of teams in each strength tile scoring a certain number of goals against teams in each strength tile, and sampled from those distributions for each of the 3,800,000 games in the simulations. I experimented with making the tiles smaller, but that meant that there were too few examples of games between teams of particular tiles. I also added a home vs. away boost factor.

This ended up coming out pretty realistic. For example, here’s the average number of goals that teams in each strength tile score and concede:

avg goals for and against per tile.PNG

So, what are the points distributions per position in a more realistic simulation?

accurate simulation 10k position x points.PNG

This looks pleasingly similar to the distributions in my graph of Premier League points by position. Most simulation results cluster around the middle of each band (the black line denotes the average). But at the extreme end, you can win the league with 112 points if you’re already a strong team and you outperform / get lucky, like Sporting Random did here:

highest winning total.png

…and you can also win the league with as little as 64 points if you outperform / get lucky and if the rest of the league underperform / get unlucky, like Real Random did here:

lowest winning total.png

At the bottom of the table, you can get relegated in 18th with 46 points, which is what happened to Random United, a solid midtable team who had a pretty average season… except that everybody else at the bottom of the table completely outperformed expectations / got incredibly lucky:

highest relegated total.png

This chart shows the overlap between the relegation positions and safety. There are some interesting data points at the extreme ends, but the main point is that there are a lot of simulations where a team got 33 points or fewer but finished 17th, and there are several simulations where a team got 38 points or more but still finished 18th:

relegation points.PNG

To put it another way, 93% of teams getting 40 points didn’t get relegated:

cumulative percent curves

You can explore the full interaction between points and position in this graph, where you can set a threshold. Here, this shows how often a team finishes in a particular position when getting at least 40 points – so in 5.85% of simulations, you can get 40+ points, but still finish 18th:

% of teams per position

And to work out what your safety threshold is, this graph shows how many teams end up safe or relegated based on their points total. 35 is the turning point; 50.27% of teams getting 35 points end up safe:

% of relegations per points

As a final view, here’s a breakdown of the variance in positions by each team strength tile. It shows how you can be an incredibly strong team and expect to get 2.4 to 2.7 points per game, and you’ll win the league 63% of the time, but also miss out on the top four entirely a little under 1% of the time:

strength tiles and positions

Forty points isn’t a magic number – you’re safe around 93% of the time if you get 40 points, but it’s not guaranteed.

Standard
Alteryx, data visualisation, Tableau

Eurovision Song Contest: a market basket analysis of voting patterns and international relations.

I’ve been doing a lot of market basket analysis at The Information Lab lately. Market basket analysis is a way of looking for things that people buy at the same time (or that people never buy at the same time) in order to spot trends in people’s behaviour. For example, it’s probably obvious that if somebody buys cereal, they’ll probably also buy milk. Or that if somebody buys tofu, they’re not going to be buying sausages. This is a really nice example of how it all works.

Thing is, after a while, using bread and butter or cereal and milk or sausages and tofu as an example gets kinda dry. And talking about Lego shovels and milligram-level accurate scales is sometimes a little unprofessional, even if it is a perfect example of consumer behaviour.

So, I’ve been analysing the Eurovision Song Contest. The jury votes lend themselves pretty well to market basket analysis, because they’re pretty similar to transactions: each country’s jury (or customer) votes for (or buys) ten countries (or items) at a time (in a basket), and the fact that these countries (or items) are a subset of all possible countries (or items) to vote for (or buy) means that you can make the same selection vs. non-selection distinction. And we all know that some countries always vote for some other countries, regardless of how good the song is, which is part of what makes it fun.

I took the historic Eurovision data collected by Stephan Okhuijsen of Datagraver. Then, using Alteryx, I filtered it to all contests from 1993 onwards, because European countries have been relatively consistent since then. I also filtered it to the final only, and to the jury votes only.

I set the minimum support for a rule to 0.01, which is kind of high for a regular market basket analysis using tens of thousands of SKUs in a supermarket, but works fine for such a closed set of possible choices of countries. I also set the minimum confidence to 0.05. That gave me almost 33,000 association rules, of which about 1,600 were one-to-one country mappings.

The full results are in an interactive dashboard here.

dash

In the matrix at the bottom, you can see who consistently votes for who, and it’s pretty predictable. Cyprus and Greece, for example, almost always give each other the most possible points. There’s a big love in between Moldova and Romania, and between Turkey and Azerbaijan. The Nordics are a bit too cool to give each other full marks every time, but it’s still a bit of a Scandi circle jerk. Andorra love Spain, although it doesn’t seem like that’s reciprocated. Azerbaijan have never voted for Armenia, funnily enough. And Austria have given Australia full marks twice, which I like to believe is because they were hoping to exploit a poor fuzzy matching process in the background scoring:

2 austria australia

But market basket analysis shows how countries behave as a group, where we can see how some associations are Europe-wide, and some are just confined to the two countries. For example, some of the Scandi trends are reflected in votes across Europe; if a country, any country, votes for Denmark, they’re also likely to vote for Norway:

1 denmark to norway

And surprise, surprise, countries that vote for Ukraine will also vote for Russia:

1 ukraine to russia

But the Greece/Cyprus love in is special just for them; in fact, if anything, there’s a slightly negative association between them, meaning that if a country votes for Cyprus, they’re slightly less likely to vote for Greece as well:

1 cyprus to greece

Likewise with Turkey and Azerbaijan. Just because they give each other full points all the time, other European countries don’t link the two together in their voting behaviour at all:

1 turkey to azerbaijan

Meanwhile, even though Azerbaijan will never give points to Armenia, and Armenia have only ever given one point to Azerbaijan, other European countries are far more optimistic. Maybe they hope that voting for both Armenia and Azerbaijan at Eurovision can resolve the Nagorno-Karabakh dispute. Or maybe they just don’t know anything about the Caucasus region and think they’re the same place, I don’t know.

1 azerbaijan to armenia

This is quite nice to illustrate, because the market basket analysis allows you to make the distinction; while there are some obvious associations between countries, like how Greece and Cyprus always vote for each other, it shows that those associations aren’t necessarily transferred to other countries’ voting behaviour.

Click through to the interactive version here to explore in more detail. I’m going to be using this in my teaching examples more often.

Standard
data visualisation, Tableau

Language. Sex. Violins. Other?: how to create Violin Plots in Tableau.

Are you tired of histograms? Do you look at the count distribution of your actual data points and find yourself thinking, yeah, that’s cool and all, but I wish there was a more abstract way of showing this? Then you’ll probably like violin plots. That’s these things here:

Despite their somewhat sexual connotations, violin plots can be really useful for comparing distributions of data. To be honest, if it mattered that much to me, I’d probably go for a boxplot with overlaid, mostly transparent data points… but hey, people still use these, Tableau doesn’t support them natively, and I haven’t found a full tutorial anywhere (apologies if I’ve missed one – let me know!), so here’s how to make them.

To follow along, you can download the Tableau workbook I used from my Tableau Public page here.

It’s all based around Kernel density estimation. This is maths for “take my data, smooth it out a bit, and make it so I can generalise it to data I haven’t got yet”. You can read more about that here, and I’m going to use the same set of six values used in the Wikipedia example.

Here’s what you’ll need, and here’s one I made earlier:

    1. Your data. One column with one row per observation, one column with one row per observation ID. Something a little like this:
      1. data
    2. A handy data scaffold. I’ve used a hundred points, going from zero to 99; if your data has a lot of variance, you might want to whack that up to a thousand, although that’ll make things proper slow. Either way, keep it simple; it should look like this:
      2. scaffold

Okay, nice. Stick these into Tableau, and join them with a custom join calculation so that every row in the data joins to every row in the scaffold (i.e. six rows balloons out to 600 rows here; this is why using a 1000 row scaffold isn’t pretty, performance-wise). I normally just type in “join” on both sides:

join

Also, remember that with a scaffolded dataset, simply summing your values will just multiply the value you actually want by a hundred. Watch out for that.

Okay, we’ve got our data; let’s plot the sample values we want to create a violin plot of.

plot samples.png

What we need to do is draw a kernel around each data point, like this (but better):

plot samples 2

…and add up the y-axis values of those kernels to create the overall kernel density, like this (but a lot better):

plot samples 3

This is why we need the data scaffold; you can’t draw a kernel with one point, so we need a hundred points for each point.

The first thing to do is to create an adjusted x-axis. We want the hundred points for each data point to range from the lowest to the highest value. You can do that like this (ignore the bandwidth part for now):

IF [X] = 0 THEN {MIN([Sample Value])} - [X scaling factor]
ELSEIF [X] = 99 THEN {MAX([Sample Value])} + [X scaling factor]
ELSE
({MIN([Sample Value])} - [X scaling factor]) +
(
ABS(
({MAX([Sample Value])}+[X scaling factor]) - ({MIN([Sample Value])}-[X scaling factor])
)
* ([X]/99)
)
END

Alternatively, you can see that there’s no point making the scaffolded points for the values go all the way across the range, so you could fix it on the Sample ID instead. But I found that this had a knock-on effect down the line that I didn’t like, so let’s leave this for now. If you can make it work, I’d love to hear from you.

We’ve now got a set of Adjusted X data points across the range of the data for each data point:

adjusted x range

The next step is to stick something on the y-axis so that each point goes up the required amount to draw a kernel around each data point. It’ll end up looking like this:

kernel per data point

…and the calculation required to do that is this:

1/({COUNTD([Sample ID])}*[bandwidth (wiki example)])
*
(1/(SQRT(2*PI()))) * EXP(-0.5 * (
([Adjusted X] - [Sample Value])^2)/[bandwidth (wiki example)])

This is done as a normal kernel using the standard normal density function, because that’ll probably do the job well enough for most situations. I’m not going to go into the different types of kernel functions, but you can read about them here, and if a different kernel function tickles your fancy, you can rewrite the (1/(SQRT(2*PI()))) * EXP(-0.5 * ( part of the equation with something else.

I’m also not going to go into bandwidths, because it’s complicated. There are various proper methods for choosing your bandwidth, but if you play about with it, you’ll see that setting the bandwidth too low doesn’t smooth out the curve enough, and setting the bandwidth too high smooths out the curve too much.

ezgif-4-6d764c7c16.gif

Anyway. To create the kernel density estimation for the data points, we need to sum up the individual kernels. This is the easy part in Tableau; CTRL+drag the same kernel calculation field to rows again, take Sample ID off colour/detail, sum it up, and put it on a synchronised dual axis. Voilà.

density estimate.png

This grey curve is half a violin plot on its side. But before we go into how to rotate and fill it, let’s go back to the scaling factor. I’ve kept it at 0 the whole way through, so that the x-axis runs from the smallest data point to the highest data point. That’s fine if you’re showing your actual data, but the whole point of kernel density estimates is to show a probability function… or in other words, “okay this is the data I’ve got, but what if there’s going to be more data like this, where’s it going to go?”. There may well be other values higher than your highest point or lower than your lowest point. So, I created a parameter to mess about with how far the x-axis goes, simply by adding a constant to the highest value and subtracting that same constant from the lowest value. You can adjust it as you see fit; I think setting it to 4 captures this data nicely:

density estimate

Right. That’s the maths behind a violin plot. Now to actually make one.

All we need to do is fill it and rotate it. The filling is easy; just convert it from line to area:

area 1

…but the rotation messes this right up.

area 2

So, we need to redraw it as a polygon. And to do that, we need to redo some of the calculations. Sorry about that.

Firstly, make this change to the Adjusted X calculation:

IF [X] = 0 THEN ({MIN([Sample Value])} - [X scaling factor])
ELSEIF [X] = 1 THEN ({MIN([Sample Value])} - [X scaling factor])
ELSEIF [X] = 99 THEN ({MAX([Sample Value])} + [X scaling factor])
ELSE
({MIN([Sample Value])} - [X scaling factor]) +
(
ABS(
({MAX([Sample Value])}+[X scaling factor]) - ({MIN([Sample Value])}-[X scaling factor])
)
* (([X]-1)/97)
)
END

And now make this change to your kernel calculation:

IF [X] = 0 THEN 0
ELSEIF [X] = 99 THEN 0
ELSE
1/({COUNTD([Sample ID])}*[bandwidth (wiki example)])
*
(1/(SQRT(2*PI()))) * EXP(-0.5 * (
([Adjusted X (polygon)] - [Sample Value])^2)/[bandwidth (wiki example)])
END

That should do the trick. If you’re using a bigger scaffold, remember to update the 99 to 999 and the 97 to 997! Now you can plot your polygon like this:

polygon

And if you repeat the kernel calculation, whack a minus on the front of it, and dual axis it, you can make a nice violin:

violin

These violins take a lot of formatting to make, and it’s an absolute faff to compare two separate distributions. And the LODs for finding the max and min values in the data will require you to add in a FIXED for any dimension you want in the view. They’ll also screw up filters, unless you put them in context. It is possible, though; here’s an unformatted set of violins for Sales in each Category in California using Tableau’s Superstore dataset. With some a fair bit of tidying, this could look pretty good:

violins superstore

Again, it’s not an ideal way of showing the distributions, and hopefully Tableau introduce violin plots in the same way as boxplots in a later version. But for now, this is how you’d do it if you really wanted one.

Standard
data visualisation, Tableau

We chase the waves: how to make a sinusoidal time series in Tableau.

Browsing what other people have done on Tableau Public is a great source of both challenge and inspiration. Recently, I’ve been really taken with Neil Richards’ visualisation of football league winners over time, with a beautiful sine wave showing how long it’s been since each team last won the league. I’ve no idea what to call these plots, but they’re fantastic (click image to see Neil’s original on Tableau Public).

0 Neil's thing

I’ve wanted to take these apart and see how they work for a while, and finally got round to it the other day. It turns out that Neil did a lot of the angle calculations outside Tableau, which is fair enough… so I set myself the challenge of doing it all with table calculations. I got there eventually, but it was a good workout.

[to skip the explanation and just download the workbook I’ve made, click here]

This blog is a walk-through of how to do it. Instead of football data, I’ve used official pope names; I was on a wikipedia spiral and noticed that seven of the eleven popes between 1775 and 1958 were called Pius, taking the Pius count from VI to XII. Naturally this reminded me of Barcelona’s recent league dominance, winning six of the last nine La Liga championships, so it seemed obvious to see if the chart for pope names would be similarly tightly woven.

So. This is all the input data you’ll need: a list of all popes, in order, with a record ID and a number showing how many of that name there’s been so far:

1 popes data

Come to think of it, if you’re good with INDEX() calcs, you might not even need the PopeNameNo column… but I’m not, so I do.

You’ll also need a simple scaffold sheet with 100 points, going from 0 to 99. If you’re trying to visualise something with more data points than the 267 popes I’ve got, you might want to whack up the scaffold to 999 instead.

2 scaffold data

[I’m including the elected-but-not-consecrated Stephen II in this list, because even though he’s not an official pope, all the subsequent Stephens had an increased number until relatively recently, and then it got confusing. So he’s in here.]

Read the two files into Tableau, and create a calculated join with “x” in the join field for the popes data, so that there are 100 points for each pope. Now we’re ready to do some vizzical jiggery-popery!

Although Neil’s vizzes had time on a y-axis going vertically, I’ve spent way too long looking at time series graphs for that to feel intuitive, so I’m reverting to the vanilla “time on x-axis going left” approach.  Let’s stick Pope ID on the x-axis as a continuous dimension:

3 original x axis

Great, we’ve got a line made up of lots of circles. This doesn’t make it easy to see what’s going on, so let’s filter to a single pope name – Leo will do for now:

4 filter to leo

Here’s all the Leos, in order. It was a fairly popular (pope-ular?) pope name in the later part of the first millenium, but then it fell out of favour for a while, with almost 500 years between Leo IX in 1054 and Leo X in 1513.

We want to connect these dots with a line, but if we set the mark type to line, it’ll just be a straight line. Rather, we want a curved line, like this:

5 leo annotated

This is why we’ve got the scaffold table. We can’t just connect two points with a curvy line – or at least, I can’t. Instead, we need to put a load of dots between the two main points, and connect them up. That means figuring out the x and y axis values to put those dots in the right place to connect the two main points with a nice sine wave.

To do that, we’ll need to create a new x-axis measure instead of simply Pope ID, where the distance (in units of popes) is divided by up so that the scaffold points are evenly distributed along the x-axis. But first, that means calculating the distance between popes in units of popes. We can do that with a lookup() calculation:

LOOKUP(ATTR([Pope ID]), 1)

6 next pope id calc

This is working nicely – I’ve stuck it on the tooltip, and hovering over Leo IX, who’s pope number 153 in my list, tells me that the next Leo is pope number 218.

This’ll work fine for this filtered view, but to get it to do this properly, you’ll need to put the Point field from the scaffold table on detail, and edit the table calculation to compute using Point and Pope ID:

7 point, pope id calc

At the moment, all those points are on top of each other on the Pope ID value. This isn’t what we want – we want to spread them out evenly between the Pope ID values. To do that, we’ll need this calculation here. It’s a bit long, and there’s MIN() functions everywhere because of all the table calculations, but hey:

8 x calc

Logically, what it’s doing is this:

  1. There are 100 scaffolding points, going from 0 to 99.
  2. If it’s the first one, i.e. 0, give it the same value as Pope ID. For the Leo IX to Leo X example, this is 153.
  3. If it’s the last one, give it the same value as the next Pope ID with the same name. For the Leo IX to Leo X example, this is 218.
  4. If it’s any of the rest, calculate the difference between the two Pope ID values (i.e. 218 – 153, which is 65 pope units), and then divide that by the maximum point value, which is 99 (if you made your scaffold points 1-100 instead, you’ll have to set this to maximum point value -1, not 100). This is because there’s 99 spaces to fill between all the scaffold points. Then multiply that fraction by the number of the point, and add it to the Pope ID value.

You can also copy and paste it directly from here if that makes it easier:

IF MIN([Point]) = 0 THEN MIN([Pope ID])
ELSEIF MIN([Point]) = MIN([MaxPoint]) THEN [NextPopeID]
ELSE
MIN([Pope ID]) +
(
([NextPopeID] - MIN([Pope ID])) / MIN([MaxPoint]) * MIN([Point])
)
END

Grand. Set the new x-axis value to calculate using Pope ID and Point, restarting every Pope ID, and that’s the x-axis sorted. But these points are still basically just calculating a straight line, whereas what we actually need to do is push them up the y-axis by a different amount, kind of like this:

9 x calc why

Let’s also add a direction calculation, so that the wave between the first and second goes upwards, the wave between the second and third goes downwards, and so on. We can do that by working out whether it’s an odd or even number, and setting the direction accordingly:

IF INT([Pope Name No] % 2) = 0 THEN -1 ELSE 1 END

Now let’s work on our y-axis calculation. It’s got three parts:

  1. Working out a nice sinusoidal curve
  2. Multiplying that value by how long it’s been between popes, so that the longer it is between popes, the higher the curve goes
  3. Multiplying that by the distance calculation so that it goes above or below the x-axis accordingly

The first phase of a sine wave goes from 0 on the y-axis, up to a peak of 1, and then back down to 0 between the x-axis values of 0 and π, like so:

10 sine wave explanation

In our case, we don’t want a wave between 0 pope units and 3.141… pope units; rather, we want to define the beginning and end of this phase of a sine wave to be between one pope and the next pope of the same name. So, for Leo IX to Leo X, we want 153 to be our 0 and 218 to be our π. That means taking the scaffold point, dividing it by the maximum point of 99 to get the % of the distance that that point is along the 0-to-π scale, and then multiply it by π.

That’ll give us the first phase of a sine wave of the same height (of 1) between the popes, regardless of how long it’s been between them. We want the peak to be higher the longer it’s been between popes, so we multiply it by the distance. Then we can multiply by our positive/negative direction calc. Here’s the code:

MIN(SIN([Point]/[MaxPoint] * PI()))
*
([NextPopeID] - MIN([Pope ID]))
*
MIN([Direction])

So, stick the y-axis calc on rows, set the table calculation to calculate using Pope and Pope ID, and voila! We have a nice set of sine waves between our Leos.

10 y calc

(this plot reminds me of doing Fourier transformations for EEG analysis; technically, we haven’t created this complex wave by layering up different sine waves on top of each other, but we can kind of decompose it into sets of individual sine waves as we go along)

The hard work is done now, so let’s bring the rest of our popes back in:

11 no filter all popes

Delightful. The rest of it is all about making it pretty, which I can leave to your personal tastes. But the real question is: what happens with Pius, the Barcelona of second millenium popes? Can we clearly see the era of Pius dominance?

11 pius

…yes, we can.

These graphs can be applied to basically anything that goes in a sequential order and may or may not have repeated values; this graph here is every word from my old band’s EP in order. I like how you can see where the choruses are, because the lines get more tightly woven as the words in the chorus are repeated more often.

12 sinusoidal pangolins.png

I hope this blog makes it clear how to make these graphs! I still don’t know what to call them, but in my head they’re unimaginatively down as sine wave time series. Thanks again to Neil for creating them first, and for making his workbooks downloadable and play-around-withable!

Standard
Alteryx, data visualisation, Tableau

The Gaslight Analysis: when sentiment analysis doesn’t quite work.

I love sentiment analysis. It’s a great way of getting fascinating insights from a glut of text data. You can take a load of Yelp reviews, figure out how people feel about a place, and cross-validate it with the star rating. You can take the works of Jane Austen and plot narrative arcs. You can look at the texting styles of you and your girlfriend. If you’ve got a dataset with clear sentences in standard textbook English, you can find out all sorts of things.

But, here’s the thing with language; it’s gloriously, infuriatingly messy.

That makes it really hard to do really good sentiment analysis – certainly with the free, widely-available tools. Most of those assign certain emotional values to specific words; for example, in the NRC dataset often used with the R package Tidytext, the word “alive” has associations of ANTICIPATION, JOY, POSITIVE, and TRUST, while the word “afraid” has associations of FEAR and NEGATIVE.

This approach works great for sentences like this:

“I bought these shoes last week, and they’re amazing. They feel great, and they make me feel great. Good value too! 10/10, very happy about this.”

…but it doesn’t work for sentences like this:

“I don’t feel good about this. I don’t feel good about this at all. I’d love to get out of this situation right now.”

The second sentence is pretty obviously negative, but it works by negating words. The word “good” isn’t actually good, because it’s being negated by “don’t” a couple of words earlier. And “love” isn’t a positive emotion here, as it’s expressing the desire to get out of the situation, meaning that what’s going on is not a positive thing.

It’s possible to address this with sentiment analysis, but it’s complicated. You’d have to account for every possible way of negating/reversing a word, and there’s a lot of those. You’d have to account for every possible way that a word that’s positive in isolation could actually be referring to a negative overall situation, and that’s a huge task. This is why good sentiment analysis costs a fortune. It’s really complicated.

Luckily, people tend not to speak indirectly all the time, and in aggregate, the twisting, sentiment-negating sentences are cancelled out by the number of straightforward sentences where word-level sentiment analysis does work. But the caveat is that just because you’re using sentiment analysis, that doesn’t mean you’re using it well, and you should really cross-validate it with some other measures.

I’m exploring this with lyrics from Brian Fallon’s bands – The Gaslight Anthem, The Horrible Crowes, and his solo project. I’m looking at Fallon’s lyrics because:

  1. Song lyrics are enough of a deviation from standard English to pose problems for standard sentiment analysis;
  2. At the same time, song lyrics are some of the most obviously emotional usages of language we have;
  3. Fallon writes lyrics in a pretty clear style, often in full sentences, without too many obscure metaphors or references;
  4. I really like his music.

I’ve used the Tidytext package in R for doing sentiment analysis before. This time, I’m using the same NRC sentiment dataset, but trying it out in Alteryx instead. I’ve also visualised it in Tableau, and you can click any of these images to go to an interactive link where you can play around with it yourself.

So, first things first; let’s have a look at sentiment in each song:

1

Looks pretty good so far. Here’s lookin’ at you, kid is a wistful, regretful song; definitely on the negative side, quite a bit of sadness, very little joy. Click the graph to explore other Brian Fallon songs, if you know them.

There’s a lot of different sentiment measures available, so let’s simplify it to looking at positive and negative. Here’s lookin’ at you, kid has 13 negative words, and 4 positive words. If we take difference (9) and divide it by the biggest value (13), we get a ratio of positive to negative words:

Positive – Negative
————————————
MAX(Positive, Negative)

This accounts for the difference between positive and negative words, as well as the number. For example, if one song has 10 positive words and 5 negative words, and another song has 6 positive words and 1 negative word, the difference is the same, but the second song is more positive overall, because it has far more times the number of positive words than negative words.

If we calculate this +/- balance for each song, we can order them as follows:

2

There’s a nice mix of positive and negative songs, and if you know the songs, some of them definitely feel right; Here’s lookin’ at you, kid is negative, so is Get hurt, while 45 is an upbeat, positive song. But there’s definitely some weird ones in there. We did it when we were young is a sad, regretful song, but it’s up there in the top half of positive songs. That doesn’t seem right.

So let’s cross-validate this. Spotify’s Echo Nest data has a measure called Valence, which is a measure of how positive the mood of a song is. You can get all kinds of interesting measures for your Spotify playlist here. When we plot the Spotify Valence (branching off to the left for values under 50), we get this instead:

3.png

Spotify has Here’s lookin’ at you, kid as one of the most negative songs, along with Cherry blossoms, which seems about right to me, but has We did it when we were young as a pretty neutral song, which still doesn’t feel right. Have a look at the difference with Blue jeans and white t-shirts, as well – it’s one of the most negative songs according to Spotify, but one of the most positive according to sentiment analysis. I’d put it somewhere in the middle, maybe a bit more positive than neutral.

Since I keep using my own perspective as a fan and a human, I figured I’d better cross-validate both of these stats with what fans think. I set up a simple survey where to get Brian Fallon fans to rate each song for positivity on a scale of 1 to 7, where 1 meant really negative, 4 meant neutral, and 7 meant really positive. I stressed in the introduction, several times, that it’s not a rating of how much you like each song, or how positive each song makes you personally feel (like, I really like Fallon’s sad songs because they make me feel nice… but they’re still objectively sad), but about the emotion in the song itself. Around 15-20 fans answered for each song, so I averaged their ratings together to get a human-generated emotion rating per song. It’s not the most scientific approach, but it’s good enough for the purposes of this blog.

Here’s what we get, centred around an average of 4 for neutral songs:

4

This time, Blue jeans and white t-shirts comes in as I see it – fairly positive, but not hugely so. We did it when we were young is down there in the negative range, along with Get hurt and Here’s lookin’ at you, kid.

It’s fascinating to see how the three measures agree and disagree for each song. If we rank each song along each measure (with 1 being most positive and 85 being most negative), we can see how the rank difference varies. There are five possible combinations:

  1. All measures disagree with each other
  2. All measures agree with each other
  3. Spotify valence and fan rating agree, but sentiment analysis disagrees
  4. Sentiment analysis and fan rating agree, but Spotify valence disagrees
  5. Sentiment analysis and Spotify valence agree, but fan rating disagrees

…and there’s at least one example of each:

5.15.25.35.45.5

We can see which songs are most consistently rated across all three measures by looking at the difference between each song’s highest and lowest positivity rank:

6

Blood loss is the most consistently rated, with a rank difference of only four places, while The backseat has a massive rank difference of 81 – fans put it as the second most positive song in Brian Fallon’s catalogue, while sentiment analysis rates it as 83rd, ahead of only I believe Jesus brought us together and I witnessed a crime. Spotify puts it at 24th.

Another way of showing this variation is by creating scatterplots of each measure against each other, with each dot representing a song:

7

I’ve run simple correlations on each plot – not exactly statistically kosher, but this is all just exploratory. There is no correlation between valence and sentiment analysis, and more tellingly, no correlation between sentiment analysis and fan ratings. There is a correlation between valence and fan ratings, but it’s not particularly strong.

The overall point, then, is to be careful with sentiment analysis. It’s not that it doesn’t work – it can often work really well, and be a really useful line of investigating data. But relying on sentiment analysis alone, without checking whether it matches measures that should reflect the same kind of thing, might give you some false insights. You don’t want to have Great expectations, or you might Get hurt.

Standard
Alteryx, data visualisation, football, Tableau

The growing gap between the Premiership’s Top Six and the rest.

This is my first football data blog for a while, and I feel all nostalgic! It’s nice to dive into some league table data again, and even nicer now that I have Alteryx; I was able to format my data about 10x quicker than I was in when I first started doing this in R. Then again, I’ve probably also spent 5x more time using Alteryx than R in the last year or so. Anyway.

I’ve been hearing a lot more analysis of the Top 6 in the Premiership recently. I first noticed it in the last couple of seasons, when I saw a few journalists/people on Twitter writing about a “Big Six Mini-League”. Liverpool often seemed to do quite well at this, and Arsenal often seemed to do quite badly at this. Neither team won the actual league.

I’ve started looking at how the Top 6 sides in the Premiership perform each year (using data from this fantastically well-maintained repository), and there’s quite a few interesting stories in here. The first main point is that the big clubs are accelerating away from the rest of the league. The second main point is that any big six mini-league doesn’t really matter, as you can win the Premiership with an underwhelming record against your main rivals if you trash everybody else. I mean, that shouldn’t be much of a surprise – if you’re a Top 6 team, only 30 points are on offer from matches against your rivals, but you can potentially take 84 points from the 28 matches against the rest of the league.

For all these analyses, I’m taking Top 6 literally – meaning the teams that finish that season in the top six positions. Nothing to do with net spend, illustrious history, shirt sales in Indonesia, or anything like that. I then look at the average points-per-game changes by team, position, season, and Top 6/Bottom 14 status. I also filtered out the first three seasons of the Premiership to keep it slightly easier for comparison, since there were 22 teams in the league until 1995-96.

When plotting the average points-per-game per season between the two groups, a clear trend emerges; the Top 6 are better and better at beating the rest of the league:

1

However, this trend appears to be asymmetrical. When looking at the overall average points-per-game for all games across the season, teams that finish in the Top 6 are getting better, but there’s only a negligible decline for the rest of the league. This suggests the bigger, better teams are pulling away from the rest of the league:

2

This effect is most striking when plotting the difference in overall average points-per-game between the two groups:

3

Teams finishing in the Top 6 scored around 0.6 points-per-game more than the rest of the league in the early nineties, but that’s now up to over 1 point-per-game in the latest couple of seasons. That half-a-point difference translates to a 19-point difference across a whole 38-game season.

We can plot each team in each season of the Premiership (since 1995-96, when the league was first reduced to 20 teams) and look at how well they did against the top teams and the rest of the league. In this graph, the straight line represents equal performance vs the Top 6 and Bottom 14:

4

A couple of things stand out:

1. Only a handful of teams have ever done better vs. the Top 6 than the rest of the league. This seems to have no effect on final position.

2. It’s possible to win the league with a poor record against the Top 6 by consistently beating everybody else. Manchester United won the league in 00-01 and 08-09 with only 1.3 PPG vs the Top 6.

3. Manchester City this year are ridiculous.

I find it interesting to compare Liverpool and Arsenal over the years. One narrative I sometimes hear is that Liverpool tend to raise their game for big matches, but are too inconsistent the rest of the time, whereas Arsenal struggle against big sides but do well enough in the rest of the league to consistently finish well. This chart seems to bear that analysis out; Liverpool’s cluster of dots are higher on the chart, but further to the left:

6

…while Arsenal’s cluster is slightly lower but further to the right… and most importantly, more colourful:

5

And if you want to explore other teams and seasons, there’s an interactive version of all these graphs here.

Standard