Eurovision Song Contest: a market basket analysis of voting patterns and international relations.

I’ve been doing a lot of market basket analysis at The Information Lab lately. Market basket analysis is a way of looking for things that people buy at the same time (or that people never buy at the same time) in order to spot trends in people’s behaviour. For example, it’s probably obvious that if somebody buys cereal, they’ll probably also buy milk. Or that if somebody buys tofu, they’re not going to be buying sausages. This is a really nice example of how it all works.

Thing is, after a while, using bread and butter or cereal and milk or sausages and tofu as an example gets kinda dry. And talking about Lego shovels and milligram-level accurate scales is sometimes a little unprofessional, even if it is a perfect example of consumer behaviour.

So, I’ve been analysing the Eurovision Song Contest. The jury votes lend themselves pretty well to market basket analysis, because they’re pretty similar to transactions: each country’s jury (or customer) votes for (or buys) ten countries (or items) at a time (in a basket), and the fact that these countries (or items) are a subset of all possible countries (or items) to vote for (or buy) means that you can make the same selection vs. non-selection distinction. And we all know that some countries always vote for some other countries, regardless of how good the song is, which is part of what makes it fun.

I took the historic Eurovision data collected by Stephan Okhuijsen of Datagraver. Then, using Alteryx, I filtered it to all contests from 1993 onwards, because European countries have been relatively consistent since then. I also filtered it to the final only, and to the jury votes only.

I set the minimum support for a rule to 0.01, which is kind of high for a regular market basket analysis using tens of thousands of SKUs in a supermarket, but works fine for such a closed set of possible choices of countries. I also set the minimum confidence to 0.05. That gave me almost 33,000 association rules, of which about 1,600 were one-to-one country mappings.

The full results are in an interactive dashboard here.

In the matrix at the bottom, you can see who consistently votes for who, and it’s pretty predictable. Cyprus and Greece, for example, almost always give each other the most possible points. There’s a big love in between Moldova and Romania, and between Turkey and Azerbaijan. The Nordics are a bit too cool to give each other full marks every time, but it’s still a bit of a Scandi circle jerk. Andorra love Spain, although it doesn’t seem like that’s reciprocated. Azerbaijan have never voted for Armenia, funnily enough. And Austria have given Australia full marks twice, which I like to believe is because they were hoping to exploit a poor fuzzy matching process in the background scoring:

But market basket analysis shows how countries behave as a group, where we can see how some associations are Europe-wide, and some are just confined to the two countries. For example, some of the Scandi trends are reflected in votes across Europe; if a country, any country, votes for Denmark, they’re also likely to vote for Norway:

And surprise, surprise, countries that vote for Ukraine will also vote for Russia:

But the Greece/Cyprus love in is special just for them; in fact, if anything, there’s a slightly negative association between them, meaning that if a country votes for Cyprus, they’re slightly less likely to vote for Greece as well:

Likewise with Turkey and Azerbaijan. Just because they give each other full points all the time, other European countries don’t link the two together in their voting behaviour at all:

Meanwhile, even though Azerbaijan will never give points to Armenia, and Armenia have only ever given one point to Azerbaijan, other European countries are far more optimistic. Maybe they hope that voting for both Armenia and Azerbaijan at Eurovision can resolve the Nagorno-Karabakh dispute. Or maybe they just don’t know anything about the Caucasus region and think they’re the same place, I don’t know.

This is quite nice to illustrate, because the market basket analysis allows you to make the distinction; while there are some obvious associations between countries, like how Greece and Cyprus always vote for each other, it shows that those associations aren’t necessarily transferred to other countries’ voting behaviour.

Click through to the interactive version here to explore in more detail. I’m going to be using this in my teaching examples more often.

Tableau

How to give your area and bar charts a makeover with connected scatterplots.

I haven’t always been a fan of connected scatterplots, but I’m gradually coming round to them. First it was with centre of gravity maps; now it’s as a replacement for (some) area charts and bars.

I came across a chart a lot like this at work this week:

I’ve mocked it up using some fake data, but it’s pretty much showing how two departments (or groups, or types of things, or categories) do in terms of profit (or another measure) over the year, and how many orders (or another measure again) there were in each month across both groups.

I’m not a fan of the area chart plus bars over the top approach. Firstly, the bars obscure what’s going on with the area chart underneath them, and secondly, area charts can be misleading as it’s hard to parse each thing separately.

Let’s show the same information with a connected scatterplot:

Here, the red and blue lines show the two departments, and the grey line shows the aggregated number of orders and profit. Instantly, the difference between the two departments is a lot clearer; department A has had a lot of variation in number of orders but profit has stayed pretty consistent, while department B has had a lot of variation in profit but the number of orders has stayed consistent.

The lines are joined up by month… but as it is, it’s impossible to tell where the year begins and ends, which makes the whole thing pretty pointless. Let’s show time with size:

The lines get thicker as the month gets more recent, and now it’s easy to see the trends over the year. The variation in orders in department A is all over the place, but the variation in profit in department B is a bit more consistent; profit has gone down over the year. We can also see the aggregate profit and order trends much more clearly on the grey line, with orders going up but profit going down.

The downside of using line size to show time trends is that the thinner parts are hard to see and the thicker parts can be hard to parse. Let’s try it with colour instead:

I personally prefer this approach to using line size, but it’s also a bit of a faff. Tableau doesn’t like it if you try to do a three-way colour split by dimension (i.e. red, blue, and grey) and then change the shade of the colour by a measure, so you have to convert the month to discrete and make sure to order everything correctly. It’s not too taxing though, so it’s worth it if the data doesn’t change all the time.

Finally, you can go the whole hog and do some double encoding with both colour and size on the line to show time:

This is eye-catching, but possibly to the point where it’s more distracting than informative.

I like the connected scatterplots in this example, although there are many situations where the lines will overlap in a way that won’t tell you much. It certainly won’t work well with lots of different departments; here’s the mess you get if you look at all countries in EU Superstore for all months of all years:

Even when filtering to four countries and two years of data, it’s not the clearest way of showing things:

In summary, then, give a connected scatterplot a go. It may well not work, but sometimes it’ll result in something a lot clearer and more informative than a combined area/bar chart.

Alteryx, data visualisation, Maps, Tableau

Alaska Fried Chicken: the UK’s curious approach to naming chicken shops.

I went a little bit viral a couple of weeks ago when I tweeted about chicken shops in the UK which are named after American states which aren’t Kentucky. If I’d thought about it, I’d have written this blog up first, created a Tableau Public viz, and had all kinds of other shit ready to plug once I started getting some serious #numbers… but I didn’t. So, to make up for that, this blog will go through that thread in more detail and answer a few questions I received along the way.

It all started when I walked past Tennessee Fried Chicken in Camberwell, pretty close to where I live. It’s clearly a knock-off KFC, and I wanted to know how many other chicken shops had the same name format: [American state] Fried Chicken.

The first thing to do is to get a list of all the restaurants in the UK. I spent a while wondering how to get this data, but then I remembered that my colleague Luke Stoughton once built a Tableau Public dashboard about food hygiene ratings in the UK. All UK chicken shops – hopefully! – are inspected by the Food Standards Agency. So, Luke kindly showed me his Alteryx workflow for scraping the data from the FSA API, and I adjusted it to look for chicken shops.

My first line of inquiry is pretty stringent: how many chicken shops in the UK are called “X Fried Chicken” where X is an American state which isn’t Kentucky?

Turns out it’s 34. “Tennessee Fried Chicken” – including variants such as Tenessee and Tennesse – is the most popular with 13 chicken shops. The next highest is Kansas with six, which I’m assuming is so the owners can refer to their shops as KFC, although maybe the owner/s just really like tornadoes, wheat, and/or the Wizard of Oz. Then there’s four Californias, a couple of Floridas, and one each of Arizona, Georgia, Michigan, Mississippi, Montana, Ohio, Texas, and Virginia.

[tangent: I’m aware that a lot of these states aren’t exactly famed for their fried chicken, but as a Brit, all I have to go on for most of them are my stereotypes from American media. But hey, maybe it’s still accurate, and Ohio Fried Chicken tastes of opiates and post-industrial decline, Arizona Fried Chicken comes pre-pulped for the senior clientele who can’t chew so well these days, and Florida Fried Chicken is actually just alligator. Michigan Fried Chicken is, I dunno, fried in car oil rather than vegetable oil, and Alaska Fried Chicken is their sneaky way of dealing with the bald eagle problem up there? I’m running out of crude state stereotypes now, I’m afraid. Out of all these states, I’ve only actually been to California.]

There’s also a “DC Fried Chicken”, which is close but not quite close enough for me, and a “South Harrow Tennessee Fried Chicken”, which I’m not counting because either.

Here is where these American State Fried Chicken shops are in the UK:

Interestingly, this isn’t a case of a map simply showing population distributions. The shops cluster around the London and Manchester regions, but with almost none in any other major urban centre.

Let’s have a look at the clusters separately. Here’s the chicken shops around the Manchester area:

None of them are in the proper centre of Manchester itself, but they’re in the towns around. One town in particular stands out: Oldham. Let’s have a look at the centre of Oldham:

Oldham, you’re fantastic. There are six separate “X Fried Chicken” shops in Oldham, and four of them – Georgia, Michigan, Montana, and Virginia – are the only ones by that name in the whole country.

For comparison, here’s the London area:

This is where all the Tennessees are, as well as the one Texas and Mississippi.

It looks like there’s a lot more variety in the north of England compared to the south, and sure enough, a split emerges:

[chicken icon from https://www.flaticon.com/packs/animals-33%5D

Chicken shops in the south of England (and that one Tennessee place in Wales) tend to name their shops after states in the geographical south of the USA, while chicken shops in the north of England name their shops after any states they like.

This is where my initial Twitter thread ended, and I woke up the next day to a lot of comments like “Y IS THEIR NO MARYLAND THEIR IS MARYLAND CHICKEN IN LEICESTER”. Well, yeah, but it’s not Maryland Fried Chicken, is it?

So I re-ran the data to look at chicken shops with an American state in the name. This is the point at which it’s hard to tell if there’s any data drop out; the FSA data categorises places to inspect as restaurants, takeaways, etc., but not as specifically as chicken shops. All I’ve got to go on is the name, so I’ve taken all shops with an American state and the word “chicken” in the name. This would exclude (sadly fictional) places like “South Dakota Spicy Wings” and “The Organic Vermont Quail Emporium”, but it’d also include a lot of false positives; for example, you’d think that taking all takeaway places with “wings” in the name would be safe, but when I manually checked a few on Google Street View (because I’m dedicated to my research), about half of them are Chinese and refer to the owner’s surname, not the delicacy available.

This brings in a few more states – Marlyand, New Jersey, and Nevada:

Let’s have another look at the UK’s south vs north split. We’ve got a bit of midlands representation now, with the Maryland Chickens in Leicester and Nottingham, the Nevada Chickens in Nottingham and Derby, and a California Chicken & Pizza near Dudley. The latitude naming split between the south/midlands and the north isn’t quite as obvious anymore:

…but, there is still a noticeable difference. This graph shows each chicken shop with an American state and the word “chicken” in the name, ordered by latitude going south to north:

In the south and the midlands, there’s the occasional chicken shop that’s going individual – there’s the Texas Fried Chicken in Edmonton, the two Mississippi places in London which don’t seem to be related (Mississippi Chicken & Pizza in Dagenham, Mississippi Fried Chicken in Islington), the Kansas Chicken & Ribs place in Hornsey is almost definitely a different chain from the six Kansas Fried Chicken shops in and around Manchester, and the California Fried Chicken in Luton is probably independent of the California Fried Chickens on the south coast – but most of them are Tennessee or Maryland chains in the same area. In all, the south and midlands have 17 chicken shops named after 8 American states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.47.

In the north, however, there’s a proliferation of independent chicken shops – 15 shops named after 9 different states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.6. There’s the chain of six Kansas Fried Chicken places and two Florida Fried Chicken places in Manchester and Oldham, but the rest are completely separate. Good job, The North.

The broader question is: why does the UK do this? There’s obviously the copycat nature of it; chicken shops want to seem plausible, and sounding like a KFC (and looking like one too, since they’re almost always designed in red/white/blue colours) links it in people’s minds. I think there’s more to it, though. Having a really American-sounding word in the name is probably a bit like how Japanese companies scatter English words everywhere to sound international and dynamic (even if they make no sense), or how Americans often perceive British names and accents as fancier and more authoritative (even if to British ears it’s somebody from Birmingham called Jenkins). We’re doing the same, but… for fried chicken.

Finally, since this data is all from the Food Standards Agency’s hygiene ratings, it’d be a shame not to look at the actual hygiene ratings:

It looks like independently-named chicken shops named after American states in the north are more hygienic. The chains in the south and midlands – Tennessee, Maryland, California, and especially New Jersey – don’t have great hygiene ratings, and the independent shops do pretty badly too. In contrast, the chicken shops in the north score highly for cleanliness. In fact, a quick linear regression of hygiene onto latitude gives me an R² of 0.74 and a p-value of < 0.0001. Speculations as to why this is on a postcard, please.

Update, November 2018: I’ve finally got round to refreshing the data and putting up an interactive, searchable map. Sadly, it looks like Ohio Fried Chicken has shut down, but there’s another Arizona Fried Chicken now, so… (s)wings and roundabouts. Have a look for (probable) chicken shops in your area here.

Preëmpting your questions/comments:

“I live in […] and my local shop […] isn’t mentioned!”
Maybe you’re talking about a Dallas Chicken place. That’s not a state. Nor is Dixy Chicken, it just sounds a bit American. If it’s definitely a state, then does it have chicken in the name? If not, I won’t have picked it up. I also haven’t picked up shops which have, say, “Vermont Fried Chicken” written on the shop sign if it’s registered in the database as “VFC”. Same with if the state is misspelled, either by the shop or by the data collectors. If it’s all still fine, perhaps the shop is so new that it hasn’t had an inspection… or perhaps the shop is operating illegally and isn’t registered for a hygiene inspection.

“Did you know about Mr. Chicken, the guy who designs the signs?”
I didn’t, but I do now! He’s brilliant.

“How did you do all this?”
I use Alteryx for data scraping/preparation and Tableau for data visualisation.

“I have an idea for something / I want to talk to you about something, can I get in touch?”
Please do! My Twitter handle is @GwilymLockwood, or you can email me on gwilym.lockwood@theinformationlab.co.uk

“Your analysis is amazing, probably the best thing I’ve ever seen with my eyes. Where can I explore more of your stuff?”
Thanks, that’s so kind! There’s a lot of my infographic work on my Tableau Public site here.

Tableau

Strings and roundabouts: how to label your numbers in Tableau effectively

Lots of Tableau dashboards feature big summary numbers. They’re pretty nice, and they look like this:

And what’s even nicer is that you can alter the measure’s default number format to automatically round a specific unit, like this:

But sometimes, the range of numbers is a lot wider than sales per state in Superstore. Let’s have a look at population per country in the World Indicators dataset:

Those are some long numbers, so let’s round them to the nearest unit again. But which one? If we round to the thousands, we get this:

And if we round to the millions, we get this:

It’d be great if we could get Tableau to figure out what the nearest sensible unit is. That functionality doesn’t exist yet (as far as I know!), but we can write a specific optimised rounding calculation for labelling purposes. It’s a bit of a long one:

This calculation returns the number you want as a string. It does this by:

Aggregating the number you’re actually working with already and finding out whether it’s above a billion (in which case you’d want to summarise to whatever number of billions it is), or above a million (in which case you’d want to summarise to whatever number of millions it is), and so on.
Converting it to an absolute number so that it works for negative numbers too.
Taking that aggregated number and dividing by the sensible unit. For example, if your number is 34000000, you’d want to express it as 34 million, so we’re dividing it by a million to return 34.
Rounding that divided figure to one decimal place. This is just my preference, you can do what you like! Set the number to 0 for no decimal places, or 2 for two decimal places, etc.
Convert that number to a string.
Add a text unit abbreviation to the end of it.

Of course, you can also add trillions, quadrillions, and so on, if that’s what your data requires.

This sorts us out nicely:

Now, I’ve deliberately aggregated everything within the calculation, and I’m only using it for labelling purposes. I categorically do not recommend aggregating outside this calculation or using this calculation for calculating anything else. This will result in a shitstorm of rounding errors which can seriously damage your data. But as a final step once you’ve sorted everything out, I find that this is really nice for presenting data.

Here’s the calculation in text so you can copy and paste it into your workbooks:

IF ABS(AVG([Population Total])) >= 1000000000 THEN //round for billions STR(ROUND(AVG([Population Total] / 1000000000), 1)) + "b" ELSEIF ABS(AVG([Population Total])) >= 1000000 THEN //round for millions STR(ROUND(AVG([Population Total] / 1000000), 1)) + "m" ELSEIF ABS(AVG([Population Total])) >= 1000 THEN //round for thousands STR(ROUND(AVG([Population Total] / 1000), 1)) + "k" ELSE STR(ROUND(AVG([Population Total]),0)) END

data visualisation, Tableau

Time is a Machine: how to create a clock in Tableau

I made a clock in Tableau this week, and you can find it on Tableau Public here.

It always shows the current time for the UK, but it shouldn’t be hard to parameterise to update to whatever time zone you’re in.

Essentially, all it is is two points on a scatterplot, connected by lines to the coördinates (0,0), and superimposed on a background image. I made the background image in Powerpoint, based on the clock in the Time episode of Don’t Hug Me I’m Scared.

I’ve written before about using radial calculations to plot distance from the centre and change the lengths while keeping the angles constant. This time, we’re going to change up the trigonometry a bit, and calculate the angle while keeping how far the line goes constant.

Firstly, though, we need some data to work with. All you need to get a DateTime is a single cell in a single column… but for plotting purposes, we’re going to need the following dataset:

That’s all we’ll need! Read that into Tableau, and the rest can be done with calculated fields.

Firstly, we need to find out what the time is. Tableau has the NOW() function, which is really useful. It returns the exact time, down to the second, of the time on your computer (assuming that you’re working in Tableau Desktop with an Excel sheet you’ve created just for this). But when it’s published on Tableau Server, it returns the time of the Tableau Server Data Engine, which seems to be eight hours behind UK time (as of 19th September 2017, when I’m writing this; I’ve no idea how daylight saving changes will affect it).

So, let’s create our Right Now field, and add eight hours to it with the DATEADD() function so that it’ll give us the UK time when published:

Right Now:
DATEADD('hour', 8, NOW())

The next step is to take Right Now and parse out the time parts that we want to plot. Let’s just go with hours and minutes; plotting seconds is possible, but it’ll look like it’s not working if the dashboard isn’t updating every half second or so. So, let’s create an Hours field and a Minutes field as follows:

Hours:
DATEPART('hour', [Right Now])

Minutes:
DATEPART('minute', [Right Now])

This will give the current hour and the current minute as a number. There’s an extra step we need to take, though… the hour hand on a clock doesn’t point at the exact hour number for the whole of the hour, it moves around depending on the minutes that have passed. If it’s half past ten, the hour hand doesn’t point at ten exactly, it points about halfway between the ten and the eleven.

So, let’s create another field called Exact Hour for the exact point between hour marks to plot:

Exact Hour:
[Hours] + ([Minutes] / 60)

This works by giving us the hour (e.g. 6 for 6pm), and then adding the amount of the hour that we’ve got through. For example, if it’s 6.15pm, the number of minutes is 15, and we’re quarter of the way through the hour. 15/60 = 0.25, so the point where the hour hand will point to is 6.25, i.e. quarter of the way from 6 to 7.

After that, we need to create a single field to plot. This is why the underlying data has the Time Unit field, with separate rows for each hand.

Time for plotting:
IF [Time Unit] = "Hours" THEN [Exact Hour] ELSE [Minutes] END

Now that we have our field to plot, we’re ready to do some trigonometry!

We know that we want the clock hands to begin at (0,0) on the scatterplot; what we need to work out is where the clock hands need to end. To be able to plot the X and Y coördinates of where the hands end, we first need to know the angle of the line from (0,0). In simple terms, the scatterplot works like this:

Finding the angle is fairly simple. There are 360° in a circle, and rather conveniently, a clock face is just a big old circle, starting with 0° from the centre at the 12 o’clock position. There are 12 hour points that go round the clock face, so if we want to find out the hour hand’s angle, we divide the hour value by 12 to find out how far around 360° it is, then multiply that fraction by 360. For minutes, the same thing holds, but there are 60 points instead of 12.

Angle:
IF [Time Unit] = "Hours" THEN ([Time for plotting] / 12 ) * 360 ELSE ([Time for plotting] / 60 ) * 360 END

“But wait!”, I hear you shout at the screen. Dividing the hour by 12 might work for the morning, but what about when it’s the afternoon, when Tableau’s DATEPART() function will return the number 18 for 6pm, as it works on a 24 hour format?

You’re completely right, I haven’t accounted for that. But I don’t really need to. If it’s 6pm, the hour is 18. 18/12 is 1.5, and multiplying that fraction by 360 gives us 540°. Sure, 540 is bigger than the 360° that are in a circle… but the wonderful thing about circles is that they’re, well, circular. Plotting 540° on this clock face will look identical to plotting 180°. If it bothers you that they’re not technically the same, feel free to add an IF clause to identify the afternoon and then subtract 12 hours from the Exact Hour field.

Now that we’ve got the right angles, we can calculate where the coördinates go. This is a bit more tricky.

The first thing to bear in mind is that I’ve changed the trigonometric functions to reflect how Tableau will actually plot the angles, rather than using the standard ones in maths textbooks.

Maths textbooks will tell you that to find the coördinates (X,Y) on a circle, given the angle θ and a radius of 1 from the centre point (0,0), the equations are Y = Sin θ and X = Cos θ. I’m not going to go into why or how here, but please just trust me on this one and take it at face value. Y = Sin θ and X = Cos θ.

Those maths textbooks will also give you a diagram like this:

But this isn’t what we have; we have this angle instead:

…so using the exact same calculations won’t quite work for us here, because they calculate it relative to a different axis. But, we can still use the earlier diagram to help us work it out; we just need to rotate it and flip it a bit until we have what we need:

This looks like the angle we’re trying to work out, right?

This means that our X axis is the Y axis in the canonical diagram, and our Y axis is the X axis in the canonical diagram. Let’s just rename the two axes so X goes along the bottom and Y goes up and down again:

Now, for us, X = Sin θ and Y = Cos θ. Nice.

That’s all well and good, but there’s another step before it’ll actually work in Tableau. We’ve calculated our angle in degrees (because that’s what everybody learns at school first, and that’s still what’s the most intuitive thing for me). Thing is, Tableau uses radians with trigonometric functions. When we use radians, 360° is equivalent to 2π… which means that 1° is equivalent to π/180. So, we can still use our angle field, we just have to multiply it by π/180 (radians is another thing that you’ll just have to take my word on for now, I’m afraid; just remember that π = 3.14159… and so on, and π also = 180°).

Finally, we want our clock hands to be different lengths. To do this, you can take the equations and multiply them by a constant. Through trial and error, I found that I liked it best when the minute hand was 1.6x the length of the hour hand, so I multiplied the equations by 1.6 when it was for minutes and by 1 when it was for hours, just to keep it consistent.

The fields are:

X:
IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * SIN([Angle]* PI() / 180) ELSE 1 * SIN([Angle]* PI() / 180) END ELSE 0 END

Y:
IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * COS([Angle]* PI() / 180) ELSE 1 * COS([Angle]* PI() / 180) END ELSE 0 END

If you’re wondering why Path ID matters, it’s about connecting the lines to the dots. What we need is to have the lines start at (0,0) and end at (X,Y), but we still need to tell Tableau that the starting point is (0,0) where Path ID = 0.

That’s a lot of trigonometry, but we’re finally done! All you need to do now is to drag SUM(X) to columns and SUM(Y) to rows, and put Time Unit on detail. This will give you two circles. Drag SUM(Y) to rows again, and change it to line. Put Path ID on the Path shelf. Then dual axis the two SUM(Y) fields, and synchronise axes.

This probably doesn’t quite look right yet, because you have to make sure that you fix both the X and Y axes to be between the same range; I’ve fixed both of mine to go from -2 to +2, which has worked out nicely.

That’s it for making the clock! But there’s even more fun to be had in the final step, which is playing around with background images. I found a lot of beautiful handless clock faces online, but most of them have copyright restrictions, so I’m not going to use those. Instead, I went for an homage to Don’t Hug Me I’m Scared, a youtube series with probably my favourite animated clock character of all time. At some point, I might try it out with my own face and see how horrific that looks.

I hope this helps! It was really fun to build and write about. Please leave me a comment if you have any questions, and I’ll do my best to answer.

(title inspiration: Time is a Machine – Listener)

Alteryx, football, Maps, Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps

I haven’t written a blog in far too long. My bad. So, to get back into the swing of things, here’s something I’ve been playing with this week: centre of gravity plots.

It started with an accident. I had some EU member data, and I was simply trying to make a filled map based on the year each country joined, just to see if it was worth plotting. You know, something like this:

Except that I’d been having a clumsy day (the kind of day where I spilled coffee on my desk, twice), and accidentally missed the filled map option and clicked line instead:

Now, I normally don’t like connected scatterplots, but realised that I could change a couple of things to this accident to make quite a nice connected scatterplot on a map, joining up the central latitude and longitude of each country, so I thought I’d follow through with it and see what happened.

(by the way, the colour palette I use is the Viridis Palette, which I absolutely love. You can find the text to copy/paste into your Tableau preferences file here)

Firstly, I changed my “year joined” field from a discrete dimension into a continuous measure so that I could make it a continuous line with AVG(Year joined):

This connects all the countries by their central latitude and longitude as generated by Tableau, but it joins them up in order from left to right on the map. So, I then added AVG(Year joined) to the path shelf as well, which means that each country is joined in chronological order, or in alphabetical order when there’s a tie (as with Belgium, France, Germany, Italy, Luxembourg, and the Netherlands, who formed the EU in 1958):

I was pretty happy with this; it shows the EU’s expansion eastwards over time far, far better than the filled map did.

I got talking to Mark and Neil online, who introduced me to the idea of “centre of gravity” plots, which show the average latitude and longitude of something and how it changes with respect to something else (usually time). In this case, a centre of gravity plot of the EU would show the average central point of Belgium, France, Germany, Italy, Luxembourg, and the Netherlands in 1958, then the average central point of Belgium, France, Germany, Italy, Luxembourg, the Netherlands, Denmark, Ireland, and the UK in 1973… and so on. I figured it should be easy enough, I’d just take Country off detail, replace it with Year joined, and average the latitudes and longitudes together.

Sadly, it doesn’t work that way. The Latitude (generated) and Longitude (generated) fields that Tableau automatically generates when it detects a geographic field like country can’t be aggregated, and can’t be used if the geographic field they’re based on isn’t in the view. That meant I couldn’t average the latitudes and longitudes over multiple countries without creating lots of different groups.

But, there’s a simple way around this! You can create a text table of the latlongs, copy/paste them into Excel or whatever, then read that in as another data source. Firstly, drag your geographic field into the view, and put the latitude on text, like so:

Then copy and paste it all (I just click on there randomly, hit ctrl+A, ctrl+C, switch to Excel, ctrl-V). Now do the same for the longitude. Save the document, and read it in as a separate data source in Tableau. Now you can blend the data on Country, or whatever your geographic field is, and you’ve got actual latlongs that you can use like proper measures.

And so I did. I recreated the line chart with the new fields, but took Country off detail, and made AVG(Latitude) and AVG(Longitude) into moving average table calculations which take the current value and an arbitrarily high number of previous values (I put in 100, just because). This looked pretty good:

…but then I realised that it wasn’t accurate data. Look at the point for 1973, after the UK, Ireland, and Denmark joined. Doesn’t that seem a little far north?

To investigate it fully, I duplicated the sheet as a crosstab, because sometimes, tables are the best way to go. What I found is that I’ve got a bit of Simpson’s Paradox going on; the calculation is taking averages of averages:

Not so great. If we add Country to the view after the Year joined pill, you can see what it should be:

But the problem is, how do we put Country on detail but then get the moving average to ignore it? I tried various LODs, but couldn’t get it to work exactly – if you have a solution, I would love to hear it! My default approach is to try to restructure the data in Alteryx – because that generally solves everything – but I feel like I’m becoming too reliant on restructuring the data rather than working with what Tableau can do.

Anyway, I ended up restructuring the data by generating a row for each country and year that the country has been a member of the EU. That means I can create a data table like this:

…which removes the need for a moving average calculation entirely, because the entire data is moving with the year instead. Just take country off detail / out of the view, and you get the right averages:

Much more accurate:

This is a better way of structuring the data for this particular instance, because the dataset is tiny; 28 countries, 60-ish years, 913 rows in my Excel file. It’s not going to be a good, sustainable solution for a centre of gravity plot over a much bigger dataset though. I did the same thing for the UN – 193 countries, 70-ish years – and ended up with 10,045 rows in my Excel file. It’s easy to see how this could explode with much more data.

It does look interesting, though; I’d never have guessed that the UN’s centre of gravity hadn’t really left the Sahara since its inception:

Finally, since I was on a roll, I plotted the centre of gravity for the English football champions since the first ever professional season in 1888-89. Conceptually, this was slightly different; unlike the EU and the UN, the champion isn’t a group of teams constantly joining over the years (although it is possible to plot that too). Rather, I wanted to create a rolling average of the centre of gravity over the last N years. If you set it to five years, it’s a bit messy, moving around the country quite a lot:

But if you set it to 20 years, the line tells a nice story. You can see how English football started out with the original northern teams being the most powerful, then it moves south after the Second World War, then it moves north-west during the Liverpool/Manchester era of domination, and finally it’s moving south again more recently:

Many thanks to Ian, who showed me how to parameterise this. Firstly, put your hard-coded (i.e. not Tableau generated!) latitude or longitude field in the view, and create a moving average over the last ten years. Or two, or thirteen, or ninety-eight, it doesn’t really matter. Next, drag the moving average latitude/longitude pill from the rows/columns into the measures pane in order to store it. This creates a calculated field. Meanwhile, create a parameter to let you select a number. This will change the period to calculate the moving average over. Open up the new calculated fields, and replace the number ten/two/thirteen/ninety-eight with your newly-created parameter, remembering to leave the minus sign in front of it:

This will let you parameterise your moving average centre of gravity.

It was a lot of fun to play around with these maps this week. I’ve packaged them all up in a Tableau Public workbook here; I hope you find it as interesting as I did!

(title inspiration: Touché Amoré – Gravity, Metaphorically)

Tableau

Time is money, money is power, power is pizza, and pizza is knowledge.

I’m not proud to admit it. I was recently in a pub and pulled out my phone to calculate the area of a set of pizzas we were ordering. There were two sizes of pizza on the menu – 12″ and 18″, for £12 and £18 each – and my group were going to order five 12″ pizzas.

I couldn’t let that slide.

“Nah, let’s get three 18″ pizzas instead”, I said. “More pizza, less money.”

Intuitively, it feels wrong. Yes, an 18″ pizza is bigger, but surely five 12″ pizzas are more than only three 18″ pizzas? This is when I worked out the total area of the pizzas and their price per square inch, and for the record, buying three 18″ pizzas at £18 gets you 763 sq inch of pizza for £54, whereas buying five 12″ pizzas at £12 gets you 566 sq inch of pizza for £60. We bought the big ones.

This got me thinking about how to visualise this in Tableau, and I created the following handy dashboard for settling your pub pizza problems (click here for desktop version, or here for mobile version):

This was a fun one to build, not least because there’s basically no underlying data. I knew I wanted the inputs for the calculations to depend entirely on parameters, so all I needed was the names of the two pizzas. This is the spreadsheet I created:

Simple.

The next thing was to create three parameters per pizza; diameter, price, and quantity. The values entered here can be passed to the data with a case statement, e.g.:

…or directly within a formula:

Things were starting to get complicated, so I built out a table to make sure the numbers were working properly:

From here on out, it was relatively straightforward. It’s really hard to accurately judge area when looking at circles and squares (which is one of the reasons I hate packed bubbles and tree maps), so I started out by visualising the pizzas as actual circles vs. bar graphs.

This makes the two pizzas look relatively similar in size:

…whereas this shows that one of them is more than double the other:

One useful little trick I found was to create a calculated field to place the two pizza icons in specific places:

This spaced them out nicely rather than piling them on top of each other or putting them side by side and too close together:

The most fiddly bit was creating the text summary. I dragged a lot of calculated string fields onto the label shelf, and configured it like this:

And an example of some of the fields going into making a sentence which automatically updates is the best pizza deal field here:

I really enjoyed playing around with this viz. People who use Tableau often talk about liberating your data, but this was a nice exercise in what’s possible with basically no data at all!

Tableau

Paint your target: how to create a bullseye graph in Tableau

Ever thought of tracking whether something’s hitting the target by showing an actual target? I was looking through some old radial blogs recently, and realised you could use a scatterplot on x-y coördinates to show accuracy on a bullseye target.

(to cut straight to the viz on Tableau Public, go here; to find out how to create it, keep reading!)

For example, you could set up an image of a target a bit like this as a background image:

…and plot simple x-y points in Tableau over the top to show how close people/departments/countries are to meeting their target:

I used MS Publisher to design these concentric circles in the background image because when you select them all and save as an image, it doesn’t save the blank space as white, so you can use it with any colour worksheet or dashboard. Feel free to download this target image here (and if you use it, give me a shout! I’d love to see what kind of cool things you’re using it for).

Sadly, Superstore doesn’t have target data on it, so I’ve mocked up a quick dataset of sales and target sales per country as follows:

The next thing to do is calculate the “accuracy” on the bullseye; the nearer the sales figure is to the target, the closer to the middle of the bullseye it should be. However, once you’ve met or exceeded your target, you don’t want the points to keep moving. Countries with sales at 101% of the target and at 300% of the target should both still be in the middle.

The exact middle of the target is going to be at coördinates (0,0), which means that we actually want to create a field which takes some kind of inverse of the accuracy, where the greater the accuracy, the smaller the number in that field.

So, let’s create the following calculated fields. I’ve adapted them a bit from this blog on radial bar charts so that you can put the X field on columns and the Y field on rows, which is more intuitive.

Accuracy:
IF [Sales] > [Target] THEN 1.05 ELSE [Sales] / [Target] END

Radial Field:
1.1 - [Accuracy]
Why have I set the Accuracy equaliser bit to 1.05 instead of 1? And why have I set the Radial Field calculation to 1.1 instead of 1?

Well, it’s a bit fiddly, but it’s about plotting. I want everything where Sales is 100% of Target or more to be in the gold bullseye, so I want that to fill a certain amount of space. If I set the Accuracy equaliser bit to 1 and the Radial Field calculation part to 1, then it plots everything that’s at 100% or more at (0,0) and includes everything that’s 90% or more in the gold bullseye. I want to space out the 100% or more points so that they’re not on top of each other, and I want only the 100% or more points to be included in the gold bullseye. Setting the Radial Field calculation part to 1.1 makes it so that the edge of the gold bullseye denotes 100%. That means that the 100% or more points will be plotted on the edge of the bullseye. So, the 1.05 part in the Accuracy calculation moves those points further inside the bullseye, but not into the exact middle where they’ll be on top of each other.

While I’m at it, there’s a limitation to the Accuracy calculation. Have you spotted how it’s [Sales] > [Target] rather than SUM([Sales]) > SUM([Target])? This is because the angle calculation needs to aggregate the Radial Field further, and will break if the Accuracy or Radial Field calculations are already aggregated. This means that you’ll probably need to do some data processing to make sure that it’s just one row per thing in the dataset.

The next fields to calculate are:

Radial Angle:
(INDEX() -1 ) * (1/WINDOW_COUNT(COUNT([Radial Field]))) * 2 * PI()
X:
MAX([Radial Field]) * SIN([Radial Angle])
Y:
MAX([Radial Field]) * COS([Radial Angle])
Now you can drag X onto columns, Y onto rows, and put Country on detail:

This looks pretty rubbish, but that’s because the INDEX() function in the Radial Angle calculation is a table calc, and Tableau needs to know how to compute it. Edit the X and Y fields to compute using Country instead:

…and now you’ll get some points spaced out properly:

The calculations mean you can plot the points in a radial way; it’ll go through whatever field you’ve put on detail, and plot the points the right distance away from the centre, starting at 12 o’clock and looping round clockwise. With Country, it’s done alphabetically, so Belgium is on the X axis zero line. If you want to order the points differently, you can add a numeric field to the detail shelf, and change the table calculation to compute using that numeric field (but make sure to keep Country there!).

The next thing to do is to put the target in as a background image. Confusingly, this is under the Map options. Once you’ve found it, you have to specify exactly where the background image is going to sit on the x-y grid of coördinates. With this target, there are four concentric circles, denoting 70%, 80%, 90%, and 100%+. Because I set the Radial Field to be 1.1 – [Accuracy], there’s a distance of 0.1 in a circle around point (0,0) and the edge of the bullseye denoting 100%. That means that the bullseye section is going to run from -0.1 to 0.1 on both axes. Counting backwards, there are three other circles which are the same thickness; the light grey one denoting 90% will run from -0.2 to 0.2 on both axes, the mid grey one denoting 80% will run from -0.3 to 0.3 on both axes, and the dark grey one denoting 70% will run from -0.4 to 0.4 on both axes. As the edge of the dark grey circle is the edge of the background image, this is where you need to tell Tableau to position the background image:

Click OK, and there we are! A nice target, with the different points on it:

You can change the formatting to remove the zero lines if you like:

…but I actually kinda like them on this graph. I’ve also put Country on the colour shelf, and made the borders around the circles black.

I haven’t seen this approach before, so I’m not really sure what to call it. I’ve plumped for a bullseye graph, but maybe it already exists under another name. Let me know if somebody else has covered this, and definitely let me know if you find this useful! You can download the Tableau workbook I used to make this example here (it’s 10.2, by the way, but the same approach should work for older versions).

Tableau

The Colour Revolt: using rank calculations to get around Tableau’s colour defaults

Tableau’s colour palettes are generally pretty good, pretty flexible, and pretty… well, pretty. But sometimes you want to colour-code a dimension with more values than the number of colours in the palette, and this is when Tableau has a bit of a problem.

This blog is about how Tableau assigns colours, how that messes some things up, and how to get around that.

I’m working with the Global Indicators dataset in Tableau 10.2 here. Let’s say I want to plot each country’s CO2 emissions each year. Without using colours at all, we get this:

Not exactly ideal, so I’ll put the Country dimension on the colour shelf. But before I do, I’m just going to take the CO2 Emissions measure off the view so that we can see how Tableau assigns colours to values in a dimension:

Tableau takes all values in the dimension in order, and loops through the colour palette (in this case the 20-shaded automatic palette), assigning the colours in order and restarting every 20 values.

As there are about 200 countries but only 20 colours, ten or so countries will be represented by the same colour. This might not be a problem, depending on what we’re plotting, but then again, it might. Let’s put the CO2 Emissions measure back into the view:

Yeah, it’s a problem. It just so happens that China and the USA are a multiple of 20 apart in the dimension list, so they’ve been assigned the same brown colour. This is a problem.

Even more irritatingly, Tableau assigns the colour palette to all possible values in the dimension before you filter them. So, if I change the filter so that it only has China and the USA…

…they’re still the same colour.

How can we assign colours to only the countries selected by the filter in the view, rather than all countries?

One solution (thanks Anna for the tip!) is to create a calculated Rank() field, which will only look at the countries in the view. I’ll stick with a default rank of the countries themselves for now, which will do it alphabetically, but you can also do it by how they do on a particular measure.

Make this new calculated field discrete, and drag it onto the colour shelf. You’ll also need to separate out the countries by putting Country on detail (which was previously done automatically when Country was on colour).

This has changed the colours… but they’re still the same. This is because Rank() is a table calculation, and we need to change its default settings. Instead of compute using Table (across), compute using Country:

This now ranks the countries alphabetically (in reverse order, but it doesn’t really matter for this purpose) and assigns a different colour to each of them:

Now we have nicely distinguishable colours for dimensions with loads of values when we actually only want to plot a handful!

The colour marks card is a bit annoying, though; it only shows the rank number next to the colour, not the country name. You can get around this by creating a new sheet, applying the country filter to that sheet too, and just putting Country in rows:

You can now use this second sheet as a colour legend in a dashboard.

Of course, you might suggest that a simpler solution to the colour problem is to forget about colours altogether and just use labels on the lines. And you’d be totally correct! But maybe you’ve got a more complicated plot where you’ve got several different values for the same country or something. Like a scatterplot of two measures with a circle for each year:

Labels would look pretty hideous on this, especially if the circles overlapped a lot. But the Rank() colour workaround still applies.

Quantifying three years of a long distance relationship

I read two really useful guides to processing text data recently; an analysis of Trump’s tweets to work out whether it’s him or an intern sending them, and a sentiment analysis of Pride and Prejudice. Three years of a long distance relationship means that I have a nice big corpus of Whatsapp messages between my girlfriend and me, so I did the romantic thing and quantified some of our interactions in R. Also, this required quite a bit of text munging in Excel first, which turned out to be far quicker and easier than using regex in this case.

First of all, let’s look at when we text each other throughout the day. We’re in different time zones, but only by an hour, and since texts are inherently dependent – one text is overwhelmingly likely to lead to another pretty soon after – I haven’t adjusted the times.

Our texting activity represents our general activity pretty well; nothing much going on until about 7am, then a slow start to the day, a bit of a post-lunch dip, and then an evening peak when we’re more likely to be out and about doing things.

We can also have a look at how many messages we send each other, and how that’s changed over time:

We’ve sent each other a fairly similar number of texts per day throughout the long distance period, but it looks pretty bad on me that I have consistently sent fewer texts than her…

…or does it? When I plot the length of each text sent, I consistently write longer messages:

So, there’s two distinct texting styles here; I write longer messages less frequently, she writes shorter messages more frequently. The other thing I like about the text length graph is that you can see the times when we’ve been together and not texted each other that much; three weeks in November 2014 when I was running experiments in London, three weeks around Christmas 2015, and a load of long weekends throughout. It’s not that we don’t text each other at all then, it’s more that those texts tend to be stuff like “have we got milk?”, or simply “pub?”.

Plotting log likelihood ratios of how much each of us uses each word in comparison to the other also captures our texting styles:

For example, we both use the word /ha/ to express laughter, but I spell it “ha” and she spells it “hah”. Likewise, “til” and “till” as abbreviations for “until”, and I seem to use “somebody” while she uses “someone”.

If we filter out equivalent words and proper names (like the pubs, supermarkets, and stations we go to most often), another difference in dialogue style appears:

I am apparently a lot more conversational; I write out interjections (hmm, oooh, hey, ohhh) and reactions (fuck’s comes from for fuck’s sake, hoera comes from the Dutch phrase hiep hiep hoera, and boourns comes from, erm, The Simpsons). Apart from hhmmm, she doesn’t write interjections or contextual replies at all. Apart from the interjections and replies, my main thing is adjectives; she tends towards nouns and verbs.

The next step is sentiment analysis. If I plot log likelihood bars for each sentiment, I seem to be an atrociously negative person:

…but this, I think, is more a problem with the way sentiment analysis works in the syuzhet and tidytext packages using NRC sentiment data. Each word in the NRC corpus has a given value, 0 or 1, for a range of sentiments, and this sentiment analysis style simply adds it up for each word in a given set.

Because of that, it doesn’t really capture the actual sentiment behind the way we’re using these words. Let’s look at the main words driving the differences in each sentiment:

For me, a lot of my disgust and anger is coming from the word damn. If I was texting damn! every time I stubbed my toe or something, perhaps that would be accurate; but in this case, a lot of the time I write damn is in sympathy, as in exchanges like:

“My computer crashed this afternoon and I lost all the work I’d done today”
“Damn, that’s horrible”

Meanwhile, the word coop is actually me talking about the coöp / co-op, where I get my groceries. I’m not talking about being trapped, either physically or mentally.

The same goes for my girlfriend being more positive. With words like engagement and ceremony, she’s not joyous or anticipatory about her own upcoming nuptials or anything; rather, several of her colleagues have got engaged and married recently, and most of her uses of the words engagement and ceremony are her complaining about how that’s the only topic of conversation at the office. As for assessment, council, and teacher, she works in education. These are generally neutral descriptions of what’s happened that day.

So, I was hoping to be able to plot some sentiment analyses to show our relationship over time, but either it doesn’t work for text messages, or we’re really fucking obtuse. I think it might be the former.

Instead, I’ll settle for showing how much we both swear over time:

Each dot represents the number of occurrences per month of a particular expletive. I’m clearly the more profane here, although I do waver a bit while she’s fairly consistent.

More importantly is how we talk about beer a similar amount:

Since couples who drink together stay together (or in the words of this study, “concordant drinking couples reported decreased negative marital quality over time”), I think this bodes pretty well for us.

Vizzee Rascal

Gwilym Lockwood's blog about turning numbers into pictures.

Tag Archives: data visualisation

Eurovision Song Contest: a market basket analysis of voting patterns and international relations.

How to give your area and bar charts a makeover with connected scatterplots.

Alaska Fried Chicken: the UK’s curious approach to naming chicken shops.

Strings and roundabouts: how to label your numbers in Tableau effectively

Time is a Machine: how to create a clock in Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps

Time is money, money is power, power is pizza, and pizza is knowledge.

Paint your target: how to create a bullseye graph in Tableau

The Colour Revolt: using rank calculations to get around Tableau’s colour defaults

Quantifying three years of a long distance relationship