# We chase the waves: how to make a sinusoidal time series in Tableau.

Browsing what other people have done on Tableau Public is a great source of both challenge and inspiration. Recently, I’ve been really taken with Neil Richards’ visualisation of football league winners over time, with a beautiful sine wave showing how long it’s been since each team last won the league. I’ve no idea what to call these plots, but they’re fantastic (click image to see Neil’s original on Tableau Public).

I’ve wanted to take these apart and see how they work for a while, and finally got round to it the other day. It turns out that Neil did a lot of the angle calculations outside Tableau, which is fair enough… so I set myself the challenge of doing it all with table calculations. I got there eventually, but it was a good workout.

This blog is a walk-through of how to do it. Instead of football data, I’ve used official pope names; I was on a wikipedia spiral and noticed that seven of the eleven popes between 1775 and 1958 were called Pius, taking the Pius count from VI to XII. Naturally this reminded me of Barcelona’s recent league dominance, winning six of the last nine La Liga championships, so it seemed obvious to see if the chart for pope names would be similarly tightly woven.

So. This is all the input data you’ll need: a list of all popes, in order, with a record ID and a number showing how many of that name there’s been so far:

Come to think of it, if you’re good with INDEX() calcs, you might not even need the PopeNameNo column… but I’m not, so I do.

You’ll also need a simple scaffold sheet with 100 points, going from 0 to 99. If you’re trying to visualise something with more data points than the 267 popes I’ve got, you might want to whack up the scaffold to 999 instead.

[I’m including the elected-but-not-consecrated Stephen II in this list, because even though he’s not an official pope, all the subsequent Stephens had an increased number until relatively recently, and then it got confusing. So he’s in here.]

Read the two files into Tableau, and create a calculated join with “x” in the join field for the popes data, so that there are 100 points for each pope. Now we’re ready to do some vizzical jiggery-popery!

Although Neil’s vizzes had time on a y-axis going vertically, I’ve spent way too long looking at time series graphs for that to feel intuitive, so I’m reverting to the vanilla “time on x-axis going left” approach.  Let’s stick Pope ID on the x-axis as a continuous dimension:

Great, we’ve got a line made up of lots of circles. This doesn’t make it easy to see what’s going on, so let’s filter to a single pope name – Leo will do for now:

Here’s all the Leos, in order. It was a fairly popular (pope-ular?) pope name in the later part of the first millenium, but then it fell out of favour for a while, with almost 500 years between Leo IX in 1054 and Leo X in 1513.

We want to connect these dots with a line, but if we set the mark type to line, it’ll just be a straight line. Rather, we want a curved line, like this:

This is why we’ve got the scaffold table. We can’t just connect two points with a curvy line – or at least, I can’t. Instead, we need to put a load of dots between the two main points, and connect them up. That means figuring out the x and y axis values to put those dots in the right place to connect the two main points with a nice sine wave.

To do that, we’ll need to create a new x-axis measure instead of simply Pope ID, where the distance (in units of popes) is divided by up so that the scaffold points are evenly distributed along the x-axis. But first, that means calculating the distance between popes in units of popes. We can do that with a lookup() calculation:

`LOOKUP(ATTR([Pope ID]), 1)`

This is working nicely – I’ve stuck it on the tooltip, and hovering over Leo IX, who’s pope number 153 in my list, tells me that the next Leo is pope number 218.

This’ll work fine for this filtered view, but to get it to do this properly, you’ll need to put the Point field from the scaffold table on detail, and edit the table calculation to compute using Point and Pope ID:

At the moment, all those points are on top of each other on the Pope ID value. This isn’t what we want – we want to spread them out evenly between the Pope ID values. To do that, we’ll need this calculation here. It’s a bit long, and there’s MIN() functions everywhere because of all the table calculations, but hey:

Logically, what it’s doing is this:

1. There are 100 scaffolding points, going from 0 to 99.
2. If it’s the first one, i.e. 0, give it the same value as Pope ID. For the Leo IX to Leo X example, this is 153.
3. If it’s the last one, give it the same value as the next Pope ID with the same name. For the Leo IX to Leo X example, this is 218.
4. If it’s any of the rest, calculate the difference between the two Pope ID values (i.e. 218 – 153, which is 65 pope units), and then divide that by the maximum point value, which is 99 (if you made your scaffold points 1-100 instead, you’ll have to set this to maximum point value -1, not 100). This is because there’s 99 spaces to fill between all the scaffold points. Then multiply that fraction by the number of the point, and add it to the Pope ID value.

You can also copy and paste it directly from here if that makes it easier:

```IF MIN([Point]) = 0 THEN MIN([Pope ID]) ELSEIF MIN([Point]) = MIN([MaxPoint]) THEN [NextPopeID] ELSE MIN([Pope ID]) + ( ([NextPopeID] - MIN([Pope ID])) / MIN([MaxPoint]) * MIN([Point]) ) END```

Grand. Set the new x-axis value to calculate using Pope ID and Point, restarting every Pope ID, and that’s the x-axis sorted. But these points are still basically just calculating a straight line, whereas what we actually need to do is push them up the y-axis by a different amount, kind of like this:

Let’s also add a direction calculation, so that the wave between the first and second goes upwards, the wave between the second and third goes downwards, and so on. We can do that by working out whether it’s an odd or even number, and setting the direction accordingly:

`IF INT([Pope Name No] % 2) = 0 THEN -1 ELSE 1 END`

Now let’s work on our y-axis calculation. It’s got three parts:

1. Working out a nice sinusoidal curve
2. Multiplying that value by how long it’s been between popes, so that the longer it is between popes, the higher the curve goes
3. Multiplying that by the distance calculation so that it goes above or below the x-axis accordingly

The first phase of a sine wave goes from 0 on the y-axis, up to a peak of 1, and then back down to 0 between the x-axis values of 0 and π, like so:

In our case, we don’t want a wave between 0 pope units and 3.141… pope units; rather, we want to define the beginning and end of this phase of a sine wave to be between one pope and the next pope of the same name. So, for Leo IX to Leo X, we want 153 to be our 0 and 218 to be our π. That means taking the scaffold point, dividing it by the maximum point of 99 to get the % of the distance that that point is along the 0-to-π scale, and then multiply it by π.

That’ll give us the first phase of a sine wave of the same height (of 1) between the popes, regardless of how long it’s been between them. We want the peak to be higher the longer it’s been between popes, so we multiply it by the distance. Then we can multiply by our positive/negative direction calc. Here’s the code:

```MIN(SIN([Point]/[MaxPoint] * PI())) * ([NextPopeID] - MIN([Pope ID])) * MIN([Direction])```

So, stick the y-axis calc on rows, set the table calculation to calculate using Pope and Pope ID, and voila! We have a nice set of sine waves between our Leos.

(this plot reminds me of doing Fourier transformations for EEG analysis; technically, we haven’t created this complex wave by layering up different sine waves on top of each other, but we can kind of decompose it into sets of individual sine waves as we go along)

The hard work is done now, so let’s bring the rest of our popes back in:

Delightful. The rest of it is all about making it pretty, which I can leave to your personal tastes. But the real question is: what happens with Pius, the Barcelona of second millenium popes? Can we clearly see the era of Pius dominance?

…yes, we can.

These graphs can be applied to basically anything that goes in a sequential order and may or may not have repeated values; this graph here is every word from my old band’s EP in order. I like how you can see where the choruses are, because the lines get more tightly woven as the words in the chorus are repeated more often.

I hope this blog makes it clear how to make these graphs! I still don’t know what to call them, but in my head they’re unimaginatively down as sine wave time series. Thanks again to Neil for creating them first, and for making his workbooks downloadable and play-around-withable!

Standard

# The Gaslight Analysis: when sentiment analysis doesn’t quite work.

I love sentiment analysis. It’s a great way of getting fascinating insights from a glut of text data. You can take a load of Yelp reviews, figure out how people feel about a place, and cross-validate it with the star rating. You can take the works of Jane Austen and plot narrative arcs. You can look at the texting styles of you and your girlfriend. If you’ve got a dataset with clear sentences in standard textbook English, you can find out all sorts of things.

But, here’s the thing with language; it’s gloriously, infuriatingly messy.

That makes it really hard to do really good sentiment analysis – certainly with the free, widely-available tools. Most of those assign certain emotional values to specific words; for example, in the NRC dataset often used with the R package Tidytext, the word “alive” has associations of ANTICIPATION, JOY, POSITIVE, and TRUST, while the word “afraid” has associations of FEAR and NEGATIVE.

This approach works great for sentences like this:

“I bought these shoes last week, and they’re amazing. They feel great, and they make me feel great. Good value too! 10/10, very happy about this.”

…but it doesn’t work for sentences like this:

The second sentence is pretty obviously negative, but it works by negating words. The word “good” isn’t actually good, because it’s being negated by “don’t” a couple of words earlier. And “love” isn’t a positive emotion here, as it’s expressing the desire to get out of the situation, meaning that what’s going on is not a positive thing.

It’s possible to address this with sentiment analysis, but it’s complicated. You’d have to account for every possible way of negating/reversing a word, and there’s a lot of those. You’d have to account for every possible way that a word that’s positive in isolation could actually be referring to a negative overall situation, and that’s a huge task. This is why good sentiment analysis costs a fortune. It’s really complicated.

Luckily, people tend not to speak indirectly all the time, and in aggregate, the twisting, sentiment-negating sentences are cancelled out by the number of straightforward sentences where word-level sentiment analysis does work. But the caveat is that just because you’re using sentiment analysis, that doesn’t mean you’re using it well, and you should really cross-validate it with some other measures.

I’m exploring this with lyrics from Brian Fallon’s bands – The Gaslight Anthem, The Horrible Crowes, and his solo project. I’m looking at Fallon’s lyrics because:

1. Song lyrics are enough of a deviation from standard English to pose problems for standard sentiment analysis;
2. At the same time, song lyrics are some of the most obviously emotional usages of language we have;
3. Fallon writes lyrics in a pretty clear style, often in full sentences, without too many obscure metaphors or references;
4. I really like his music.

I’ve used the Tidytext package in R for doing sentiment analysis before. This time, I’m using the same NRC sentiment dataset, but trying it out in Alteryx instead. I’ve also visualised it in Tableau, and you can click any of these images to go to an interactive link where you can play around with it yourself.

So, first things first; let’s have a look at sentiment in each song:

Looks pretty good so far. Here’s lookin’ at you, kid is a wistful, regretful song; definitely on the negative side, quite a bit of sadness, very little joy. Click the graph to explore other Brian Fallon songs, if you know them.

There’s a lot of different sentiment measures available, so let’s simplify it to looking at positive and negative. Here’s lookin’ at you, kid has 13 negative words, and 4 positive words. If we take difference (9) and divide it by the biggest value (13), we get a ratio of positive to negative words:

Positive – Negative
————————————
MAX(Positive, Negative)

This accounts for the difference between positive and negative words, as well as the number. For example, if one song has 10 positive words and 5 negative words, and another song has 6 positive words and 1 negative word, the difference is the same, but the second song is more positive overall, because it has far more times the number of positive words than negative words.

If we calculate this +/- balance for each song, we can order them as follows:

There’s a nice mix of positive and negative songs, and if you know the songs, some of them definitely feel right; Here’s lookin’ at you, kid is negative, so is Get hurt, while 45 is an upbeat, positive song. But there’s definitely some weird ones in there. We did it when we were young is a sad, regretful song, but it’s up there in the top half of positive songs. That doesn’t seem right.

So let’s cross-validate this. Spotify’s Echo Nest data has a measure called Valence, which is a measure of how positive the mood of a song is. You can get all kinds of interesting measures for your Spotify playlist here. When we plot the Spotify Valence (branching off to the left for values under 50), we get this instead:

Spotify has Here’s lookin’ at you, kid as one of the most negative songs, along with Cherry blossoms, which seems about right to me, but has We did it when we were young as a pretty neutral song, which still doesn’t feel right. Have a look at the difference with Blue jeans and white t-shirts, as well – it’s one of the most negative songs according to Spotify, but one of the most positive according to sentiment analysis. I’d put it somewhere in the middle, maybe a bit more positive than neutral.

Since I keep using my own perspective as a fan and a human, I figured I’d better cross-validate both of these stats with what fans think. I set up a simple survey where to get Brian Fallon fans to rate each song for positivity on a scale of 1 to 7, where 1 meant really negative, 4 meant neutral, and 7 meant really positive. I stressed in the introduction, several times, that it’s not a rating of how much you like each song, or how positive each song makes you personally feel (like, I really like Fallon’s sad songs because they make me feel nice… but they’re still objectively sad), but about the emotion in the song itself. Around 15-20 fans answered for each song, so I averaged their ratings together to get a human-generated emotion rating per song. It’s not the most scientific approach, but it’s good enough for the purposes of this blog.

Here’s what we get, centred around an average of 4 for neutral songs:

This time, Blue jeans and white t-shirts comes in as I see it – fairly positive, but not hugely so. We did it when we were young is down there in the negative range, along with Get hurt and Here’s lookin’ at you, kid.

It’s fascinating to see how the three measures agree and disagree for each song. If we rank each song along each measure (with 1 being most positive and 85 being most negative), we can see how the rank difference varies. There are five possible combinations:

1. All measures disagree with each other
2. All measures agree with each other
3. Spotify valence and fan rating agree, but sentiment analysis disagrees
4. Sentiment analysis and fan rating agree, but Spotify valence disagrees
5. Sentiment analysis and Spotify valence agree, but fan rating disagrees

…and there’s at least one example of each:

We can see which songs are most consistently rated across all three measures by looking at the difference between each song’s highest and lowest positivity rank:

Blood loss is the most consistently rated, with a rank difference of only four places, while The backseat has a massive rank difference of 81 – fans put it as the second most positive song in Brian Fallon’s catalogue, while sentiment analysis rates it as 83rd, ahead of only I believe Jesus brought us together and I witnessed a crime. Spotify puts it at 24th.

Another way of showing this variation is by creating scatterplots of each measure against each other, with each dot representing a song:

I’ve run simple correlations on each plot – not exactly statistically kosher, but this is all just exploratory. There is no correlation between valence and sentiment analysis, and more tellingly, no correlation between sentiment analysis and fan ratings. There is a correlation between valence and fan ratings, but it’s not particularly strong.

The overall point, then, is to be careful with sentiment analysis. It’s not that it doesn’t work – it can often work really well, and be a really useful line of investigating data. But relying on sentiment analysis alone, without checking whether it matches measures that should reflect the same kind of thing, might give you some false insights. You don’t want to have Great expectations, or you might Get hurt.

Standard

# The growing gap between the Premiership’s Top Six and the rest.

This is my first football data blog for a while, and I feel all nostalgic! It’s nice to dive into some league table data again, and even nicer now that I have Alteryx; I was able to format my data about 10x quicker than I was in when I first started doing this in R. Then again, I’ve probably also spent 5x more time using Alteryx than R in the last year or so. Anyway.

I’ve been hearing a lot more analysis of the Top 6 in the Premiership recently. I first noticed it in the last couple of seasons, when I saw a few journalists/people on Twitter writing about a “Big Six Mini-League”. Liverpool often seemed to do quite well at this, and Arsenal often seemed to do quite badly at this. Neither team won the actual league.

I’ve started looking at how the Top 6 sides in the Premiership perform each year (using data from this fantastically well-maintained repository), and there’s quite a few interesting stories in here. The first main point is that the big clubs are accelerating away from the rest of the league. The second main point is that any big six mini-league doesn’t really matter, as you can win the Premiership with an underwhelming record against your main rivals if you trash everybody else. I mean, that shouldn’t be much of a surprise – if you’re a Top 6 team, only 30 points are on offer from matches against your rivals, but you can potentially take 84 points from the 28 matches against the rest of the league.

For all these analyses, I’m taking Top 6 literally – meaning the teams that finish that season in the top six positions. Nothing to do with net spend, illustrious history, shirt sales in Indonesia, or anything like that. I then look at the average points-per-game changes by team, position, season, and Top 6/Bottom 14 status. I also filtered out the first three seasons of the Premiership to keep it slightly easier for comparison, since there were 22 teams in the league until 1995-96.

When plotting the average points-per-game per season between the two groups, a clear trend emerges; the Top 6 are better and better at beating the rest of the league:

However, this trend appears to be asymmetrical. When looking at the overall average points-per-game for all games across the season, teams that finish in the Top 6 are getting better, but there’s only a negligible decline for the rest of the league. This suggests the bigger, better teams are pulling away from the rest of the league:

This effect is most striking when plotting the difference in overall average points-per-game between the two groups:

Teams finishing in the Top 6 scored around 0.6 points-per-game more than the rest of the league in the early nineties, but that’s now up to over 1 point-per-game in the latest couple of seasons. That half-a-point difference translates to a 19-point difference across a whole 38-game season.

We can plot each team in each season of the Premiership (since 1995-96, when the league was first reduced to 20 teams) and look at how well they did against the top teams and the rest of the league. In this graph, the straight line represents equal performance vs the Top 6 and Bottom 14:

A couple of things stand out:

1. Only a handful of teams have ever done better vs. the Top 6 than the rest of the league. This seems to have no effect on final position.

2. It’s possible to win the league with a poor record against the Top 6 by consistently beating everybody else. Manchester United won the league in 00-01 and 08-09 with only 1.3 PPG vs the Top 6.

3. Manchester City this year are ridiculous.

I find it interesting to compare Liverpool and Arsenal over the years. One narrative I sometimes hear is that Liverpool tend to raise their game for big matches, but are too inconsistent the rest of the time, whereas Arsenal struggle against big sides but do well enough in the rest of the league to consistently finish well. This chart seems to bear that analysis out; Liverpool’s cluster of dots are higher on the chart, but further to the left:

…while Arsenal’s cluster is slightly lower but further to the right… and most importantly, more colourful:

And if you want to explore other teams and seasons, there’s an interactive version of all these graphs here.

Standard

# Why won’t my Tableau small multiples chart work?

Andy Kriebel wrote a great tutorial on how to make small multiples charts in Tableau here. It works pretty much all the time… but you’ll also find that if you simply copy and paste the calculations, it might not work with your data.

For example, have a look at Superstore here. I’m plotting sum of Sales for each continuous month per product subcategory. The rows and columns calculations split up the view nicely into a line for each subcategory, which is good:

But look closely, and you’ll see some weird stuff going on; there’s a brown-ish dot for subcategory = Copiers and month = October 2014 in the Chairs section (second row, second from left):

What’s going on there?

It turns out that the rows and columns calculations can’t handle nulls in the underlying dataset. I haven’t dived into this fully, but I’m guessing this is because the index calculation works depending on what’s in the view, rather than being fixed on all the subcategories and months regardless of whether there’s data or not.

In this case, what happens is that the October 2014 missing data for one one category – Accessories – shunts everything else up one; the Appliances value turns up in the Accessories small multiple, the Art value turns up in the Appliances small multiple, and so on. The same thing would happen in March 2014 if there was another subcategory after Tables too.

You’ll see that if you switch to calculating using quarters instead of months, this problem disappears completely.

Andy’s calculations are great because they’re really flexible, and they’ll work fine without much further adjustment most of the time. But if you get issues with null data like this, you can try this alternative instead.

With these calculations, I’m going to hardcode the small multiples by whatever thing you’re splitting up the view by. That means that you’d have to create separate fields for every dimension you’d ever want to do it by, which is extra work, but it does take care of the nulls issue.

First, create a calculation called Number (or Subcategory ID, or Steve, or whatever suits you). This is a case statement which assigns a number from 0 to N-1 for a particular dimension.

```CASE [Sub-Category] WHEN 'Accessories' THEN 0 WHEN 'Appliances' THEN 1 WHEN 'Art' THEN 2 WHEN 'Binders' THEN 3 WHEN 'Bookcases' THEN 4 WHEN 'Chairs' THEN 5 WHEN 'Copiers' THEN 6 WHEN 'Envelopes' THEN 7 WHEN 'Fasteners' THEN 8 WHEN 'Furnishings' THEN 9 WHEN 'Labels' THEN 10 WHEN 'Machines' THEN 11 WHEN 'Paper' THEN 12 WHEN 'Phones' THEN 13 WHEN 'Storage' THEN 14 WHEN 'Supplies' THEN 15 WHEN 'Tables' THEN 16 END```

Typing all that out is quite a faff, so I generate that text with a concat function in Excel like this:

Now create a calc called Columns with the modulo function like this:

`[Number] % 4`

And then create a calc called Rows by dividing and rounding like this:

`INT ([Number] / 4)`

It’s crucial that you use the same constant each time! I’ve used 4 because that’ll give me 4 columns across the top, meaning that the 17 subcategories in superstore will be split over four rows of four columns and a fifth row with one column, exactly like Andy’s small multiples do. If you want to do it another way, you could use 3 instead. That would give you five rows of three columns and a sixth row of two columns. There’s a lot of playing around with the configuration, but it’s also more flexible in terms of the configuration you want to plot.

Now that you’ve got these row and column calcs, you can drag them into the view like this, and generate small multiples which work even with null data:

Just to make sure, let’s colour code it by subcategory too. No differently coloured dots in the wrong places anymore!

Another advantage of this approach is that you can colour the graphs by another field. You can do that with Andy’s calcs too, but you have to be careful about how the table calcs work and what they’re using to compute the calculations. Because my calcs don’t have index() in them, there’s no table calc issues to worry about. Just drag and drop.

Standard
Tableau

# How to give your area and bar charts a makeover with connected scatterplots.

I haven’t always been a fan of connected scatterplots, but I’m gradually coming round to them. First it was with centre of gravity maps; now it’s as a replacement for (some) area charts and bars.

I came across a chart a lot like this at work this week:

I’ve mocked it up using some fake data, but it’s pretty much showing how two departments (or groups, or types of things, or categories) do in terms of profit (or another measure) over the year, and how many orders (or another measure again) there were in each month across both groups.

I’m not a fan of the area chart plus bars over the top approach. Firstly, the bars obscure what’s going on with the area chart underneath them, and secondly, area charts can be misleading as it’s hard to parse each thing separately.

Let’s show the same information with a connected scatterplot:

Here, the red and blue lines show the two departments, and the grey line shows the aggregated number of orders and profit. Instantly, the difference between the two departments is a lot clearer; department A has had a lot of variation in number of orders but profit has stayed pretty consistent, while department B has had a lot of variation in profit but the number of orders has stayed consistent.

The lines are joined up by month… but as it is, it’s impossible to tell where the year begins and ends, which makes the whole thing pretty pointless. Let’s show time with size:

The lines get thicker as the month gets more recent, and now it’s easy to see the trends over the year. The variation in orders in department A is all over the place, but the variation in profit in department B is a bit more consistent; profit has gone down over the year. We can also see the aggregate profit and order trends much more clearly on the grey line, with orders going up but profit going down.

The downside of using line size to show time trends is that the thinner parts are hard to see and the thicker parts can be hard to parse. Let’s try it with colour instead:

I personally prefer this approach to using line size, but it’s also a bit of a faff. Tableau doesn’t like it if you try to do a three-way colour split by dimension (i.e. red, blue, and grey) and then change the shade of the colour by a measure, so you have to convert the month to discrete and make sure to order everything correctly. It’s not too taxing though, so it’s worth it if the data doesn’t change all the time.

Finally, you can go the whole hog and do some double encoding with both colour and size on the line to show time:

This is eye-catching, but possibly to the point where it’s more distracting than informative.

I like the connected scatterplots in this example, although there are many situations where the lines will overlap in a way that won’t tell you much. It certainly won’t work well with lots of different departments; here’s the mess you get if you look at all countries in EU Superstore for all months of all years:

Even when filtering to four countries and two years of data, it’s not the clearest way of showing things:

In summary, then, give a connected scatterplot a go. It may well not work, but sometimes it’ll result in something a lot clearer and more informative than a combined area/bar chart.

Standard

# Not all who wander are lost… but it takes work to wander well.

[I originally wrote this as a guest blog for My Scholarly Goop, I’m now crossposting to my own blog]

I used to be jealous of my friend Dave when we were at school. He’s wanted to be a doctor ever since he can remember, and that gave purpose to everything he did. We’d be in a chemistry lesson, and he’d be listening intently, even though he knew the topic, and he knew that he knew the topic, because a mastery of chemistry would be the foundation of the rest of his career. He wouldn’t let losing focus on a dull Tuesday afternoon potentially jeopardise his university applications a few years down the line. I’d be sitting next to him, quietly filling out the sudoku I ripped out of the newspaper in the library.

I envied his sense of purpose and direction. Our lives looked a bit like this:

But in time, I grew a lot less jealous, and embraced my lack of direction. There’s something liberating about not being focused on any particular thing; it gives you room to explore all the spaces in between.

My academic career veered from Japanese to linguistics to cognitive neuroscience, with a smattering of international relations, public policy, and statistics. Along the way, I worked in various jobs in financial software consultancy, accounting, commercial law, translation, and selling dog food. I’m now a data scientist, and I’ve done projects with a Premiership football team, an auction house, a scientific research funding body, a cargo company, and two different medical charities. I’m currently on a six-month placement with Solar Turbines, working on making huge turbine engines more efficient and reliable. I haven’t followed a path so much as got lost in the forest, and I’ve thoroughly enjoyed taking in all the trees.

Now, if you’re like my friend Dave, then great! You know what your goal is, and you’re probably doing what you need to do to get there. But if you’re reading a blog about post-PhD career paths, you’re probably a bit more like me. It’s exciting, because you can do anything! But it’s also overwhelming, because you should consider everything.

Writing about the joys of not having goals may sound flippant, but aimlessly wandering successfully takes a lot of effort. To put it in context, this is what my summer in 2016 looked like:

I applied for something like 35 or 40 jobs in various fields, I went to six interviews, and I received one job offer (I withdrew from another two interviews before attending, because I’d have taken my current job at The Information Lab even if I’d been offered any of the others). Each job application took a good two hours or so to write, so that’s maybe eighty hours of work. The job application process for The Information Lab meant I had to download and learn a new bit of software, so I probably spent at least ten hours on that application alone.

But just the hours spent applying for jobs wasn’t enough. My PhD was about how special kinds of onomatopoeic words might or might not be related to multisensory processing and perhaps synaesthesia… and that’s the general summary version. PhD research is highly specific, which also means it’s highly irrelevant to most non-academic jobs. When you talk about your PhD, this is what you think you’re saying, and how it compares to what most people hear:

Unlike the content of your PhD itself, the wider skills you learn are valuable, widely transferable, and highly sought. However, it’s your responsibility to show this to people. You could be a brilliant analyst, but nobody’s going to listen if you only talk about your skills in the context of your research. So, for every two hours or so I spent on a particular job application, I probably spent another two hours creating a portfolio on my blog. I used my coding, statistical, and data visualisation skills to play with various different data sources, such as looking at how the gender gap in GCSE results in England corresponds to various measures of how “good” a school is, or looking at football stats to show that the 2016 Portugal side are the worst winners of a European Championship. This took a lot of time and effort, but every single organisation that invited me to an interview said that it was this portfolio of work on my blog that had got me there. Not my PhD work.

This means that I spent about 160 hours, or twenty eight-hour days, or just under one working month, on getting one job offer. This probably sounds quite daunting. And yeah, it is. Getting a job outside academia was like writing an extra chapter for my PhD.

But the good news is that while it might be hard work, it won’t be wasted work. I’ve spoken to a fair few PhD students looking for a career outside academia, and I think a lot of people fall into a zone of self-defeat; they’re more than qualified for the jobs they’re applying for, but they undersell themselves because they’ve spent years surrounded by frighteningly competent people during their PhDs and only see themselves in relative terms. I felt like this for a long time myself, and it takes a while to get out of this mindset. I think academia does this to people’s perceptions of themselves:

All this means is that academia itself is probably the biggest obstacle to getting a job outside academia. Take the time to research different careers, different people, different ideas; once you can frame your skills and abilities independently of your own research, you’re halfway there.

If you’re considering a career outside academia, you’re already far more qualified than you think you are, you’ve got more to offer than you think you do, and you can be far more than you think you can. But it’s up to you to prove it.

Standard

# Alaska Fried Chicken: the UK’s curious approach to naming chicken shops.

I went a little bit viral a couple of weeks ago when I tweeted about chicken shops in the UK which are named after American states which aren’t Kentucky. If I’d thought about it, I’d have written this blog up first, created a Tableau Public viz, and had all kinds of other shit ready to plug once I started getting some serious #numbers… but I didn’t. So, to make up for that, this blog will go through that thread in more detail and answer a few questions I received along the way.

It all started when I walked past Tennessee Fried Chicken in Camberwell, pretty close to where I live. It’s clearly a knock-off KFC, and I wanted to know how many other chicken shops had the same name format: [American state] Fried Chicken.

The first thing to do is to get a list of all the restaurants in the UK. I spent a while wondering how to get this data, but then I remembered that my colleague Luke Stoughton once built a Tableau Public dashboard about food hygiene ratings in the UK. All UK chicken shops – hopefully! – are inspected by the Food Standards Agency. So, Luke kindly showed me his Alteryx workflow for scraping the data from the FSA API, and I adjusted it to look for chicken shops.

My first line of inquiry is pretty stringent: how many chicken shops in the UK are called “X Fried Chicken” where X is an American state which isn’t Kentucky?

Turns out it’s 34. “Tennessee Fried Chicken” – including variants such as Tenessee and Tennesse – is the most popular with 13 chicken shops. The next highest is Kansas with six, which I’m assuming is so the owners can refer to their shops as KFC, although maybe the owner/s just really like tornadoes, wheat, and/or the Wizard of Oz. Then there’s four Californias, a couple of Floridas, and one each of Arizona, Georgia, Michigan, Mississippi, Montana, Ohio, Texas, and Virginia.

[tangent: I’m aware that a lot of these states aren’t exactly famed for their fried chicken, but as a Brit, all I have to go on for most of them are my stereotypes from American media. But hey, maybe it’s still accurate, and Ohio Fried Chicken tastes of opiates and post-industrial decline, Arizona Fried Chicken comes pre-pulped for the senior clientele who can’t chew so well these days, and Florida Fried Chicken is actually just alligator. Michigan Fried Chicken is, I dunno, fried in car oil rather than vegetable oil, and Alaska Fried Chicken is their sneaky way of dealing with the bald eagle problem up there? I’m running out of crude state stereotypes now, I’m afraid. Out of all these states, I’ve only actually been to California.]

There’s also a “DC Fried Chicken”, which is close but not quite close enough for me, and a “South Harrow Tennessee Fried Chicken”, which I’m not counting because either.

Here is where these American State Fried Chicken shops are in the UK:

Interestingly, this isn’t a case of a map simply showing population distributions. The shops cluster around the London and Manchester regions, but with almost none in any other major urban centre.

Let’s have a look at the clusters separately. Here’s the chicken shops around the Manchester area:

None of them are in the proper centre of Manchester itself, but they’re in the towns around. One town in particular stands out: Oldham. Let’s have a look at the centre of Oldham:

Oldham, you’re fantastic. There are six separate “X Fried Chicken” shops in Oldham, and four of them – Georgia, Michigan, Montana, and Virginia – are the only ones by that name in the whole country.

For comparison, here’s the London area:

This is where all the Tennessees are, as well as the one Texas and Mississippi.

It looks like there’s a lot more variety in the north of England compared to the south, and sure enough, a split emerges:

[chicken icon from https://www.flaticon.com/packs/animals-33%5D

Chicken shops in the south of England (and that one Tennessee place in Wales) tend to name their shops after states in the geographical south of the USA, while chicken shops in the north of England name their shops after any states they like.

This is where my initial Twitter thread ended, and I woke up the next day to a lot of comments like “Y IS THEIR NO MARYLAND THEIR IS MARYLAND CHICKEN IN LEICESTER”. Well, yeah, but it’s not Maryland Fried Chicken, is it?

So I re-ran the data to look at chicken shops with an American state in the name. This is the point at which it’s hard to tell if there’s any data drop out; the FSA data categorises places to inspect as restaurants, takeaways, etc., but not as specifically as chicken shops. All I’ve got to go on is the name, so I’ve taken all shops with an American state and the word “chicken” in the name. This would exclude (sadly fictional) places like “South Dakota Spicy Wings” and “The Organic Vermont Quail Emporium”, but it’d also include a lot of false positives; for example, you’d think that taking all takeaway places with “wings” in the name would be safe, but when I manually checked a few on Google Street View (because I’m dedicated to my research), about half of them are Chinese and refer to the owner’s surname, not the delicacy available.

This brings in a few more states – Marlyand, New Jersey, and Nevada:

Let’s have another look at the UK’s south vs north split. We’ve got a bit of midlands representation now, with the Maryland Chickens in Leicester and Nottingham, the Nevada Chickens in Nottingham and Derby, and a California Chicken & Pizza near Dudley. The latitude naming split between the south/midlands and the north isn’t quite as obvious anymore:

…but, there is still a noticeable difference. This graph shows each chicken shop with an American state and the word “chicken” in the name, ordered by latitude going south to north:

In the south and the midlands, there’s the occasional chicken shop that’s going individual – there’s the Texas Fried Chicken in Edmonton, the two Mississippi places in London which don’t seem to be related (Mississippi Chicken & Pizza in Dagenham, Mississippi Fried Chicken in Islington), the Kansas Chicken & Ribs place in Hornsey is almost definitely a different chain from the six Kansas Fried Chicken shops in and around Manchester, and the California Fried Chicken in Luton is probably independent of the California Fried Chickens on the south coast – but most of them are Tennessee or Maryland chains in the same area. In all, the south and midlands have 17 chicken shops named after 8 American states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.47.

In the north, however, there’s a proliferation of independent chicken shops – 15 shops named after 9 different states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.6. There’s the chain of six Kansas Fried Chicken places and two Florida Fried Chicken places in Manchester and Oldham, but the rest are completely separate. Good job, The North.

The broader question is: why does the UK do this? There’s obviously the copycat nature of it; chicken shops want to seem plausible, and sounding like a KFC (and looking like one too, since they’re almost always designed in red/white/blue colours) links it in people’s minds. I think there’s more to it, though. Having a really American-sounding word in the name is probably a bit like how Japanese companies scatter English words everywhere to sound international and dynamic (even if they make no sense), or how Americans often perceive British names and accents as fancier and more authoritative (even if to British ears it’s somebody from Birmingham called Jenkins). We’re doing the same, but… for fried chicken.

Finally, since this data is all from the Food Standards Agency’s hygiene ratings, it’d be a shame not to look at the actual hygiene ratings:

It looks like independently-named chicken shops named after American states in the north are more hygienic. The chains in the south and midlands – Tennessee, Maryland, California, and especially New Jersey – don’t have great hygiene ratings, and the independent shops do pretty badly too. In contrast, the chicken shops in the north score highly for cleanliness. In fact, a quick linear regression of hygiene onto latitude gives me an R2 of 0.74 and a p-value of < 0.0001. Speculations as to why this is on a postcard, please.

Update, November 2018: I’ve finally got round to refreshing the data and putting up an interactive, searchable map. Sadly, it looks like Ohio Fried Chicken has shut down, but there’s another Arizona Fried Chicken now, so… (s)wings and roundabouts. Have a look for (probable) chicken shops in your area here.

“I live in […] and my local shop […] isn’t mentioned!”
Maybe you’re talking about a Dallas Chicken place. That’s not a state. Nor is Dixy Chicken, it just sounds a bit American. If it’s definitely a state, then does it have chicken in the name? If not, I won’t have picked it up. I also haven’t picked up shops which have, say, “Vermont Fried Chicken” written on the shop sign if it’s registered in the database as “VFC”. Same with if the state is misspelled, either by the shop or by the data collectors. If it’s all still fine, perhaps the shop is so new that it hasn’t had an inspection… or perhaps the shop is operating illegally and isn’t registered for a hygiene inspection.

“Did you know about Mr. Chicken, the guy who designs the signs?”
I didn’t, but I do now! He’s brilliant.

“How did you do all this?”
I use Alteryx for data scraping/preparation and Tableau for data visualisation.

“I have an idea for something / I want to talk to you about something, can I get in touch?”
Please do! My Twitter handle is @GwilymLockwood, or you can email me on gwilym.lockwood@theinformationlab.co.uk

“Your analysis is amazing, probably the best thing I’ve ever seen with my eyes. Where can I explore more of your stuff?”
Thanks, that’s so kind! There’s a lot of my infographic work on my Tableau Public site here.

Standard
Tableau

# Strings and roundabouts: how to label your numbers in Tableau effectively

Lots of Tableau dashboards feature big summary numbers. They’re pretty nice, and they look like this:

And what’s even nicer is that you can alter the measure’s default number format to automatically round a specific unit, like this:

But sometimes, the range of numbers is a lot wider than sales per state in Superstore. Let’s have a look at population per country in the World Indicators dataset:

Those are some long numbers, so let’s round them to the nearest unit again. But which one? If we round to the thousands, we get this:

And if we round to the millions, we get this:

It’d be great if we could get Tableau to figure out what the nearest sensible unit is. That functionality doesn’t exist yet (as far as I know!), but we can write a specific optimised rounding calculation for labelling purposes. It’s a bit of a long one:

This calculation returns the number you want as a string. It does this by:

1. Aggregating the number you’re actually working with already and finding out whether it’s above a billion (in which case you’d want to summarise to whatever number of billions it is), or above a million (in which case you’d want to summarise to whatever number of millions it is), and so on.
2. Converting it to an absolute number so that it works for negative numbers too.
3. Taking that aggregated number and dividing by the sensible unit. For example, if your number is 34000000, you’d want to express it as 34 million, so we’re dividing it by a million to return 34.
4. Rounding that divided figure to one decimal place. This is just my preference, you can do what you like! Set the number to 0 for no decimal places, or 2 for two decimal places, etc.
5. Convert that number to a string.
6. Add a text unit abbreviation to the end of it.

Of course, you can also add trillions, quadrillions, and so on, if that’s what your data requires.

This sorts us out nicely:

Now, I’ve deliberately aggregated everything within the calculation, and I’m only using it for labelling purposes. I categorically do not recommend aggregating outside this calculation or using this calculation for calculating anything else. This will result in a shitstorm of rounding errors which can seriously damage your data. But as a final step once you’ve sorted everything out, I find that this is really nice for presenting data.

Here’s the calculation in text so you can copy and paste it into your workbooks:

```IF ABS(AVG([Population Total])) >= 1000000000 THEN //round for billions STR(ROUND(AVG([Population Total] / 1000000000), 1)) + "b" ELSEIF ABS(AVG([Population Total])) >= 1000000 THEN //round for millions STR(ROUND(AVG([Population Total] / 1000000), 1)) + "m" ELSEIF ABS(AVG([Population Total])) >= 1000 THEN //round for thousands STR(ROUND(AVG([Population Total] / 1000), 1)) + "k" ELSE STR(ROUND(AVG([Population Total]),0)) END```

Standard

# Time is a Machine: how to create a clock in Tableau

I made a clock in Tableau this week, and you can find it on Tableau Public here.

It always shows the current time for the UK, but it shouldn’t be hard to parameterise to update to whatever time zone you’re in.

Essentially, all it is is two points on a scatterplot, connected by lines to the coördinates (0,0), and superimposed on a background image. I made the background image in Powerpoint, based on the clock in the Time episode of Don’t Hug Me I’m Scared.

I’ve written before about using radial calculations to plot distance from the centre and change the lengths while keeping the angles constant. This time, we’re going to change up the trigonometry a bit, and calculate the angle while keeping how far the line goes constant.

Firstly, though, we need some data to work with. All you need to get a DateTime is a single cell in a single column… but for plotting purposes, we’re going to need the following dataset:

That’s all we’ll need! Read that into Tableau, and the rest can be done with calculated fields.

Firstly, we need to find out what the time is. Tableau has the `NOW()` function, which is really useful. It returns the exact time, down to the second, of the time on your computer (assuming that you’re working in Tableau Desktop with an Excel sheet you’ve created just for this). But when it’s published on Tableau Server, it returns the time of the Tableau Server Data Engine, which seems to be eight hours behind UK time (as of 19th September 2017, when I’m writing this; I’ve no idea how daylight saving changes will affect it).

So, let’s create our Right Now field, and add eight hours to it with the `DATEADD()` function so that it’ll give us the UK time when published:

Right Now:
```DATEADD('hour', 8, NOW()) ```

The next step is to take Right Now and parse out the time parts that we want to plot. Let’s just go with hours and minutes; plotting seconds is possible, but it’ll look like it’s not working if the dashboard isn’t updating every half second or so. So, let’s create an Hours field and a Minutes field as follows:

Hours:
```DATEPART('hour', [Right Now]) ```

Minutes:
```DATEPART('minute', [Right Now]) ```

This will give the current hour and the current minute as a number. There’s an extra step we need to take, though… the hour hand on a clock doesn’t point at the exact hour number for the whole of the hour, it moves around depending on the minutes that have passed. If it’s half past ten, the hour hand doesn’t point at ten exactly, it points about halfway between the ten and the eleven.

So, let’s create another field called Exact Hour for the exact point between hour marks to plot:

Exact Hour:
`[Hours] + ([Minutes] / 60)`

This works by giving us the hour (e.g. 6 for 6pm), and then adding the amount of the hour that we’ve got through. For example, if it’s 6.15pm, the number of minutes is 15, and we’re quarter of the way through the hour. 15/60 = 0.25, so the point where the hour hand will point to is 6.25, i.e. quarter of the way from 6 to 7.

After that, we need to create a single field to plot. This is why the underlying data has the Time Unit field, with separate rows for each hand.

Time for plotting:
```IF [Time Unit] = "Hours" THEN [Exact Hour] ELSE [Minutes] END```

Now that we have our field to plot, we’re ready to do some trigonometry!

We know that we want the clock hands to begin at (0,0) on the scatterplot; what we need to work out is where the clock hands need to end. To be able to plot the X and Y coördinates of where the hands end, we first need to know the angle of the line from (0,0). In simple terms, the scatterplot works like this:

Finding the angle is fairly simple. There are 360° in a circle, and rather conveniently, a clock face is just a big old circle, starting with 0° from the centre at the 12 o’clock position. There are 12 hour points that go round the clock face, so if we want to find out the hour hand’s angle, we divide the hour value by 12 to find out how far around 360° it is, then multiply that fraction by 360. For minutes, the same thing holds, but there are 60 points instead of 12.

Angle:
```IF [Time Unit] = "Hours" THEN ([Time for plotting] / 12 ) * 360 ELSE ([Time for plotting] / 60 ) * 360 END```

“But wait!”, I hear you shout at the screen. Dividing the hour by 12 might work for the morning, but what about when it’s the afternoon, when Tableau’s DATEPART() function will return the number 18 for 6pm, as it works on a 24 hour format?

You’re completely right, I haven’t accounted for that. But I don’t really need to. If it’s 6pm, the hour is 18. 18/12 is 1.5, and multiplying that fraction by 360 gives us 540°. Sure, 540 is bigger than the 360° that are in a circle… but the wonderful thing about circles is that they’re, well, circular. Plotting 540° on this clock face will look identical to plotting 180°. If it bothers you that they’re not technically the same, feel free to add an IF clause to identify the afternoon and then subtract 12 hours from the Exact Hour field.

Now that we’ve got the right angles, we can calculate where the coördinates go. This is a bit more tricky.

The first thing to bear in mind is that I’ve changed the trigonometric functions to reflect how Tableau will actually plot the angles, rather than using the standard ones in maths textbooks.

Maths textbooks will tell you that to find the coördinates (X,Y) on a circle, given the angle θ and a radius of 1 from the centre point (0,0), the equations are Y = Sin θ and X = Cos θ. I’m not going to go into why or how here, but please just trust me on this one and take it at face value. Y = Sin θ and X = Cos θ.

Those maths textbooks will also give you a diagram like this:

But this isn’t what we have; we have this angle instead:

…so using the exact same calculations won’t quite work for us here, because they calculate it relative to a different axis. But, we can still use the earlier diagram to help us work it out; we just need to rotate it and flip it a bit until we have what we need:

This looks like the angle we’re trying to work out, right?

This means that our X axis is the Y axis in the canonical diagram, and our Y axis is the X axis in the canonical diagram. Let’s just rename the two axes so X goes along the bottom and Y goes up and down again:

Now, for us, X = Sin θ and Y = Cos θ. Nice.

That’s all well and good, but there’s another step before it’ll actually work in Tableau. We’ve calculated our angle in degrees (because that’s what everybody learns at school first, and that’s still what’s the most intuitive thing for me). Thing is, Tableau uses radians with trigonometric functions. When we use radians, 360° is equivalent to 2π… which means that 1° is equivalent to π/180. So, we can still use our angle field, we just have to multiply it by π/180 (radians is another thing that you’ll just have to take my word on for now, I’m afraid; just remember that π = 3.14159… and so on, and π also = 180°).

Finally, we want our clock hands to be different lengths. To do this, you can take the equations and multiply them by a constant. Through trial and error, I found that I liked it best when the minute hand was 1.6x the length of the hour hand, so I multiplied the equations by 1.6 when it was for minutes and by 1 when it was for hours, just to keep it consistent.

The fields are:

X:
```IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * SIN([Angle]* PI() / 180) ELSE 1 * SIN([Angle]* PI() / 180) END ELSE 0 END```

Y:
```IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * COS([Angle]* PI() / 180) ELSE 1 * COS([Angle]* PI() / 180) END ELSE 0 END```

If you’re wondering why Path ID matters, it’s about connecting the lines to the dots. What we need is to have the lines start at (0,0) and end at (X,Y), but we still need to tell Tableau that the starting point is (0,0) where Path ID = 0.

That’s a lot of trigonometry, but we’re finally done! All you need to do now is to drag SUM(X) to columns and SUM(Y) to rows, and put Time Unit on detail. This will give you two circles. Drag SUM(Y) to rows again, and change it to line. Put Path ID on the Path shelf. Then dual axis the two SUM(Y) fields, and synchronise axes.

This probably doesn’t quite look right yet, because you have to make sure that you fix both the X and Y axes to be between the same range; I’ve fixed both of mine to go from -2 to +2, which has worked out nicely.

That’s it for making the clock! But there’s even more fun to be had in the final step, which is playing around with background images. I found a lot of beautiful handless clock faces online, but most of them have copyright restrictions, so I’m not going to use those. Instead, I went for an homage to Don’t Hug Me I’m Scared, a youtube series with probably my favourite animated clock character of all time. At some point, I might try it out with my own face and see how horrific that looks.

I hope this helps! It was really fun to build and write about. Please leave me a comment if you have any questions, and I’ll do my best to answer.

(title inspiration: Time is a Machine – Listener)

Standard

# Centre of Gravity, Metaphorically: Plotting time-based changes on maps

I haven’t written a blog in far too long. My bad. So, to get back into the swing of things, here’s something I’ve been playing with this week: centre of gravity plots.

It started with an accident. I had some EU member data, and I was simply trying to make a filled map based on the year each country joined, just to see if it was worth plotting. You know, something like this:

Except that I’d been having a clumsy day (the kind of day where I spilled coffee on my desk, twice), and accidentally missed the filled map option and clicked line instead:

Now, I normally don’t like connected scatterplots, but realised that I could change a couple of things to this accident to make quite a nice connected scatterplot on a map, joining up the central latitude and longitude of each country, so I thought I’d follow through with it and see what happened.

(by the way, the colour palette I use is the Viridis Palette, which I absolutely love. You can find the text to copy/paste into your Tableau preferences file here)

Firstly, I changed my “year joined” field from a discrete dimension into a continuous measure so that I could make it a continuous line with AVG(Year joined):

This connects all the countries by their central latitude and longitude as generated by Tableau, but it joins them up in order from left to right on the map. So, I then added AVG(Year joined) to the path shelf as well, which means that each country is joined in chronological order, or in alphabetical order when there’s a tie (as with Belgium, France, Germany, Italy, Luxembourg, and the Netherlands, who formed the EU in 1958):

I was pretty happy with this; it shows the EU’s expansion eastwards over time far, far better than the filled map did.

I got talking to Mark and Neil online, who introduced me to the idea of “centre of gravity” plots, which show the average latitude and longitude of something and how it changes with respect to something else (usually time). In this case, a centre of gravity plot of the EU would show the average central point of Belgium, France, Germany, Italy, Luxembourg, and the Netherlands in 1958, then the average central point of Belgium, France, Germany, Italy, Luxembourg, the Netherlands, Denmark, Ireland, and the UK in 1973… and so on. I figured it should be easy enough, I’d just take Country off detail, replace it with Year joined, and average the latitudes and longitudes together.

Sadly, it doesn’t work that way. The Latitude (generated) and Longitude (generated) fields that Tableau automatically generates when it detects a geographic field like country can’t be aggregated, and can’t be used if the geographic field they’re based on isn’t in the view. That meant I couldn’t average the latitudes and longitudes over multiple countries without creating lots of different groups.

But, there’s a simple way around this! You can create a text table of the latlongs, copy/paste them into Excel or whatever, then read that in as another data source. Firstly, drag your geographic field into the view, and put the latitude on text, like so:

Then copy and paste it all (I just click on there randomly, hit ctrl+A, ctrl+C, switch to Excel, ctrl-V). Now do the same for the longitude. Save the document, and read it in as a separate data source in Tableau. Now you can blend the data on Country, or whatever your geographic field is, and you’ve got actual latlongs that you can use like proper measures.

And so I did. I recreated the line chart with the new fields, but took Country off detail, and made AVG(Latitude) and AVG(Longitude) into moving average table calculations which take the current value and an arbitrarily high number of previous values (I put in 100, just because). This looked pretty good:

…but then I realised that it wasn’t accurate data. Look at the point for 1973, after the UK, Ireland, and Denmark joined. Doesn’t that seem a little far north?

To investigate it fully, I duplicated the sheet as a crosstab, because sometimes, tables are the best way to go. What I found is that I’ve got a bit of Simpson’s Paradox going on; the calculation is taking averages of averages:

Not so great. If we add Country to the view after the Year joined pill, you can see what it should be:

But the problem is, how do we put Country on detail but then get the moving average to ignore it? I tried various LODs, but couldn’t get it to work exactly – if you have a solution, I would love to hear it! My default approach is to try to restructure the data in Alteryx – because that generally solves everything – but I feel like I’m becoming too reliant on restructuring the data rather than working with what Tableau can do.

Anyway, I ended up restructuring the data by generating a row for each country and year that the country has been a member of the EU. That means I can create a data table like this:

…which removes the need for a moving average calculation entirely, because the entire data is moving with the year instead. Just take country off detail / out of the view, and you get the right averages:

Much more accurate:

This is a better way of structuring the data for this particular instance, because the dataset is tiny; 28 countries, 60-ish years, 913 rows in my Excel file. It’s not going to be a good, sustainable solution for a centre of gravity plot over a much bigger dataset though. I did the same thing for the UN – 193 countries, 70-ish years – and ended up with 10,045 rows in my Excel file. It’s easy to see how this could explode with much more data.

It does look interesting, though; I’d never have guessed that the UN’s centre of gravity hadn’t really left the Sahara since its inception:

Finally, since I was on a roll, I plotted the centre of gravity for the English football champions since the first ever professional season in 1888-89. Conceptually, this was slightly different; unlike the EU and the UN, the champion isn’t a group of teams constantly joining over the years (although it is possible to plot that too). Rather, I wanted to create a rolling average of the centre of gravity over the last N years. If you set it to five years, it’s a bit messy, moving around the country quite a lot:

But if you set it to 20 years, the line tells a nice story. You can see how English football started out with the original northern teams being the most powerful, then it moves south after the Second World War, then it moves north-west during the Liverpool/Manchester era of domination, and finally it’s moving south again more recently:

Many thanks to Ian, who showed me how to parameterise this. Firstly, put your hard-coded (i.e. not Tableau generated!) latitude or longitude field in the view, and create a moving average over the last ten years. Or two, or thirteen, or ninety-eight, it doesn’t really matter. Next, drag the moving average latitude/longitude pill from the rows/columns into the measures pane in order to store it. This creates a calculated field. Meanwhile, create a parameter to let you select a number. This will change the period to calculate the moving average over. Open up the new calculated fields, and replace the number ten/two/thirteen/ninety-eight with your newly-created parameter, remembering to leave the minus sign in front of it:

This will let you parameterise your moving average centre of gravity.

It was a lot of fun to play around with these maps this week. I’ve packaged them all up in a Tableau Public workbook here; I hope you find it as interesting as I did!

(title inspiration: Touché Amoré – Gravity, Metaphorically)

Standard