How to force proportional axes in Tableau

Here’s a quick Tableau tip I’ve used a fair bit lately. It’s a really simple one but I haven’t seen it blogged about before – apologies if you’ve written about it and I’ve unintentionally ripped you off.

When Tableau creates a scatterplot, it automatically sizes the axes to the extent of the data. That’s often exactly how I want it to look, but sometimes, it’s not. For example, in this situation, I’m comparing a forecast I made with the actuals, and the different axis scaling makes my forecast look better than it is:

I want to get around that – instead of Tableau fitting the two axis scales to the data available, I want to force Tableau to fit the two axis scales to fit each other, giving me an equal scale on both axes. I can do this with two calculations and four reference lines, which is a little more effort than just fixing my axes, but it’s also a lot more robust than fixing my axes, which isn’t dynamic.

The calcs are as follows.

Max calculation:

MAX( WINDOW_MAX(SUM([Forecast])), WINDOW_MAX(SUM([Actual])) )

Min calculation:

MIN( WINDOW_MIN(SUM([Forecast])), WINDOW_MIN(SUM([Actual])) )

The max calc finds the window_max of the measures on each axis, and then finds the max of that. In my case, this is SUM([Forecast]) and SUM([Actual]), which you can switch out for whatever your measure is. Basically, it means “what’s the point with the highest value on either axis?”. The min calc does the same thing but with the min of the two window_mins.

That returns the highest and lowest value across both axes. You can then put those on each axis as four reference lines – take the max of the max calc and the min of the min calc on each axis – and then make the reference lines invisible.

And that’s it! A perfectly square scatterplot with the same scale on both axes. If you’re putting it in a dashboard, make sure to fix the sheet size so that it’s square there too.

Tableau

Strings and Roundabouts, pt.2: dynamic decimals

Quite a while ago, I wrote what I thought was a highly-specific blog for a niche use-case – dynamically rounding your Tableau numbers to millions, thousands, billions, or whatever made sense. That ended up being one of my most-viewed blogs.

So today, I’m writing a follow-up. How do you round the number of decimals to a number that actually makes sense?

Take this input data:

If you plot this in Tableau, it’s normally enough to set the default format to Number (standard). That gives us this:

But if you don’t like the scientific formatting for Thing 2 and 7 in Type b, you’ll have to set the number of decimal places to the right number. But that’ll give you this:

Ew.

You can get around this with strings. I don’t use this too often, but it comes in handy now and again. Here’s the formula that you can copy/paste and use in your own workbooks:

REPLACE( REPLACE( RTRIM( REPLACE(STR(ROUND(AVG([Value]),15)), "0", " ")), //get the avg value, round it to 15 dp (or more! or less!), //turn it into a string, and replace the zeros with spaces //then rtrim the trailing spaces " ", "0") //then replace remaining spaces with zeros again , (IF RIGHT( REPLACE( RTRIM( REPLACE(STR(ROUND(AVG([Value]),15)), "0", " ")), " ", "0") , 1) = "." THEN "." ELSE "" END) //if the last character of everything you did above is ".", //then find that ".", otherwise find nothing… , "") //…and remove it

Working from inside out, the calculation does this:

Take the AVG() of your field. You’ll want to change this to whichever aggregation makes most sense for your use case.
e.g. 6.105
Rounds that aggregation to 15 decimal places. This is almost definitely going to be enough, but hey, you might need to up it to 20 or so. I have never needed to do this.
e.g. 6.105000000000000
Turns that into a string.
e.g. “6.105000000000000”
Replaces the zeros in the string to spaces.
e.g. “6.1 5 ”
Uses RTRIM() to remove all trailing spaces on the right of the string.
e.g. “6.1 5”
Replaces any remaining spaces with zeros again.
e.g. “6.105”
If the last character of the string is a decimal point, then there are no decimals needed, so it removes that decimal point by replacing it with nothing; otherwise, it leaves it where it is.
e.g. “6.105”

And there you go – the number is formatted as a string to the exact number of decimals you’ve got in your Excel file.

Interestingly, there are some differences between the way REPLACE() and REGEX_REPLACE() work. It seems that REPLACE() will wait for the aggregation, rounding, and conversion to string before doing anything, whereas REGEX_REPLACE() will give you the same issues you get as if you just turn a number straight into a string without rounding first.

You can see all this in a workbook on Tableau Public here: https://public.tableau.com/profile/gwilym#!/vizhome/200501stringsandroundaboutspt2/Examples

data visualisation, Tableau

Calculating dynamic Z scores in Tableau

What are Z scores? How can you calculate them in Tableau? And once you’ve done that, what can you use them for? This blog will cover all of that, using some fake data from a factory that produces things. We’ll have a look at how the things differ from each other across various different manufacturing dimensions, and use that to see what to do with the thing we’re currently building. It’s all in a Tableau Public workbook here.

Firstly, what’s a Z score, and why would we want to use one?

A Z score is a way of looking at how much more, or less, something is from average in a relative way that accounts for the spread of data. For example, let’s start with height. I’m 6’3″ (or 190cm), and I live in England, where, according to wikipedia at the time of writing, the average male height is 5’9″ (or 175cm). That makes me taller than average.

However, averages don’t tell you anything about the spread of data, which means that taking the simple difference in height doesn’t tell you anything about how tall I am relative to everybody else. If every man in England (apart from me) was somewhere between 5’8″ and 5’10”, I’d be an absolute giant, relatively speaking. But as it is, I’m never the tallest guy in the room, so while I’m taller than average, I only feel averagely tall.

This relative difference from average can be expressed in a Z score, which is essentially saying, “how many standard deviations above or below average is this value?”. A Z score is calculated like this:

Value - Average Value / Standard Deviation of Values

So, my height as a Z score compared to men in England would be:

6'3" - 5'9" / Standard Deviation of Heights (which I don't know)

In the hypothetical example where every other man is between 5’8″ and 5’10”, the spread of heights is small, which means that the standard deviation of heights would be really low, which means that my Z score would be really high. But in the real world, the spread of heights is much greater, so the standard deviation of heights is bigger, which means that my Z score is lower.

It also means you can normalise comparisons over different metrics with different scales. Let’s say I’m an Olympic heptathlete. I’m doing seven different events, and the units they’re measured in are different – some are in metres, like the high jump and the shot put, and some are in seconds, like the hurdles and the sprints. The scale of those units is different too – I’ll be able to throw the shot put many times further than I can jump. That makes comparing my performance across my different events difficult! But Z scores let you compare. If my shot put Z score is +2.1 compared to other athletes while my hurdles score is -0.3 compared to other athletes, I know that I need to work on my hurdles more than my shot put.

OK, so Z scores are a way of normalising data to do comparisons. How do I do it in Tableau?

Sets are fantastic for this. Here’s a quick explanation of why before we move onto how to set it all up.

I like using sets to decide which things I’m focusing on (the “I want to know how normal this thing is” group) and which things are in my reference group (the “I want to take this lot as the basis for all my comparisons” group).

A lot of the time, you’ll want all things to be in both groups. For example, if I’m a professional athlete, I want to compare myself to my peer group, and I’ll want to see how my closest rivals compare to the same peer group too. So, I’d stick all the top athletes in my sport in the main group (so I can see their Z scores) and in the reference group (so that I’m comparing everybody to each other).

Actually, I’m very much not a professional athlete… but when I’m out cycling, I might still want to compare myself to the Tour de France pros to see just how out of my league they are. In that case, I’d want all the professional cyclists in the reference group, and I’d want to put myself in the main group, but what I don’t want to do is put myself in the reference group – my slow trundling up Anerley Hill would only bring the reference group’s average performance down and widen the reference group’s standard deviation, and I’d mistakenly make myself look closer to the pros than I actually am.

That’s why I like using sets and set actions in Tableau. Now for the actual Tableau work!

First of all, let’s talk data structure. I’ve got a long and thin data source; a field for the [Dimension Name], a field for the [Thing], and a field for the [Dimension Value]:

OK. The next step is to set up the sets. I want to create two sets based on my [Thing] field – one for the main analysis set, one for the reference set. You can do this by right-clicking on [Thing] and selecting Create Set.

Now that I’ve got two sets, I can start creating my Z score calculations. The formula for a Z score is:

Value of the thing you want a Z score for - Average value in the reference group / Standard Deviation of values in the reference group

You could do all this in one calculation, but I like breaking mine down into individual parts.

[Reference Set Avg]
{FIXED [Dimension Name]: AVG(IF [Reference Set] THEN [Dimension Value] END)}

[Reference Set StDev]
IF {FIXED [Dimension Name]: COUNTD(IF [Reference Set] THEN [Dimension Value] END)} =1 THEN 0 ELSE {FIXED [Dimension Name]: STDEV(IF [Reference Set] THEN [Dimension Value] END)} END

Now I can use those two calcs in my Z score calc:

[Z Score]
(AVG([Dimension Value]) - AVG([Reference Set Avg])) / AVG([Reference Set Stdev])

That’s all it takes to calculate Z scores! Here’s a scatterplot of my dimension A1. The actual dimension value and the Z score are perfectly correlated, but now we’ve got a normalised value on the y-axis:

And that normalised value is nice and useful, because now we can compare two dimensions with very different scales, like A1 and B:

I often plot Z scores on diverging bar charts. A chart like this will show me how a thing compares to other things across multiple dimensions, and a thing’s idiosyncrasies will stick out:

Similarly, if I want to see what the outliers are across a whole data set, I can create a concatenated [Thing-Dimension] field, plot the absolute Z score, colour by the actual Z score, and sort. This instantly shows me where the biggest outliers in my data are:

Eagle-eyed readers may have noticed that I haven’t calculated a separate field for the analysis set, and I’m just using AVG([Dimension Value]) in the numerator. That’ll calculate the Z score for any [Thing] in the view regardless of whether it’s in the analysis set or not, so those readers may be wondering why we need the analysis set at all. Never fear, we’ll use this set in some more advanced calculations that are coming up.

Making Z scores interactive

With a few extra steps, you can create two sheets to use as set member choosers (I think that drop-down set controllers are coming in 2020.2 or 2020.3, which is exciting! But for now, I’m in 2020.1, and this is the workaround we need to update set membership).

I set up my reference set chooser sheet like this:

…and then the dashboard action like this:

Repeat for the analysis set, and you can build a dashboard a bit like this (click the image to see the interactive version on Tableau Public):

I’m using this to select an individual dimension, and then looking at how 010X compares to 001X through 009X. I’m plotting the actual value on the x-axis, because that’s what I’ll have to adjust in the factory if I decide to make any changes, and I’ve included the Z score in the tooltip.

The nice thing about using sets and set actions is that we can update these Z scores by changing the reference set. Maybe we’ll find out that one of our things, say, 004X, was actually faulty and shouldn’t be included in our set of “normal” things that we’re using as a reference. Do we need to re-run our entire data pipeline? Nope, just deselect it from our reference group selector.

Next steps: comparing Z scores

That’s nice and everything, but let’s take it a bit further. I know that 002X, 003X, and 007X were particularly good things, and ideally, all the things I manufacture in future will be like those three. So, I’ve created a new set called [High performance set], and I want to compare my WIP thing 010X to the high performance set based on the same reference set I selected earlier.

That means I’ve got a lot of comparisons going on:

I also want to group my dimensions into themes. For example, A1 through A8 are technically separate dimensions, but they represent the same kind of thing taken at different points – maybe it’s the thickness of a circular plate at eight different points around the circumference of the plate, or maybe it’s the weight of eight different ball bearings in the same part of the thing, or something like that. So, since they’re all related, I want to see how 010X compares to the high performance set across the A dimensions as a group of dimensions. In my workbook, I’ve simply grouped them by regex-ing out any numbers from the dimension name.

I’ve created a dashboard like this (click for interactive version):

What am I doing here? In the bar chart at the top, I can see how the Z scores for 010X compares to the Z scores for the high performance set for each group of dimensions. I’m finding the Z score for each dimension within a dimension group, and comparing the average Z score for each dimension group for the analysis and high performance sets.

What I’m seeing here is that, on average, the C dimensions in 010X are higher than the high performance set. If I click the C bar, it’ll filter the “compare selection” chart:

This stacked bar chart shows me the Z scores for all C dimensions for the things in the analysis and high performance sets. This is telling me that the high performance things tended to have C dimensions lower than normal across the reference group, and that while 010X also has some C dimensions on the lower side of normal, it’s not as low as the high performance group. So, maybe my manufacturing specifications for the C dimensions are actually a bit high, and I should tune them lower if I want more high performance things.

Building the “compare selection” chart is relatively straightforward – put the [Z score] field on columns, and stack your rows with the Group and Thing dimensions, as well as the IN/OUT value of the analysis set so that it’s sorted nicely:

I’ve also created a calculation that returns a T/F value based on set membership and I’m using it to filter the view. It’s simply:

[Analysis or High Perf set]
[Analysis Set] OR [High performance set]

…and I’ve set the filter to TRUE.

The tricky bit is getting the values for the diverging bar chart. I like using the compare selection sheet as a way of checking the calculations. What we want to work out is the average Z score across all things and dimensions for the analysis set, and the average Z score across all things and dimensions for the high performance set. Then we want to take the analysis set average and subtract the high performance set average to see the difference.

In other words, we want this:

…minus this:

…which should give me 0.857944.

The first thing we need to do is to create a new field: [Thing-Dimension]. It’s just a concatenated field of [Thing] and [Dimension Name], like this:

[Thing-Dimension]
[Thing] + "-" + [Dimension Name]

To be able to plot the average Z scores and difference in a simple bar chart for each dimension group, we can’t have the thing or dimension in the view, which means we need an LOD which includes those fields:

[Z score (LOD include Thing-Dimension)]
(
{INCLUDE [Thing-Dimension]: AVG([Dimension Value])}
- {INCLUDE [Thing-Dimension]: AVG([Reference Set Avg])}
) / {INCLUDE [Thing-Dimension]:AVG([Reference Set Stdev])}

Now we can use that field to work out the difference between our sets:

[Z score difference]
AVG(IF [Analysis Set] THEN [Z score (LOD include Thing-Dimension)] END) - AVG (IF [High performance set] THEN [Z score (LOD include Thing-Dimension)] END

Finally, we can create our bar chart! And it’s nice and simple:

Let’s just check the calc works. Is it 0.857944, as I worked out manually earlier on? Yup, it’s showing up as 0.858 in my tooltip. Lovely:

Now that I’ve compared Z scores across groups of dimensions to get an idea of the general way that my things compare to each other, I can dive back into the actual data to look at what those differences are and potentially fix my manufacturing variance.

Here’s my final dashboard (again, click for the interactive version). I’ve plotted the Z scores for all dimensions for 010X, and I can click any of those Z scores to update the scatterplot and marginal histogram of actual values below. I know that the C dimensions are a bit different for 010X in comparison to the high performance set, so let’s have a look at those:

I can look at that scatterplot and instantly see which of the C dimensions are driving that difference between 010X and the high performance set:

It’s dimensions C2 and C4.

Let’s start with C2. 010X has a high Z score of 2.25, and we can see in the scatterplot that this is a higher value than normal. As it is, that should be raising flags in the factory – that’s a high C2 value, both absolutely and relatively, so we should probably turn it down a bit to be more in line with the others at around 30. As an aside, it’s interesting to see that the high performance set all have low C2 values, so maybe we should turn it down lower than 30 to be closer to the high performance set:

Now, let’s have a look at C4. No issues there, right? 010X has a C4 value which is slightly higher than the average for the reference group, but the Z score is only 0.198, which indicates that it’s pretty much bang on normal. However, we can see that even though it’s normal for the reference group, it’s quite a lot higher than the high performance group. So, again, maybe we’re manufacturing C4 to a specification that says “aim for a C4 value between 30 and 34”, whereas we should consider amending those limits to between 26 and 30 based on how the lowest C4 values have all been the high performance things:

This is just a few of many different ways you can use Z scores and Tableau to look at manufacturing data. There are all kinds of interesting use cases out there – hopefully this explainer helps you build some of your own.

Tableau

Standard errors and confidence intervals in Tableau

tl;dr version

Here’s how to make bar graphs with standard errors and confidence intervals in Tableau. It involves making some calculations yourself, which may or may not differ from Tableau’s built in versions. You can download the workbook showing you how to make the graphs here:
https://public.tableau.com/profile/gwilym#!/vizhome/Standarderrorsandconfidenceintervals/Standarderrorbarsoptions

Why show variance in your graphs?

Sometimes when you plot values on a graph, you want to show not only the aggregated value, but also the variance or uncertainty around it. Now, before I get into this blog properly, I want to say that I don’t actually recommend plotting bar graphs with error bars or confidence intervals, as it can be misleading. The Bar Bar Plots campaign has far more information on it, but ultimately it’s more honest, and really straightforward, to show the actual data points in Tableau, so why wouldn’t you just do that?

Friends don’t let friends make barplots – solid advice from Page Piccinini.

But in the event that you do need to show simple bars and an indication of uncertainty, you’ve got two main options:

Standard errors
Confidence intervals

Introduction to the data

I’m going to use some data I collected during an experiment I ran in 2015. In this experiment, Dutch people learned some Japanese ideophones (vividly descriptive words). But there was a catch – half the words they learned were with the real meanings (e.g. fuwafuwa, which means “fluffy”, and they learned that it meant “pluizig”), and half the words they learned were with the opposite meanings (e.g. debudebu, which means “fat”, but they learned that it meant “dun”, or “thin”). Then they did a quick test to see if they remembered the word associations correctly. You can read more about that here, if you like.

All the following graphs in this blog have been created in this workbook on Tableau Public. Please feel free to download and explore how it’s all made!

Here’s a simple bar graph of the results. For the words they learned with their real meanings, people answered correctly in the test round 86.7% of the time. But when tested on the words they learned with their opposite meanings, people answered correctly only 71.3% of the time.

But this hides the variation in the data. Sure, the average in each condition (and the difference between them) is what I care about, but with simple bar graphs, it’s easy to forget that lots of individual people are below and above the average in each condition. You can see that variation here:

Also, these are averages taken from a sample. I can’t go to a conference and say, “hey everybody, I’ve done the research and Dutch undergrads get 86.7% correct in the real condition and only 71.3% in the opposite condition”… well, I could, but it would be misleading. I can’t guarantee that these results are definitely in line with what the entire population of Dutch undergrads would get if I somehow managed to test all of them, so I need to make some kind of statement about the uncertainty of that result. I can do this with standard errors or confidence intervals.

Standard errors

Let’s start with standard errors. The standard error of the mean is essentially a way of saying how uncertain you are about the mean based on the size of your sample by estimating the standard deviation of the whole population. The wikipedia article on standard errors is pretty good.

The first step is to create a field for the standard error. This is the standard deviation of the scores per condition, divided by the square root of the number of participants:

STDEV([Correct])
/
SQRT(COUNTD([Participant]))

You’ll notice I’ve also got fields for the sample standard deviation and the not sample standard deviation. This is from when I was playing around with different calculations for the standard deviation of the sample vs. the standard deviation of the population. I’m not going to go into it in this blog, but here’s a really nice explainer here, and you can download the workbook to investigate further. In summary, it looks like Tableau’s native STDEV() function uses the formula for the corrected sample standard deviation by default, rather than the population standard deviation. This is pretty nice, it feels like a safer assumption to make. Cheers, Tableau.

Now that we’ve got the standard error, we can create new fields for our upper and lower standard error limits like this:

AVG([Correct]) – [SE]
and
AVG([Correct]) + [SE]

So, now we can create some nice standard error bars. This uses a combination of measure names/values and dual axes, so it’s a little bit complicated. Firstly, create your simple bars for the correct % per condition:

Now, drag the lower standard error field onto rows to create a separate graph. Drag the upper standard error field onto the same axis of that new graph to set up a measure names/measure values situation:

Now, switch the measure values mark type to line, and drag measure values from columns and drop it on the path card:

All you have to do now is create a dual axis graph, synchronise the axes, and remove condition from colour on the standard error lines:

Great! We’ve now got bar graphs with standard error bars. I mean, I still don’t recommend doing this, but it’s a common request.

Confidence intervals

Now, let’s have a look at confidence intervals. They are a range around your sample mean which tell you that, if you repeated the same study over and over, X% (usually 95%) of confidence intervals from future studies will contain the true population mean. They’re hard to explain (there’s a good blog here), but easy to see.

In Tableau, confidence intervals are really straightforward. You can plot your data points, go to the analytics pane, and bring in an “average with 95% CI” reference line, which creates a reference band around the average:

Nice. This is exactly how I’d like to visualise my experimental data! You can see the average per condition, the confidence intervals, and the underlying participant data.

Quick disclaimer: because I’m looking at percentages here, this is a proportion rather than a hard and fast value, so I shouldn’t actually be using confidence intervals at all… but if we pretend that the 86.7% value is actually an average 0.867 value of something like my participants taking 0.867 seconds taken to respond, or young children being 0.867 metres tall at a certain age, or 0.867 kg lost for each week under a new diet plan, then it’s okay. I’m just going to keep going with my percentages, but please bear this in mind.

However, if your journal insists on old school bar graphs, Tableau’s built in average with 95% CI reference band won’t work. Well, technically it will, it’s just that it’ll show you this:

Because we’ve had to take Participant off detail in order to show an aggregation across participants, the reference band doesn’t know how to compute it, and it assumes that there’s just one data point.

One way around this would to built a dual axis graph. Keep the bars with just condition on colour, and create another axis. Add participant to detail, and set the mark type to circle. Make the circles as small as possible and completely transparent, hit dual axis, synchronise axes, and voila. Now you can have an average with 95% CI reference band again.

The downside is that this is pretty ugly. The reference line/band is way outside the edges of the bars, and it just doesn’t have that standard look that you’re used to. What we actually want is something like our standard error lines from earlier, but with confidence intervals.

The good news is that we can do it! But we’ll have to move away from Tableau’s built in confidence intervals, and create our own calculation, just like we did with standard errors.

The first step is to use the standard error field we made earlier to calculate the confidence intervals. When you look up how to calculate confidence intervals, you’ll probably find something saying that 95% confidence intervals are calculated by taking the mean, and adding/subtracting 1.96 multiplied by the standard error. This 1.96 figure is from the Z distribution, which tells you that 95% of normally distributed data is within 1.96 standard deviations of the mean. And because this is a sample of a population, we multiply that 1.96 by the standard error to get our confidence intervals. Here’s another great blog which breaks it all down.

So, we can create separate fields for our upper and lower confidence interval limits like this:

AVG([Correct]) – (1.96 * [SE])
and
AVG([Correct]) + (1.96 * [SE])

Once we’ve done that, we can build our graphs. This is the same technique as the standard error bars earlier. Create the measure names/values and dual axis graph with measure names on the line path, and you’ll get the same kind of graph, but now showing confidence intervals instead of standard errors:

Excellent! We’ve now got our 95% confidence intervals… or do we?

Confidence intervals, pt.2 – what’s going on?

Some of the more statistically minded of you may have been yelling at the screen when I used the 1.96 value from the Z distribution to calculate my confidence intervals. You see, confidence intervals shouldn’t always simply use the Z distribution, even though that’s the standard formula you’ll find when looking up the definition of confidence intervals. Rather, when you’ve got a small sample, which is generally defined as under 30, you should use the T distribution because the size of the sample may skew the normality of the sample. Again, there’s a lot of good information here.

I started investigating this when I noticed that Tableau’s average with 95% confidence interval calculations were different from my manually calculated ones. Have a look at this comparison – you’ll notice that the confidence interval values are slightly different:

I started playing around with the Z/T value in the confidence interval calculation by making it parameter-driven, and I found that Tableau’s confidence interval calculation seemed to use a number like 2.048 rather than 1.96:

This is because Tableau’s confidence interval calculation is using the T distribution rather than the Z distribution. You can find the appropriate T values to use based on your degrees of freedom (which is your sample size minus one) in Appendix B.2 of this very useful pdf (there’s also a table set to 4dp instead of 3dp here). In my case, I’ve got 29 participants, so the degrees of freedom is 28, and the lookup table shows that the relevant T value for a 95% confidence interval is 2.048, so I can put that in my confidence interval calculations. It also looks like Tableau’s confidence intervals are calculated on a more precise number than 2.048, which suggests that the back end is calculating it directly from the T distribution rather than using the fairly common approach of looking it up in a table where everything is rounded to three decimal places. That’s pretty nice too.

My next step was to check whether Tableau switches between the T and Z distributions based on sample size. So, I duplicated my data and fudged the [correct] field by a random number to create a sample of 58 participants. With 58 participants, it’s fine to use the Z distribution to calculate 95% confidence intervals. But even then, it looks like Tableau is using the T distribution – when I set my parameter to 2.0025 using the slightly-more-precise values in the T table here, you can see that the confidence intervals using T values, not Z values, match Tableau’s calculations:

This is pretty good as well, I think. As your sample size increases, the T distribution starts to match the Z distribution more and more closely anyway. Notice how, with 29 participants, the T value was 2.0484, and with 58 participants, it was 2.0025. This is getting closer and closer to 1.96. At 200 participants, the T value would be 1.9719. Overstating the confidence intervals by using the T distribution is safer default behaviour than accidentally understating them by using the Z distribution.

So, to conclude, I’ve found out the following about confidence intervals in Tableau:

They’re based on standard errors which use the corrected sample standard deviation (and Tableau’s STDEV() function returns the corrected sample standard deviation as well).
They’re based on the T distribution regardless of your sample size.

Again, I’ve published the workbook containing my demo graphs and my standard deviation and T vs. Z explorations here: https://public.tableau.com/profile/gwilym#!/vizhome/Standarderrorsandconfidenceintervals/Standarderrorbarsoptions

One final word of thanks to my colleague David for helping me out with some of the troubleshooting!

data visualisation, Tableau

Donut charts and dynamic reference banding

Donut charts aren’t everybody’s cup of tea, but I quite like them for showing a percentage against a total which has to be 100%. Things like the percentage of tickets answered within an hour, or an industrial test pass rate as a percentage, or an on time percentage.

The problem is that percentages often come with targets. If you’re measuring a rate, you’re probably measuring it to check that you’re on target. For example, if you’ve got 19.8% of tickets being answered within an hour, you’ve probably also got a target of 15% or 20% or something, and you’d probably want to show that on your donut chart for context, like this:

In Tableau, you can’t do that, not without creating some pretty filthy trigonometric calculations. But I’ve recently found a workaround which I rather like, which I’ll explain in this blog. You can download the supporting workbook from Tableau Public here.

I’ve used Superstore, which isn’t too ideal for percentages and targets, but hey, it’s something everybody uses. Let’s say you’re the head of sales for California. You know you’re a big market, and you want to keep it that way – you want 15% of all of Superstore’s sales to be in California.

You can create donut charts showing this percentage easily by creating two fields. One called [California Sales], which is:
IF [State] = “California” THEN [Sales] END

The other would be [Rest of US Sales], which is:
SUM([Sales]) – SUM([California Sales])

And you’d put it on a donut chart with those two fields as the two measure values, then put measure names on colour, and split it out by category to get something like this:

Sadly, we can’t put a reference line at the 15% mark to show the target. Not easily, at least. But what we can do is to play around with the colours. If the percentage is above the target, we could show the percentage up to the target in yellow, and then the overperformance in green, like so:

And if we adjust the target higher, we could show the percentage up to the actual percentage in yellow, and then the underperformance in red, like so:

This is a little complicated. It requires a few extra calculations; [California Sales Percentage], [Target Distance], [California Sales Base], [Rest of Sales], [California Sales Over], and [California Sales Under]. Let’s go through the logic one by one.

[California Sales Percentage]
In this calculation, you take the existing [California Sales] field that you’ve made, and found out what that is as a percentage of all sales. It’s simply:
SUM([California Sales]) / SUM([Sales])

[Target Distance]
This is how far from the target the California Sales Percentage is. I’ve used [Target] as a parameter to make it adjustable, but you could also hardcode it. It’s simply the California Sales Percentage minus the target; so, if you’ve got an actual % of 21%, and your target is 15%, then the Target Distance will be 6%. It’s simply:
[California Sales Percentage] – [Target]

[California Sales Base]
This calculation will be what’s in yellow in the donut. If your California Sales Percentage is above the target, then you’ll want it to be yellow up to the target, and then green above that, so this base field will simply be the target. If your California Sales Percentage is below the target, then you’ll want it to be yellow up to the actual sales percentage, and then red for the space between the percentage and the target. So, you can calculate it like this:
IF [Target Distance] > 0 THEN ([Target] * SUM([Sales]))
ELSE SUM([California Sales]) END

[Rest of Sales]
This is the bit in grey. If your California Sales Percentage is above the target, then you’ll want it to be grey from the actual sales up to 100%. If your California Sales Percentage is below the target, then you’ll want it to be grey from the target value up to 100%. That can be calculated like this:
IF [Target Distance] < 0 THEN
SUM([Sales]) – ([Target] * SUM([Sales]))
ELSEIF [Target Distance] > 0 THEN
SUM([Sales]) – SUM([California Sales])
END

[California Sales Over]
This is the bit in green. If your California Sales Percentage is above the target, then you’ll want it to be green between the target and the actual sales percentage. If it’s below target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] > 0 THEN
SUM([California Sales]) – ([Target] * SUM([Sales]))
ELSE 0 END

[California Sales Under]
Finally, this is the bit in red. If your California Sales Percentage is below the target, then you’ll want it to be red between the actual sales percentage and the target. If it’s above target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] < 0 THEN
([Target] * SUM([Sales]))-SUM([California Sales])
ELSE 0 END

Okay! Now we’re ready to build our donuts. This is the easy bit.

Build out your donuts like normal, like this:

Now, instead of the current two measure values, we’ll want all four of the colour ones:

For this one, I’ve set the target to 20% so that there are examples of categories that are above and below target, all in one view.

And that’s it! It’s not quite a reference line, but it’s a nice way of showing a percentage performance against a target within a donut chart. As a reminder, you can download my workbook here:
https://public.tableau.com/profile/gwilym#!/vizhome/Dynamicreferencebandingondonutcharts/Donutchart-dynamicreferencebanding

data visualisation, Tableau

Why won’t my Tableau small multiples chart work?

Andy Kriebel wrote a great tutorial on how to make small multiples charts in Tableau here. It works pretty much all the time… but you’ll also find that if you simply copy and paste the calculations, it might not work with your data.

For example, have a look at Superstore here. I’m plotting sum of Sales for each continuous month per product subcategory. The rows and columns calculations split up the view nicely into a line for each subcategory, which is good:

But look closely, and you’ll see some weird stuff going on; there’s a brown-ish dot for subcategory = Copiers and month = October 2014 in the Chairs section (second row, second from left):

What’s going on there?

It turns out that the rows and columns calculations can’t handle nulls in the underlying dataset. I haven’t dived into this fully, but I’m guessing this is because the index calculation works depending on what’s in the view, rather than being fixed on all the subcategories and months regardless of whether there’s data or not.

In this case, what happens is that the October 2014 missing data for one one category – Accessories – shunts everything else up one; the Appliances value turns up in the Accessories small multiple, the Art value turns up in the Appliances small multiple, and so on. The same thing would happen in March 2014 if there was another subcategory after Tables too.

You’ll see that if you switch to calculating using quarters instead of months, this problem disappears completely.

Andy’s calculations are great because they’re really flexible, and they’ll work fine without much further adjustment most of the time. But if you get issues with null data like this, you can try this alternative instead.

With these calculations, I’m going to hardcode the small multiples by whatever thing you’re splitting up the view by. That means that you’d have to create separate fields for every dimension you’d ever want to do it by, which is extra work, but it does take care of the nulls issue.

First, create a calculation called Number (or Subcategory ID, or Steve, or whatever suits you). This is a case statement which assigns a number from 0 to N-1 for a particular dimension.

CASE [Sub-Category] WHEN 'Accessories' THEN 0 WHEN 'Appliances' THEN 1 WHEN 'Art' THEN 2 WHEN 'Binders' THEN 3 WHEN 'Bookcases' THEN 4 WHEN 'Chairs' THEN 5 WHEN 'Copiers' THEN 6 WHEN 'Envelopes' THEN 7 WHEN 'Fasteners' THEN 8 WHEN 'Furnishings' THEN 9 WHEN 'Labels' THEN 10 WHEN 'Machines' THEN 11 WHEN 'Paper' THEN 12 WHEN 'Phones' THEN 13 WHEN 'Storage' THEN 14 WHEN 'Supplies' THEN 15 WHEN 'Tables' THEN 16 END

Typing all that out is quite a faff, so I generate that text with a concat function in Excel like this:

Now create a calc called Columns with the modulo function like this:

[Number] % 4

And then create a calc called Rows by dividing and rounding like this:

INT ([Number] / 4)

It’s crucial that you use the same constant each time! I’ve used 4 because that’ll give me 4 columns across the top, meaning that the 17 subcategories in superstore will be split over four rows of four columns and a fifth row with one column, exactly like Andy’s small multiples do. If you want to do it another way, you could use 3 instead. That would give you five rows of three columns and a sixth row of two columns. There’s a lot of playing around with the configuration, but it’s also more flexible in terms of the configuration you want to plot.

Now that you’ve got these row and column calcs, you can drag them into the view like this, and generate small multiples which work even with null data:

Just to make sure, let’s colour code it by subcategory too. No differently coloured dots in the wrong places anymore!

Another advantage of this approach is that you can colour the graphs by another field. You can do that with Andy’s calcs too, but you have to be careful about how the table calcs work and what they’re using to compute the calculations. Because my calcs don’t have index() in them, there’s no table calc issues to worry about. Just drag and drop.

Tableau

How to give your area and bar charts a makeover with connected scatterplots.

I haven’t always been a fan of connected scatterplots, but I’m gradually coming round to them. First it was with centre of gravity maps; now it’s as a replacement for (some) area charts and bars.

I came across a chart a lot like this at work this week:

I’ve mocked it up using some fake data, but it’s pretty much showing how two departments (or groups, or types of things, or categories) do in terms of profit (or another measure) over the year, and how many orders (or another measure again) there were in each month across both groups.

I’m not a fan of the area chart plus bars over the top approach. Firstly, the bars obscure what’s going on with the area chart underneath them, and secondly, area charts can be misleading as it’s hard to parse each thing separately.

Let’s show the same information with a connected scatterplot:

Here, the red and blue lines show the two departments, and the grey line shows the aggregated number of orders and profit. Instantly, the difference between the two departments is a lot clearer; department A has had a lot of variation in number of orders but profit has stayed pretty consistent, while department B has had a lot of variation in profit but the number of orders has stayed consistent.

The lines are joined up by month… but as it is, it’s impossible to tell where the year begins and ends, which makes the whole thing pretty pointless. Let’s show time with size:

The lines get thicker as the month gets more recent, and now it’s easy to see the trends over the year. The variation in orders in department A is all over the place, but the variation in profit in department B is a bit more consistent; profit has gone down over the year. We can also see the aggregate profit and order trends much more clearly on the grey line, with orders going up but profit going down.

The downside of using line size to show time trends is that the thinner parts are hard to see and the thicker parts can be hard to parse. Let’s try it with colour instead:

I personally prefer this approach to using line size, but it’s also a bit of a faff. Tableau doesn’t like it if you try to do a three-way colour split by dimension (i.e. red, blue, and grey) and then change the shade of the colour by a measure, so you have to convert the month to discrete and make sure to order everything correctly. It’s not too taxing though, so it’s worth it if the data doesn’t change all the time.

Finally, you can go the whole hog and do some double encoding with both colour and size on the line to show time:

This is eye-catching, but possibly to the point where it’s more distracting than informative.

I like the connected scatterplots in this example, although there are many situations where the lines will overlap in a way that won’t tell you much. It certainly won’t work well with lots of different departments; here’s the mess you get if you look at all countries in EU Superstore for all months of all years:

Even when filtering to four countries and two years of data, it’s not the clearest way of showing things:

In summary, then, give a connected scatterplot a go. It may well not work, but sometimes it’ll result in something a lot clearer and more informative than a combined area/bar chart.

Tableau

Strings and roundabouts: how to label your numbers in Tableau effectively

Lots of Tableau dashboards feature big summary numbers. They’re pretty nice, and they look like this:

And what’s even nicer is that you can alter the measure’s default number format to automatically round a specific unit, like this:

But sometimes, the range of numbers is a lot wider than sales per state in Superstore. Let’s have a look at population per country in the World Indicators dataset:

Those are some long numbers, so let’s round them to the nearest unit again. But which one? If we round to the thousands, we get this:

And if we round to the millions, we get this:

It’d be great if we could get Tableau to figure out what the nearest sensible unit is. That functionality doesn’t exist yet (as far as I know!), but we can write a specific optimised rounding calculation for labelling purposes. It’s a bit of a long one:

This calculation returns the number you want as a string. It does this by:

Aggregating the number you’re actually working with already and finding out whether it’s above a billion (in which case you’d want to summarise to whatever number of billions it is), or above a million (in which case you’d want to summarise to whatever number of millions it is), and so on.
Converting it to an absolute number so that it works for negative numbers too.
Taking that aggregated number and dividing by the sensible unit. For example, if your number is 34000000, you’d want to express it as 34 million, so we’re dividing it by a million to return 34.
Rounding that divided figure to one decimal place. This is just my preference, you can do what you like! Set the number to 0 for no decimal places, or 2 for two decimal places, etc.
Convert that number to a string.
Add a text unit abbreviation to the end of it.

Of course, you can also add trillions, quadrillions, and so on, if that’s what your data requires.

This sorts us out nicely:

Now, I’ve deliberately aggregated everything within the calculation, and I’m only using it for labelling purposes. I categorically do not recommend aggregating outside this calculation or using this calculation for calculating anything else. This will result in a shitstorm of rounding errors which can seriously damage your data. But as a final step once you’ve sorted everything out, I find that this is really nice for presenting data.

Here’s the calculation in text so you can copy and paste it into your workbooks:

IF ABS(AVG([Population Total])) >= 1000000000 THEN //round for billions STR(ROUND(AVG([Population Total] / 1000000000), 1)) + "b" ELSEIF ABS(AVG([Population Total])) >= 1000000 THEN //round for millions STR(ROUND(AVG([Population Total] / 1000000), 1)) + "m" ELSEIF ABS(AVG([Population Total])) >= 1000 THEN //round for thousands STR(ROUND(AVG([Population Total] / 1000), 1)) + "k" ELSE STR(ROUND(AVG([Population Total]),0)) END

data visualisation, Tableau

Time is a Machine: how to create a clock in Tableau

I made a clock in Tableau this week, and you can find it on Tableau Public here.

It always shows the current time for the UK, but it shouldn’t be hard to parameterise to update to whatever time zone you’re in.

Essentially, all it is is two points on a scatterplot, connected by lines to the coördinates (0,0), and superimposed on a background image. I made the background image in Powerpoint, based on the clock in the Time episode of Don’t Hug Me I’m Scared.

I’ve written before about using radial calculations to plot distance from the centre and change the lengths while keeping the angles constant. This time, we’re going to change up the trigonometry a bit, and calculate the angle while keeping how far the line goes constant.

Firstly, though, we need some data to work with. All you need to get a DateTime is a single cell in a single column… but for plotting purposes, we’re going to need the following dataset:

That’s all we’ll need! Read that into Tableau, and the rest can be done with calculated fields.

Firstly, we need to find out what the time is. Tableau has the NOW() function, which is really useful. It returns the exact time, down to the second, of the time on your computer (assuming that you’re working in Tableau Desktop with an Excel sheet you’ve created just for this). But when it’s published on Tableau Server, it returns the time of the Tableau Server Data Engine, which seems to be eight hours behind UK time (as of 19th September 2017, when I’m writing this; I’ve no idea how daylight saving changes will affect it).

So, let’s create our Right Now field, and add eight hours to it with the DATEADD() function so that it’ll give us the UK time when published:

Right Now:
DATEADD('hour', 8, NOW())

The next step is to take Right Now and parse out the time parts that we want to plot. Let’s just go with hours and minutes; plotting seconds is possible, but it’ll look like it’s not working if the dashboard isn’t updating every half second or so. So, let’s create an Hours field and a Minutes field as follows:

Hours:
DATEPART('hour', [Right Now])

Minutes:
DATEPART('minute', [Right Now])

This will give the current hour and the current minute as a number. There’s an extra step we need to take, though… the hour hand on a clock doesn’t point at the exact hour number for the whole of the hour, it moves around depending on the minutes that have passed. If it’s half past ten, the hour hand doesn’t point at ten exactly, it points about halfway between the ten and the eleven.

So, let’s create another field called Exact Hour for the exact point between hour marks to plot:

Exact Hour:
[Hours] + ([Minutes] / 60)

This works by giving us the hour (e.g. 6 for 6pm), and then adding the amount of the hour that we’ve got through. For example, if it’s 6.15pm, the number of minutes is 15, and we’re quarter of the way through the hour. 15/60 = 0.25, so the point where the hour hand will point to is 6.25, i.e. quarter of the way from 6 to 7.

After that, we need to create a single field to plot. This is why the underlying data has the Time Unit field, with separate rows for each hand.

Time for plotting:
IF [Time Unit] = "Hours" THEN [Exact Hour] ELSE [Minutes] END

Now that we have our field to plot, we’re ready to do some trigonometry!

We know that we want the clock hands to begin at (0,0) on the scatterplot; what we need to work out is where the clock hands need to end. To be able to plot the X and Y coördinates of where the hands end, we first need to know the angle of the line from (0,0). In simple terms, the scatterplot works like this:

Finding the angle is fairly simple. There are 360° in a circle, and rather conveniently, a clock face is just a big old circle, starting with 0° from the centre at the 12 o’clock position. There are 12 hour points that go round the clock face, so if we want to find out the hour hand’s angle, we divide the hour value by 12 to find out how far around 360° it is, then multiply that fraction by 360. For minutes, the same thing holds, but there are 60 points instead of 12.

Angle:
IF [Time Unit] = "Hours" THEN ([Time for plotting] / 12 ) * 360 ELSE ([Time for plotting] / 60 ) * 360 END

“But wait!”, I hear you shout at the screen. Dividing the hour by 12 might work for the morning, but what about when it’s the afternoon, when Tableau’s DATEPART() function will return the number 18 for 6pm, as it works on a 24 hour format?

You’re completely right, I haven’t accounted for that. But I don’t really need to. If it’s 6pm, the hour is 18. 18/12 is 1.5, and multiplying that fraction by 360 gives us 540°. Sure, 540 is bigger than the 360° that are in a circle… but the wonderful thing about circles is that they’re, well, circular. Plotting 540° on this clock face will look identical to plotting 180°. If it bothers you that they’re not technically the same, feel free to add an IF clause to identify the afternoon and then subtract 12 hours from the Exact Hour field.

Now that we’ve got the right angles, we can calculate where the coördinates go. This is a bit more tricky.

The first thing to bear in mind is that I’ve changed the trigonometric functions to reflect how Tableau will actually plot the angles, rather than using the standard ones in maths textbooks.

Maths textbooks will tell you that to find the coördinates (X,Y) on a circle, given the angle θ and a radius of 1 from the centre point (0,0), the equations are Y = Sin θ and X = Cos θ. I’m not going to go into why or how here, but please just trust me on this one and take it at face value. Y = Sin θ and X = Cos θ.

Those maths textbooks will also give you a diagram like this:

But this isn’t what we have; we have this angle instead:

…so using the exact same calculations won’t quite work for us here, because they calculate it relative to a different axis. But, we can still use the earlier diagram to help us work it out; we just need to rotate it and flip it a bit until we have what we need:

This looks like the angle we’re trying to work out, right?

This means that our X axis is the Y axis in the canonical diagram, and our Y axis is the X axis in the canonical diagram. Let’s just rename the two axes so X goes along the bottom and Y goes up and down again:

Now, for us, X = Sin θ and Y = Cos θ. Nice.

That’s all well and good, but there’s another step before it’ll actually work in Tableau. We’ve calculated our angle in degrees (because that’s what everybody learns at school first, and that’s still what’s the most intuitive thing for me). Thing is, Tableau uses radians with trigonometric functions. When we use radians, 360° is equivalent to 2π… which means that 1° is equivalent to π/180. So, we can still use our angle field, we just have to multiply it by π/180 (radians is another thing that you’ll just have to take my word on for now, I’m afraid; just remember that π = 3.14159… and so on, and π also = 180°).

Finally, we want our clock hands to be different lengths. To do this, you can take the equations and multiply them by a constant. Through trial and error, I found that I liked it best when the minute hand was 1.6x the length of the hour hand, so I multiplied the equations by 1.6 when it was for minutes and by 1 when it was for hours, just to keep it consistent.

The fields are:

X:
IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * SIN([Angle]* PI() / 180) ELSE 1 * SIN([Angle]* PI() / 180) END ELSE 0 END

Y:
IF [Path ID] = 1 THEN IF [Time Unit] = "Minutes" THEN 1.6 * COS([Angle]* PI() / 180) ELSE 1 * COS([Angle]* PI() / 180) END ELSE 0 END

If you’re wondering why Path ID matters, it’s about connecting the lines to the dots. What we need is to have the lines start at (0,0) and end at (X,Y), but we still need to tell Tableau that the starting point is (0,0) where Path ID = 0.

That’s a lot of trigonometry, but we’re finally done! All you need to do now is to drag SUM(X) to columns and SUM(Y) to rows, and put Time Unit on detail. This will give you two circles. Drag SUM(Y) to rows again, and change it to line. Put Path ID on the Path shelf. Then dual axis the two SUM(Y) fields, and synchronise axes.

This probably doesn’t quite look right yet, because you have to make sure that you fix both the X and Y axes to be between the same range; I’ve fixed both of mine to go from -2 to +2, which has worked out nicely.

That’s it for making the clock! But there’s even more fun to be had in the final step, which is playing around with background images. I found a lot of beautiful handless clock faces online, but most of them have copyright restrictions, so I’m not going to use those. Instead, I went for an homage to Don’t Hug Me I’m Scared, a youtube series with probably my favourite animated clock character of all time. At some point, I might try it out with my own face and see how horrific that looks.

I hope this helps! It was really fun to build and write about. Please leave me a comment if you have any questions, and I’ll do my best to answer.

(title inspiration: Time is a Machine – Listener)

Alteryx, football, Maps, Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps

I haven’t written a blog in far too long. My bad. So, to get back into the swing of things, here’s something I’ve been playing with this week: centre of gravity plots.

It started with an accident. I had some EU member data, and I was simply trying to make a filled map based on the year each country joined, just to see if it was worth plotting. You know, something like this:

Except that I’d been having a clumsy day (the kind of day where I spilled coffee on my desk, twice), and accidentally missed the filled map option and clicked line instead:

Now, I normally don’t like connected scatterplots, but realised that I could change a couple of things to this accident to make quite a nice connected scatterplot on a map, joining up the central latitude and longitude of each country, so I thought I’d follow through with it and see what happened.

(by the way, the colour palette I use is the Viridis Palette, which I absolutely love. You can find the text to copy/paste into your Tableau preferences file here)

Firstly, I changed my “year joined” field from a discrete dimension into a continuous measure so that I could make it a continuous line with AVG(Year joined):

This connects all the countries by their central latitude and longitude as generated by Tableau, but it joins them up in order from left to right on the map. So, I then added AVG(Year joined) to the path shelf as well, which means that each country is joined in chronological order, or in alphabetical order when there’s a tie (as with Belgium, France, Germany, Italy, Luxembourg, and the Netherlands, who formed the EU in 1958):

I was pretty happy with this; it shows the EU’s expansion eastwards over time far, far better than the filled map did.

I got talking to Mark and Neil online, who introduced me to the idea of “centre of gravity” plots, which show the average latitude and longitude of something and how it changes with respect to something else (usually time). In this case, a centre of gravity plot of the EU would show the average central point of Belgium, France, Germany, Italy, Luxembourg, and the Netherlands in 1958, then the average central point of Belgium, France, Germany, Italy, Luxembourg, the Netherlands, Denmark, Ireland, and the UK in 1973… and so on. I figured it should be easy enough, I’d just take Country off detail, replace it with Year joined, and average the latitudes and longitudes together.

Sadly, it doesn’t work that way. The Latitude (generated) and Longitude (generated) fields that Tableau automatically generates when it detects a geographic field like country can’t be aggregated, and can’t be used if the geographic field they’re based on isn’t in the view. That meant I couldn’t average the latitudes and longitudes over multiple countries without creating lots of different groups.

But, there’s a simple way around this! You can create a text table of the latlongs, copy/paste them into Excel or whatever, then read that in as another data source. Firstly, drag your geographic field into the view, and put the latitude on text, like so:

Then copy and paste it all (I just click on there randomly, hit ctrl+A, ctrl+C, switch to Excel, ctrl-V). Now do the same for the longitude. Save the document, and read it in as a separate data source in Tableau. Now you can blend the data on Country, or whatever your geographic field is, and you’ve got actual latlongs that you can use like proper measures.

And so I did. I recreated the line chart with the new fields, but took Country off detail, and made AVG(Latitude) and AVG(Longitude) into moving average table calculations which take the current value and an arbitrarily high number of previous values (I put in 100, just because). This looked pretty good:

…but then I realised that it wasn’t accurate data. Look at the point for 1973, after the UK, Ireland, and Denmark joined. Doesn’t that seem a little far north?

To investigate it fully, I duplicated the sheet as a crosstab, because sometimes, tables are the best way to go. What I found is that I’ve got a bit of Simpson’s Paradox going on; the calculation is taking averages of averages:

Not so great. If we add Country to the view after the Year joined pill, you can see what it should be:

But the problem is, how do we put Country on detail but then get the moving average to ignore it? I tried various LODs, but couldn’t get it to work exactly – if you have a solution, I would love to hear it! My default approach is to try to restructure the data in Alteryx – because that generally solves everything – but I feel like I’m becoming too reliant on restructuring the data rather than working with what Tableau can do.

Anyway, I ended up restructuring the data by generating a row for each country and year that the country has been a member of the EU. That means I can create a data table like this:

…which removes the need for a moving average calculation entirely, because the entire data is moving with the year instead. Just take country off detail / out of the view, and you get the right averages:

Much more accurate:

This is a better way of structuring the data for this particular instance, because the dataset is tiny; 28 countries, 60-ish years, 913 rows in my Excel file. It’s not going to be a good, sustainable solution for a centre of gravity plot over a much bigger dataset though. I did the same thing for the UN – 193 countries, 70-ish years – and ended up with 10,045 rows in my Excel file. It’s easy to see how this could explode with much more data.

It does look interesting, though; I’d never have guessed that the UN’s centre of gravity hadn’t really left the Sahara since its inception:

Finally, since I was on a roll, I plotted the centre of gravity for the English football champions since the first ever professional season in 1888-89. Conceptually, this was slightly different; unlike the EU and the UN, the champion isn’t a group of teams constantly joining over the years (although it is possible to plot that too). Rather, I wanted to create a rolling average of the centre of gravity over the last N years. If you set it to five years, it’s a bit messy, moving around the country quite a lot:

But if you set it to 20 years, the line tells a nice story. You can see how English football started out with the original northern teams being the most powerful, then it moves south after the Second World War, then it moves north-west during the Liverpool/Manchester era of domination, and finally it’s moving south again more recently:

Many thanks to Ian, who showed me how to parameterise this. Firstly, put your hard-coded (i.e. not Tableau generated!) latitude or longitude field in the view, and create a moving average over the last ten years. Or two, or thirteen, or ninety-eight, it doesn’t really matter. Next, drag the moving average latitude/longitude pill from the rows/columns into the measures pane in order to store it. This creates a calculated field. Meanwhile, create a parameter to let you select a number. This will change the period to calculate the moving average over. Open up the new calculated fields, and replace the number ten/two/thirteen/ninety-eight with your newly-created parameter, remembering to leave the minus sign in front of it:

This will let you parameterise your moving average centre of gravity.

It was a lot of fun to play around with these maps this week. I’ve packaged them all up in a Tableau Public workbook here; I hope you find it as interesting as I did!

(title inspiration: Touché Amoré – Gravity, Metaphorically)

Vizzee Rascal

Gwilym Lockwood's blog about turning numbers into pictures.

Tag Archives: Tableau

How to force proportional axes in Tableau

Strings and Roundabouts, pt.2: dynamic decimals

Calculating dynamic Z scores in Tableau

Standard errors and confidence intervals in Tableau

tl;dr version

Why show variance in your graphs?

Introduction to the data

Standard errors

Confidence intervals

Confidence intervals, pt.2 – what’s going on?

Donut charts and dynamic reference banding

Why won’t my Tableau small multiples chart work?

How to give your area and bar charts a makeover with connected scatterplots.

Strings and roundabouts: how to label your numbers in Tableau effectively

Time is a Machine: how to create a clock in Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps