Tableau

Standard errors and confidence intervals in Tableau

tl;dr version

Here’s how to make bar graphs with standard errors and confidence intervals in Tableau. It involves making some calculations yourself, which may or may not differ from Tableau’s built in versions. You can download the workbook showing you how to make the graphs here:
https://public.tableau.com/profile/gwilym#!/vizhome/Standarderrorsandconfidenceintervals/Standarderrorbarsoptions

Why show variance in your graphs?

Sometimes when you plot values on a graph, you want to show not only the aggregated value, but also the variance or uncertainty around it. Now, before I get into this blog properly, I want to say that I don’t actually recommend plotting bar graphs with error bars or confidence intervals, as it can be misleading. The Bar Bar Plots campaign has far more information on it, but ultimately it’s more honest, and really straightforward, to show the actual data points in Tableau, so why wouldn’t you just do that?

Friends don’t let friends make barplots – solid advice from Page Piccinini.

But in the event that you do need to show simple bars and an indication of uncertainty, you’ve got two main options:

  1. Standard errors
  2. Confidence intervals

Introduction to the data

I’m going to use some data I collected during an experiment I ran in 2015. In this experiment, Dutch people learned some Japanese ideophones (vividly descriptive words). But there was a catch – half the words they learned were with the real meanings (e.g. fuwafuwa, which means “fluffy”, and they learned that it meant “pluizig”), and half the words they learned were with the opposite meanings (e.g. debudebu, which means “fat”, but they learned that it meant “dun”, or “thin”). Then they did a quick test to see if they remembered the word associations correctly. You can read more about that here, if you like.

All the following graphs in this blog have been created in this workbook on Tableau Public. Please feel free to download and explore how it’s all made!

Here’s a simple bar graph of the results. For the words they learned with their real meanings, people answered correctly in the test round 86.7% of the time. But when tested on the words they learned with their opposite meanings, people answered correctly only 71.3% of the time.

But this hides the variation in the data. Sure, the average in each condition (and the difference between them) is what I care about, but with simple bar graphs, it’s easy to forget that lots of individual people are below and above the average in each condition. You can see that variation here:

Also, these are averages taken from a sample. I can’t go to a conference and say, “hey everybody, I’ve done the research and Dutch undergrads get 86.7% correct in the real condition and only 71.3% in the opposite condition”… well, I could, but it would be misleading. I can’t guarantee that these results are definitely in line with what the entire population of Dutch undergrads would get if I somehow managed to test all of them, so I need to make some kind of statement about the uncertainty of that result. I can do this with standard errors or confidence intervals.

Standard errors

Let’s start with standard errors. The standard error of the mean is essentially a way of saying how uncertain you are about the mean based on the size of your sample by estimating the standard deviation of the whole population. The wikipedia article on standard errors is pretty good.

The first step is to create a field for the standard error. This is the standard deviation of the scores per condition, divided by the square root of the number of participants:

STDEV([Correct])
/
SQRT(COUNTD([Participant]))

You’ll notice I’ve also got fields for the sample standard deviation and the not sample standard deviation. This is from when I was playing around with different calculations for the standard deviation of the sample vs. the standard deviation of the population. I’m not going to go into it in this blog, but here’s a really nice explainer here, and you can download the workbook to investigate further. In summary, it looks like Tableau’s native STDEV() function uses the formula for the corrected sample standard deviation by default, rather than the population standard deviation. This is pretty nice, it feels like a safer assumption to make. Cheers, Tableau.

Now that we’ve got the standard error, we can create new fields for our upper and lower standard error limits like this:

AVG([Correct]) – [SE]
and
AVG([Correct]) + [SE]

So, now we can create some nice standard error bars. This uses a combination of measure names/values and dual axes, so it’s a little bit complicated. Firstly, create your simple bars for the correct % per condition:

Now, drag the lower standard error field onto rows to create a separate graph. Drag the upper standard error field onto the same axis of that new graph to set up a measure names/measure values situation:

Now, switch the measure values mark type to line, and drag measure values from columns and drop it on the path card:

All you have to do now is create a dual axis graph, synchronise the axes, and remove condition from colour on the standard error lines:

Great! We’ve now got bar graphs with standard error bars. I mean, I still don’t recommend doing this, but it’s a common request.

Confidence intervals

Now, let’s have a look at confidence intervals. They are a range around your sample mean which tell you that, if you repeated the same study over and over, X% (usually 95%) of confidence intervals from future studies will contain the true population mean. They’re hard to explain (there’s a good blog here), but easy to see.

In Tableau, confidence intervals are really straightforward. You can plot your data points, go to the analytics pane, and bring in an “average with 95% CI” reference line, which creates a reference band around the average:

Nice. This is exactly how I’d like to visualise my experimental data! You can see the average per condition, the confidence intervals, and the underlying participant data.

Quick disclaimer: because I’m looking at percentages here, this is a proportion rather than a hard and fast value, so I shouldn’t actually be using confidence intervals at all… but if we pretend that the 86.7% value is actually an average 0.867 value of something like my participants taking 0.867 seconds taken to respond, or young children being 0.867 metres tall at a certain age, or 0.867 kg lost for each week under a new diet plan, then it’s okay. I’m just going to keep going with my percentages, but please bear this in mind.

However, if your journal insists on old school bar graphs, Tableau’s built in average with 95% CI reference band won’t work. Well, technically it will, it’s just that it’ll show you this:

Because we’ve had to take Participant off detail in order to show an aggregation across participants, the reference band doesn’t know how to compute it, and it assumes that there’s just one data point.

One way around this would to built a dual axis graph. Keep the bars with just condition on colour, and create another axis. Add participant to detail, and set the mark type to circle. Make the circles as small as possible and completely transparent, hit dual axis, synchronise axes, and voila. Now you can have an average with 95% CI reference band again.

The downside is that this is pretty ugly. The reference line/band is way outside the edges of the bars, and it just doesn’t have that standard look that you’re used to. What we actually want is something like our standard error lines from earlier, but with confidence intervals.

The good news is that we can do it! But we’ll have to move away from Tableau’s built in confidence intervals, and create our own calculation, just like we did with standard errors.

The first step is to use the standard error field we made earlier to calculate the confidence intervals. When you look up how to calculate confidence intervals, you’ll probably find something saying that 95% confidence intervals are calculated by taking the mean, and adding/subtracting 1.96 multiplied by the standard error. This 1.96 figure is from the Z distribution, which tells you that 95% of normally distributed data is within 1.96 standard deviations of the mean. And because this is a sample of a population, we multiply that 1.96 by the standard error to get our confidence intervals. Here’s another great blog which breaks it all down.

So, we can create separate fields for our upper and lower confidence interval limits like this:

AVG([Correct]) – (1.96 * [SE])
and
AVG([Correct]) + (1.96 * [SE])

Once we’ve done that, we can build our graphs. This is the same technique as the standard error bars earlier. Create the measure names/values and dual axis graph with measure names on the line path, and you’ll get the same kind of graph, but now showing confidence intervals instead of standard errors:

Excellent! We’ve now got our 95% confidence intervals… or do we?

Confidence intervals, pt.2 – what’s going on?

Some of the more statistically minded of you may have been yelling at the screen when I used the 1.96 value from the Z distribution to calculate my confidence intervals. You see, confidence intervals shouldn’t always simply use the Z distribution, even though that’s the standard formula you’ll find when looking up the definition of confidence intervals. Rather, when you’ve got a small sample, which is generally defined as under 30, you should use the T distribution because the size of the sample may skew the normality of the sample. Again, there’s a lot of good information here.

I started investigating this when I noticed that Tableau’s average with 95% confidence interval calculations were different from my manually calculated ones. Have a look at this comparison – you’ll notice that the confidence interval values are slightly different:

I started playing around with the Z/T value in the confidence interval calculation by making it parameter-driven, and I found that Tableau’s confidence interval calculation seemed to use a number like 2.048 rather than 1.96:

This is because Tableau’s confidence interval calculation is using the T distribution rather than the Z distribution. You can find the appropriate T values to use based on your degrees of freedom (which is your sample size minus one) in Appendix B.2 of this very useful pdf (there’s also a table set to 4dp instead of 3dp here). In my case, I’ve got 29 participants, so the degrees of freedom is 28, and the lookup table shows that the relevant T value for a 95% confidence interval is 2.048, so I can put that in my confidence interval calculations. It also looks like Tableau’s confidence intervals are calculated on a more precise number than 2.048, which suggests that the back end is calculating it directly from the T distribution rather than using the fairly common approach of looking it up in a table where everything is rounded to three decimal places. That’s pretty nice too.

My next step was to check whether Tableau switches between the T and Z distributions based on sample size. So, I duplicated my data and fudged the [correct] field by a random number to create a sample of 58 participants. With 58 participants, it’s fine to use the Z distribution to calculate 95% confidence intervals. But even then, it looks like Tableau is using the T distribution – when I set my parameter to 2.0025 using the slightly-more-precise values in the T table here, you can see that the confidence intervals using T values, not Z values, match Tableau’s calculations:

This is pretty good as well, I think. As your sample size increases, the T distribution starts to match the Z distribution more and more closely anyway. Notice how, with 29 participants, the T value was 2.0484, and with 58 participants, it was 2.0025. This is getting closer and closer to 1.96. At 200 participants, the T value would be 1.9719. Overstating the confidence intervals by using the T distribution is safer default behaviour than accidentally understating them by using the Z distribution.

So, to conclude, I’ve found out the following about confidence intervals in Tableau:

  1. They’re based on standard errors which use the corrected sample standard deviation (and Tableau’s STDEV() function returns the corrected sample standard deviation as well).
  2. They’re based on the T distribution regardless of your sample size.

Again, I’ve published the workbook containing my demo graphs and my standard deviation and T vs. Z explorations here: https://public.tableau.com/profile/gwilym#!/vizhome/Standarderrorsandconfidenceintervals/Standarderrorbarsoptions

One final word of thanks to my colleague David for helping me out with some of the troubleshooting!

Standard
data visualisation, Tableau

Donut charts and dynamic reference banding

Donut charts aren’t everybody’s cup of tea, but I quite like them for showing a percentage against a total which has to be 100%. Things like the percentage of tickets answered within an hour, or an industrial test pass rate as a percentage, or an on time percentage.

The problem is that percentages often come with targets. If you’re measuring a rate, you’re probably measuring it to check that you’re on target. For example, if you’ve got 19.8% of tickets being answered within an hour, you’ve probably also got a target of 15% or 20% or something, and you’d probably want to show that on your donut chart for context, like this:

In Tableau, you can’t do that, not without creating some pretty filthy trigonometric calculations. But I’ve recently found a workaround which I rather like, which I’ll explain in this blog. You can download the supporting workbook from Tableau Public here.

I’ve used Superstore, which isn’t too ideal for percentages and targets, but hey, it’s something everybody uses. Let’s say you’re the head of sales for California. You know you’re a big market, and you want to keep it that way – you want 15% of all of Superstore’s sales to be in California.

You can create donut charts showing this percentage easily by creating two fields. One called [California Sales], which is:
IF [State] = “California” THEN [Sales] END

The other would be [Rest of US Sales], which is:
SUM([Sales]) – SUM([California Sales])

And you’d put it on a donut chart with those two fields as the two measure values, then put measure names on colour, and split it out by category to get something like this:

Sadly, we can’t put a reference line at the 15% mark to show the target. Not easily, at least. But what we can do is to play around with the colours. If the percentage is above the target, we could show the percentage up to the target in yellow, and then the overperformance in green, like so:

And if we adjust the target higher, we could show the percentage up to the actual percentage in yellow, and then the underperformance in red, like so:

This is a little complicated. It requires a few extra calculations; [California Sales Percentage], [Target Distance], [California Sales Base], [Rest of Sales], [California Sales Over], and [California Sales Under]. Let’s go through the logic one by one.

[California Sales Percentage]
In this calculation, you take the existing [California Sales] field that you’ve made, and found out what that is as a percentage of all sales. It’s simply:
SUM([California Sales]) / SUM([Sales])

[Target Distance]
This is how far from the target the California Sales Percentage is. I’ve used [Target] as a parameter to make it adjustable, but you could also hardcode it. It’s simply the California Sales Percentage minus the target; so, if you’ve got an actual % of 21%, and your target is 15%, then the Target Distance will be 6%. It’s simply:
[California Sales Percentage] – [Target]

[California Sales Base]
This calculation will be what’s in yellow in the donut. If your California Sales Percentage is above the target, then you’ll want it to be yellow up to the target, and then green above that, so this base field will simply be the target. If your California Sales Percentage is below the target, then you’ll want it to be yellow up to the actual sales percentage, and then red for the space between the percentage and the target. So, you can calculate it like this:
IF [Target Distance] > 0 THEN ([Target] * SUM([Sales]))
ELSE SUM([California Sales]) END

[Rest of Sales]
This is the bit in grey. If your California Sales Percentage is above the target, then you’ll want it to be grey from the actual sales up to 100%. If your California Sales Percentage is below the target, then you’ll want it to be grey from the target value up to 100%. That can be calculated like this:
IF [Target Distance] < 0 THEN
SUM([Sales]) – ([Target] * SUM([Sales]))
ELSEIF [Target Distance] > 0 THEN
SUM([Sales]) – SUM([California Sales])
END

[California Sales Over]
This is the bit in green. If your California Sales Percentage is above the target, then you’ll want it to be green between the target and the actual sales percentage. If it’s below target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] > 0 THEN
SUM([California Sales]) – ([Target] * SUM([Sales]))
ELSE 0 END

[California Sales Under]
Finally, this is the bit in red. If your California Sales Percentage is below the target, then you’ll want it to be red between the actual sales percentage and the target. If it’s above target, you don’t want it to show up at all, so set it to zero like this:
IF [Target Distance] < 0 THEN
([Target] * SUM([Sales]))-SUM([California Sales])
ELSE 0 END

Okay! Now we’re ready to build our donuts. This is the easy bit.

Build out your donuts like normal, like this:

Now, instead of the current two measure values, we’ll want all four of the colour ones:

For this one, I’ve set the target to 20% so that there are examples of categories that are above and below target, all in one view.

And that’s it! It’s not quite a reference line, but it’s a nice way of showing a percentage performance against a target within a donut chart. As a reminder, you can download my workbook here:
https://public.tableau.com/profile/gwilym#!/vizhome/Dynamicreferencebandingondonutcharts/Donutchart-dynamicreferencebanding

Standard
data visualisation, Tableau

Why won’t my Tableau small multiples chart work?

Andy Kriebel wrote a great tutorial on how to make small multiples charts in Tableau here. It works pretty much all the time… but you’ll also find that if you simply copy and paste the calculations, it might not work with your data.

For example, have a look at Superstore here. I’m plotting sum of Sales for each continuous month per product subcategory. The rows and columns calculations split up the view nicely into a line for each subcategory, which is good:

small multiples not working

But look closely, and you’ll see some weird stuff going on; there’s a brown-ish dot for subcategory = Copiers and month = October 2014 in the Chairs section (second row, second from left):

small multiples not working - highlight point

What’s going on there?

It turns out that the rows and columns calculations can’t handle nulls in the underlying dataset. I haven’t dived into this fully, but I’m guessing this is because the index calculation works depending on what’s in the view, rather than being fixed on all the subcategories and months regardless of whether there’s data or not.

In this case, what happens is that the October 2014 missing data for one one category – Accessories – shunts everything else up one; the Appliances value turns up in the Accessories small multiple, the Art value turns up in the Appliances small multiple, and so on. The same thing would happen in March 2014 if there was another subcategory after Tables too.

table with nulls

You’ll see that if you switch to calculating using quarters instead of months, this problem disappears completely.

Andy’s calculations are great because they’re really flexible, and they’ll work fine without much further adjustment most of the time. But if you get issues with null data like this, you can try this alternative instead.

With these calculations, I’m going to hardcode the small multiples by whatever thing you’re splitting up the view by. That means that you’d have to create separate fields for every dimension you’d ever want to do it by, which is extra work, but it does take care of the nulls issue.

First, create a calculation called Number (or Subcategory ID, or Steve, or whatever suits you). This is a case statement which assigns a number from 0 to N-1 for a particular dimension.

CASE [Sub-Category]
WHEN 'Accessories' THEN 0
WHEN 'Appliances' THEN 1
WHEN 'Art' THEN 2
WHEN 'Binders' THEN 3
WHEN 'Bookcases' THEN 4
WHEN 'Chairs' THEN 5
WHEN 'Copiers' THEN 6
WHEN 'Envelopes' THEN 7
WHEN 'Fasteners' THEN 8
WHEN 'Furnishings' THEN 9
WHEN 'Labels' THEN 10
WHEN 'Machines' THEN 11
WHEN 'Paper' THEN 12
WHEN 'Phones' THEN 13
WHEN 'Storage' THEN 14
WHEN 'Supplies' THEN 15
WHEN 'Tables' THEN 16
END

Typing all that out is quite a faff, so I generate that text with a concat function in Excel like this:

excel help for calc

Now create a calc called Columns with the modulo function like this:

[Number] % 4

And then create a calc called Rows by dividing and rounding like this:

INT ([Number] / 4)

It’s crucial that you use the same constant each time! I’ve used 4 because that’ll give me 4 columns across the top, meaning that the 17 subcategories in superstore will be split over four rows of four columns and a fifth row with one column, exactly like Andy’s small multiples do. If you want to do it another way, you could use 3 instead. That would give you five rows of three columns and a sixth row of two columns. There’s a lot of playing around with the configuration, but it’s also more flexible in terms of the configuration you want to plot.

Now that you’ve got these row and column calcs, you can drag them into the view like this, and generate small multiples which work even with null data:

small multiples fixed (simple)

Just to make sure, let’s colour code it by subcategory too. No differently coloured dots in the wrong places anymore!

small multiples fixed plus colour

Another advantage of this approach is that you can colour the graphs by another field. You can do that with Andy’s calcs too, but you have to be careful about how the table calcs work and what they’re using to compute the calculations. Because my calcs don’t have index() in them, there’s no table calc issues to worry about. Just drag and drop.

small multiples fixed plus region colour

Standard
Tableau

How to give your area and bar charts a makeover with connected scatterplots.

I haven’t always been a fan of connected scatterplots, but I’m gradually coming round to them. First it was with centre of gravity maps; now it’s as a replacement for (some) area charts and bars.

I came across a chart a lot like this at work this week:

1 area and bars

I’ve mocked it up using some fake data, but it’s pretty much showing how two departments (or groups, or types of things, or categories) do in terms of profit (or another measure) over the year, and how many orders (or another measure again) there were in each month across both groups.

I’m not a fan of the area chart plus bars over the top approach. Firstly, the bars obscure what’s going on with the area chart underneath them, and secondly, area charts can be misleading as it’s hard to parse each thing separately.

Let’s show the same information with a connected scatterplot:

2 simple connected scatterplot

Here, the red and blue lines show the two departments, and the grey line shows the aggregated number of orders and profit. Instantly, the difference between the two departments is a lot clearer; department A has had a lot of variation in number of orders but profit has stayed pretty consistent, while department B has had a lot of variation in profit but the number of orders has stayed consistent.

The lines are joined up by month… but as it is, it’s impossible to tell where the year begins and ends, which makes the whole thing pretty pointless. Let’s show time with size:

3 connected scatterplot two colour, size

The lines get thicker as the month gets more recent, and now it’s easy to see the trends over the year. The variation in orders in department A is all over the place, but the variation in profit in department B is a bit more consistent; profit has gone down over the year. We can also see the aggregate profit and order trends much more clearly on the grey line, with orders going up but profit going down.

The downside of using line size to show time trends is that the thinner parts are hard to see and the thicker parts can be hard to parse. Let’s try it with colour instead:

4 connected scatterplot, colour range, same size

I personally prefer this approach to using line size, but it’s also a bit of a faff. Tableau doesn’t like it if you try to do a three-way colour split by dimension (i.e. red, blue, and grey) and then change the shade of the colour by a measure, so you have to convert the month to discrete and make sure to order everything correctly. It’s not too taxing though, so it’s worth it if the data doesn’t change all the time.

Finally, you can go the whole hog and do some double encoding with both colour and size on the line to show time:

5 connected scatterplot, colour range, size

This is eye-catching, but possibly to the point where it’s more distracting than informative.

I like the connected scatterplots in this example, although there are many situations where the lines will overlap in a way that won’t tell you much. It certainly won’t work well with lots of different departments; here’s the mess you get if you look at all countries in EU Superstore for all months of all years:

6 eu superstore all

Even when filtering to four countries and two years of data, it’s not the clearest way of showing things:

7 eu superstore some

In summary, then, give a connected scatterplot a go. It may well not work, but sometimes it’ll result in something a lot clearer and more informative than a combined area/bar chart.

Standard
Tableau

Strings and roundabouts: how to label your numbers in Tableau effectively

Lots of Tableau dashboards feature big summary numbers. They’re pretty nice, and they look like this:

1 numbers

And what’s even nicer is that you can alter the measure’s default number format to automatically round a specific unit, like this:

2 round to k.png

But sometimes, the range of numbers is a lot wider than sales per state in Superstore. Let’s have a look at population per country in the World Indicators dataset:

3 populations.png

Those are some long numbers, so let’s round them to the nearest unit again. But which one? If we round to the thousands, we get this:

4 populations k.png

And if we round to the millions, we get this:

5 populations m.png

It’d be great if we could get Tableau to figure out what the nearest sensible unit is. That functionality doesn’t exist yet (as far as I know!), but we can write a specific optimised rounding calculation for labelling purposes. It’s a bit of a long one:

6 pop rounded optimised

This calculation returns the number you want as a string. It does this by:

  1. Aggregating the number you’re actually working with already and finding out whether it’s above a billion (in which case you’d want to summarise to whatever number of billions it is), or above a million (in which case you’d want to summarise to whatever number of millions it is), and so on.
  2. Converting it to an absolute number so that it works for negative numbers too.
  3. Taking that aggregated number and dividing by the sensible unit. For example, if your number is 34000000, you’d want to express it as 34 million, so we’re dividing it by a million to return 34.
  4. Rounding that divided figure to one decimal place. This is just my preference, you can do what you like! Set the number to 0 for no decimal places, or 2 for two decimal places, etc.
  5. Convert that number to a string.
  6. Add a text unit abbreviation to the end of it.

Of course, you can also add trillions, quadrillions, and so on, if that’s what your data requires.

This sorts us out nicely:

7 boom

Now, I’ve deliberately aggregated everything within the calculation, and I’m only using it for labelling purposes. I categorically do not recommend aggregating outside this calculation or using this calculation for calculating anything else. This will result in a shitstorm of rounding errors which can seriously damage your data. But as a final step once you’ve sorted everything out, I find that this is really nice for presenting data.

Here’s the calculation in text so you can copy and paste it into your workbooks:

IF ABS(AVG([Population Total])) >= 1000000000 THEN
//round for billions
STR(ROUND(AVG([Population Total] / 1000000000), 1)) + "b"
ELSEIF ABS(AVG([Population Total])) >= 1000000 THEN
//round for millions
STR(ROUND(AVG([Population Total] / 1000000), 1)) + "m"
ELSEIF ABS(AVG([Population Total])) >= 1000 THEN
//round for thousands
STR(ROUND(AVG([Population Total] / 1000), 1)) + "k"
ELSE
STR(ROUND(AVG([Population Total]),0))
END

Standard
data visualisation, Tableau

Time is a Machine: how to create a clock in Tableau

I made a clock in Tableau this week, and you can find it on Tableau Public here.

1 dhmis clock

It always shows the current time for the UK, but it shouldn’t be hard to parameterise to update to whatever time zone you’re in.

Essentially, all it is is two points on a scatterplot, connected by lines to the coördinates (0,0), and superimposed on a background image. I made the background image in Powerpoint, based on the clock in the Time episode of Don’t Hug Me I’m Scared.

I’ve written before about using radial calculations to plot distance from the centre and change the lengths while keeping the angles constant. This time, we’re going to change up the trigonometry a bit, and calculate the angle while keeping how far the line goes constant.

Firstly, though, we need some data to work with. All you need to get a DateTime is a single cell in a single column… but for plotting purposes, we’re going to need the following dataset:

2 data

That’s all we’ll need! Read that into Tableau, and the rest can be done with calculated fields.

Firstly, we need to find out what the time is. Tableau has the NOW() function, which is really useful. It returns the exact time, down to the second, of the time on your computer (assuming that you’re working in Tableau Desktop with an Excel sheet you’ve created just for this). But when it’s published on Tableau Server, it returns the time of the Tableau Server Data Engine, which seems to be eight hours behind UK time (as of 19th September 2017, when I’m writing this; I’ve no idea how daylight saving changes will affect it).

So, let’s create our Right Now field, and add eight hours to it with the DATEADD() function so that it’ll give us the UK time when published:

Right Now:
DATEADD('hour', 8, NOW())

The next step is to take Right Now and parse out the time parts that we want to plot. Let’s just go with hours and minutes; plotting seconds is possible, but it’ll look like it’s not working if the dashboard isn’t updating every half second or so. So, let’s create an Hours field and a Minutes field as follows:

Hours:
DATEPART('hour', [Right Now])

Minutes:
DATEPART('minute', [Right Now])

This will give the current hour and the current minute as a number. There’s an extra step we need to take, though… the hour hand on a clock doesn’t point at the exact hour number for the whole of the hour, it moves around depending on the minutes that have passed. If it’s half past ten, the hour hand doesn’t point at ten exactly, it points about halfway between the ten and the eleven.

3 hour hand issues

So, let’s create another field called Exact Hour for the exact point between hour marks to plot:

Exact Hour:
[Hours] + ([Minutes] / 60)

This works by giving us the hour (e.g. 6 for 6pm), and then adding the amount of the hour that we’ve got through. For example, if it’s 6.15pm, the number of minutes is 15, and we’re quarter of the way through the hour. 15/60 = 0.25, so the point where the hour hand will point to is 6.25, i.e. quarter of the way from 6 to 7.

After that, we need to create a single field to plot. This is why the underlying data has the Time Unit field, with separate rows for each hand.

Time for plotting:
IF [Time Unit] = "Hours" THEN
[Exact Hour]
ELSE
[Minutes]
END

Now that we have our field to plot, we’re ready to do some trigonometry!

We know that we want the clock hands to begin at (0,0) on the scatterplot; what we need to work out is where the clock hands need to end. To be able to plot the X and Y coördinates of where the hands end, we first need to know the angle of the line from (0,0). In simple terms, the scatterplot works like this:

4 angles

Finding the angle is fairly simple. There are 360° in a circle, and rather conveniently, a clock face is just a big old circle, starting with 0° from the centre at the 12 o’clock position. There are 12 hour points that go round the clock face, so if we want to find out the hour hand’s angle, we divide the hour value by 12 to find out how far around 360° it is, then multiply that fraction by 360. For minutes, the same thing holds, but there are 60 points instead of 12.

Angle:
IF [Time Unit] = "Hours" THEN
([Time for plotting] / 12 ) * 360
ELSE
([Time for plotting] / 60 ) * 360
END

“But wait!”, I hear you shout at the screen. Dividing the hour by 12 might work for the morning, but what about when it’s the afternoon, when Tableau’s DATEPART() function will return the number 18 for 6pm, as it works on a 24 hour format?

You’re completely right, I haven’t accounted for that. But I don’t really need to. If it’s 6pm, the hour is 18. 18/12 is 1.5, and multiplying that fraction by 360 gives us 540°. Sure, 540 is bigger than the 360° that are in a circle… but the wonderful thing about circles is that they’re, well, circular. Plotting 540° on this clock face will look identical to plotting 180°. If it bothers you that they’re not technically the same, feel free to add an IF clause to identify the afternoon and then subtract 12 hours from the Exact Hour field.

Now that we’ve got the right angles, we can calculate where the coördinates go. This is a bit more tricky.

The first thing to bear in mind is that I’ve changed the trigonometric functions to reflect how Tableau will actually plot the angles, rather than using the standard ones in maths textbooks.

Maths textbooks will tell you that to find the coördinates (X,Y) on a circle, given the angle θ and a radius of 1 from the centre point (0,0), the equations are Y = Sin θ and X = Cos θ. I’m not going to go into why or how here, but please just trust me on this one and take it at face value. Y = Sin θ and X = Cos θ.

Those maths textbooks will also give you a diagram like this:

5 maths 1

But this isn’t what we have; we have this angle instead:

5 maths 2

…so using the exact same calculations won’t quite work for us here, because they calculate it relative to a different axis. But, we can still use the earlier diagram to help us work it out; we just need to rotate it and flip it a bit until we have what we need:

5 maths 3.2

This looks like the angle we’re trying to work out, right?

This means that our X axis is the Y axis in the canonical diagram, and our Y axis is the X axis in the canonical diagram. Let’s just rename the two axes so X goes along the bottom and Y goes up and down again:

5 maths 4

Now, for us, X = Sin θ and Y = Cos θ. Nice.

That’s all well and good, but there’s another step before it’ll actually work in Tableau. We’ve calculated our angle in degrees (because that’s what everybody learns at school first, and that’s still what’s the most intuitive thing for me). Thing is, Tableau uses radians with trigonometric functions. When we use radians, 360° is equivalent to 2π… which means that 1° is equivalent to π/180. So, we can still use our angle field, we just have to multiply it by π/180 (radians is another thing that you’ll just have to take my word on for now, I’m afraid; just remember that π = 3.14159… and so on, and π also = 180°).

Finally, we want our clock hands to be different lengths. To do this, you can take the equations and multiply them by a constant. Through trial and error, I found that I liked it best when the minute hand was 1.6x the length of the hour hand, so I multiplied the equations by 1.6 when it was for minutes and by 1 when it was for hours, just to keep it consistent.

The fields are:

X:
IF [Path ID] = 1 THEN
IF [Time Unit] = "Minutes" THEN
1.6 * SIN([Angle]* PI() / 180)
ELSE
1 * SIN([Angle]* PI() / 180)
END
ELSE 0
END

Y:
IF [Path ID] = 1 THEN
IF [Time Unit] = "Minutes" THEN
1.6 * COS([Angle]* PI() / 180)
ELSE
1 * COS([Angle]* PI() / 180)
END
ELSE 0
END

If you’re wondering why Path ID matters, it’s about connecting the lines to the dots. What we need is to have the lines start at (0,0) and end at (X,Y), but we still need to tell Tableau that the starting point is (0,0) where Path ID = 0.

That’s a lot of trigonometry, but we’re finally done! All you need to do now is to drag SUM(X) to columns and SUM(Y) to rows, and put Time Unit on detail. This will give you two circles. Drag SUM(Y) to rows again, and change it to line. Put Path ID on the Path shelf. Then dual axis the two SUM(Y) fields, and synchronise axes.

This probably doesn’t quite look right yet, because you have to make sure that you fix both the X and Y axes to be between the same range; I’ve fixed both of mine to go from -2 to +2, which has worked out nicely.

6 clock.png

That’s it for making the clock! But there’s even more fun to be had in the final step, which is playing around with background images. I found a lot of beautiful handless clock faces online, but most of them have copyright restrictions, so I’m not going to use those. Instead, I went for an homage to Don’t Hug Me I’m Scared, a youtube series with probably my favourite animated clock character of all time. At some point, I might try it out with my own face and see how horrific that looks.

I hope this helps! It was really fun to build and write about. Please leave me a comment if you have any questions, and I’ll do my best to answer.

(title inspiration: Time is a Machine – Listener)

Standard
Alteryx, football, Maps, Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps

I haven’t written a blog in far too long. My bad. So, to get back into the swing of things, here’s something I’ve been playing with this week: centre of gravity plots.

It started with an accident. I had some EU member data, and I was simply trying to make a filled map based on the year each country joined, just to see if it was worth plotting. You know, something like this:

1 eu filled.png

Except that I’d been having a clumsy day (the kind of day where I spilled coffee on my desk, twice), and accidentally missed the filled map option and clicked line instead:

2 broken line.png

Now, I normally don’t like connected scatterplots, but realised that I could change a couple of things to this accident to make quite a nice connected scatterplot on a map, joining up the central latitude and longitude of each country, so I thought I’d follow through with it and see what happened.

(by the way, the colour palette I use is the Viridis Palette, which I absolutely love. You can find the text to copy/paste into your Tableau preferences file here)

Firstly, I changed my “year joined” field from a discrete dimension into a continuous measure so that I could make it a continuous line with AVG(Year joined):

3 connected line left right.png

This connects all the countries by their central latitude and longitude as generated by Tableau, but it joins them up in order from left to right on the map. So, I then added AVG(Year joined) to the path shelf as well, which means that each country is joined in chronological order, or in alphabetical order when there’s a tie (as with Belgium, France, Germany, Italy, Luxembourg, and the Netherlands, who formed the EU in 1958):

4 connected line year.png

I was pretty happy with this; it shows the EU’s expansion eastwards over time far, far better than the filled map did.

I got talking to Mark and Neil online, who introduced me to the idea of “centre of gravity” plots, which show the average latitude and longitude of something and how it changes with respect to something else (usually time). In this case, a centre of gravity plot of the EU would show the average central point of Belgium, France, Germany, Italy, Luxembourg, and the Netherlands in 1958, then the average central point of Belgium, France, Germany, Italy, Luxembourg, the Netherlands, Denmark, Ireland, and the UK in 1973… and so on. I figured it should be easy enough, I’d just take Country off detail, replace it with Year joined, and average the latitudes and longitudes together.

Sadly, it doesn’t work that way. The Latitude (generated) and Longitude (generated) fields that Tableau automatically generates when it detects a geographic field like country can’t be aggregated, and can’t be used if the geographic field they’re based on isn’t in the view. That meant I couldn’t average the latitudes and longitudes over multiple countries without creating lots of different groups.

But, there’s a simple way around this! You can create a text table of the latlongs, copy/paste them into Excel or whatever, then read that in as another data source. Firstly, drag your geographic field into the view, and put the latitude on text, like so:

5 create table.png

Then copy and paste it all (I just click on there randomly, hit ctrl+A, ctrl+C, switch to Excel, ctrl-V). Now do the same for the longitude. Save the document, and read it in as a separate data source in Tableau. Now you can blend the data on Country, or whatever your geographic field is, and you’ve got actual latlongs that you can use like proper measures.

And so I did. I recreated the line chart with the new fields, but took Country off detail, and made AVG(Latitude) and AVG(Longitude) into moving average table calculations which take the current value and an arbitrarily high number of previous values (I put in 100, just because). This looked pretty good:

6 cog flawed averages.png

…but then I realised that it wasn’t accurate data. Look at the point for 1973, after the UK, Ireland, and Denmark joined. Doesn’t that seem a little far north?

7 cog flawed illustrated

To investigate it fully, I duplicated the sheet as a crosstab, because sometimes, tables are the best way to go. What I found is that I’ve got a bit of Simpson’s Paradox going on; the calculation is taking averages of averages:

8 cog flawed explained.png

Not so great. If we add Country to the view after the Year joined pill, you can see what it should be:

9 what it should be doing.png

But the problem is, how do we put Country on detail but then get the moving average to ignore it? I tried various LODs, but couldn’t get it to work exactly – if you have a solution, I would love to hear it! My default approach is to try to restructure the data in Alteryx – because that generally solves everything – but I feel like I’m becoming too reliant on restructuring the data rather than working with what Tableau can do.

Anyway, I ended up restructuring the data by generating a row for each country and year that the country has been a member of the EU. That means I can create a data table like this:

10 restructured table

…which removes the need for a moving average calculation entirely, because the entire data is moving with the year instead. Just take country off detail / out of the view, and you get the right averages:

11 restructured table 2

Much more accurate:

12 EU cog.png

This is a better way of structuring the data for this particular instance, because the dataset is tiny; 28 countries, 60-ish years, 913 rows in my Excel file. It’s not going to be a good, sustainable solution for a centre of gravity plot over a much bigger dataset though. I did the same thing for the UN – 193 countries, 70-ish years – and ended up with 10,045 rows in my Excel file. It’s easy to see how this could explode with much more data.

It does look interesting, though; I’d never have guessed that the UN’s centre of gravity hadn’t really left the Sahara since its inception:

12 UN cog

Finally, since I was on a roll, I plotted the centre of gravity for the English football champions since the first ever professional season in 1888-89. Conceptually, this was slightly different; unlike the EU and the UN, the champion isn’t a group of teams constantly joining over the years (although it is possible to plot that too). Rather, I wanted to create a rolling average of the centre of gravity over the last N years. If you set it to five years, it’s a bit messy, moving around the country quite a lot:

13 english football 5 years.png

But if you set it to 20 years, the line tells a nice story. You can see how English football started out with the original northern teams being the most powerful, then it moves south after the Second World War, then it moves north-west during the Liverpool/Manchester era of domination, and finally it’s moving south again more recently:

14 english football 20 years.png

Many thanks to Ian, who showed me how to parameterise this. Firstly, put your hard-coded (i.e. not Tableau generated!) latitude or longitude field in the view, and create a moving average over the last ten years. Or two, or thirteen, or ninety-eight, it doesn’t really matter. Next, drag the moving average latitude/longitude pill from the rows/columns into the measures pane in order to store it. This creates a calculated field. Meanwhile, create a parameter to let you select a number. This will change the period to calculate the moving average over. Open up the new calculated fields, and replace the number ten/two/thirteen/ninety-eight with your newly-created parameter, remembering to leave the minus sign in front of it:

15 calc mov avg param

This will let you parameterise your moving average centre of gravity.

It was a lot of fun to play around with these maps this week. I’ve packaged them all up in a Tableau Public workbook here; I hope you find it as interesting as I did!

(title inspiration: Touché Amoré – Gravity, Metaphorically)

Standard