Alteryx

Between order and (statistical) model: how the crosstab tool in Alteryx orders things alphabetically but inconsistently

I was using my Mahalanobis Distance calculation recently on some of my Spotify listening data, and I ran into difficulty. When I calculated the MD value of one song compared to the benchmark group, it gave me a value of 3.12. Nice. But, when I calculated the MD values of several songs at once compared to the same benchmark group, I got a value of 0.67 for that same song that was 3.12 when calculated individually. The same thing happened for lots of other songs; I got one value when calculating it individually, and another when calculating a whole bunch of them together.

This was weird, and after several hours of diagnosing what was going on, I finally found it. There’s an inconsistency with the Crosstab tool that I’d never noticed before, and this had a critical knock-on effect.

I’ll walk through it step by step with some random data. Here’s the content in a text input tool; note the variety of capitals, lower case, and numbers:

2

For the MD calculation, what I need is two tables; one where there’s a column for each Thing Name, like this:

3

And one where there’s a row for each Thing Name, like this:

4

It should be simple to generate this, but it isn’t because the Crosstab tool orders the Thing Name alphabetically.

First, let’s see what happens when generating the table with Thing Name as columns. Set the Crosstab tool up like this (for the aggregation, you can choose First, Average, Sum, it doesn’t make a difference with this dataset):

5

Run the workflow, and this is the output. Note how the output has reordered the Thing Name alphabetically:

6

It’s put the Thing Names beginning with numbers first, put those in ascending order, then taken all the Thing Names beginning with letters, and put those in alphabetical order a through z.

Right. Let’s now look at what happens when generating a table with one row per Thing Name. Set up the Crosstab tool like this (again, aggregation method doesn’t matter):

7

And here’s the output:

8

 

This time, it’s put the Thing Name in the rows in alphabetical order slightly differently. First come the Thing Names beginning with numbers in ascending numerical order as before… but then it’s treating Thing Names beginning with CAPITAL LETTERS and Thing Names beginning lower case letters separately. It runs through the capital-first Thing Names A through Z, and then and only then does it run through the lower case-first Thing Names a-z.

Considering that the MD calculation involves matrix multiplication where it’s assumed that the order of items in the rows and columns is identical, this creates a massive problem down the line!

There are two solutions. One is to CAPITALISE EVERYTHING before even starting, which will probably work in most cases… but if your Thing Names are identical except for case (e.g. if XXX, xXx, and xxX are different variables), it will collapse them together a bit like it does for punctuation. This is not ideal.

The other solution is to use record IDs and manual reordering to ensure that the rows and columns stay in the same order, like this (which is how I generated the first two tables in this blog):

9

This was an incredibly simple thing that was messing up my calculations, but it took me hours to find. If you’re running into issues – and even if you don’t think you are – check the order of things in your tables!

Standard
Tableau

Time is money, money is power, power is pizza, and pizza is knowledge.

I’m not proud to admit it. I was recently in a pub and pulled out my phone to calculate the area of a set of pizzas we were ordering. There were two sizes of pizza on the menu – 12″ and 18″, for £12 and £18 each – and my group were going to order five 12″ pizzas.

I couldn’t let that slide.

“Nah, let’s get three 18″ pizzas instead”, I said. “More pizza, less money.”

Intuitively, it feels wrong. Yes, an 18″ pizza is bigger, but surely five 12″ pizzas are more than only three 18″ pizzas? This is when I worked out the total area of the pizzas and their price per square inch, and for the record, buying three 18″ pizzas at £18 gets you 763 sq inch of pizza for £54, whereas buying five 12″ pizzas at £12 gets you 566 sq inch of pizza for £60. We bought the big ones.

This got me thinking about how to visualise this in Tableau, and I created the following handy dashboard for settling your pub pizza problems (click here for desktop version, or here for mobile version):

mobile pizza dash.png

This was a fun one to build, not least because there’s basically no underlying data. I knew I wanted the inputs for the calculations to depend entirely on parameters, so all I needed was the names of the two pizzas. This is the spreadsheet I created:

1

Simple.

The next thing was to create three parameters per pizza; diameter, price, and quantity. The values entered here can be passed to the data with a case statement, e.g.:

2

…or directly within a formula:

3

Things were starting to get complicated, so I built out a table to make sure the numbers were working properly:

4.png

From here on out, it was relatively straightforward. It’s really hard to accurately judge area when looking at circles and squares (which is one of the reasons I hate packed bubbles and tree maps), so I started out by visualising the pizzas as actual circles vs. bar graphs.

This makes the two pizzas look relatively similar in size:

5.png

…whereas this shows that one of them is more than double the other:

6

One useful little trick I found was to create a calculated field to place the two pizza icons in specific places:

9.png

This spaced them out nicely rather than piling them on top of each other or putting them side by side and too close together:

10.png

The most fiddly bit was creating the text summary. I dragged a lot of calculated string fields onto the label shelf, and configured it like this:

8

And an example of some of the fields going into making a sentence which automatically updates is the best pizza deal field here:

7

I really enjoyed playing around with this viz. People who use Tableau often talk about liberating your data, but this was a nice exercise in what’s possible with basically no data at all!

April pizza

Standard
Tableau

Paint your target: how to create a bullseye graph in Tableau

Ever thought of tracking whether something’s hitting the target by showing an actual target? I was looking through some old radial blogs recently, and realised you could use a scatterplot on x-y coördinates to show accuracy on a bullseye target.

(to cut straight to the viz on Tableau Public, go here; to find out how to create it, keep reading!)

For example, you could set up an image of a target a bit like this as a background image:

1

…and plot simple x-y points in Tableau over the top to show how close people/departments/countries are to meeting their target:

10

I used MS Publisher to design these concentric circles in the background image because when you select them all and save as an image, it doesn’t save the blank space as white, so you can use it with any colour worksheet or dashboard. Feel free to download this target image here (and if you use it, give me a shout! I’d love to see what kind of cool things you’re using it for).

Sadly, Superstore doesn’t have target data on it, so I’ve mocked up a quick dataset of sales and target sales per country as follows:

2

The next thing to do is calculate the “accuracy” on the bullseye; the nearer the sales figure is to the target, the closer to the middle of the bullseye it should be. However, once you’ve met or exceeded your target, you don’t want the points to keep moving. Countries with sales at 101% of the target and at 300% of the target should both still be in the middle.

The exact middle of the target is going to be at coördinates (0,0), which means that we actually want to create a field which takes some kind of inverse of the accuracy, where the greater the accuracy, the smaller the number in that field.

3

So, let’s create the following calculated fields. I’ve adapted them a bit from this blog on radial bar charts so that you can put the X field on columns and the Y field on rows, which is more intuitive.

Accuracy:

IF [Sales] > [Target] THEN 1.05
ELSE [Sales] / [Target]
END

Radial Field:

1.1 - [Accuracy]

Why have I set the Accuracy equaliser bit to 1.05 instead of 1? And why have I set the Radial Field calculation to 1.1 instead of 1?

Well, it’s a bit fiddly, but it’s about plotting. I want everything where Sales is 100% of Target or more to be in the gold bullseye, so I want that to fill a certain amount of space. If I set the Accuracy equaliser bit to 1 and the Radial Field calculation part to 1, then it plots everything that’s at 100% or more at (0,0) and includes everything that’s 90% or more in the gold bullseye. I want to space out the 100% or more points so that they’re not on top of each other, and I want only the 100% or more points to be included in the gold bullseye. Setting the Radial Field calculation part to 1.1 makes it so that the edge of the gold bullseye denotes 100%. That means that the 100% or more points will be plotted on the edge of the bullseye. So, the 1.05 part in the Accuracy calculation moves those points further inside the bullseye, but not into the exact middle where they’ll be on top of each other.

While I’m at it, there’s a limitation to the Accuracy calculation. Have you spotted how it’s [Sales] > [Target] rather than SUM([Sales]) > SUM([Target])? This is because the angle calculation needs to aggregate the Radial Field further, and will break if the Accuracy or Radial Field calculations are already aggregated. This means that you’ll probably need to do some data processing to make sure that it’s just one row per thing in the dataset.

The next fields to calculate are:

Radial Angle:

(INDEX() -1 ) *
(1/WINDOW_COUNT(COUNT([Radial Field])))
* 2 * PI()

X:

MAX([Radial Field]) * SIN([Radial Angle])

Y:

MAX([Radial Field]) * COS([Radial Angle])

Now you can drag X onto columns, Y onto rows, and put Country on detail:

4

This looks pretty rubbish, but that’s because the INDEX() function in the Radial Angle calculation is a table calc, and Tableau needs to know how to compute it. Edit the X and Y fields to compute using Country instead:

5

…and now you’ll get some points spaced out properly:

6

The calculations mean you can plot the points in a radial way; it’ll go through whatever field you’ve put on detail, and plot the points the right distance away from the centre, starting at 12 o’clock and looping round clockwise. With Country, it’s done alphabetically, so Belgium is on the X axis zero line. If you want to order the points differently, you can add a numeric field to the detail shelf, and change the table calculation to compute using that numeric field (but make sure to keep Country there!).

The next thing to do is to put the target in as a background image. Confusingly, this is under the Map options. Once you’ve found it, you have to specify exactly where the background image is going to sit on the x-y grid of coördinates. With this target, there are four concentric circles, denoting 70%, 80%, 90%, and 100%+. Because I set the Radial Field to be 1.1 – [Accuracy], there’s a distance of 0.1 in a circle around point (0,0) and the edge of the bullseye denoting 100%. That means that the bullseye section is going to run from -0.1 to 0.1 on both axes. Counting backwards, there are three other circles which are the same thickness; the light grey one denoting 90% will run from -0.2 to 0.2 on both axes, the mid grey one denoting 80% will run from -0.3 to 0.3 on both axes, and the dark grey one denoting 70% will run from -0.4 to 0.4 on both axes. As the edge of the dark grey circle is the edge of the background image, this is where you need to tell Tableau to position the background image:

7

Click OK, and there we are! A nice target, with the different points on it:

8

You can change the formatting to remove the zero lines if you like:

9

…but I actually kinda like them on this graph. I’ve also put Country on the colour shelf, and made the borders around the circles black.

10

I haven’t seen this approach before, so I’m not really sure what to call it. I’ve plumped for a bullseye graph, but maybe it already exists under another name. Let me know if somebody else has covered this, and definitely let me know if you find this useful! You can download the Tableau workbook I used to make this example here (it’s 10.2, by the way, but the same approach should work for older versions).

Standard
Alteryx, football, Tableau

The relationship between away team performance and distance travelled in the English football league

If you follow football, you often hear about arduous away trips to the other side of the country. This seems to imply that the further an away trip is, the more difficult it is for the away team.

However, is that actually true? Do away teams really do worse when they’ve travelled a long way to get there, or is there no difference?

The football league season has just finished, so I’ve taken each match result from the Championship, League One, and League Two in the 2016-17 season. After some searching, I got the coordinates of each football league team’s stadium, and used the spatial tools in Alteryx to calculate the distance between each stadium. I then joined that to a dataset of the match results, and you can download and play with that dataset here. I stuck that into Tableau, and you can explore the interactive version here.

First, let’s have a look at how many points away teams win on average when travelling different distances. I’ve broken the distance travelled into bins of 25 miles as the crow flies from the away team’s stadium to the home team’s stadium, then found the average number of points an away team wins when travelling distances in that bin (I excluded the games where the away team travelled over 300 miles as there were only two match ups in that bin – Plymouth vs Hartlepool and Plymouth vs Carlisle).

It turns out that it actually seems easier for away teams when they travel further away:

Teams travelling under 25 miles win just under a point on average, while teams travelling over 200 miles win between 1.3 and 1.6 points on average.

This is surprising, but there could be several reasons contributing to this:

  1. Local rivalries. It’s possible that away teams do worse in derby matches than in other matches; this is something to investigate further.
  2. Team bonding. It’s possible that travelling a longer distance together is a shared experience that can help with team bonding.
  3. Southern economic dominance. England is relatively centralised, economically speaking; most of the wealth is in the south. Teams in the South travel further than average to away games, so perhaps the distance advantage actually shows a southern economic advantage; teams in richer areas can buy better players.
  4. Centralisation vs. sparser regions. England is relatively centralised, geographically speaking; most of the population lives in the bits in the middle, and teams in the Midlands travel the least distance on average. Perhaps teams in more centralised areas (e.g. Walsall, Coventry) have more competition for resources like new talent and crowd attendance, while teams in less centralised areas (e.g Exeter, Newcastle) might have less competition for those resources.

I also used Tableau’s clustering algorithm to separate out teams and their away performances based on distance travelled, and it resulted in four basic away performance phenotypes (which you can explore properly and search for your own team here):

Since I had the stadium details, I had a look at whether the stadium capacity made a difference. This isn’t a sophisticated analysis – better teams tend to be more financially successful and therefore invest in bigger stadiums, so it’s probably just a proxy for how good the home team is overall, rather than capturing how a large home crowd could intimidate an away team.

Finally, this heat map combines the two previous graphs and shows that away teams tend to do better when they travel further to a smaller ground. This potentially shows the centralisation issue discussed earlier; the lack of data in the bottom right corner of the graph shows that there are very few big stadiums in parts of the country like the far North West, North East, and South West, where away teams have to travel a long way to get to.

So, it looks like the further an away team travels, the better they tend to do… although that could reflect more complicated economic and geographic factors.

Standard
Alteryx, R

How to be an R soul: an introduction to the R Tool in Alteryx

Alteryx is great for a lot of analysis, and the in-built tools improve with every release. But sometimes you just need to work with the R code directly; maybe you’ve inherited an R document that you need to reproduce, or maybe you need to use a specific package for sentiment analysis, or maybe you’re just far more used to R syntax and want to make sure the model is running exactly as you intend.

This is where the R tool comes in handy.

For this blog, I re-ran a section of one of the experiments I did for my PhD. You can find the data and R analysis script here (better still, download the Rmarkdown html and view in your browser to see the code and the command line output), and you can read the paper here. One section of the analysis compared mixed models using the lme4 package, which I’m not sure how to do in Alteryx. I’m sure there’s a way, but the R tool is perfect for making sure that I reproduce the results exactly.

First, drop the R tool into the workflow:

1

It’s not enough to just connect the previous tool to the R tool input, though; you have to specifically tell the R tool to load the data in. You can do that with this bit of code at the top of the scripting panel. The R tool takes multiple inputs so you can bring in various different pieces of data; the R tool recognises them as #1, #2, #3, etc. This line says “read input #1 into the R tool as a dataframe and store it as behdata within the R script”:

2

You then need to load the R packages you’ll be using. It’s a bit tricky to install extra R packages in Alteryx if the installer doesn’t match your version, but Alteryx comes with quite a lot of useful R packages pre-installed anyway (see here for Alteryx 10 and here for Alteryx 11). However, even if the packages are already installed, they need to be loaded each time.

3

Now, you can continue with the R code… for the most part.

4

Once you’ve done your coding, you’ll need to write the results to the R tool output. This code is pretty similar to the input; it reads “write the object modelcomparison to R tool output 1”:

5

However, because Alteryx works with dataframes, you can only write dataframes to Alteryx. This means you’ll have to convert matrices into dataframes, and if you’re dealing with lists, you’ll have to coerce them to dataframes before you can do anything with them.

Sadly, the R tool doesn’t have a command line. When I want to look at the properties of the model, in R I’d simply type summary(modelname) and get a nice result in the command line:

6

One way of doing this in Alteryx would be to store the summary as an object and then write to one of the outputs. However, a model summary like this is a list in R, which can’t be written to Alteryx without converting it to a dataframe first. If you try it, you’ll see this error:

7.png

and it’s a little more complicated than that, but that’s another blog for another time.

 

Standard
Science

Our revolution is a joke

I nearly didn’t go on the science march / the march for science / the super happy fun time science party on Saturday. I hadn’t been intending to, for reasons covered far better here, here, and especially here and post-hoc here (I’d like to buy the author of this one a beer just for the phrase “public displays of self-reckoned intelligence as a kind of performance art”). But while marching for science probably does nothing, sitting at home definitely does nothing, so I figured I’d go along for a bit of political protesting.

[side note: yes, I think it’s political. Science itself is, or at least should be, apolitical; but saying that political institutions like goverments should take scientific research seriously and make policies based on carefully researched and analysed evidence is still a political position]

I had a sign and everything. As 1) not that many people who weren’t already marching gave knew or cared about it, and as 2) signs that say “yay science!!1!” or signs that are very clever and obscure science jokes are pretty much useless, mine was mostly aimed at the other marchers:

post truth post modernism

I was trying to say that the post-truth politics that many academics were marching against and the post-modernist discourse that many of the same academics support are two sides of the same ideologically-motivated, evidence-free, autoethnographic coin… but it wasn’t very good, because it didn’t make BuzzFeed’s Literally Just 35 Signs From The Science March That Will Make You Laugh listicle. Which is a shame.

I’m not a historian, but I’m pretty sure that protests which were ultimately effective had a better message than just pop culture references, irony, or pith:effectiveness of protest vs ironic pop culture memes

Of course, with so many funny signs on display, a lot of people looked like they were having a great time. I’m glad that they were out in the sunshine enjoying themselves, but there’s a time and a place for it, and a serious political protest is not that place:

carnival vs serious political protest

After the rally in Parliament Square was done, the organisers and speakers all got on stage and sang Monty Python’s Galaxy Song, because the perfect way to trivialise a serious political point about the lack of evidence-based government policy is to make the headline act a load of awkward British people mumbling a twee song from an absurd musical comedy film:

pie chart of serious protests

All this is incredibly frustrating, because I agree with the general motivations for the march. I want what those people want to happen to happen. That’s why I was there.

But this isn’t going to help.

I’m not really sure what will help either, to be fair. As a scientist, my PhD was about how we can learn and process the sounds of words, not how we can hold governments to account. But as scientists, we can start by cutting out the jokes. After all, when we want our articles to be taken seriously, we keep the titles serious because articles with funny titles get fewer citations and less attention; why should it be any different for our protest signs?

Standard
Tableau

The Colour Revolt: using rank calculations to get around Tableau’s colour defaults

Tableau’s colour palettes are generally pretty good, pretty flexible, and pretty… well, pretty. But sometimes you want to colour-code a dimension with more values than the number of colours in the palette, and this is when Tableau has a bit of a problem.

This blog is about how Tableau assigns colours, how that messes some things up, and how to get around that.

I’m working with the Global Indicators dataset in Tableau 10.2 here. Let’s say I want to plot each country’s CO2 emissions each year. Without using colours at all, we get this:

1

Not exactly ideal, so I’ll put the Country dimension on the colour shelf. But before I do, I’m just going to take the CO2 Emissions measure off the view so that we can see how Tableau assigns colours to values in a dimension:

2

Tableau takes all values in the dimension in order, and loops through the colour palette (in this case the 20-shaded automatic palette), assigning the colours in order and restarting every 20 values.

As there are about 200 countries but only 20 colours, ten or so countries will be represented by the same colour. This might not be a problem, depending on what we’re plotting, but then again, it might. Let’s put the CO2 Emissions measure back into the view:

3

Yeah, it’s a problem. It just so happens that China and the USA are a multiple of 20 apart in the dimension list, so they’ve been assigned the same brown colour. This is a problem.

Even more irritatingly, Tableau assigns the colour palette to all possible values in the dimension before you filter them. So, if I change the filter so that it only has China and the USA…

4

…they’re still the same colour.

How can we assign colours to only the countries selected by the filter in the view, rather than all countries?

One solution (thanks Anna for the tip!) is to create a calculated Rank() field, which will only look at the countries in the view. I’ll stick with a default rank of the countries themselves for now, which will do it alphabetically, but you can also do it by how they do on a particular measure.

5

Make this new calculated field discrete, and drag it onto the colour shelf. You’ll also need to separate out the countries by putting Country on detail (which was previously done automatically when Country was on colour).

6

This has changed the colours… but they’re still the same. This is because Rank() is a table calculation, and we need to change its default settings. Instead of compute using Table (across), compute using Country:

7

This now ranks the countries alphabetically (in reverse order, but it doesn’t really matter for this purpose) and assigns a different colour to each of them:

8

Now we have nicely distinguishable colours for dimensions with loads of values when we actually only want to plot a handful!

The colour marks card is a bit annoying, though; it only shows the rank number next to the colour, not the country name. You can get around this by creating a new sheet, applying the country filter to that sheet too, and just putting Country in rows:

9

You can now use this second sheet as a colour legend in a dashboard.

Of course, you might suggest that a simpler solution to the colour problem is to forget about colours altogether and just use labels on the lines. And you’d be totally correct! But maybe you’ve got a more complicated plot where you’ve got several different values for the same country or something. Like a scatterplot of two measures with a circle for each year:

10

Labels would look pretty hideous on this, especially if the circles overlapped a lot. But the Rank() colour workaround still applies.

Standard