Alteryx, data visualisation, football, Tableau

The growing gap between the Premiership’s Top Six and the rest.

This is my first football data blog for a while, and I feel all nostalgic! It’s nice to dive into some league table data again, and even nicer now that I have Alteryx; I was able to format my data about 10x quicker than I was in when I first started doing this in R. Then again, I’ve probably also spent 5x more time using Alteryx than R in the last year or so. Anyway.

I’ve been hearing a lot more analysis of the Top 6 in the Premiership recently. I first noticed it in the last couple of seasons, when I saw a few journalists/people on Twitter writing about a “Big Six Mini-League”. Liverpool often seemed to do quite well at this, and Arsenal often seemed to do quite badly at this. Neither team won the actual league.

I’ve started looking at how the Top 6 sides in the Premiership perform each year (using data from this fantastically well-maintained repository), and there’s quite a few interesting stories in here. The first main point is that the big clubs are accelerating away from the rest of the league. The second main point is that any big six mini-league doesn’t really matter, as you can win the Premiership with an underwhelming record against your main rivals if you trash everybody else. I mean, that shouldn’t be much of a surprise – if you’re a Top 6 team, only 30 points are on offer from matches against your rivals, but you can potentially take 84 points from the 28 matches against the rest of the league.

For all these analyses, I’m taking Top 6 literally – meaning the teams that finish that season in the top six positions. Nothing to do with net spend, illustrious history, shirt sales in Indonesia, or anything like that. I then look at the average points-per-game changes by team, position, season, and Top 6/Bottom 14 status. I also filtered out the first three seasons of the Premiership to keep it slightly easier for comparison, since there were 22 teams in the league until 1995-96.

When plotting the average points-per-game per season between the two groups, a clear trend emerges; the Top 6 are better and better at beating the rest of the league:

1

However, this trend appears to be asymmetrical. When looking at the overall average points-per-game for all games across the season, teams that finish in the Top 6 are getting better, but there’s only a negligible decline for the rest of the league. This suggests the bigger, better teams are pulling away from the rest of the league:

2

This effect is most striking when plotting the difference in overall average points-per-game between the two groups:

3

Teams finishing in the Top 6 scored around 0.6 points-per-game more than the rest of the league in the early nineties, but that’s now up to over 1 point-per-game in the latest couple of seasons. That half-a-point difference translates to a 19-point difference across a whole 38-game season.

We can plot each team in each season of the Premiership (since 1995-96, when the league was first reduced to 20 teams) and look at how well they did against the top teams and the rest of the league. In this graph, the straight line represents equal performance vs the Top 6 and Bottom 14:

4

A couple of things stand out:

1. Only a handful of teams have ever done better vs. the Top 6 than the rest of the league. This seems to have no effect on final position.

2. It’s possible to win the league with a poor record against the Top 6 by consistently beating everybody else. Manchester United won the league in 00-01 and 08-09 with only 1.3 PPG vs the Top 6.

3. Manchester City this year are ridiculous.

I find it interesting to compare Liverpool and Arsenal over the years. One narrative I sometimes hear is that Liverpool tend to raise their game for big matches, but are too inconsistent the rest of the time, whereas Arsenal struggle against big sides but do well enough in the rest of the league to consistently finish well. This chart seems to bear that analysis out; Liverpool’s cluster of dots are higher on the chart, but further to the left:

6

…while Arsenal’s cluster is slightly lower but further to the right… and most importantly, more colourful:

5

And if you want to explore other teams and seasons, there’s an interactive version of all these graphs here.

Standard
Alteryx, data visualisation, Maps, Tableau

Alaska Fried Chicken: the UK’s curious approach to naming chicken shops.

I went a little bit viral a couple of weeks ago when I tweeted about chicken shops in the UK which are named after American states which aren’t Kentucky. If I’d thought about it, I’d have written this blog up first, created a Tableau Public viz, and had all kinds of other shit ready to plug once I started getting some serious #numbers… but I didn’t. So, to make up for that, this blog will go through that thread in more detail and answer a few questions I received along the way.

It all started when I walked past Tennessee Fried Chicken in Camberwell, pretty close to where I live. It’s clearly a knock-off KFC, and I wanted to know how many other chicken shops had the same name format: [American state] Fried Chicken.

The first thing to do is to get a list of all the restaurants in the UK. I spent a while wondering how to get this data, but then I remembered that my colleague Luke Stoughton once built a Tableau Public dashboard about food hygiene ratings in the UK. All UK chicken shops – hopefully! – are inspected by the Food Standards Agency. So, Luke kindly showed me his Alteryx workflow for scraping the data from the FSA API, and I adjusted it to look for chicken shops.

My first line of inquiry is pretty stringent: how many chicken shops in the UK are called “X Fried Chicken” where X is an American state which isn’t Kentucky?

Turns out it’s 34. “Tennessee Fried Chicken” – including variants such as Tenessee and Tennesse – is the most popular with 13 chicken shops. The next highest is Kansas with six, which I’m assuming is so the owners can refer to their shops as KFC, although maybe the owner/s just really like tornadoes, wheat, and/or the Wizard of Oz. Then there’s four Californias, a couple of Floridas, and one each of Arizona, Georgia, Michigan, Mississippi, Montana, Ohio, Texas, and Virginia.

1 state fried chicken map

[tangent: I’m aware that a lot of these states aren’t exactly famed for their fried chicken, but as a Brit, all I have to go on for most of them are my stereotypes from American media. But hey, maybe it’s still accurate, and Ohio Fried Chicken tastes of opiates and post-industrial decline, Arizona Fried Chicken comes pre-pulped for the senior clientele who can’t chew so well these days, and Florida Fried Chicken is actually just alligator. Michigan Fried Chicken is, I dunno, fried in car oil rather than vegetable oil, and Alaska Fried Chicken is their sneaky way of dealing with the bald eagle problem up there? I’m running out of crude state stereotypes now, I’m afraid. Out of all these states, I’ve only actually been to California.]

There’s also a “DC Fried Chicken”, which is close but not quite close enough for me, and a “South Harrow Tennessee Fried Chicken”, which I’m not counting because either.

Here is where these American State Fried Chicken shops are in the UK:

2 map uk

Interestingly, this isn’t a case of a map simply showing population distributions. The shops cluster around the London and Manchester regions, but with almost none in any other major urban centre.

Let’s have a look at the clusters separately. Here’s the chicken shops around the Manchester area:

2.1 map greater nw

None of them are in the proper centre of Manchester itself, but they’re in the towns around. One town in particular stands out: Oldham. Let’s have a look at the centre of Oldham:

2.2 oldham only

Oldham, you’re fantastic. There are six separate “X Fried Chicken” shops in Oldham, and four of them – Georgia, Michigan, Montana, and Virginia – are the only ones by that name in the whole country.

For comparison, here’s the London area:

2.3 greater london area only

This is where all the Tennessees are, as well as the one Texas and Mississippi.

It looks like there’s a lot more variety in the north of England compared to the south, and sure enough, a split emerges:

3 latitude scatterplot

[chicken icon from https://www.flaticon.com/packs/animals-33%5D

Chicken shops in the south of England (and that one Tennessee place in Wales) tend to name their shops after states in the geographical south of the USA, while chicken shops in the north of England name their shops after any states they like.

This is where my initial Twitter thread ended, and I woke up the next day to a lot of comments like “Y IS THEIR NO MARYLAND THEIR IS MARYLAND CHICKEN IN LEICESTER”. Well, yeah, but it’s not Maryland Fried Chicken, is it?

So I re-ran the data to look at chicken shops with an American state in the name. This is the point at which it’s hard to tell if there’s any data drop out; the FSA data categorises places to inspect as restaurants, takeaways, etc., but not as specifically as chicken shops. All I’ve got to go on is the name, so I’ve taken all shops with an American state and the word “chicken” in the name. This would exclude (sadly fictional) places like “South Dakota Spicy Wings” and “The Organic Vermont Quail Emporium”, but it’d also include a lot of false positives; for example, you’d think that taking all takeaway places with “wings” in the name would be safe, but when I manually checked a few on Google Street View (because I’m dedicated to my research), about half of them are Chinese and refer to the owner’s surname, not the delicacy available.

This brings in a few more states – Marlyand, New Jersey, and Nevada:

4 state chicken map

Let’s have another look at the UK’s south vs north split. We’ve got a bit of midlands representation now, with the Maryland Chickens in Leicester and Nottingham, the Nevada Chickens in Nottingham and Derby, and a California Chicken & Pizza near Dudley. The latitude naming split between the south/midlands and the north isn’t quite as obvious anymore:

5 latitude with no fried restriction

…but, there is still a noticeable difference. This graph shows each chicken shop with an American state and the word “chicken” in the name, ordered by latitude going south to north:

6 north vs midlands and south

In the south and the midlands, there’s the occasional chicken shop that’s going individual – there’s the Texas Fried Chicken in Edmonton, the two Mississippi places in London which don’t seem to be related (Mississippi Chicken & Pizza in Dagenham, Mississippi Fried Chicken in Islington), the Kansas Chicken & Ribs place in Hornsey is almost definitely a different chain from the six Kansas Fried Chicken shops in and around Manchester, and the California Fried Chicken in Luton is probably independent of the California Fried Chickens on the south coast – but most of them are Tennessee or Maryland chains in the same area. In all, the south and midlands have 17 chicken shops named after 8 American states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.47.

In the north, however, there’s a proliferation of independent chicken shops – 15 shops named after 9 different states (excluding Kentucky), or a State-to-Chicken-Shop ratio of 0.6. There’s the chain of six Kansas Fried Chicken places and two Florida Fried Chicken places in Manchester and Oldham, but the rest are completely separate. Good job, The North.

The broader question is: why does the UK do this? There’s obviously the copycat nature of it; chicken shops want to seem plausible, and sounding like a KFC (and looking like one too, since they’re almost always designed in red/white/blue colours) links it in people’s minds. I think there’s more to it, though. Having a really American-sounding word in the name is probably a bit like how Japanese companies scatter English words everywhere to sound international and dynamic (even if they make no sense), or how Americans often perceive British names and accents as fancier and more authoritative (even if to British ears it’s somebody from Birmingham called Jenkins). We’re doing the same, but… for fried chicken.

Finally, since this data is all from the Food Standards Agency’s hygiene ratings, it’d be a shame not to look at the actual hygiene ratings:

7 hygiene

It looks like independently-named chicken shops named after American states in the north are more hygienic. The chains in the south and midlands – Tennessee, Maryland, California, and especially New Jersey – don’t have great hygiene ratings, and the independent shops do pretty badly too. In contrast, the chicken shops in the north score highly for cleanliness. In fact, a quick linear regression of hygiene onto latitude gives me an R2 of 0.74 and a p-value of < 0.0001. Speculations as to why this is on a postcard, please.

Preëmpting your questions/comments:

“I live in […] and my local shop […] isn’t mentioned!”
Maybe you’re talking about a Dallas Chicken place. That’s not a state. Nor is Dixy Chicken, it just sounds a bit American. If it’s definitely a state, then does it have chicken in the name? If not, I won’t have picked it up. I also haven’t picked up shops which have, say, “Vermont Fried Chicken” written on the shop sign if it’s registered in the database as “VFC”. Same with if the state is misspelled, either by the shop or by the data collectors. If it’s all still fine, perhaps the shop is so new that it hasn’t had an inspection… or perhaps the shop is operating illegally and isn’t registered for a hygiene inspection.

“Did you know about Mr. Chicken, the guy who designs the signs?”
I didn’t, but I do now! He’s brilliant.

“How did you do all this?”
I use Alteryx for data scraping/preparation and Tableau for data visualisation.

“I have an idea for something / I want to talk to you about something, can I get in touch?”
Please do! My Twitter handle is @GwilymLockwood, or you can email me on gwilym.lockwood@theinformationlab.co.uk

“Your analysis is amazing, probably the best thing I’ve ever seen with my eyes. Where can I explore more of your stuff?”
Thanks, that’s so kind! There’s a lot of my infographic work on my Tableau Public site here.

Standard
Alteryx, football, Maps, Tableau

Centre of Gravity, Metaphorically: Plotting time-based changes on maps

I haven’t written a blog in far too long. My bad. So, to get back into the swing of things, here’s something I’ve been playing with this week: centre of gravity plots.

It started with an accident. I had some EU member data, and I was simply trying to make a filled map based on the year each country joined, just to see if it was worth plotting. You know, something like this:

1 eu filled.png

Except that I’d been having a clumsy day (the kind of day where I spilled coffee on my desk, twice), and accidentally missed the filled map option and clicked line instead:

2 broken line.png

Now, I normally don’t like connected scatterplots, but realised that I could change a couple of things to this accident to make quite a nice connected scatterplot on a map, joining up the central latitude and longitude of each country, so I thought I’d follow through with it and see what happened.

(by the way, the colour palette I use is the Viridis Palette, which I absolutely love. You can find the text to copy/paste into your Tableau preferences file here)

Firstly, I changed my “year joined” field from a discrete dimension into a continuous measure so that I could make it a continuous line with AVG(Year joined):

3 connected line left right.png

This connects all the countries by their central latitude and longitude as generated by Tableau, but it joins them up in order from left to right on the map. So, I then added AVG(Year joined) to the path shelf as well, which means that each country is joined in chronological order, or in alphabetical order when there’s a tie (as with Belgium, France, Germany, Italy, Luxembourg, and the Netherlands, who formed the EU in 1958):

4 connected line year.png

I was pretty happy with this; it shows the EU’s expansion eastwards over time far, far better than the filled map did.

I got talking to Mark and Neil online, who introduced me to the idea of “centre of gravity” plots, which show the average latitude and longitude of something and how it changes with respect to something else (usually time). In this case, a centre of gravity plot of the EU would show the average central point of Belgium, France, Germany, Italy, Luxembourg, and the Netherlands in 1958, then the average central point of Belgium, France, Germany, Italy, Luxembourg, the Netherlands, Denmark, Ireland, and the UK in 1973… and so on. I figured it should be easy enough, I’d just take Country off detail, replace it with Year joined, and average the latitudes and longitudes together.

Sadly, it doesn’t work that way. The Latitude (generated) and Longitude (generated) fields that Tableau automatically generates when it detects a geographic field like country can’t be aggregated, and can’t be used if the geographic field they’re based on isn’t in the view. That meant I couldn’t average the latitudes and longitudes over multiple countries without creating lots of different groups.

But, there’s a simple way around this! You can create a text table of the latlongs, copy/paste them into Excel or whatever, then read that in as another data source. Firstly, drag your geographic field into the view, and put the latitude on text, like so:

5 create table.png

Then copy and paste it all (I just click on there randomly, hit ctrl+A, ctrl+C, switch to Excel, ctrl-V). Now do the same for the longitude. Save the document, and read it in as a separate data source in Tableau. Now you can blend the data on Country, or whatever your geographic field is, and you’ve got actual latlongs that you can use like proper measures.

And so I did. I recreated the line chart with the new fields, but took Country off detail, and made AVG(Latitude) and AVG(Longitude) into moving average table calculations which take the current value and an arbitrarily high number of previous values (I put in 100, just because). This looked pretty good:

6 cog flawed averages.png

…but then I realised that it wasn’t accurate data. Look at the point for 1973, after the UK, Ireland, and Denmark joined. Doesn’t that seem a little far north?

7 cog flawed illustrated

To investigate it fully, I duplicated the sheet as a crosstab, because sometimes, tables are the best way to go. What I found is that I’ve got a bit of Simpson’s Paradox going on; the calculation is taking averages of averages:

8 cog flawed explained.png

Not so great. If we add Country to the view after the Year joined pill, you can see what it should be:

9 what it should be doing.png

But the problem is, how do we put Country on detail but then get the moving average to ignore it? I tried various LODs, but couldn’t get it to work exactly – if you have a solution, I would love to hear it! My default approach is to try to restructure the data in Alteryx – because that generally solves everything – but I feel like I’m becoming too reliant on restructuring the data rather than working with what Tableau can do.

Anyway, I ended up restructuring the data by generating a row for each country and year that the country has been a member of the EU. That means I can create a data table like this:

10 restructured table

…which removes the need for a moving average calculation entirely, because the entire data is moving with the year instead. Just take country off detail / out of the view, and you get the right averages:

11 restructured table 2

Much more accurate:

12 EU cog.png

This is a better way of structuring the data for this particular instance, because the dataset is tiny; 28 countries, 60-ish years, 913 rows in my Excel file. It’s not going to be a good, sustainable solution for a centre of gravity plot over a much bigger dataset though. I did the same thing for the UN – 193 countries, 70-ish years – and ended up with 10,045 rows in my Excel file. It’s easy to see how this could explode with much more data.

It does look interesting, though; I’d never have guessed that the UN’s centre of gravity hadn’t really left the Sahara since its inception:

12 UN cog

Finally, since I was on a roll, I plotted the centre of gravity for the English football champions since the first ever professional season in 1888-89. Conceptually, this was slightly different; unlike the EU and the UN, the champion isn’t a group of teams constantly joining over the years (although it is possible to plot that too). Rather, I wanted to create a rolling average of the centre of gravity over the last N years. If you set it to five years, it’s a bit messy, moving around the country quite a lot:

13 english football 5 years.png

But if you set it to 20 years, the line tells a nice story. You can see how English football started out with the original northern teams being the most powerful, then it moves south after the Second World War, then it moves north-west during the Liverpool/Manchester era of domination, and finally it’s moving south again more recently:

14 english football 20 years.png

Many thanks to Ian, who showed me how to parameterise this. Firstly, put your hard-coded (i.e. not Tableau generated!) latitude or longitude field in the view, and create a moving average over the last ten years. Or two, or thirteen, or ninety-eight, it doesn’t really matter. Next, drag the moving average latitude/longitude pill from the rows/columns into the measures pane in order to store it. This creates a calculated field. Meanwhile, create a parameter to let you select a number. This will change the period to calculate the moving average over. Open up the new calculated fields, and replace the number ten/two/thirteen/ninety-eight with your newly-created parameter, remembering to leave the minus sign in front of it:

15 calc mov avg param

This will let you parameterise your moving average centre of gravity.

It was a lot of fun to play around with these maps this week. I’ve packaged them all up in a Tableau Public workbook here; I hope you find it as interesting as I did!

(title inspiration: Touché Amoré – Gravity, Metaphorically)

Standard
Alteryx

Between order and (statistical) model: how the crosstab tool in Alteryx orders things alphabetically but inconsistently

I was using my Mahalanobis Distance calculation recently on some of my Spotify listening data, and I ran into difficulty. When I calculated the MD value of one song compared to the benchmark group, it gave me a value of 3.12. Nice. But, when I calculated the MD values of several songs at once compared to the same benchmark group, I got a value of 0.67 for that same song that was 3.12 when calculated individually. The same thing happened for lots of other songs; I got one value when calculating it individually, and another when calculating a whole bunch of them together.

This was weird, and after several hours of diagnosing what was going on, I finally found it. There’s an inconsistency with the Crosstab tool that I’d never noticed before, and this had a critical knock-on effect.

I’ll walk through it step by step with some random data. Here’s the content in a text input tool; note the variety of capitals, lower case, and numbers:

2

For the MD calculation, what I need is two tables; one where there’s a column for each Thing Name, like this:

3

And one where there’s a row for each Thing Name, like this:

4

It should be simple to generate this, but it isn’t because the Crosstab tool orders the Thing Name alphabetically.

First, let’s see what happens when generating the table with Thing Name as columns. Set the Crosstab tool up like this (for the aggregation, you can choose First, Average, Sum, it doesn’t make a difference with this dataset):

5

Run the workflow, and this is the output. Note how the output has reordered the Thing Name alphabetically:

6

It’s put the Thing Names beginning with numbers first, put those in ascending order, then taken all the Thing Names beginning with letters, and put those in alphabetical order a through z.

Right. Let’s now look at what happens when generating a table with one row per Thing Name. Set up the Crosstab tool like this (again, aggregation method doesn’t matter):

7

And here’s the output:

8

 

This time, it’s put the Thing Name in the rows in alphabetical order slightly differently. First come the Thing Names beginning with numbers in ascending numerical order as before… but then it’s treating Thing Names beginning with CAPITAL LETTERS and Thing Names beginning lower case letters separately. It runs through the capital-first Thing Names A through Z, and then and only then does it run through the lower case-first Thing Names a-z.

Considering that the MD calculation involves matrix multiplication where it’s assumed that the order of items in the rows and columns is identical, this creates a massive problem down the line!

There are two solutions. One is to CAPITALISE EVERYTHING before even starting, which will probably work in most cases… but if your Thing Names are identical except for case (e.g. if XXX, xXx, and xxX are different variables), it will collapse them together a bit like it does for punctuation. This is not ideal.

The other solution is to use record IDs and manual reordering to ensure that the rows and columns stay in the same order, like this (which is how I generated the first two tables in this blog):

9

This was an incredibly simple thing that was messing up my calculations, but it took me hours to find. If you’re running into issues – and even if you don’t think you are – check the order of things in your tables!

Standard
Alteryx, football, Tableau

The relationship between away team performance and distance travelled in the English football league

If you follow football, you often hear about arduous away trips to the other side of the country. This seems to imply that the further an away trip is, the more difficult it is for the away team.

However, is that actually true? Do away teams really do worse when they’ve travelled a long way to get there, or is there no difference?

The football league season has just finished, so I’ve taken each match result from the Championship, League One, and League Two in the 2016-17 season. After some searching, I got the coordinates of each football league team’s stadium, and used the spatial tools in Alteryx to calculate the distance between each stadium. I then joined that to a dataset of the match results, and you can download and play with that dataset here. I stuck that into Tableau, and you can explore the interactive version here.

First, let’s have a look at how many points away teams win on average when travelling different distances. I’ve broken the distance travelled into bins of 25 miles as the crow flies from the away team’s stadium to the home team’s stadium, then found the average number of points an away team wins when travelling distances in that bin (I excluded the games where the away team travelled over 300 miles as there were only two match ups in that bin – Plymouth vs Hartlepool and Plymouth vs Carlisle).

It turns out that it actually seems easier for away teams when they travel further away:

Teams travelling under 25 miles win just under a point on average, while teams travelling over 200 miles win between 1.3 and 1.6 points on average.

This is surprising, but there could be several reasons contributing to this:

  1. Local rivalries. It’s possible that away teams do worse in derby matches than in other matches; this is something to investigate further.
  2. Team bonding. It’s possible that travelling a longer distance together is a shared experience that can help with team bonding.
  3. Southern economic dominance. England is relatively centralised, economically speaking; most of the wealth is in the south. Teams in the South travel further than average to away games, so perhaps the distance advantage actually shows a southern economic advantage; teams in richer areas can buy better players.
  4. Centralisation vs. sparser regions. England is relatively centralised, geographically speaking; most of the population lives in the bits in the middle, and teams in the Midlands travel the least distance on average. Perhaps teams in more centralised areas (e.g. Walsall, Coventry) have more competition for resources like new talent and crowd attendance, while teams in less centralised areas (e.g Exeter, Newcastle) might have less competition for those resources.

I also used Tableau’s clustering algorithm to separate out teams and their away performances based on distance travelled, and it resulted in four basic away performance phenotypes (which you can explore properly and search for your own team here):

Since I had the stadium details, I had a look at whether the stadium capacity made a difference. This isn’t a sophisticated analysis – better teams tend to be more financially successful and therefore invest in bigger stadiums, so it’s probably just a proxy for how good the home team is overall, rather than capturing how a large home crowd could intimidate an away team.

Finally, this heat map combines the two previous graphs and shows that away teams tend to do better when they travel further to a smaller ground. This potentially shows the centralisation issue discussed earlier; the lack of data in the bottom right corner of the graph shows that there are very few big stadiums in parts of the country like the far North West, North East, and South West, where away teams have to travel a long way to get to.

So, it looks like the further an away team travels, the better they tend to do… although that could reflect more complicated economic and geographic factors.

Standard
Alteryx, R

How to be an R soul: an introduction to the R Tool in Alteryx

Alteryx is great for a lot of analysis, and the in-built tools improve with every release. But sometimes you just need to work with the R code directly; maybe you’ve inherited an R document that you need to reproduce, or maybe you need to use a specific package for sentiment analysis, or maybe you’re just far more used to R syntax and want to make sure the model is running exactly as you intend.

This is where the R tool comes in handy.

For this blog, I re-ran a section of one of the experiments I did for my PhD. You can find the data and R analysis script here (better still, download the Rmarkdown html and view in your browser to see the code and the command line output), and you can read the paper here. One section of the analysis compared mixed models using the lme4 package, which I’m not sure how to do in Alteryx. I’m sure there’s a way, but the R tool is perfect for making sure that I reproduce the results exactly.

First, drop the R tool into the workflow:

1

It’s not enough to just connect the previous tool to the R tool input, though; you have to specifically tell the R tool to load the data in. You can do that with this bit of code at the top of the scripting panel. The R tool takes multiple inputs so you can bring in various different pieces of data; the R tool recognises them as #1, #2, #3, etc. This line says “read input #1 into the R tool as a dataframe and store it as behdata within the R script”:

2

You then need to load the R packages you’ll be using. It’s a bit tricky to install extra R packages in Alteryx if the installer doesn’t match your version, but Alteryx comes with quite a lot of useful R packages pre-installed anyway (see here for Alteryx 10 and here for Alteryx 11). However, even if the packages are already installed, they need to be loaded each time.

3

Now, you can continue with the R code… for the most part.

4

Once you’ve done your coding, you’ll need to write the results to the R tool output. This code is pretty similar to the input; it reads “write the object modelcomparison to R tool output 1”:

5

However, because Alteryx works with dataframes, you can only write dataframes to Alteryx. This means you’ll have to convert matrices into dataframes, and if you’re dealing with lists, you’ll have to coerce them to dataframes before you can do anything with them.

Sadly, the R tool doesn’t have a command line. When I want to look at the properties of the model, in R I’d simply type summary(modelname) and get a nice result in the command line:

6

One way of doing this in Alteryx would be to store the summary as an object and then write to one of the outputs. However, a model summary like this is a list in R, which can’t be written to Alteryx without converting it to a dataframe first. If you try it, you’ll see this error:

7.png

and it’s a little more complicated than that, but that’s another blog for another time.

 

Standard
Alteryx

Now you’re making me cross(tab)… getting around character glitches in Alteryx’s crosstab tool.

I was building an Alteryx app for a client this week, and spent an hour or two tripping up over a really straightforward issue. My workflow worked just fine for a small subset of the data that I was testing it on… but when I fed in the rest of the data, I got this error message:

1-error-message

This isn’t helpful. My data is perfectly clean, thank you very much. I’m not having that. The workflow was working fine for a subset of the data, so there’s no reason it should have tripped up just because more data was added. Or so I thought… but it turns out that Alteryx’s Crosstab tool has a problem with special characters.

Let’s start from the beginning. I’m building an app with a drop down menu which lets you filter the data to a single value. That looks a little bit like this:

3-simple-app-set-up

You can manually type in the possibilities in the drop down tool, but if there’s a lot of them (which there generally are), it’s a bit of an arse ache, and it’s not dynamic either in case the data changes in future… so the best option is to populate the drop down menu with the field names of a connected tool:

2-app-drop-down-configuration

Irritatingly, there isn’t an option in the drop down tool configuration to take distinct values from the rows of a particular field of a connected tool. This means that you have to take the field where the interesting stuff is and crosstab it, so that all the values become a column heading.

This is pretty straightforward. First, I used a summarize tool and grouped the data by the field which has the values which you want to be in the drop down tool. Then, because you can’t crosstab a single field, I simply grouped by the same field again. That gave me this output:

5-first-group-by-twice

…and I just crosstabbed it so that I’d get A * B, A + B, and A – B as the field names, and also A * B, A + B, and A – B as the first row of data.

But no:

7-error-message

The warning message is more informative than the error message here. What’s going on with the multiple fields named “A___B”?

It turns out that the crosstab tool automatically changes special characters, like *, +, and -, to underscores in field names. In my subset of data at work, I wasn’t working with any values with special characters in them; but when I brought the rest of the data in, there were values that were textually different, like A * B and A + B, which became the same thing when replacing the special characters with underscores. I’m not sure why it does this; my guess is that it’s something to do with making field names compatible with programmes like SQL and R, which are more restrictive in the characters they allow in field names.

I wasted quite a bit of time trying to work out what was going on here, but luckily, there’s a simple solution. Instead of grouping by the field in the summarize tool twice, just group by that field once. Then, add a Record ID tool in, so that you get something like this:

9-record-id

Now, you can crosstab successfully. Put the Record ID field as the new column headers, and the thing you’re actually interested in as the values:

8-cross-tab-successfully

The next step is to use a dynamic rename tool to take the column names from the first row of data. Unlike the crosstab tool, the dynamic rename tool doesn’t change special characters when assigning new column names:

9-dynamic-rename

…and there you go. Now you have an app where the populated drop down menu works with special characters!

Standard