How to update a production file in Excel very carefully

First things first – keeping and updating important data in an Excel file is not a good idea. There’s a massive risk of human error, which is why there’s a whole industry of information management tools that do a much better job at it.

But sometimes it’s inevitable, and you’ll end up with a situation where a file called CrucialInformation.csv or something like that is feeding into all kinds of different systems, meaning that it’s now a production file that other things depend on. And maybe you’re the person who’s been given the responsibility of keeping it updated. The data person at your job has said “this file is really important, make sure you don’t accidentally put wrong information in or delete anything out of it because then everything will break, okay, have fun, bye!”, and left you to it.

So, how do you responsibly manage a production csv file without breaking things and while minimising mistakes?

In this example, let’s say we work for a shop that sells food items, and the government has introduced new restrictions around items with the ingredient madeupylose in them. Items with high levels of madeupylose can only be sold to over 18s from behind the counter with the cigarettes, items with medium levels of madeupylose can be sold to over 18s from the normal shelves, and items with low levels of madeupylose can be sold to anybody as normal. This isn’t one of the main ingredients that we track in the proper systems yet, so while the data team are working on getting this set up in the databases, we need to manually track which items have madeupylose in them as a short-term solution.

(just to be explicit, this is a completely made-up ingredient in a completely made-up example! but I do use this approach for making targeted updates to large files)

We’ve got a production file called MadeupyloseItems.csv which stores the data, and it looks like this:

Let’s go through how to make controlled updates to this file and not break everything.

Step 1: don’t even touch it, just make copies

First of all, the safest thing to do is to pretty much never touch the production file. Firstly, because having a file open will lock it, so any background process that’s trying to access the data inside it won’t be able to. Secondly, because creating copies means we can have an archive and muck about with things safe in the knowledge that we won’t mess it up.

Wherever the production file is saved, I generally create a folder called “_archive”:

I do all the actual work on files saved in here, and almost never touch the file itself. Let’s have a look at what’s already in there:

There are two archived, datestamped files in there already – the csv from 12th September 2022, and a copy of the live production file with the datestamp from when it was last saved. It’s important to keep them there just in case I make a mistake and need to restore an old version.

There’s also the Madeupylose updater file, which I use to make my updates.

Step 2: create a totally separate Excel file to manage changes

I use a dedicated Excel file to manage the changes I make to production csv files. In the first tab, I copy/paste all the data from the most recent file – in this case, I’ve copy/pasted it out of the 220914 Madeupylose.csv file in the _archive folder so I don’t have to open the live file:

In the next tab, I make my actual changes. I’ve got some items that I’m adding in and some items that I’m recategorising:

I can use this file to add in some checks – for example, have I spelled “High”, “Medium”, and “Low” correctly? Are they all the same format? There are loads of things you could do here, like sticking a filter on to check individual values, making sure that there are no duplicates, and so on.

Once I’m happy with my updates, I now need to add them to the production file somehow. I could potentially just open the file and add them in… but that would involve opening the file, for a start, and maybe disrupting another process. And it’s not just a case of adding them in – I could simply copy/paste the new items, but I’d have to find the already-categorised items and have to change them, which would be quite easy to make a mistake on (and I really don’t want to hit CTRL+F a load of times).

So, this is where I create a third tab called New Data and use a couple of xlookups.

Step 3: xlookups

If you haven’t come across xlookups before, they’re just like vlookups but simpler and better.

Firstly, I’m going to create a new column in the ExistingData tab, and I’ll use this formula:

=XLOOKUP(ExistingData!A2, Updates!A:A, Updates!A:A)

What this does is take the item number in the existing data tab, look for it in the updates tab, and if it finds it in the updates tab, it’ll return that number. If it doesn’t find it, it gives me an #N/A value:

What this means is that I can distinguish between the items that have already been categorised that I’ve changed and the ones that I haven’t. I know that anything with an #N/A value in the lookup column is an item that I haven’t changed at all. So, I can add a filter to the lookup column, and select only the #N/A values:

I’ll copy/paste columns A and B for these items into the NewData tab. Then, I’ll copy/paste columns A and B for all items in the Updates tab into the NewData tab underneath (and remember to remove the headers!). Now, I’ve got the full information in one place:

What I like about this approach is that it works the same whether you’re updating 20 records or 20,000 records – once you’ve made your updates in the updates tab, the xlookups to get it all transferred over only take seconds, and you can be sure that you’ve definitely got all the things you need in the NewData tab.

Step 4: update the production file

We’re now ready to update the production file. I prefer to take an extra step to create a whole new csv and never actually touch the production file. So, I’m going to take everything in the NewData tab, copy/paste it into a new file, and save that as a csv called “220916 MadeupyloseItems.csv” in the _archive folder:

We’ve got the archived version of the current file – it’s the one from 220914, so there’s a backup in case I’ve messed up the latest update somehow. We’ve also got the archived version of the new file that I want to be the production file. The final step is to move that into production by overwriting the production file. That’s as simple as going back to the 220916 MadeupyloseItems.csv file in Excel, hit Save As, and overwrite the live file:

And there we have it! We’ve updated the live production file without even touching it, we’ve carefully tracked which lines we’re adding in and which lines we’re changing, and we’ve got a full set of backups just in case.


SAP: what’s the difference between the merchandise category hierarchy and the article hierarchy?

I’ve been working with product classification in SAP recently, and something that comes up a lot is the difference between the merchandise category hierarchy and the article hierarchy. Both can be used for showing hierarchies and reporting, so what’s the difference?

The merchandise category is obligatory – you have to have the merchandise category when setting up SAP. It’s also a pain to change, so you need to think really carefully out what you want the merchandise category to be when you’re starting out. You don’t necessarily have to create a merchandise category hierarchy, but you do have to assign merchandise categories. The article hierarchy isn’t obligatory at all, and it’s more flexible to change.

The purpose of having separate hierarchies is that they represent different aspects of a thing. The merchandise category is about what a thing actually is, while the article hierarchy is about what you do with it.

To illustrate this, let’s take a load of different nuts – peanuts, almonds, walnuts, chestnuts, and so on. Under a merchandise category perspective, you’d want to classify these into fundamental properties of the item – for example, you could use the GS1 standards, which separate them into bricks like Nuts/Seeds – Unprepared/Unprocessed (in shell) and Nuts/Seeds – Prepared/Processed (out of shell). That would mean that nuts like loose chestnuts and walnuts in shells would be in the first category, salted peanuts and flaked almonds would be in the second category. These are fundamental to what the product actually is.

Under an article hierarchy perspective, you’d maybe want to set up a separate way of classifying these nuts depending on what you do with them in your store. While salted peanuts and flaked almonds are both Nuts/Seeds – Prepared/Processed (out of shell), they’re quite different in terms of what people actually buy them for – salted peanuts are more of a snacking thing and you’d probably expect to see them in the same aisle as crisps and Bombay mix, while flaked almonds are more of a baking/cooking thing and you’d probably expect to see them in the same aisle as flour and glacé cherries. And maybe you’d want to reflect that in your internal organisation too, having one team of people responsible for snacks and one team responsible for baking/cooking, and splitting nuts across the two. This isn’t fundamental to what the product actually is, as they’re both nuts, but it’s more about what you do with it.

I’ve found that the best analogy for merchandise categories vs. article hierarchy is to step away from shop products and talk about animals.

If merchandise category is about what something fundamentally is, then it’s like the order of species, or taxonomic classification. Dogs and wolves are separate species, but they’re both in the dog family, and if you go up a little higher, that includes foxes in the canidae genus, and if you go up higher again to carnivores, this is where you get to cats as well, and the level above that is all mammals, so that’s where cows come in, and so on. Chickens are birds, which are totally different from mammals at a very high level, so you’ve got to go all the way from species up to the phylum chordata to get to the level where cows and chickens are the same kind of thing. That’s the taxonomic classification of animals, and it’s just how it is. Scientists can provide evidence to move things around a little bit, but it’s pretty much set. This is the merchandise category hierarchy of the animal world.

(wordpress has compressed this a lot – open the full size image in a new tab or use an image reader browser extension like Imagus to see full size)

But now, let’s say you run a pet shop, or a zoo. You know your taxonomic classification perfectly, but you also know that selling dogs and wolves in a pet shop is a terrible idea, because while dogs and wolves are technically very similar, one of them is very much not a pet. Or you’re planning out which animals to keep together in which bits of the zoo, and while cows and chickens are nothing like each other on the taxonomic classification, they’re both farmyard animals (and they won’t eat each other), so it would probably make sense to keep cows and chickens together in the “life on the farm” section of the zoo rather than having cows and lions in the “but they’re both mammals, it’ll all be fine” section of the zoo.

So, this is where you’d create a separate article hierarchy. It’s not about what an animal is, it’s about what you do with that animal, how you organise things, who’s responsible for caring for them, and so on.

This is a lot easier to change and it lets you build different reporting on top of the merchandise categories, but it’s a completely different way of seeing things from the merchandise category hierarchy view of the world. This is why the merchandise category hierarchy and the article hierarchy are different, and why the merchandise category level is obligatory (although putting it into a merchandise category hierarchy is optional), while the article hierarchy is optional.


How to open an open Excel file in Alteryx

Before I start this blog, a disclaimer:

This is disgusting. Don’t do it. If you are running a production workflow off a manually maintained Excel file, you have bigger problems than not being able to open an Excel file in Alteryx because somebody else is using it.

With that out of the way… you know how you can’t open an Excel file in Alteryx if you’ve got that file open? Or worse, if it’s an xlsx on a shared drive, and you don’t even have it open, but somebody else does? Yeah, not fun.

The “solution” is to email everybody to ask them to close the file, or to create a copy and read that in, changing all your inputs to reference the copy, then probably forgetting to delete the copy and now you’re running it off the copy and nobody’s updating that so you’re stuck in a cycle of overwriting the copy whenever you remember, and … this is not really a solution.

The ideal thing to do would be to change your process so that whatever’s in the xlsx is in a database. That’d be nice. But if that’s not an option for you, I’ve created a set of macros that will read an open xlsx in Alteryx.

In the first part of this blog, I’ll show you how to use the macros. In the second, I’ll explain how it works. If you’re not interested in the details, the only thing you need to know is this:

  1. They use run command tools (which means your organisation might be a bit weird about you downloading them, but just show them this blog, nobody who uses comic sans could be an evil man).
  2. Those run commands mimic the process of you going to the file explorer, clicking on a file, hitting copy, hitting paste in the same location, reading the data out of that copy, then deleting that copy. Leave no trace, it’s not just for hiking.
  3. This means that command windows will briefly open up while you’re running the workflow. That’s fine, just don’t click anything or hit any keys while that’s happening or you might accidentally interrupt something (I’ve mashed the keys a bit trying to do this deliberately and the worst I’ve done is stop it copying the file or stop it deleting the file, but that’s still annoying. Anyway, you’ve been warned.).

The macros

There are three macros, which you can find here:

1a: Open an open xlsx – sheet names from full path

1b: Open an open xlsx – sheet names from browse

2: Open an open xlsx – read specific sheet

Using these macros is a two-step process. The first step takes the xlsx and reads in the sheet names. There are two options here; you can browse for the xlsx file directly in the macro using option 1b, or you can use a directory tool and feed in the necessary xlsx file using option 1a (this is a few extra steps, but it’s my preferred approach). You can then filter to the sheet(s) you want to open, and feed that into the second step, which takes the sheet(s) and reads in the data.

Example 1

Let’s work through this simple example. I’ve got a file, imaginatively called 201210.test.xlsx, and I can’t read it because somebody else has it open. In these three steps, I can get the file, select the sheet I need, and get the data out of it:

In the first step, I’m using macro 1b: Open an open xlsx – sheet names from browse. The input is incredibly simple – just hit the folder icon and browse for the xlsx you need:

But this one comes with one caveat – for the same reason that you can’t open an xlsx if it’s open, the file browse doesn’t work if the xlsx is open either. You only need to set this macro up once, though – once you’ve selected the file you need, you’re good, and you can run this workflow whenever. You only need to have the file closed when you first bring this macro onto the canvas and select the file through the file browser.

This returns the sheet names:

All you need to do next is filter to the sheet name you need. I want to open Sheet2, so I filter to that and feed it into the next macro, which is macro 2: Open an open xlsx – read specific sheet. This takes three input fields – the full path of the xlsx, the sheet you want to open, and the line you want to import from:

[FullPath] and [Sheet Names] are the names of the fields returned by the first macro, so you shouldn’t even need to change anything here, just whack it in.

That gives you the data in Sheet2!

Example 2

Let’s do the same thing with a directory tool and my preferred approach – using macro 1a: Open an open xlsx – sheet names from full path:

In the first step, I use the directory tool to give me the full paths of all the xlsx files in a particular folder. This is really powerful, because it means I can choose a file in a dynamic fashion – for example, if the manually maintained process has a different xlsx for each month, you can sort by CreationTime from most recent to least recent, then use the select records tool to select the first record. That’ll always feed in the most recently created xlsx into the first macro. In this simple example, I’m just going to filter to the file I want, then deselect the columns I don’t need, which is everything except FullPath:

I can now feed the full path into the macro, like so:

…and the next steps are identical to example 1.

Example 3

If you need to read in multiple sheets, the get open xlsx sheet data macro will do that and union them together, as long as they’re all in the same format. I’ve got another example xlsx, which is unsurprisingly called 201210 test 2.xlsx. I’ve got four sheets, one for each quarter, where the data structure is identical:

The data structure has to be identical, can’t stress that enough

With this, you don’t have to filter to a specific sheet – you can just bung the get open xlsx sheet data macro straight on the output of the first one, like this:

That’s all you have to do! If this is something you’ve been looking for, please download them from the public gallery, test them out, and let me know if there are any bugs. But seriously, please don’t use this as a production process.

How it works

It’s all about the run commands. I’ll walk through macro 1a: open an open xlsx from a full path, but they’re all basically the same. There are four main steps internally:

  1. Creating the file path and the batch commands
  2. Copying the file
  3. Getting data out of the copy
  4. Deleting the copy

The first regex tool just finds the file extension. I’ll use this later to build the new file path:

The formula tool has three calculations in it; one to create a new full path, one to create the copy command, and one to create the delete command.

The NewFullPath calculation takes my existing full path (i.e. the xlsx that’s open) and adds a timestamp and flag to it. It does this by replacing “.xlsx” with “2021-02-12 17.23.31temptodelete1a.xlsx”. It’s ugly, but it’s a useful way of making pretty sure that you create a unique copy so that you’re not going to just call it “- copy.xlsx” and then delete somebody else’s copy.

The Command calculation uses the xcopy command (many thanks to Ian Baldwin, your friendly neighbourhood run-command-in-Alteryx hero, for helping me figure out why I couldn’t get this working). What this does is create a command to feed into the run command tool that says “take this full path and create a copy of it called [NewFullPath]”.

The DeleteCommand calculation uses the delete command. You feed that into the run command tool, and it simply takes the new full path and tells Windows to delete it.

Now that you’ve got the commands, it’s run command time (and block until done, just to make sure it’s happening in the right order).

To copy the file: put a select tool down to deselect everything except the command field. Now set up your run command tool like this:

You want to treat the command like a .bat file. You can do this by setting the write source output to be a .bat file that runs in temp – I just called it CopyFiles.bat. To do this, you’ll need to hit output, then type it in in the configuration window that pops up, and keep the other settings as above too. If you’re just curious how this macro works, you don’t need to change anything here, it’ll run just fine (let me know if it doesn’t).

To get the list of sheet names, the first step is to create a new column with the full path, a pipe separator, then “<List of Sheet Names>” to feed into the dynamic input:

From there, you can use the existing xlsx as the data source template, and use NewFullPathSheetNames from the previous formula as the data source list field, and change the entire file path:

That’ll return the sheet names only:

So the last step is an append tool to bring back the original FullPath. That’s all you need as the output of this step to go into the sheet read macro (and the only real difference in that macro is that the formula tool and dynamic input tool is set up to read the data out of the sheets rather than get a list of sheet names).

Finally, to delete the copy, deselect everything except the DeleteCommand field, and set it up in the same way as the copy run command tool earlier:

And that’s about it. I hope this explanation is full enough for you to muck about with run command tools with a little confidence. I really like xcopy functions as a way of mass-shifting things around, it’s a powerful way of doing things. But like actual power tools, it can be dangerous – be careful when deleting things, because, well, that can go extremely badly for you.

And finally, none of this is good. This is like the data equivalent of me having a leaky roof, putting a bucket on the floor, and then coming up with an automatic way of switching out the full bucket for an empty one before it overflows. What I should actually do is fix the roof. If you’re using these macros, I highly recommend changing your underlying data management practices/processes instead!


Hacking redacted documents with statistics: the Ghislaine Maxwell deposition

The original Slate article introduces this better than I can, because they’re actual journalists and I’m just a data botherer:

On Thursday morning, a federal court released a 2016 deposition given by Ghislaine Maxwell, the 58-year-old British woman charged by the federal government with enticing underage girls to have sex with Jeffrey Epstein. That deposition, which Maxwell has fought to withhold, was given as part of a defamation suit brought by Virginia Roberts Giuffre, who alleges that she was lured to become Epstein’s sex slave. That defamation suit was settled in 2017. Epstein died by suicide in 2019.

In the deposition, Maxwell was pressed to answer questions about the many famous men in Epstein’s orbit, among them Bill Clinton, Alan Dershowitz, and Prince Andrew. In the document that was released on Thursday, those names and others appear under black bars. According to the Miami Herald, which sued for this and other documents to be released, the deposition was released only after “days of wrangling over redactions.”

Slate: “We Cracked the Redactions in the Ghislaine Maxwell Deposition”

It’s some grim shit. I haven’t been following the story that closely, and I don’t particularly want to read all 400-odd pages of testimony.

But this bit caught my eye:

It turns out, though, that those redactions are possible to crack. That’s because the deposition—which you can read in full here—includes a complete alphabetized index of the redacted and unredacted words that appear in the document.

This is … not exactly redacted. It looks pretty redacted in the text itself:

Above: Page 231 of the depositions with black bars redacting some names.

But the index helpfully lists out all the redacted words. With the original word lengths intact. In alphabetical order. You don’t need any sophisticated statistical methods to see that many of the redacted black bars on page 231 concern a short word which begins with the letter A, and is followed by either the letter I or the letter L:

Above: The index of the deposition, helpfully listing out all the redacted words alphabetically anbd referencing them to individual lines on individual pages.

It also doesn’t take much effort to scroll through the rest of the index, and notice that another short word beginning with the letters GO occurs in exactly the same place:

Above: I wish all metadata was this good.

And once you’ve put those two things together, it’s not a huge leap to figure out that this is probably about Al Gore.

When I first read the Slate article on Friday 23rd October 2020, around 8am UK time, Slate had already listed out a few names they’d figured out by manually going through the index and piecing together coöccurring words. And that reminded me of market basket analysis, one of my favourite statistical processes. I love it because you can figure out where things occur together at scale, and I love it because conceptually it’s not even that hard, it’s just fractions.

Market basket analysis is normally used for retail data to look at what people buy together, like burgers and burger buns, and what people don’t tend to buy together, like veggie sausages and bacon. But since it’s basically just looking at what happens together, you can apply it to all sorts of use cases outside supermarkets.

In this case, our shopping basket is the individual page of the deposition, and our items are the redacted words. If two individual redacted words occur together on the same page(s) more than you’d expect by chance, then those two words are probably a first name and a surname. And if we can figure out the first letter(s) of the words from their positions in the index, we’ve got the initials of our redacted people.

For example, let’s take the two words which are probably Al Gore, A1 and GO1. If the redacted word A1 appears on 2% of pages, and if the redacted word GO1 appears on 3% of pages, then if there’s no relationship between the two words, you’d expect A1 and GO1 to appear together on 3% of 2% of pages, i.e. on 0.06% of pages. But if those two words appear together on 1% of pages, that’s ~16x more often than you’d expect by chance, suggesting that there’s a relationship between the words there.

So, I opened up the Maxwell deposition pdf, which you can find here, and spent a happy Friday evening going through scanned pdf pages (which are the worst, even worse than the .xls UK COVID-19 test and trace debacle, please never store your data like this, thank you) and turning it into something usable. Like basically every data project I’ve ever worked on, about 90% of my time was spent getting the data into a form that I could actually use.

oh god why

Working through it…

How data should look.

And now we’re ready for some stats. I used Alteryx to look at possible one-to-one association rules between redacted words. Since I don’t know the actual order of the redacted words, there are two possible orders for any two words: Word1 Word2 and Word2 Word1. For example, the name that’s almost definitely Al Gore is represented by the two words “A1” and “GO1” in my data. If there’s a high lift between those two words, that tells me it’s likely to be a name, but I’m not sure if that name is “A1 GO1” or “GO1 A1”.

After running the market basket analysis and sorting by lift, I get these results. Luckily, Slate have already identified a load of names, so I’m reasonably confident that this approach works:

A list of initials and names, in Excel this time to make it more human-readable.

Like I said, I haven’t been following this story that closely, and I’m not close enough to be able to take a guess at the names. But I’m definitely intrigued by the top one – Slate haven’t cracked it as of the time of writing. The numbers suggest that there’s somebody called either Je___ Ja___ or Ja___ Je___ who’s being talked about here:

Je___ Ja___ or Ja___ Je___?

I don’t particularly want to speculate on who’s involved in this. It’s a nasty business and I’d prefer to stay out of it. But there are a few things that this document illustrates perfectly:

  1. It’s not really redacted if you’ve still got the indexing, come on, seriously
  2. Even fairly simple statistical procedures can be really useful
  3. Different fields should look at the statistical approaches used in other fields more often – it really frustrates me that I almost never see any applications of market basket analysis outside retail data
  4. Please never store any data in a pdf if you want people to be able to use that data

Synthesised size/sound sound symbolism: the MS Paint version

Earlier today, I presented my CogSci paper Synthesised size/sound sound symbolism. The six page paper is short and to the point, but hey, it technically counts as a publication, so I figure it’d be remiss of me not to render it badly in MS Paint.

We start with the Japanese ideophone learning study. Participants learned some Japanese ideophones, half with their real Dutch translations, half with their opposite translations, and it turned out that they were way better at remembering the ideophones with their real translations. Importantly, this didn’t happen when we did exactly the same thing with regular, arbitrary adjectives. You can read that paper here, and download some of the experiment materials for it here if you want to try it yourself. You can also read the replication study we did with EEG here, and download the data and analysis scripts for that here.

The results looked a bit like this:


But while we can see that there is an obvious effect, we don’t know how it works. Is it a learning boost that people get from the cross-modal correspondences in the real condition?

match boost lerning

Or is it a learning hindrance that people get from the cross-modal clashes in the opposite condition?

mismatch hindrance learning

(or indeed, is it both?)

Without adding a neutral condition where the words neither obviously match nor mismatch their meanings, we don’t really know.

So, we created some synthesised size/sound sound-symbolic pseudowords, which is easier said than done. It’s well known that people associate voiced consonants and low, back vowels with large size, and voiceless consonants and front, high vowels with small size. This is probably because of the mouth shape you make when saying those sounds:

a and i

We created big-sounding words (like badobado), small-sounding words (like kitikiti), and neutral-sounding words which were halfway in between (like depedepe).

A neutral condition could tell us if it’s a graded effect…

graded effect.png

a match boost effect…

match boost.png

or a mismatch hindrance effect:

mismatch hindrance.png

Turns out it’s a match boost effect. Participants learned the match pseudowords (e.g. badobado and big, kitikiti and small) better than the neutral pseudowords (e.g. kedekede and big, depedepe and small), but there wasn’t a difference in how well they learned the neutral and mismatch pseudowords (e.g. godagoda and small, tikutiku and big).

results n=30

For good measure, we did it again with double the original sample size, because it’s nice to check things.

first experiment and replication blog picture

…and, yes, we found exactly the same thing:

results n=60

So, it looks like it’s a special match boost effect from the cross-modal correspondences, not a graded effect reflecting all cross-modal information.

This is a nice paradigm which can be easily altered to try with different languages or different stimuli, e.g. using big and small shapes rather than the words “big” and “small” in order to rule out letter confounds. I’ll put up all the Presentation scripts, synthesised stimuli, and analysis scripts once I’ve tidied them up a lot (if you think my MS Paint scribbles are messy, you should see my code). It should be pretty straightforward for anybody to download and redo this, so it’d make a good project for a Bachelors/Masters intern. I’d love to see this get taken further and tried out with useful variations and changes. But good luck coming up with a more satisfying title!


Rolling form and how Arsenal lost the Premier League title

Is form taken seriously in the media discussion of football? Every game preview acknowledges the results from a team’s last five games, often with green, grey, and red blobs so you can see how much they’ve won, drawn, and lost recently… but it’s rarely quantified beyond that. This is kind of understandable, since you get three points for a win regardless of what form you or your opposition have been in, but I feel like a bit more of a deep dive into form data is needed.

I’ve been scraping a lot of game data from last season’s Premier League with the ultimate goal of making an adjusted table for rolling form. Which teams beat form teams most often? Which teams benefited most from playing teams at a low ebb more often than other teams? Is rolling form a better predictor of a match winner than a team’s position?

This has thrown up an absolute glut of data that can (and will) fill several blog posts. For now, though, I’d like to focus on Arsenal’s woes last season.

One of the main media narratives around Arsenal is that they’re flat track bullies; they swat aside mediocre teams with ease, looking fluent and impressive while doing so, but they come unstuck against the top teams. These criticisms extend to the individual players, with Olivier Giroud and Per Mertesacker often held up as examples of almost-but-not-quite players; good enough to beat most teams, but not in the elite level that would take Arsenal past the Chelseas, the Manchester Uniteds, the Manchester Citys (despite finishing above all of them last season, but whatever).

Thing is, that’s completely untrue.

Here’s a graph of Arsenal’s wins, losses, and draws last season. The dots and lines represent each game, and are positioned on the y-axis according to Arsenal’s and their opposition’s form over the last five games (in mean points per game) just before each game. This excludes the first five matches of the season, which are always a bit dicey and rarely reflect the final standings.

Arsenal form plot.png

What strikes me most clearly is that Arsenal lost all their games to teams in average to poor form. Indeed, three losses were against teams who’d been averaging less than a point per game, which, if applied to a whole season, would set them out as clear relegation candidates. Moreover, around half these losses have come when Arsenal have been in a good run of form themselves, averaging two or more points per game (which, if applied to a whole season, would have seen them score 76 points, five more than they actually managed). Meanwhile, a lot of Arsenal’s wins actually came against teams who were doing well, averaging 1.5 points per game or more. This suggests that Arsenal were winning the difficult games and losing the easy games.

This isn’t much use on its own, so let’s look at Leicester and Tottenham for comparison.

Not only did Leicester lose less often – obviously – their losses never came against teams averaging less than one point per game. The same goes for Tottenham, who lost their games to teams in a fairly decent run of form.

We can also look at how Arsenal did according to their league position and the league position of their opposition just before the game.

Arsenal position plot.png

This plot is also pretty illuminating. The bulk of Arsenal’s losses came against teams in the bottom half of the table, while a lot of their record against teams in the top five when they played them was actually pretty good, winning four, drawing three, and losing only one. There’s also a nice split between their wins and draws which shows that Arsenal generally beat the lower table teams and generally drew with the high mid-table teams when they played them.

Again, let’s compare that with Leicester and Tottenham.

Leicester only lost to teams in the top half of the table. Tottenham lost to teams in the relegation zone twice, but the rolling form plot showed that these relegation zone teams had been doing pretty well at the time.

To get a look at the whole league, we can plot the mean rolling form and the mean league position for the opposition for each team in losses. That’s a confusing sentence; another way of phrasing it is saying that this is looking at the average form and average league position of the teams that each team lost to last season.

form plot for losses plus lm.png

position plot for losses plus lm.png

Both these graphs show a slight relationship between how well a team did overall and the nature of their losses – the better performing teams tended to lose to better opposition, i.e., by losing to teams with higher points per game in the last five games and teams who were higher in the table at the time.

All except Arsenal. In fact, Arsenal were the worst team least season in terms of losing to teams in poor form. The teams that Arsenal lost to were in the worst average form and in the lowest average position compared with losses by any other team in the league.

In short, this contradicts the main narrative of Arsenal not being good enough against the top teams. Rather, Arsenal aren’t good enough against the bad teams, and lost out on the league by losing to teams in poor form in the relegation zone. Next time Arsenal play a poorly-performing bottom-table team, maybe a Hull or a Middlesborough in a bit of a rut, I’ll stick a tenner on Arsenal to lose; they’re only a little less likely to lose than they are to win.

wenger water bottle.gif


ERP graph competition!

A while back, I blogged about creating better ERP graphs. The data I used to generate those graphs has now been published, and all the data and analysis scripts from that paper are available to download here.

One of the things I find most fun about research is playing about with plotting graphs. There are so many different ways to visualise ERP data that it’s hard to pick one. I played around with a few different styles for my Collabra paper, and settled for plotting the two conditions with 95% confidence intervals. I was also tempted to plot the individual ERPs, as shown in the second plot, but felt the CIs were cleaner and more useful.

However, I also liked Guillaume’s approach of plotting the grand average difference wave and the difference waves for individual participants. I’ve done it with and without 95% CIs in plots 3 and 4. The difference in those two is that the green/orange distinction shows significance; the 320-784ms window was significant, the rest of the epoch wasn’t.

But I’d love to see how these can be improved! All my graphs and all the code needed are below. You can download eegdata.txt from my OSF page, but be careful not to click on the file name itself, otherwise your browser will probably freeze:

osf download instructions

After that, load it into R, and have at it. The only extra packages I use to make the graphs in this blog are ggplot2, ggthemes, tidyr, and dplyr.

If you think you can do better, email me your graphs and code to gwilym(dot)lockwood(at)mpi(dot)nl. I’ll post another blog in a few weeks with my favourite contributions… and I’ll buy the winner a beer 🙂


Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

eegdata$condition <- gsub("real", "Real", eegdata$condition)
eegdata$condition <- gsub("opposite", "Opposite", eegdata$condition)
# create dataframe by measuring across participants
dfsmallarea <- filter(eegdata, smallarea == "parietal")
dfsmallarea <- aggregate(measurement ~ smallarea*time*condition, dfsmallarea,mean)
# work it out for all trials
parietaldf <- filter(eegdata, smallarea == "parietal")
parietaldf <- aggregate(measurement ~ time*participant*condition*electrode, parietaldf,mean)
std <- function(x)sd(x)/sqrt(length(x))
# i.e. std is a function to give you the standard error of the mean
# this is the standard deviation of a sample divided by the square root of the sample size
SD <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard deviation at each time point, which will be huge
SE <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard error at each time point
CIupper <- rep(NA,length(dfsmallarea$time))  # creates empty vector for upper 95% confidence limit at each time point
CIlower <- rep(NA,length(dfsmallarea$time))  # creates empty vector for lower 95% confidence limit at each time point
for (i in 1:length(dfsmallarea$time)){
  something <- subset(parietaldf,time==dfsmallarea$time[i] & condition==dfsmallarea$condition[i], select=measurement)
  SD[i] = sd(something$measurement)
  SE[i] = std(something$measurement)
  CIupper[i] = dfsmallarea$measurement[i] + (SE[i] * 1.96)
  CIlower[i] = dfsmallarea$measurement[i] - (SE[i] * 1.96)
dfsmallarea$CIL <- CIlower
dfsmallarea$CIU <- CIupper
# Now let's plot things, starting with all trials

colours <- c("#D55E00", "#009E73")  # colourblind friendly - red/orange for opposite, green for real
dfsmallarea$time <- as.integer(as.character(dfsmallarea$time))plot <- ggplot(dfsmallarea, aes(x=time, y=measurement, color=condition)) + 
  geom_line(size=1, alpha = 1)+
  scale_linetype_manual(values=c(1,1) )+  #, guide=FALSE)+
  scale_y_continuous(limits=c(-7, 7), breaks=seq(-7,7,by=1))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_bw() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(aes(ymin = CIL, ymax = CIU, fill=condition), stat="identity", alpha = 0.3) + #95% CIs
  scale_fill_manual(values=colours)+ #, guide=FALSE) +
  scale_colour_manual(values=colours) #, guide=FALSE)

Created by Pretty R at inside-R.org


Parietal electrodes, all participants at once, all data (size 0.3, alpha 0.3) (lighter)

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
temp3$time <- as.integer(as.character(temp3$time))plot <- ggplot(temp3, aes(x=time, y=measurement,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3, aes(colour=condition))+
  geom_line(data = subset(temp3, GA == "Grand Average"), size=1.5, alpha=1, aes(colour=condition)) +
  scale_colour_manual(values=colours, guide=FALSE)+
  scale_y_continuous(limits=c(-15, 15), breaks=seq(-15,15,by=3))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - individual participants for all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org


Parietal electrodes, diffwave and individual diffwaves

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
temp4 <- spread(temp2, condition, measurement) # unmelt to create diffwave calculation
temp5 <- spread(temp, condition, measurement) 
temp4$diffwave <- temp4$real - temp4$opposite
temp5$diffwave <- temp5$real - temp5$opposite
temp6 <- rbind(temp4, temp5)
temp6$conbypar = paste(temp6$condition, temp6$participant, sep="")
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes", "no")
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org


Parietal electrodes, diffwave and individual diffwaves plus CIs

for (i in 1:length(temp6$time)){
  something <- subset(temp6,time==temp6$time[i], select=diffwave)
  SD[i] = sd(something$diffwave)
  SE[i] = std(something$diffwave)
  SDupper[i] = temp6$diffwave[i] + SD[i]
  SDlower[i] = temp6$diffwave[i] - SD[i]
  SEupper[i] = temp6$diffwave[i] + SE[i]
  SElower[i] = temp6$diffwave[i] - SE[i]
  CIupper[i] = temp6$diffwave[i] + (SE[i] * 1.96)
  CIlower[i] = temp6$diffwave[i] - (SE[i] * 1.96)
temp6$CIL <- CIlower
temp6$CIU <- CIupper
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes",ifelse(temp6$time %in% c(-200:318), "no1",
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(data = subset(temp6, GA == "Grand Average"), linetype=0, aes(ymin = CIL, ymax = CIU,group=sig, fill=sig), stat="identity", alpha = 0.5) + #95% CIs
  scale_fill_manual(values=sigcolours, guide=FALSE)

Created by Pretty R at inside-R.org


How iconicity helps people learn new words: the MS Paint version

I have a new paper out!

Gwilym Lockwood, Peter Hagoort, and Mark Dingemanse. “How Iconicity Helps People Learn New Words: Neural Correlates and Individual Differences in Sound-Symbolic Bootstrapping.” Collabra 2, no. 1 (July 6, 2016). doi:10.1525/collabra.42.

The paper can be read and downloaded right here:

…and because we’re doing the whole open thing properly, you can also download the original stimuli, data, and analysis scripts here. You probably have no intention of sifting through it all, but the point is that you can:

The experiment was pretty similar to my Sound symbolism boosts novel word learning paper from a few months ago, except that this time it was only with the ideophones, and I used EEG to measure participants’ brain activity while they learned them. People learned 19 ideophones with their real Dutch translations and 19 ideophones with their opposite Dutch translations. After I told them about that and said sorry for the deception, they heard all the ideophones again and had to guess what the real translation was from a choice of two antonyms.

The first important thing is that the results were almost identical. People got the answers right 86.7% of the time for the ideophones in the real condition, and 71.3% of the time for the ideophones in the opposite condition. When they had to guess, they got  it right 73% of the time. These figures replicate the first study very closely (86.7% to 86.1%, 71.3% to 71.1%, and 73% to 72.3%), which is excellent news.

accuracy for JEPLMC and Collabra experiments - jitterdots (0-100) 9by7 for APS presentation

All kinds of things can happen in scientific studies, so replicating a study is really important for showing that the effect is real and not just coincidental. Sadly, replications aren’t considered to be very glamorous, so a shocking amount of published science is either unreplicated or unreplicable:

first experiment and replication blog picture

The second part of this new paper is that I also measured people’s brain activity using EEG. Once you average all the trials together, you get a signal of changing activity over time in response to a thing, which is known as an event-related potential. It looks a bit like this, and isn’t that useful by itself:

erp picture single line

Instead, you have to compare two conditions. If they differ at a certain point, that’s what tells you about how the brain processes things:

erp picture two lines

In this experiment, I found a big P3 effect in the test round:

figure 8 - Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

The P3 is linked to memory and learning, so it’s not surprising that it came up in a task involving memory and learning. A lower P3 is linked to things being more difficult to learn, so again, it’s not surprising that the ideophones in the opposite condition have a lower P3 when they were harder to learn.

But, it wasn’t that simple. If it was a straight up learning effect, you would expect a correlation between how well people did in a condition and that condition’s P3 amplitude; people with lower test round scores in the opposite condition should have a lower P3 amplitude in the opposite condition. But they don’t.

However, there was a correlation between how sensitive participants were to sound symbolism (as measured by their meaning guessing accuracy in the task after the test) and how big the ERP difference between conditions was. When I split the participants into two groups (people scoring above and below the 73% average in the guessing task) and plotted their ERPs separately, it turns out that the P3 effect is big for the people who are more sensitive to sound symbolism and barely there at all for the people who are less sensitive to sound symbolism:

het talige brein picture

You can also see that the P3 amplitude in the real condition was the same across both groups. What’s behind the difference is how the amplitude in the opposite condition changes. This suggests that most people can recognise cross-modal correspondences and exploit them in word learning, but that some people are more sensitive to sound symbolism and get put off by cross-modal clashes as well.

I reckon that the variation in how sensitive people are to sound symbolism goes a little like this:

model of ss and cross modal perception

…and we do have some preliminary data from a massive cross-modal perception and synaesthesia study that shows that synaesthetes are better at the Japanese ideophone guessing game than regular people, but that’s another blog for another time.


Portugal: the worst Euro finalists ever

I haven’t been impressed with Portugal this tournament (and not just because I’m a bitter Welshman). They haven’t been very good; never really impressing in any of their matches, winning without dominating in the knock outs and held to draws by Iceland and Hungary in the group stages.

I’m pretty sure they’re the worst European Championship finalists I’ve seen, and perhaps the worst ever. But how can you measure how underwhelming a team is?

Step forward Elo ratings. Far better than the joke that is the FIFA rankings, Elo ratings adjust after every single international match based on the teams’ previous Elo ratings. For example, before Iceland and England faced off last week, Iceland had an Elo rating of 1688 and England had an Elo rating of 1929. After Iceland beat England, they exchanged 40 points – Iceland’s Elo rating went up 40 points to 1728, and England’s Elo rating went down 40 points to 1889. This is calculated using the status of the match, the number of goals scored, the result, and the expected result (more on that here, and have a browse of some more examples here).

I’ve taken every finalist (not including the winner of France vs. Germany, which kicks off about an hour from the time of writing) of the Euros since 1984, which was the first tournament where there were knock-out matches (before that, there was a group stage round-robin and the top two teams played off in a final). I’ve calculated each finalist’s mean Elo score throughout the tournament – not including the final itself – as well as calculating the mean Elo score of all the teams they played along the way.

That’s plotted right here:

Elo graph annotated

The dot size and colour shows the difference in each finalist’s Elo score from the start of the tournament until after the semifinal;the larger and lighter the dot, the more the team improved on their way to the final.

France in 2000 are probably the best team to reach a final – not only were they an excellent team at the time, they also beat a lot of strong opposition to get there. Spain in 2012 were the strongest team, but the opposition they faced wasn’t as tough. Greece in 2004 were the weakest team, but they beat some really strong teams along the way, which makes their achievement really impressive.

Portugal this year are dropping off the graph on the bottom left. They aren’t a high quality side – barely better than Greece in 2004 or Denmark in 1992 – but they haven’t been beating any impressive opposition either. They’ve not been that good, and they’ve had the weakest path to the final of any Euros finalist.

So, yes, it looks like my hunch was right – Portugal are the most underwhelming team to reach the final of a European Championship ever since there’s been a proper knock-out round. Sorry, Ronnie.

sad Ronaldo


Visualising competition in the Premier League and La Liga

Apologies, it’s taken me far too long to get round to writing this.

A little while ago, I set up a quiz to see whether people could accurately guess whether a spatially-plotted league table was from the Premier League or La Liga. The idea was to see if the (British) stereotype of the Premier League being competitive and La Liga being Barcelona and Real Madrid plus eighteen others was born out by the league tables. I did this because I’m pretty bored of reading comments like this:

“Right, yeah, English teams are bobbins in the Champions League because the Premier League might not have the highest quality, but it’s the best league in the world because of the competition and how close all the teams are. You never know who’s going to win it! It’s not like in Spain, where it’s always going to be Barcelona or Real Madrid winning it by a mile, and all the other teams don’t even matter.”

According to this stereotype, a spatial dotplot of the Premier League would look like this:

premier league stereotype

…while a spatial dotplot of La Liga would look like this:

la liga stereotype

So, I did what any reasonable person would do; I scraped a load of data, visualised the league tables as dotplots in random order, and created an online quiz to see if people could guess which league was which. If the differences between the two leagues are obvious, it should be easy enough to tell which league is which from a dotplot. The rest of this blog is basically a love letter to ggplot2.

People saw graphs like this, and simply had to say whether it was La Liga – Premier League or Premier League – La Liga:

League positions no lines no leagues 2008 classic

In this case, La Liga is on the left, and the Premier League is on the right. It’s the 2008-09 season, when Barcelona finished nine points ahead of Real Madrid and Manchester United finished four points ahead of Liverpool. The position of the league on the x-axis was shuffled, so that each league was on the left side and the right side five times each, and the questions were presented to people in a random order.

172 people did the quiz, and scored an average of 62%.

The most well-answered question was the 2011-12 season, where 74% of people correctly answered that La Liga is on the left and the Premier League is on the right:

League positions no lines leagues 2011 classic

…and the least well-answered question was the 2006-07 season, where only 35% of people correctly answered that La Liga is on the left and the Premier League is on the right:

League positions no lines leagues 2006 classic

This wasn’t a perfect psychology experiment. People doing the quiz already knew what it was about and were aware that I was looking into the stereotype, but it is a useful demonstration that it’s not easy to guess which league is which. Perhaps the stereotype isn’t entirely accurate. Indeed, if we take the ten seasons in the quiz plus the season just finished and do some descriptive statistics, the mean number of points in the Premier League is 52.1 and the mean number of points in La Liga is 52.4. while the standard deviation of points is 17.0 in the Premier League and 16.3 in La Liga. This would suggest that there’s no real difference in how widely the points are spread across the teams.

However, there might be a general trend when we look at all of them put together. If we take the ten seasons in the quiz, plus the season just finished, and plot the number of points per position per season, we get this graph:

League positions no lines leagues last 11 years classic

This actually does suggest that there may be something in the stereotype of the top teams in La Liga pulling away from the rest of the league. The dots at the top of the league are stretched higher than they are in the Premier League.

Similarly, we can look at the mean number of points per position in the last 11 seasons:

Mean league points per position no lines leagues last 11 years classic

…and this also seems to show that the top teams in La Liga are spread out a bit further than the top teams in the Premier League.

Sure enough, when we look at the stats for the top four teams, it shows a bit more spread; the mean number of points for the top four teams is very similar – 78.4 in the Premier League, 78.1 in La Liga – but the standard deviation is 8.2 in the Premier League and 12.1 in La Liga.

One last thing we can look at plotting is goal difference. Here’s the goal difference per position for each of the last 11 seasons (a goal difference of 0 is shown by the horizontal black line):

GD no lines leagues last 11 years classic

Again, the top teams in Spain seem to be more spread out compared to the rest of the league than the top teams in England are. In fact, when we average it together for the last 11 seasons, we get this:

Mean gd per position no lines leagues last 11 years classic

…which shows that, on average, the second-placed team in La Liga generally has a better goal difference than the winner of the Premier League (57.6 to 51.4), but that the third-placed team in La Liga generally has the same goal difference as the fourth-placed team in the Premier League (30.8 and 30.2 respectively).

How about combining these graphs, with points up the y-axis and goal difference denoted by dot size?

exaggerated mean pts and gd per position no lines leagues last 11 years classic

It seems that, on average, there may be some truth in the stereotype. The top two teams in La Liga dominate their competitors, both in terms of points and goal difference, whereas there seems to be less separating the top teams in the Premier League.

…and just for bonus points, let’s do the same thing for the top five European leagues over the last 11 seasons. This time it’s done by points per game rather than points, as the Bundesliga only has 18 teams:

exaggerated mean pts per game and gd per position no lines all five leagues last 11 years classic

It looks like the Premiership does have the most competitive title race after all.