Uncategorized

How clock changes work in the UK and what it means for your data processes

Twice a year, clock changes in the UK cause mild chaos. The extent of this chaos depends on your computer settings, your remote desktop/server settings, and how exactly you’ve set stuff up to figure out what the time is.

We normally describe it by saying that the clocks go forward at the end of March, and the clocks go back at the end of October. I don’t find this particularly intuitive; it doesn’t actually address what we’re doing or why, it just describes what you need to do with your watch.

I think it’s easier to think of it not as the clocks changing, but as the whole country shifting time zones, because that’s what we’re actually doing.

What’s the time?

Firstly, though… what’s the time?

As I write this, sitting in the office in Leeds, the UK, it’s about 11am on 21st May 2025. Thing is, yes, that’s the time, but it’s just my local definition of the time. To help with understanding time zones, we need to know The Time.

This is UTC, or Coordinated Universal Time. UTC isn’t just the time. UTC is The Time, and it’s the same wherever you are. Try googling it:

We could simply sack off the concept of time zones and run everything off UTC. The world of data would be a lot easier if we did and it’d save me a load of hassle at work.

But, this is a rubbish idea. Culturally, we’re used to the time being something meaningful about when it’s light and when it’s dark. The whole point of “AM” is that it means “before the middle of the day where the sun as at its highest point” and “PM” means “after that”.

If we ran everything off UTC, then “midday” in the UK would be grand, but “midday” in Japan would be a few hours after the sun has set. Also, “midnight”, when the date changes, would be around when people are starting work or school. You’d get on the school bus on Tuesday at 2315, get to school, and start Maths at “midnight” on Wednesday. Not a great way to run things. This very much not-to-scale diagram shows what that’s like during the northern hemisphere summer:

What are time zones?

So, each country (or each region of big wide countries) chooses what they want the time to be in that country, based on what makes sense for daylight in that country. Luckily, everybody around the world is pretty much in agreement that you generally want “midday” i.e. 1200 to be around the time that the sun is at its highest point and you want the date to change from Tuesday to Wednesday overnight.

You do this by setting the time where you are relative to The Time, i.e. relative to UTC.

It’s easier to explain time zones using places far away from the UK because it’s more obvious why you’d do it. If Japan used UTC time for everything, then, as I’m writing right now, 11am UTC time on 21st May 2025, the time in Japan would be 11am on 21st May 2025 but everything would be dark because the sun has already gone down.

In Japan’s case, they’ve looked at the clocks and the daylight and thought “we want the time here to be 9 hours ahead of the UTC time, so the time in Japan is UTC+9, which means that when it’s midday UTC, it’s 9pm here, which makes much more sense than just using UTC time like in Gwilym’s diagram above”. This time zone is called Japan Standard Time, or JST.

South Korea have done the same thing (Korea Standard Time, KST), as have North Korea (Pyongyang Time, PYT), Palau (Palau Time, PWT), Timor-Leste (Timor-Leste Time, TLT), most of the Sakha Republic in Russia (Yakutsk Time, YAKT), and some of Indonesia (Eastern Indonesia Time, WIT). These are all individual time zones defined by being UTC+9, and you can kind of think of them as all living in one super time zone UTC+9. At least, I do.

In the UK, UTC already aligns almost exactly with what we want. Midday UTC is pretty much at the sun’s highest point. Midnight is pretty much at the midpoint between when the sun goes down and when it comes back up again. So, we could just use UTC as the time by creating a time zone where the time is UTC+0.

However, there is no single UK time zone. We switch between UTC+0, which we call Greenwich Mean Time or GMT, and UTC+1, which we call British Summer Time, or BST. Because of this, sometimes people think that UTC and GMT are the same thing. They’re not; they give you the same end result, but GMT is a time zone based on UTC+0.

What happens when the clocks change?

In the UK, we change the clocks at the end of March, when the clocks go forward, and end of October, when the clocks go back.

We talk about the clocks “going forward” because of the act of how you adjust a clock. At 1am on a Saturday night/Sunday morning in late March, we skip an hour, and move straight from it being 1am to it being 2am. Then in October, we reverse it; the clocks go back because we get to 2am, and then move the clock back to being 1am again, essentially repeating an hour.

I find that it’s a lot easier to think of it as moving time zone rather than just changing the clocks. In the winter (late October through late March), we’re on GMT, i.e. UTC+0. This means that the time in the UK is the same time as The Time, UTC.

Then, in late March, we switch time zones to BST, i.e. suddenly the time in the UK is UTC+1 instead of UTC+0. It’s the same experience as going to France and changing your watch; you’ve changed time zone, but without physically travelling.

In late October, we switch back to GMT, i.e. UTC+0. It’s the same experience as coming back from France and changing your watch back.

The implications for daylight in the UK is that the we’re shifting the when sunrise and sunset happen. Well, sort of – sunrise and sunset still happen at the same time, based on UTC, it’s more that we’re shifting when our experience of it happens.

In late March, switching to UTC+1 means that the sun rises an hour later and also sets an hour later. So, we get more light in the evenings for longer, but less light first thing in the morning. It also means that the sun’s highest point is at about 1pm rather than literally midday.

In late October, switching back to UTC+0 means that the sun rises an hour earlier and also sets an hour later. So, we get more light earlier in the mornings, but it gets darker earlier in the evenings. It also means that the sun’s highest point is at about 12 noon.

It’s probably easiest to show this using some actual clock times throughout the year. Where I live, in Leeds, what this means is (using 2025 dates):

Mar 29th, the last day in GMT: sunrise is 0546, sunset is 1836
Mar 30th, the first day in BST: sunrise is 0643, sunset is 1938
- I notice this on my cycling club rides on Wednesday evenings, which start at 6.30pm. Suddenly we go from meeting up in the twilight and having bike lights on constantly to having a bit more light and being able to see the sun set in the Dales.
June 21st, longest day: sunrise is 0435, sunset is 2141
Oct 25th, last day in BST: sunrise is 0753, sunset is 1746
Oct 26th, first day in GMT: sunrise is 0654, sunset is 1644
- I notice this shift at work. I’ve gone from arriving at the office in the dark to arriving in the office in the light again, at least for another couple of weeks.
Dec 21st, shortest day: sunrise is 0822, sunset is 1546

If we stuck with GMT the whole year round, then on June 21st, we’d have sunrise at 0335 and sunset at 2041. Personally, I’d find this pretty annoying because I love the UK’s long summer evenings. It’s really nice to sit in a beer garden in daylight at 9.30pm, or to go for a long bike ride after work without needing bike lights, whereas I get up at about 6am so the difference between a 0335 sunrise and a 0435 sunrise makes no difference to me.

If we stuck with BST the whole year round, then on Dec 21st, we’d have sunrise at 0922 and sunset at 1646. These late, late mornings can be pretty depressing, and some arguments go that it’s important for traffic safety to have more daylight in the morning. I’m less convinced; in December, based on a 8.30am to 5pm working day, you’ll either be driving to work in the dark or driving home in the dark anyway. Personally, I’d stick to BST the whole year round and simply rebrand it “British Standard Time” instead of British Summer Time, which funnily enough is exactly what we did from 1968 to 1971. This UK parliamentary document is a nice long read about it.

What does this mean for data?

Right, 1600 words in, time to finally get to the point. For anybody working in data, it’s really important to know what time your processes run, and what definition of time you’re using for it.

People (i.e. your non-data colleagues) will expect things to happen at the same time based on their experience of time. Let’s say you automate a data model refresh at 8.30am and automate an email subscription to this report called “the daily 9am report” which gets sent out at 9am. The people receiving this report will expect it to be available at 9am *as they experience 9am*, regardless of whether we’re on UTC+0 or UTC+1.

The problem is, a lot of default data settings are set to UTC time. For example, maybe your Power BI / Tableau / SQL server is fixed to UTC time (or maybe it’s defaulted to the time in California if it’s some kind of service you’re buying from a tech company). That means:

If you first set up the daily 9am report subscription for 9am during late March to late October when we’re on UTC+1, then it’s happening at 8am UTC time. So, in late October when we switch to UTC+0, that subscription will still get sent out at 8am UTC time, but to your report recipients, it’ll happen at 8am as we experience it. This also means that, if the reason you’ve scheduled the data refresh for 8.30am and the report subscription for 9am in the first place is that the source data is only available at 8am (regardless of what time zone we’re in), then the report will get sent out at 8am with no data in it. Or in other words, the report will be an hour early, but it’ll be useless because it’s got nothing in it.
If you first set up the daily 9am report subscription for 9am during late October to late March when we’re on UTC+0, then it’s happening at 9am UTC time. So, in late March when we switch to UTC+1, the daily 9am report subscription will land in people’s inboxes at 10am as we experience it, i.e. it’s getting sent out an hour later than expected, which people won’t be happy about.

Luckily, a lot of services let you specify time zones in your refresh schedules or processes, and adapt to local time settings. Or you can set your server time to reflect local time settings rather than UTC, which means that the server running the jobs will automatically keep the right definition of what 9am means.

But, whenever you’re doing anything that involves automating something at a particular time, watch out for UTC. If something’s being done based on UTC, then it’ll cause problems next March or October. Especially if, for example, you’ve automated something or set a date field definition based on getdate()/now() just after midnight, or taken a date and converted it to a datetime which will set that datetime to midnight, and then suddenly after the clocks change, that’s now 11pm on the day before!

You also want to make sure you understand the difference between your local time and your server time. For example, my laptop auto-updates between GMT and BST, so I never have to worry about it. But some of the servers we use run on UTC only. So, there’s sometimes a mismatch between my local laptop time and my server time, meaning that when I automate something, it might not do what I think it’s doing. Just because a query with getdate() or now() works fine on my laptop, that doesn’t mean it’ll return the same result when I schedule it on a server.

Finally, this has just focused on the UK and when we change our clocks – some countries also do this, other countries don’t, and of the other counties that also do this, they might not do it at the same time! So if you’re working internationally, it’s all compounded.

Uncategorized

How to update a production file in Excel very carefully

First things first – keeping and updating important data in an Excel file is not a good idea. There’s a massive risk of human error, which is why there’s a whole industry of information management tools that do a much better job at it.

But sometimes it’s inevitable, and you’ll end up with a situation where a file called CrucialInformation.csv or something like that is feeding into all kinds of different systems, meaning that it’s now a production file that other things depend on. And maybe you’re the person who’s been given the responsibility of keeping it updated. The data person at your job has said “this file is really important, make sure you don’t accidentally put wrong information in or delete anything out of it because then everything will break, okay, have fun, bye!”, and left you to it.

So, how do you responsibly manage a production csv file without breaking things and while minimising mistakes?

In this example, let’s say we work for a shop that sells food items, and the government has introduced new restrictions around items with the ingredient madeupylose in them. Items with high levels of madeupylose can only be sold to over 18s from behind the counter with the cigarettes, items with medium levels of madeupylose can be sold to over 18s from the normal shelves, and items with low levels of madeupylose can be sold to anybody as normal. This isn’t one of the main ingredients that we track in the proper systems yet, so while the data team are working on getting this set up in the databases, we need to manually track which items have madeupylose in them as a short-term solution.

(just to be explicit, this is a completely made-up ingredient in a completely made-up example! but I do use this approach for making targeted updates to large files)

We’ve got a production file called MadeupyloseItems.csv which stores the data, and it looks like this:

Let’s go through how to make controlled updates to this file and not break everything.

Step 1: don’t even touch it, just make copies

First of all, the safest thing to do is to pretty much never touch the production file. Firstly, because having a file open will lock it, so any background process that’s trying to access the data inside it won’t be able to. Secondly, because creating copies means we can have an archive and muck about with things safe in the knowledge that we won’t mess it up.

Wherever the production file is saved, I generally create a folder called “_archive”:

I do all the actual work on files saved in here, and almost never touch the file itself. Let’s have a look at what’s already in there:

There are two archived, datestamped files in there already – the csv from 12th September 2022, and a copy of the live production file with the datestamp from when it was last saved. It’s important to keep them there just in case I make a mistake and need to restore an old version.

There’s also the Madeupylose updater file, which I use to make my updates.

Step 2: create a totally separate Excel file to manage changes

I use a dedicated Excel file to manage the changes I make to production csv files. In the first tab, I copy/paste all the data from the most recent file – in this case, I’ve copy/pasted it out of the 220914 Madeupylose.csv file in the _archive folder so I don’t have to open the live file:

In the next tab, I make my actual changes. I’ve got some items that I’m adding in and some items that I’m recategorising:

I can use this file to add in some checks – for example, have I spelled “High”, “Medium”, and “Low” correctly? Are they all the same format? There are loads of things you could do here, like sticking a filter on to check individual values, making sure that there are no duplicates, and so on.

Once I’m happy with my updates, I now need to add them to the production file somehow. I could potentially just open the file and add them in… but that would involve opening the file, for a start, and maybe disrupting another process. And it’s not just a case of adding them in – I could simply copy/paste the new items, but I’d have to find the already-categorised items and have to change them, which would be quite easy to make a mistake on (and I really don’t want to hit CTRL+F a load of times).

So, this is where I create a third tab called New Data and use a couple of xlookups.

Step 3: xlookups

If you haven’t come across xlookups before, they’re just like vlookups but simpler and better.

Firstly, I’m going to create a new column in the ExistingData tab, and I’ll use this formula:

=XLOOKUP(ExistingData!A2, Updates!A:A, Updates!A:A)

What this does is take the item number in the existing data tab, look for it in the updates tab, and if it finds it in the updates tab, it’ll return that number. If it doesn’t find it, it gives me an #N/A value:

What this means is that I can distinguish between the items that have already been categorised that I’ve changed and the ones that I haven’t. I know that anything with an #N/A value in the lookup column is an item that I haven’t changed at all. So, I can add a filter to the lookup column, and select only the #N/A values:

I’ll copy/paste columns A and B for these items into the NewData tab. Then, I’ll copy/paste columns A and B for all items in the Updates tab into the NewData tab underneath (and remember to remove the headers!). Now, I’ve got the full information in one place:

What I like about this approach is that it works the same whether you’re updating 20 records or 20,000 records – once you’ve made your updates in the updates tab, the xlookups to get it all transferred over only take seconds, and you can be sure that you’ve definitely got all the things you need in the NewData tab.

Step 4: update the production file

We’re now ready to update the production file. I prefer to take an extra step to create a whole new csv and never actually touch the production file. So, I’m going to take everything in the NewData tab, copy/paste it into a new file, and save that as a csv called “220916 MadeupyloseItems.csv” in the _archive folder:

We’ve got the archived version of the current file – it’s the one from 220914, so there’s a backup in case I’ve messed up the latest update somehow. We’ve also got the archived version of the new file that I want to be the production file. The final step is to move that into production by overwriting the production file. That’s as simple as going back to the 220916 MadeupyloseItems.csv file in Excel, hit Save As, and overwrite the live file:

And there we have it! We’ve updated the live production file without even touching it, we’ve carefully tracked which lines we’re adding in and which lines we’re changing, and we’ve got a full set of backups just in case.

Uncategorized

SAP: what’s the difference between the merchandise category hierarchy and the article hierarchy?

I’ve been working with product classification in SAP recently, and something that comes up a lot is the difference between the merchandise category hierarchy and the article hierarchy. Both can be used for showing hierarchies and reporting, so what’s the difference?

The merchandise category is obligatory – you have to have the merchandise category when setting up SAP. It’s also a pain to change, so you need to think really carefully out what you want the merchandise category to be when you’re starting out. You don’t necessarily have to create a merchandise category hierarchy, but you do have to assign merchandise categories. The article hierarchy isn’t obligatory at all, and it’s more flexible to change.

The purpose of having separate hierarchies is that they represent different aspects of a thing. The merchandise category is about what a thing actually is, while the article hierarchy is about what you do with it.

To illustrate this, let’s take a load of different nuts – peanuts, almonds, walnuts, chestnuts, and so on. Under a merchandise category perspective, you’d want to classify these into fundamental properties of the item – for example, you could use the GS1 standards, which separate them into bricks like Nuts/Seeds – Unprepared/Unprocessed (in shell) and Nuts/Seeds – Prepared/Processed (out of shell). That would mean that nuts like loose chestnuts and walnuts in shells would be in the first category, salted peanuts and flaked almonds would be in the second category. These are fundamental to what the product actually is.

Under an article hierarchy perspective, you’d maybe want to set up a separate way of classifying these nuts depending on what you do with them in your store. While salted peanuts and flaked almonds are both Nuts/Seeds – Prepared/Processed (out of shell), they’re quite different in terms of what people actually buy them for – salted peanuts are more of a snacking thing and you’d probably expect to see them in the same aisle as crisps and Bombay mix, while flaked almonds are more of a baking/cooking thing and you’d probably expect to see them in the same aisle as flour and glacé cherries. And maybe you’d want to reflect that in your internal organisation too, having one team of people responsible for snacks and one team responsible for baking/cooking, and splitting nuts across the two. This isn’t fundamental to what the product actually is, as they’re both nuts, but it’s more about what you do with it.

I’ve found that the best analogy for merchandise categories vs. article hierarchy is to step away from shop products and talk about animals.

If merchandise category is about what something fundamentally is, then it’s like the order of species, or taxonomic classification. Dogs and wolves are separate species, but they’re both in the dog family, and if you go up a little higher, that includes foxes in the canidae genus, and if you go up higher again to carnivores, this is where you get to cats as well, and the level above that is all mammals, so that’s where cows come in, and so on. Chickens are birds, which are totally different from mammals at a very high level, so you’ve got to go all the way from species up to the phylum chordata to get to the level where cows and chickens are the same kind of thing. That’s the taxonomic classification of animals, and it’s just how it is. Scientists can provide evidence to move things around a little bit, but it’s pretty much set. This is the merchandise category hierarchy of the animal world.

(wordpress has compressed this a lot – open the full size image in a new tab or use an image reader browser extension like Imagus to see full size)

But now, let’s say you run a pet shop, or a zoo. You know your taxonomic classification perfectly, but you also know that selling dogs and wolves in a pet shop is a terrible idea, because while dogs and wolves are technically very similar, one of them is very much not a pet. Or you’re planning out which animals to keep together in which bits of the zoo, and while cows and chickens are nothing like each other on the taxonomic classification, they’re both farmyard animals (and they won’t eat each other), so it would probably make sense to keep cows and chickens together in the “life on the farm” section of the zoo rather than having cows and lions in the “but they’re both mammals, it’ll all be fine” section of the zoo.

So, this is where you’d create a separate article hierarchy. It’s not about what an animal is, it’s about what you do with that animal, how you organise things, who’s responsible for caring for them, and so on.

This is a lot easier to change and it lets you build different reporting on top of the merchandise categories, but it’s a completely different way of seeing things from the merchandise category hierarchy view of the world. This is why the merchandise category hierarchy and the article hierarchy are different, and why the merchandise category level is obligatory (although putting it into a merchandise category hierarchy is optional), while the article hierarchy is optional.

Uncategorized

How to open an open Excel file in Alteryx

Before I start this blog, a disclaimer:

This is disgusting. Don’t do it. If you are running a production workflow off a manually maintained Excel file, you have bigger problems than not being able to open an Excel file in Alteryx because somebody else is using it.

With that out of the way… you know how you can’t open an Excel file in Alteryx if you’ve got that file open? Or worse, if it’s an xlsx on a shared drive, and you don’t even have it open, but somebody else does? Yeah, not fun.

The “solution” is to email everybody to ask them to close the file, or to create a copy and read that in, changing all your inputs to reference the copy, then probably forgetting to delete the copy and now you’re running it off the copy and nobody’s updating that so you’re stuck in a cycle of overwriting the copy whenever you remember, and … this is not really a solution.

The ideal thing to do would be to change your process so that whatever’s in the xlsx is in a database. That’d be nice. But if that’s not an option for you, I’ve created a set of macros that will read an open xlsx in Alteryx.

In the first part of this blog, I’ll show you how to use the macros. In the second, I’ll explain how it works. If you’re not interested in the details, the only thing you need to know is this:

They use run command tools (which means your organisation might be a bit weird about you downloading them, but just show them this blog, nobody who uses comic sans could be an evil man).
Those run commands mimic the process of you going to the file explorer, clicking on a file, hitting copy, hitting paste in the same location, reading the data out of that copy, then deleting that copy. Leave no trace, it’s not just for hiking.
This means that command windows will briefly open up while you’re running the workflow. That’s fine, just don’t click anything or hit any keys while that’s happening or you might accidentally interrupt something (I’ve mashed the keys a bit trying to do this deliberately and the worst I’ve done is stop it copying the file or stop it deleting the file, but that’s still annoying. Anyway, you’ve been warned.).

The macros

There are three macros, which you can find here:

1a: Open an open xlsx – sheet names from full path
https://gallery.alteryx.com/#!app/Open-an-open-xlsx—sheet-names-from-full-path/60267d29826fd3151084eb8c

1b: Open an open xlsx – sheet names from browse
https://gallery.alteryx.com/#!app/Open-an-open-xlsx—sheet-names-from-browse/60267d330462d71a2ce6075d

2: Open an open xlsx – read specific sheet
https://gallery.alteryx.com/#!app/Open-an-open-xlsx—read-specific-sheet/60267d388a933713608a284d

Using these macros is a two-step process. The first step takes the xlsx and reads in the sheet names. There are two options here; you can browse for the xlsx file directly in the macro using option 1b, or you can use a directory tool and feed in the necessary xlsx file using option 1a (this is a few extra steps, but it’s my preferred approach). You can then filter to the sheet(s) you want to open, and feed that into the second step, which takes the sheet(s) and reads in the data.

Example 1

Let’s work through this simple example. I’ve got a file, imaginatively called 201210.test.xlsx, and I can’t read it because somebody else has it open. In these three steps, I can get the file, select the sheet I need, and get the data out of it:

In the first step, I’m using macro 1b: Open an open xlsx – sheet names from browse. The input is incredibly simple – just hit the folder icon and browse for the xlsx you need:

But this one comes with one caveat – for the same reason that you can’t open an xlsx if it’s open, the file browse doesn’t work if the xlsx is open either. You only need to set this macro up once, though – once you’ve selected the file you need, you’re good, and you can run this workflow whenever. You only need to have the file closed when you first bring this macro onto the canvas and select the file through the file browser.

This returns the sheet names:

All you need to do next is filter to the sheet name you need. I want to open Sheet2, so I filter to that and feed it into the next macro, which is macro 2: Open an open xlsx – read specific sheet. This takes three input fields – the full path of the xlsx, the sheet you want to open, and the line you want to import from:

[FullPath] and [Sheet Names] are the names of the fields returned by the first macro, so you shouldn’t even need to change anything here, just whack it in.

That gives you the data in Sheet2!

Example 2

Let’s do the same thing with a directory tool and my preferred approach – using macro 1a: Open an open xlsx – sheet names from full path:

In the first step, I use the directory tool to give me the full paths of all the xlsx files in a particular folder. This is really powerful, because it means I can choose a file in a dynamic fashion – for example, if the manually maintained process has a different xlsx for each month, you can sort by CreationTime from most recent to least recent, then use the select records tool to select the first record. That’ll always feed in the most recently created xlsx into the first macro. In this simple example, I’m just going to filter to the file I want, then deselect the columns I don’t need, which is everything except FullPath:

I can now feed the full path into the macro, like so:

…and the next steps are identical to example 1.

Example 3

If you need to read in multiple sheets, the get open xlsx sheet data macro will do that and union them together, as long as they’re all in the same format. I’ve got another example xlsx, which is unsurprisingly called 201210 test 2.xlsx. I’ve got four sheets, one for each quarter, where the data structure is identical:

The data structure has to be identical, can’t stress that enough

With this, you don’t have to filter to a specific sheet – you can just bung the get open xlsx sheet data macro straight on the output of the first one, like this:

That’s all you have to do! If this is something you’ve been looking for, please download them from the public gallery, test them out, and let me know if there are any bugs. But seriously, please don’t use this as a production process.

How it works

It’s all about the run commands. I’ll walk through macro 1a: open an open xlsx from a full path, but they’re all basically the same. There are four main steps internally:

Creating the file path and the batch commands
Copying the file
Getting data out of the copy
Deleting the copy

The first regex tool just finds the file extension. I’ll use this later to build the new file path:

The formula tool has three calculations in it; one to create a new full path, one to create the copy command, and one to create the delete command.

The NewFullPath calculation takes my existing full path (i.e. the xlsx that’s open) and adds a timestamp and flag to it. It does this by replacing “.xlsx” with “2021-02-12 17.23.31temptodelete1a.xlsx”. It’s ugly, but it’s a useful way of making pretty sure that you create a unique copy so that you’re not going to just call it “- copy.xlsx” and then delete somebody else’s copy.

The Command calculation uses the xcopy command (many thanks to Ian Baldwin, your friendly neighbourhood run-command-in-Alteryx hero, for helping me figure out why I couldn’t get this working). What this does is create a command to feed into the run command tool that says “take this full path and create a copy of it called [NewFullPath]”.

The DeleteCommand calculation uses the delete command. You feed that into the run command tool, and it simply takes the new full path and tells Windows to delete it.

Now that you’ve got the commands, it’s run command time (and block until done, just to make sure it’s happening in the right order).

To copy the file: put a select tool down to deselect everything except the command field. Now set up your run command tool like this:

You want to treat the command like a .bat file. You can do this by setting the write source output to be a .bat file that runs in temp – I just called it CopyFiles.bat. To do this, you’ll need to hit output, then type it in in the configuration window that pops up, and keep the other settings as above too. If you’re just curious how this macro works, you don’t need to change anything here, it’ll run just fine (let me know if it doesn’t).

To get the list of sheet names, the first step is to create a new column with the full path, a pipe separator, then “<List of Sheet Names>” to feed into the dynamic input:

From there, you can use the existing xlsx as the data source template, and use NewFullPathSheetNames from the previous formula as the data source list field, and change the entire file path:

That’ll return the sheet names only:

So the last step is an append tool to bring back the original FullPath. That’s all you need as the output of this step to go into the sheet read macro (and the only real difference in that macro is that the formula tool and dynamic input tool is set up to read the data out of the sheets rather than get a list of sheet names).

Finally, to delete the copy, deselect everything except the DeleteCommand field, and set it up in the same way as the copy run command tool earlier:

And that’s about it. I hope this explanation is full enough for you to muck about with run command tools with a little confidence. I really like xcopy functions as a way of mass-shifting things around, it’s a powerful way of doing things. But like actual power tools, it can be dangerous – be careful when deleting things, because, well, that can go extremely badly for you.

And finally, none of this is good. This is like the data equivalent of me having a leaky roof, putting a bucket on the floor, and then coming up with an automatic way of switching out the full bucket for an empty one before it overflows. What I should actually do is fix the roof. If you’re using these macros, I highly recommend changing your underlying data management practices/processes instead!

Uncategorized

Hacking redacted documents with statistics: the Ghislaine Maxwell deposition

The original Slate article introduces this better than I can, because they’re actual journalists and I’m just a data botherer:

On Thursday morning, a federal court released a 2016 deposition given by Ghislaine Maxwell, the 58-year-old British woman charged by the federal government with enticing underage girls to have sex with Jeffrey Epstein. That deposition, which Maxwell has fought to withhold, was given as part of a defamation suit brought by Virginia Roberts Giuffre, who alleges that she was lured to become Epstein’s sex slave. That defamation suit was settled in 2017. Epstein died by suicide in 2019.
In the deposition, Maxwell was pressed to answer questions about the many famous men in Epstein’s orbit, among them Bill Clinton, Alan Dershowitz, and Prince Andrew. In the document that was released on Thursday, those names and others appear under black bars. According to the Miami Herald, which sued for this and other documents to be released, the deposition was released only after “days of wrangling over redactions.”
Slate: “We Cracked the Redactions in the Ghislaine Maxwell Deposition”

It’s some grim shit. I haven’t been following the story that closely, and I don’t particularly want to read all 400-odd pages of testimony.

But this bit caught my eye:

It turns out, though, that those redactions are possible to crack. That’s because the deposition—which you can read in full here—includes a complete alphabetized index of the redacted and unredacted words that appear in the document.

This is … not exactly redacted. It looks pretty redacted in the text itself:

Above: Page 231 of the depositions with black bars redacting some names.

But the index helpfully lists out all the redacted words. With the original word lengths intact. In alphabetical order. You don’t need any sophisticated statistical methods to see that many of the redacted black bars on page 231 concern a short word which begins with the letter A, and is followed by either the letter I or the letter L:

Above: The index of the deposition, helpfully listing out all the redacted words alphabetically anbd referencing them to individual lines on individual pages.

It also doesn’t take much effort to scroll through the rest of the index, and notice that another short word beginning with the letters GO occurs in exactly the same place:

Above: I wish all metadata was this good.

And once you’ve put those two things together, it’s not a huge leap to figure out that this is probably about Al Gore.

When I first read the Slate article on Friday 23rd October 2020, around 8am UK time, Slate had already listed out a few names they’d figured out by manually going through the index and piecing together coöccurring words. And that reminded me of market basket analysis, one of my favourite statistical processes. I love it because you can figure out where things occur together at scale, and I love it because conceptually it’s not even that hard, it’s just fractions.

Market basket analysis is normally used for retail data to look at what people buy together, like burgers and burger buns, and what people don’t tend to buy together, like veggie sausages and bacon. But since it’s basically just looking at what happens together, you can apply it to all sorts of use cases outside supermarkets.

In this case, our shopping basket is the individual page of the deposition, and our items are the redacted words. If two individual redacted words occur together on the same page(s) more than you’d expect by chance, then those two words are probably a first name and a surname. And if we can figure out the first letter(s) of the words from their positions in the index, we’ve got the initials of our redacted people.

For example, let’s take the two words which are probably Al Gore, A1 and GO1. If the redacted word A1 appears on 2% of pages, and if the redacted word GO1 appears on 3% of pages, then if there’s no relationship between the two words, you’d expect A1 and GO1 to appear together on 3% of 2% of pages, i.e. on 0.06% of pages. But if those two words appear together on 1% of pages, that’s ~16x more often than you’d expect by chance, suggesting that there’s a relationship between the words there.

So, I opened up the Maxwell deposition pdf, which you can find here, and spent a happy Friday evening going through scanned pdf pages (which are the worst, even worse than the .xls UK COVID-19 test and trace debacle, please never store your data like this, thank you) and turning it into something usable. Like basically every data project I’ve ever worked on, about 90% of my time was spent getting the data into a form that I could actually use.

oh god why

Working through it…

How data should look.

And now we’re ready for some stats. I used Alteryx to look at possible one-to-one association rules between redacted words. Since I don’t know the actual order of the redacted words, there are two possible orders for any two words: Word1 Word2 and Word2 Word1. For example, the name that’s almost definitely Al Gore is represented by the two words “A1” and “GO1” in my data. If there’s a high lift between those two words, that tells me it’s likely to be a name, but I’m not sure if that name is “A1 GO1” or “GO1 A1”.

After running the market basket analysis and sorting by lift, I get these results. Luckily, Slate have already identified a load of names, so I’m reasonably confident that this approach works:

A list of initials and names, in Excel this time to make it more human-readable.

Like I said, I haven’t been following this story that closely, and I’m not close enough to be able to take a guess at the names. But I’m definitely intrigued by the top one – Slate haven’t cracked it as of the time of writing. The numbers suggest that there’s somebody called either Je___ Ja___ or Ja___ Je___ who’s being talked about here:

Je___ Ja___ or Ja___ Je___?

I don’t particularly want to speculate on who’s involved in this. It’s a nasty business and I’d prefer to stay out of it. But there are a few things that this document illustrates perfectly:

It’s not really redacted if you’ve still got the indexing, come on, seriously
Even fairly simple statistical procedures can be really useful
Different fields should look at the statistical approaches used in other fields more often – it really frustrates me that I almost never see any applications of market basket analysis outside retail data
Please never store any data in a pdf if you want people to be able to use that data

Uncategorized

Synthesised size/sound sound symbolism: the MS Paint version

Earlier today, I presented my CogSci paper Synthesised size/sound sound symbolism. The six page paper is short and to the point, but hey, it technically counts as a publication, so I figure it’d be remiss of me not to render it badly in MS Paint.

We start with the Japanese ideophone learning study. Participants learned some Japanese ideophones, half with their real Dutch translations, half with their opposite translations, and it turned out that they were way better at remembering the ideophones with their real translations. Importantly, this didn’t happen when we did exactly the same thing with regular, arbitrary adjectives. You can read that paper here, and download some of the experiment materials for it here if you want to try it yourself. You can also read the replication study we did with EEG here, and download the data and analysis scripts for that here.

The results looked a bit like this:

histograms1

But while we can see that there is an obvious effect, we don’t know how it works. Is it a learning boost that people get from the cross-modal correspondences in the real condition?

match boost lerning

Or is it a learning hindrance that people get from the cross-modal clashes in the opposite condition?

mismatch hindrance learning

(or indeed, is it both?)

Without adding a neutral condition where the words neither obviously match nor mismatch their meanings, we don’t really know.

So, we created some synthesised size/sound sound-symbolic pseudowords, which is easier said than done. It’s well known that people associate voiced consonants and low, back vowels with large size, and voiceless consonants and front, high vowels with small size. This is probably because of the mouth shape you make when saying those sounds:

a and i

We created big-sounding words (like badobado), small-sounding words (like kitikiti), and neutral-sounding words which were halfway in between (like depedepe).

A neutral condition could tell us if it’s a graded effect…

graded effect.png

a match boost effect…

match boost.png

or a mismatch hindrance effect:

mismatch hindrance.png

Turns out it’s a match boost effect. Participants learned the match pseudowords (e.g. badobado and big, kitikiti and small) better than the neutral pseudowords (e.g. kedekede and big, depedepe and small), but there wasn’t a difference in how well they learned the neutral and mismatch pseudowords (e.g. godagoda and small, tikutiku and big).

results n=30

For good measure, we did it again with double the original sample size, because it’s nice to check things.

first experiment and replication blog picture

…and, yes, we found exactly the same thing:

results n=60

So, it looks like it’s a special match boost effect from the cross-modal correspondences, not a graded effect reflecting all cross-modal information.

This is a nice paradigm which can be easily altered to try with different languages or different stimuli, e.g. using big and small shapes rather than the words “big” and “small” in order to rule out letter confounds. I’ll put up all the Presentation scripts, synthesised stimuli, and analysis scripts once I’ve tidied them up a lot (if you think my MS Paint scribbles are messy, you should see my code). It should be pretty straightforward for anybody to download and redo this, so it’d make a good project for a Bachelors/Masters intern. I’d love to see this get taken further and tried out with useful variations and changes. But good luck coming up with a more satisfying title!

Uncategorized

Rolling form and how Arsenal lost the Premier League title

Is form taken seriously in the media discussion of football? Every game preview acknowledges the results from a team’s last five games, often with green, grey, and red blobs so you can see how much they’ve won, drawn, and lost recently… but it’s rarely quantified beyond that. This is kind of understandable, since you get three points for a win regardless of what form you or your opposition have been in, but I feel like a bit more of a deep dive into form data is needed.

I’ve been scraping a lot of game data from last season’s Premier League with the ultimate goal of making an adjusted table for rolling form. Which teams beat form teams most often? Which teams benefited most from playing teams at a low ebb more often than other teams? Is rolling form a better predictor of a match winner than a team’s position?

This has thrown up an absolute glut of data that can (and will) fill several blog posts. For now, though, I’d like to focus on Arsenal’s woes last season.

One of the main media narratives around Arsenal is that they’re flat track bullies; they swat aside mediocre teams with ease, looking fluent and impressive while doing so, but they come unstuck against the top teams. These criticisms extend to the individual players, with Olivier Giroud and Per Mertesacker often held up as examples of almost-but-not-quite players; good enough to beat most teams, but not in the elite level that would take Arsenal past the Chelseas, the Manchester Uniteds, the Manchester Citys (despite finishing above all of them last season, but whatever).

Thing is, that’s completely untrue.

Here’s a graph of Arsenal’s wins, losses, and draws last season. The dots and lines represent each game, and are positioned on the y-axis according to Arsenal’s and their opposition’s form over the last five games (in mean points per game) just before each game. This excludes the first five matches of the season, which are always a bit dicey and rarely reflect the final standings.

Arsenal form plot.png

What strikes me most clearly is that Arsenal lost all their games to teams in average to poor form. Indeed, three losses were against teams who’d been averaging less than a point per game, which, if applied to a whole season, would set them out as clear relegation candidates. Moreover, around half these losses have come when Arsenal have been in a good run of form themselves, averaging two or more points per game (which, if applied to a whole season, would have seen them score 76 points, five more than they actually managed). Meanwhile, a lot of Arsenal’s wins actually came against teams who were doing well, averaging 1.5 points per game or more. This suggests that Arsenal were winning the difficult games and losing the easy games.

This isn’t much use on its own, so let’s look at Leicester and Tottenham for comparison.

Not only did Leicester lose less often – obviously – their losses never came against teams averaging less than one point per game. The same goes for Tottenham, who lost their games to teams in a fairly decent run of form.

We can also look at how Arsenal did according to their league position and the league position of their opposition just before the game.

Arsenal position plot.png

This plot is also pretty illuminating. The bulk of Arsenal’s losses came against teams in the bottom half of the table, while a lot of their record against teams in the top five when they played them was actually pretty good, winning four, drawing three, and losing only one. There’s also a nice split between their wins and draws which shows that Arsenal generally beat the lower table teams and generally drew with the high mid-table teams when they played them.

Again, let’s compare that with Leicester and Tottenham.

Leicester only lost to teams in the top half of the table. Tottenham lost to teams in the relegation zone twice, but the rolling form plot showed that these relegation zone teams had been doing pretty well at the time.

To get a look at the whole league, we can plot the mean rolling form and the mean league position for the opposition for each team in losses. That’s a confusing sentence; another way of phrasing it is saying that this is looking at the average form and average league position of the teams that each team lost to last season.

form plot for losses plus lm.png

position plot for losses plus lm.png

Both these graphs show a slight relationship between how well a team did overall and the nature of their losses – the better performing teams tended to lose to better opposition, i.e., by losing to teams with higher points per game in the last five games and teams who were higher in the table at the time.

All except Arsenal. In fact, Arsenal were the worst team least season in terms of losing to teams in poor form. The teams that Arsenal lost to were in the worst average form and in the lowest average position compared with losses by any other team in the league.

In short, this contradicts the main narrative of Arsenal not being good enough against the top teams. Rather, Arsenal aren’t good enough against the bad teams, and lost out on the league by losing to teams in poor form in the relegation zone. Next time Arsenal play a poorly-performing bottom-table team, maybe a Hull or a Middlesborough in a bit of a rut, I’ll stick a tenner on Arsenal to lose; they’re only a little less likely to lose than they are to win.

wenger water bottle.gif

Uncategorized

ERP graph competition!

A while back, I blogged about creating better ERP graphs. The data I used to generate those graphs has now been published, and all the data and analysis scripts from that paper are available to download here.

One of the things I find most fun about research is playing about with plotting graphs. There are so many different ways to visualise ERP data that it’s hard to pick one. I played around with a few different styles for my Collabra paper, and settled for plotting the two conditions with 95% confidence intervals. I was also tempted to plot the individual ERPs, as shown in the second plot, but felt the CIs were cleaner and more useful.

However, I also liked Guillaume’s approach of plotting the grand average difference wave and the difference waves for individual participants. I’ve done it with and without 95% CIs in plots 3 and 4. The difference in those two is that the green/orange distinction shows significance; the 320-784ms window was significant, the rest of the epoch wasn’t.

Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

Parietal electrodes, all participants at once, all data (size 0.3, alpha 0.3) (lighter)

Parietal electrodes, diffwave and individual diffwaves

Parietal electrodes, diffwave and individual diffwaves plus CIs

But I’d love to see how these can be improved! All my graphs and all the code needed are below. You can download eegdata.txt from my OSF page, but be careful not to click on the file name itself, otherwise your browser will probably freeze:

osf download instructions

After that, load it into R, and have at it. The only extra packages I use to make the graphs in this blog are ggplot2, ggthemes, tidyr, and dplyr.

If you think you can do better, email me your graphs and code to gwilym(dot)lockwood(at)mpi(dot)nl. I’ll post another blog in a few weeks with my favourite contributions… and I’ll buy the winner a beer 🙂

Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

eegdata$condition <- gsub("real", "Real", eegdata$condition)
eegdata$condition <- gsub("opposite", "Opposite", eegdata$condition)
 
# create dataframe by measuring across participants
 
dfsmallarea <- filter(eegdata, smallarea == "parietal")
dfsmallarea <- aggregate(measurement ~ smallarea*time*condition, dfsmallarea,mean)
 
# work it out for all trials
 
parietaldf <- filter(eegdata, smallarea == "parietal")
parietaldf <- aggregate(measurement ~ time*participant*condition*electrode, parietaldf,mean)
 
std <- function(x)sd(x)/sqrt(length(x))
# i.e. std is a function to give you the standard error of the mean
# this is the standard deviation of a sample divided by the square root of the sample size
 
SD <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard deviation at each time point, which will be huge
SE <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard error at each time point
CIupper <- rep(NA,length(dfsmallarea$time))  # creates empty vector for upper 95% confidence limit at each time point
CIlower <- rep(NA,length(dfsmallarea$time))  # creates empty vector for lower 95% confidence limit at each time point
 
for (i in 1:length(dfsmallarea$time)){
  something <- subset(parietaldf,time==dfsmallarea$time[i] & condition==dfsmallarea$condition[i], select=measurement)
  SD[i] = sd(something$measurement)
  SE[i] = std(something$measurement)
  CIupper[i] = dfsmallarea$measurement[i] + (SE[i] * 1.96)
  CIlower[i] = dfsmallarea$measurement[i] - (SE[i] * 1.96)
 
}
 
dfsmallarea$CIL <- CIlower
dfsmallarea$CIU <- CIupper
 
# Now let's plot things, starting with all trials

colours <- c("#D55E00", "#009E73")  # colourblind friendly - red/orange for opposite, green for real
 
dfsmallarea$time <- as.integer(as.character(dfsmallarea$time))plot <- ggplot(dfsmallarea, aes(x=time, y=measurement, color=condition)) + 
  geom_line(size=1, alpha = 1)+
  scale_linetype_manual(values=c(1,1) )+  #, guide=FALSE)+
  scale_y_continuous(limits=c(-7, 7), breaks=seq(-7,7,by=1))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_bw() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(aes(ymin = CIL, ymax = CIU, fill=condition), stat="identity", alpha = 0.3) + #95% CIs
  scale_fill_manual(values=colours)+ #, guide=FALSE) +
  scale_colour_manual(values=colours) #, guide=FALSE)

Created by Pretty R at inside-R.org

Parietal electrodes, all participants at once, all data (size 0.3, alpha 0.3) (lighter)

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
 
temp3$time <- as.integer(as.character(temp3$time))plot <- ggplot(temp3, aes(x=time, y=measurement,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3, aes(colour=condition))+
  geom_line(data = subset(temp3, GA == "Grand Average"), size=1.5, alpha=1, aes(colour=condition)) +
  scale_colour_manual(values=colours, guide=FALSE)+
  scale_y_continuous(limits=c(-15, 15), breaks=seq(-15,15,by=3))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - individual participants for all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org

Parietal electrodes, diffwave and individual diffwaves

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
 
 
temp4 <- spread(temp2, condition, measurement) # unmelt to create diffwave calculation
temp5 <- spread(temp, condition, measurement) 
 
temp4$diffwave <- temp4$real - temp4$opposite
temp5$diffwave <- temp5$real - temp5$opposite
 
temp6 <- rbind(temp4, temp5)
temp6$conbypar = paste(temp6$condition, temp6$participant, sep="")
 
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes", "no")
 
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org

Parietal electrodes, diffwave and individual diffwaves plus CIs

for (i in 1:length(temp6$time)){
  something <- subset(temp6,time==temp6$time[i], select=diffwave)
  SD[i] = sd(something$diffwave)
  SE[i] = std(something$diffwave)
  SDupper[i] = temp6$diffwave[i] + SD[i]
  SDlower[i] = temp6$diffwave[i] - SD[i]
  SEupper[i] = temp6$diffwave[i] + SE[i]
  SElower[i] = temp6$diffwave[i] - SE[i]
  CIupper[i] = temp6$diffwave[i] + (SE[i] * 1.96)
  CIlower[i] = temp6$diffwave[i] - (SE[i] * 1.96)
 
}
 
temp6$CIL <- CIlower
temp6$CIU <- CIupper
 
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes",ifelse(temp6$time %in% c(-200:318), "no1",
                           "no2"))
 
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(data = subset(temp6, GA == "Grand Average"), linetype=0, aes(ymin = CIL, ymax = CIU,group=sig, fill=sig), stat="identity", alpha = 0.5) + #95% CIs
  scale_fill_manual(values=sigcolours, guide=FALSE)

Created by Pretty R at inside-R.org

Uncategorized

How iconicity helps people learn new words: the MS Paint version

I have a new paper out!

Gwilym Lockwood, Peter Hagoort, and Mark Dingemanse. “How Iconicity Helps People Learn New Words: Neural Correlates and Individual Differences in Sound-Symbolic Bootstrapping.” Collabra 2, no. 1 (July 6, 2016). doi:10.1525/collabra.42.

The paper can be read and downloaded right here:
http://www.collabra.org/article/10.1525/collabra.42/

…and because we’re doing the whole open thing properly, you can also download the original stimuli, data, and analysis scripts here. You probably have no intention of sifting through it all, but the point is that you can:
https://osf.io/ema3t/

The experiment was pretty similar to my Sound symbolism boosts novel word learning paper from a few months ago, except that this time it was only with the ideophones, and I used EEG to measure participants’ brain activity while they learned them. People learned 19 ideophones with their real Dutch translations and 19 ideophones with their opposite Dutch translations. After I told them about that and said sorry for the deception, they heard all the ideophones again and had to guess what the real translation was from a choice of two antonyms.

The first important thing is that the results were almost identical. People got the answers right 86.7% of the time for the ideophones in the real condition, and 71.3% of the time for the ideophones in the opposite condition. When they had to guess, they got it right 73% of the time. These figures replicate the first study very closely (86.7% to 86.1%, 71.3% to 71.1%, and 73% to 72.3%), which is excellent news.

accuracy for JEPLMC and Collabra experiments - jitterdots (0-100) 9by7 for APS presentation

All kinds of things can happen in scientific studies, so replicating a study is really important for showing that the effect is real and not just coincidental. Sadly, replications aren’t considered to be very glamorous, so a shocking amount of published science is either unreplicated or unreplicable:

first experiment and replication blog picture

The second part of this new paper is that I also measured people’s brain activity using EEG. Once you average all the trials together, you get a signal of changing activity over time in response to a thing, which is known as an event-related potential. It looks a bit like this, and isn’t that useful by itself:

erp picture single line

Instead, you have to compare two conditions. If they differ at a certain point, that’s what tells you about how the brain processes things:

erp picture two lines

In this experiment, I found a big P3 effect in the test round:

figure 8 - Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

The P3 is linked to memory and learning, so it’s not surprising that it came up in a task involving memory and learning. A lower P3 is linked to things being more difficult to learn, so again, it’s not surprising that the ideophones in the opposite condition have a lower P3 when they were harder to learn.

But, it wasn’t that simple. If it was a straight up learning effect, you would expect a correlation between how well people did in a condition and that condition’s P3 amplitude; people with lower test round scores in the opposite condition should have a lower P3 amplitude in the opposite condition. But they don’t.

However, there was a correlation between how sensitive participants were to sound symbolism (as measured by their meaning guessing accuracy in the task after the test) and how big the ERP difference between conditions was. When I split the participants into two groups (people scoring above and below the 73% average in the guessing task) and plotted their ERPs separately, it turns out that the P3 effect is big for the people who are more sensitive to sound symbolism and barely there at all for the people who are less sensitive to sound symbolism:

het talige brein picture

You can also see that the P3 amplitude in the real condition was the same across both groups. What’s behind the difference is how the amplitude in the opposite condition changes. This suggests that most people can recognise cross-modal correspondences and exploit them in word learning, but that some people are more sensitive to sound symbolism and get put off by cross-modal clashes as well.

I reckon that the variation in how sensitive people are to sound symbolism goes a little like this:

model of ss and cross modal perception

…and we do have some preliminary data from a massive cross-modal perception and synaesthesia study that shows that synaesthetes are better at the Japanese ideophone guessing game than regular people, but that’s another blog for another time.

Uncategorized

Portugal: the worst Euro finalists ever

I haven’t been impressed with Portugal this tournament (and not just because I’m a bitter Welshman). They haven’t been very good; never really impressing in any of their matches, winning without dominating in the knock outs and held to draws by Iceland and Hungary in the group stages.

I’m pretty sure they’re the worst European Championship finalists I’ve seen, and perhaps the worst ever. But how can you measure how underwhelming a team is?

Step forward Elo ratings. Far better than the joke that is the FIFA rankings, Elo ratings adjust after every single international match based on the teams’ previous Elo ratings. For example, before Iceland and England faced off last week, Iceland had an Elo rating of 1688 and England had an Elo rating of 1929. After Iceland beat England, they exchanged 40 points – Iceland’s Elo rating went up 40 points to 1728, and England’s Elo rating went down 40 points to 1889. This is calculated using the status of the match, the number of goals scored, the result, and the expected result (more on that here, and have a browse of some more examples here).

I’ve taken every finalist (not including the winner of France vs. Germany, which kicks off about an hour from the time of writing) of the Euros since 1984, which was the first tournament where there were knock-out matches (before that, there was a group stage round-robin and the top two teams played off in a final). I’ve calculated each finalist’s mean Elo score throughout the tournament – not including the final itself – as well as calculating the mean Elo score of all the teams they played along the way.

That’s plotted right here:

Elo graph annotated

The dot size and colour shows the difference in each finalist’s Elo score from the start of the tournament until after the semifinal;the larger and lighter the dot, the more the team improved on their way to the final.

France in 2000 are probably the best team to reach a final – not only were they an excellent team at the time, they also beat a lot of strong opposition to get there. Spain in 2012 were the strongest team, but the opposition they faced wasn’t as tough. Greece in 2004 were the weakest team, but they beat some really strong teams along the way, which makes their achievement really impressive.

Portugal this year are dropping off the graph on the bottom left. They aren’t a high quality side – barely better than Greece in 2004 or Denmark in 1992 – but they haven’t been beating any impressive opposition either. They’ve not been that good, and they’ve had the weakest path to the final of any Euros finalist.

So, yes, it looks like my hunch was right – Portugal are the most underwhelming team to reach the final of a European Championship ever since there’s been a proper knock-out round. Sorry, Ronnie.

sad Ronaldo

Vizzee Rascal

Gwilym Lockwood's blog about turning numbers into pictures.

Category Archives: Uncategorized

How clock changes work in the UK and what it means for your data processes

What’s the time?

What are time zones?

What happens when the clocks change?

What does this mean for data?

How to update a production file in Excel very carefully

Step 1: don’t even touch it, just make copies

Step 3: xlookups

Step 4: update the production file

SAP: what’s the difference between the merchandise category hierarchy and the article hierarchy?

How to open an open Excel file in Alteryx

The macros

Example 1

Example 2

Example 3

How it works

Hacking redacted documents with statistics: the Ghislaine Maxwell deposition

Synthesised size/sound sound symbolism: the MS Paint version

Rolling form and how Arsenal lost the Premier League title

ERP graph competition!

How iconicity helps people learn new words: the MS Paint version

Portugal: the worst Euro finalists ever