How to open an open Excel file in Alteryx

Before I start this blog, a disclaimer:

This is disgusting. Don’t do it. If you are running a production workflow off a manually maintained Excel file, you have bigger problems than not being able to open an Excel file in Alteryx because somebody else is using it.

With that out of the way… you know how you can’t open an Excel file in Alteryx if you’ve got that file open? Or worse, if it’s an xlsx on a shared drive, and you don’t even have it open, but somebody else does? Yeah, not fun.

The “solution” is to email everybody to ask them to close the file, or to create a copy and read that in, changing all your inputs to reference the copy, then probably forgetting to delete the copy and now you’re running it off the copy and nobody’s updating that so you’re stuck in a cycle of overwriting the copy whenever you remember, and … this is not really a solution.

The ideal thing to do would be to change your process so that whatever’s in the xlsx is in a database. That’d be nice. But if that’s not an option for you, I’ve created a set of macros that will read an open xlsx in Alteryx.

In the first part of this blog, I’ll show you how to use the macros. In the second, I’ll explain how it works. If you’re not interested in the details, the only thing you need to know is this:

  1. They use run command tools (which means your organisation might be a bit weird about you downloading them, but just show them this blog, nobody who uses comic sans could be an evil man).
  2. Those run commands mimic the process of you going to the file explorer, clicking on a file, hitting copy, hitting paste in the same location, reading the data out of that copy, then deleting that copy. Leave no trace, it’s not just for hiking.
  3. This means that command windows will briefly open up while you’re running the workflow. That’s fine, just don’t click anything or hit any keys while that’s happening or you might accidentally interrupt something (I’ve mashed the keys a bit trying to do this deliberately and the worst I’ve done is stop it copying the file or stop it deleting the file, but that’s still annoying. Anyway, you’ve been warned.).

The macros

There are three macros, which you can find here:

1a: Open an open xlsx – sheet names from full path

1b: Open an open xlsx – sheet names from browse

2: Open an open xlsx – read specific sheet

Using these macros is a two-step process. The first step takes the xlsx and reads in the sheet names. There are two options here; you can browse for the xlsx file directly in the macro using option 1b, or you can use a directory tool and feed in the necessary xlsx file using option 1a (this is a few extra steps, but it’s my preferred approach). You can then filter to the sheet(s) you want to open, and feed that into the second step, which takes the sheet(s) and reads in the data.

Example 1

Let’s work through this simple example. I’ve got a file, imaginatively called 201210.test.xlsx, and I can’t read it because somebody else has it open. In these three steps, I can get the file, select the sheet I need, and get the data out of it:

In the first step, I’m using macro 1b: Open an open xlsx – sheet names from browse. The input is incredibly simple – just hit the folder icon and browse for the xlsx you need:

But this one comes with one caveat – for the same reason that you can’t open an xlsx if it’s open, the file browse doesn’t work if the xlsx is open either. You only need to set this macro up once, though – once you’ve selected the file you need, you’re good, and you can run this workflow whenever. You only need to have the file closed when you first bring this macro onto the canvas and select the file through the file browser.

This returns the sheet names:

All you need to do next is filter to the sheet name you need. I want to open Sheet2, so I filter to that and feed it into the next macro, which is macro 2: Open an open xlsx – read specific sheet. This takes three input fields – the full path of the xlsx, the sheet you want to open, and the line you want to import from:

[FullPath] and [Sheet Names] are the names of the fields returned by the first macro, so you shouldn’t even need to change anything here, just whack it in.

That gives you the data in Sheet2!

Example 2

Let’s do the same thing with a directory tool and my preferred approach – using macro 1a: Open an open xlsx – sheet names from full path:

In the first step, I use the directory tool to give me the full paths of all the xlsx files in a particular folder. This is really powerful, because it means I can choose a file in a dynamic fashion – for example, if the manually maintained process has a different xlsx for each month, you can sort by CreationTime from most recent to least recent, then use the select records tool to select the first record. That’ll always feed in the most recently created xlsx into the first macro. In this simple example, I’m just going to filter to the file I want, then deselect the columns I don’t need, which is everything except FullPath:

I can now feed the full path into the macro, like so:

…and the next steps are identical to example 1.

Example 3

If you need to read in multiple sheets, the get open xlsx sheet data macro will do that and union them together, as long as they’re all in the same format. I’ve got another example xlsx, which is unsurprisingly called 201210 test 2.xlsx. I’ve got four sheets, one for each quarter, where the data structure is identical:

The data structure has to be identical, can’t stress that enough

With this, you don’t have to filter to a specific sheet – you can just bung the get open xlsx sheet data macro straight on the output of the first one, like this:

That’s all you have to do! If this is something you’ve been looking for, please download them from the public gallery, test them out, and let me know if there are any bugs. But seriously, please don’t use this as a production process.

How it works

It’s all about the run commands. I’ll walk through macro 1a: open an open xlsx from a full path, but they’re all basically the same. There are four main steps internally:

  1. Creating the file path and the batch commands
  2. Copying the file
  3. Getting data out of the copy
  4. Deleting the copy

The first regex tool just finds the file extension. I’ll use this later to build the new file path:

The formula tool has three calculations in it; one to create a new full path, one to create the copy command, and one to create the delete command.

The NewFullPath calculation takes my existing full path (i.e. the xlsx that’s open) and adds a timestamp and flag to it. It does this by replacing “.xlsx” with “2021-02-12 17.23.31temptodelete1a.xlsx”. It’s ugly, but it’s a useful way of making pretty sure that you create a unique copy so that you’re not going to just call it “- copy.xlsx” and then delete somebody else’s copy.

The Command calculation uses the xcopy command (many thanks to Ian Baldwin, your friendly neighbourhood run-command-in-Alteryx hero, for helping me figure out why I couldn’t get this working). What this does is create a command to feed into the run command tool that says “take this full path and create a copy of it called [NewFullPath]”.

The DeleteCommand calculation uses the delete command. You feed that into the run command tool, and it simply takes the new full path and tells Windows to delete it.

Now that you’ve got the commands, it’s run command time (and block until done, just to make sure it’s happening in the right order).

To copy the file: put a select tool down to deselect everything except the command field. Now set up your run command tool like this:

You want to treat the command like a .bat file. You can do this by setting the write source output to be a .bat file that runs in temp – I just called it CopyFiles.bat. To do this, you’ll need to hit output, then type it in in the configuration window that pops up, and keep the other settings as above too. If you’re just curious how this macro works, you don’t need to change anything here, it’ll run just fine (let me know if it doesn’t).

To get the list of sheet names, the first step is to create a new column with the full path, a pipe separator, then “<List of Sheet Names>” to feed into the dynamic input:

From there, you can use the existing xlsx as the data source template, and use NewFullPathSheetNames from the previous formula as the data source list field, and change the entire file path:

That’ll return the sheet names only:

So the last step is an append tool to bring back the original FullPath. That’s all you need as the output of this step to go into the sheet read macro (and the only real difference in that macro is that the formula tool and dynamic input tool is set up to read the data out of the sheets rather than get a list of sheet names).

Finally, to delete the copy, deselect everything except the DeleteCommand field, and set it up in the same way as the copy run command tool earlier:

And that’s about it. I hope this explanation is full enough for you to muck about with run command tools with a little confidence. I really like xcopy functions as a way of mass-shifting things around, it’s a powerful way of doing things. But like actual power tools, it can be dangerous – be careful when deleting things, because, well, that can go extremely badly for you.

And finally, none of this is good. This is like the data equivalent of me having a leaky roof, putting a bucket on the floor, and then coming up with an automatic way of switching out the full bucket for an empty one before it overflows. What I should actually do is fix the roof. If you’re using these macros, I highly recommend changing your underlying data management practices/processes instead!


Hacking redacted documents with statistics: the Ghislaine Maxwell deposition

The original Slate article introduces this better than I can, because they’re actual journalists and I’m just a data botherer:

On Thursday morning, a federal court released a 2016 deposition given by Ghislaine Maxwell, the 58-year-old British woman charged by the federal government with enticing underage girls to have sex with Jeffrey Epstein. That deposition, which Maxwell has fought to withhold, was given as part of a defamation suit brought by Virginia Roberts Giuffre, who alleges that she was lured to become Epstein’s sex slave. That defamation suit was settled in 2017. Epstein died by suicide in 2019.

In the deposition, Maxwell was pressed to answer questions about the many famous men in Epstein’s orbit, among them Bill Clinton, Alan Dershowitz, and Prince Andrew. In the document that was released on Thursday, those names and others appear under black bars. According to the Miami Herald, which sued for this and other documents to be released, the deposition was released only after “days of wrangling over redactions.”

Slate: “We Cracked the Redactions in the Ghislaine Maxwell Deposition”

It’s some grim shit. I haven’t been following the story that closely, and I don’t particularly want to read all 400-odd pages of testimony.

But this bit caught my eye:

It turns out, though, that those redactions are possible to crack. That’s because the deposition—which you can read in full here—includes a complete alphabetized index of the redacted and unredacted words that appear in the document.

This is … not exactly redacted. It looks pretty redacted in the text itself:

Above: Page 231 of the depositions with black bars redacting some names.

But the index helpfully lists out all the redacted words. With the original word lengths intact. In alphabetical order. You don’t need any sophisticated statistical methods to see that many of the redacted black bars on page 231 concern a short word which begins with the letter A, and is followed by either the letter I or the letter L:

Above: The index of the deposition, helpfully listing out all the redacted words alphabetically anbd referencing them to individual lines on individual pages.

It also doesn’t take much effort to scroll through the rest of the index, and notice that another short word beginning with the letters GO occurs in exactly the same place:

Above: I wish all metadata was this good.

And once you’ve put those two things together, it’s not a huge leap to figure out that this is probably about Al Gore.

When I first read the Slate article on Friday 23rd October 2020, around 8am UK time, Slate had already listed out a few names they’d figured out by manually going through the index and piecing together coöccurring words. And that reminded me of market basket analysis, one of my favourite statistical processes. I love it because you can figure out where things occur together at scale, and I love it because conceptually it’s not even that hard, it’s just fractions.

Market basket analysis is normally used for retail data to look at what people buy together, like burgers and burger buns, and what people don’t tend to buy together, like veggie sausages and bacon. But since it’s basically just looking at what happens together, you can apply it to all sorts of use cases outside supermarkets.

In this case, our shopping basket is the individual page of the deposition, and our items are the redacted words. If two individual redacted words occur together on the same page(s) more than you’d expect by chance, then those two words are probably a first name and a surname. And if we can figure out the first letter(s) of the words from their positions in the index, we’ve got the initials of our redacted people.

For example, let’s take the two words which are probably Al Gore, A1 and GO1. If the redacted word A1 appears on 2% of pages, and if the redacted word GO1 appears on 3% of pages, then if there’s no relationship between the two words, you’d expect A1 and GO1 to appear together on 3% of 2% of pages, i.e. on 0.06% of pages. But if those two words appear together on 1% of pages, that’s ~16x more often than you’d expect by chance, suggesting that there’s a relationship between the words there.

So, I opened up the Maxwell deposition pdf, which you can find here, and spent a happy Friday evening going through scanned pdf pages (which are the worst, even worse than the .xls UK COVID-19 test and trace debacle, please never store your data like this, thank you) and turning it into something usable. Like basically every data project I’ve ever worked on, about 90% of my time was spent getting the data into a form that I could actually use.

oh god why

Working through it…

How data should look.

And now we’re ready for some stats. I used Alteryx to look at possible one-to-one association rules between redacted words. Since I don’t know the actual order of the redacted words, there are two possible orders for any two words: Word1 Word2 and Word2 Word1. For example, the name that’s almost definitely Al Gore is represented by the two words “A1” and “GO1” in my data. If there’s a high lift between those two words, that tells me it’s likely to be a name, but I’m not sure if that name is “A1 GO1” or “GO1 A1”.

After running the market basket analysis and sorting by lift, I get these results. Luckily, Slate have already identified a load of names, so I’m reasonably confident that this approach works:

A list of initials and names, in Excel this time to make it more human-readable.

Like I said, I haven’t been following this story that closely, and I’m not close enough to be able to take a guess at the names. But I’m definitely intrigued by the top one – Slate haven’t cracked it as of the time of writing. The numbers suggest that there’s somebody called either Je___ Ja___ or Ja___ Je___ who’s being talked about here:

Je___ Ja___ or Ja___ Je___?

I don’t particularly want to speculate on who’s involved in this. It’s a nasty business and I’d prefer to stay out of it. But there are a few things that this document illustrates perfectly:

  1. It’s not really redacted if you’ve still got the indexing, come on, seriously
  2. Even fairly simple statistical procedures can be really useful
  3. Different fields should look at the statistical approaches used in other fields more often – it really frustrates me that I almost never see any applications of market basket analysis outside retail data
  4. Please never store any data in a pdf if you want people to be able to use that data

Synthesised size/sound sound symbolism: the MS Paint version

Earlier today, I presented my CogSci paper Synthesised size/sound sound symbolism. The six page paper is short and to the point, but hey, it technically counts as a publication, so I figure it’d be remiss of me not to render it badly in MS Paint.

We start with the Japanese ideophone learning study. Participants learned some Japanese ideophones, half with their real Dutch translations, half with their opposite translations, and it turned out that they were way better at remembering the ideophones with their real translations. Importantly, this didn’t happen when we did exactly the same thing with regular, arbitrary adjectives. You can read that paper here, and download some of the experiment materials for it here if you want to try it yourself. You can also read the replication study we did with EEG here, and download the data and analysis scripts for that here.

The results looked a bit like this:


But while we can see that there is an obvious effect, we don’t know how it works. Is it a learning boost that people get from the cross-modal correspondences in the real condition?

match boost lerning

Or is it a learning hindrance that people get from the cross-modal clashes in the opposite condition?

mismatch hindrance learning

(or indeed, is it both?)

Without adding a neutral condition where the words neither obviously match nor mismatch their meanings, we don’t really know.

So, we created some synthesised size/sound sound-symbolic pseudowords, which is easier said than done. It’s well known that people associate voiced consonants and low, back vowels with large size, and voiceless consonants and front, high vowels with small size. This is probably because of the mouth shape you make when saying those sounds:

a and i

We created big-sounding words (like badobado), small-sounding words (like kitikiti), and neutral-sounding words which were halfway in between (like depedepe).

A neutral condition could tell us if it’s a graded effect…

graded effect.png

a match boost effect…

match boost.png

or a mismatch hindrance effect:

mismatch hindrance.png

Turns out it’s a match boost effect. Participants learned the match pseudowords (e.g. badobado and big, kitikiti and small) better than the neutral pseudowords (e.g. kedekede and big, depedepe and small), but there wasn’t a difference in how well they learned the neutral and mismatch pseudowords (e.g. godagoda and small, tikutiku and big).

results n=30

For good measure, we did it again with double the original sample size, because it’s nice to check things.

first experiment and replication blog picture

…and, yes, we found exactly the same thing:

results n=60

So, it looks like it’s a special match boost effect from the cross-modal correspondences, not a graded effect reflecting all cross-modal information.

This is a nice paradigm which can be easily altered to try with different languages or different stimuli, e.g. using big and small shapes rather than the words “big” and “small” in order to rule out letter confounds. I’ll put up all the Presentation scripts, synthesised stimuli, and analysis scripts once I’ve tidied them up a lot (if you think my MS Paint scribbles are messy, you should see my code). It should be pretty straightforward for anybody to download and redo this, so it’d make a good project for a Bachelors/Masters intern. I’d love to see this get taken further and tried out with useful variations and changes. But good luck coming up with a more satisfying title!


Rolling form and how Arsenal lost the Premier League title

Is form taken seriously in the media discussion of football? Every game preview acknowledges the results from a team’s last five games, often with green, grey, and red blobs so you can see how much they’ve won, drawn, and lost recently… but it’s rarely quantified beyond that. This is kind of understandable, since you get three points for a win regardless of what form you or your opposition have been in, but I feel like a bit more of a deep dive into form data is needed.

I’ve been scraping a lot of game data from last season’s Premier League with the ultimate goal of making an adjusted table for rolling form. Which teams beat form teams most often? Which teams benefited most from playing teams at a low ebb more often than other teams? Is rolling form a better predictor of a match winner than a team’s position?

This has thrown up an absolute glut of data that can (and will) fill several blog posts. For now, though, I’d like to focus on Arsenal’s woes last season.

One of the main media narratives around Arsenal is that they’re flat track bullies; they swat aside mediocre teams with ease, looking fluent and impressive while doing so, but they come unstuck against the top teams. These criticisms extend to the individual players, with Olivier Giroud and Per Mertesacker often held up as examples of almost-but-not-quite players; good enough to beat most teams, but not in the elite level that would take Arsenal past the Chelseas, the Manchester Uniteds, the Manchester Citys (despite finishing above all of them last season, but whatever).

Thing is, that’s completely untrue.

Here’s a graph of Arsenal’s wins, losses, and draws last season. The dots and lines represent each game, and are positioned on the y-axis according to Arsenal’s and their opposition’s form over the last five games (in mean points per game) just before each game. This excludes the first five matches of the season, which are always a bit dicey and rarely reflect the final standings.

Arsenal form plot.png

What strikes me most clearly is that Arsenal lost all their games to teams in average to poor form. Indeed, three losses were against teams who’d been averaging less than a point per game, which, if applied to a whole season, would set them out as clear relegation candidates. Moreover, around half these losses have come when Arsenal have been in a good run of form themselves, averaging two or more points per game (which, if applied to a whole season, would have seen them score 76 points, five more than they actually managed). Meanwhile, a lot of Arsenal’s wins actually came against teams who were doing well, averaging 1.5 points per game or more. This suggests that Arsenal were winning the difficult games and losing the easy games.

This isn’t much use on its own, so let’s look at Leicester and Tottenham for comparison.

Not only did Leicester lose less often – obviously – their losses never came against teams averaging less than one point per game. The same goes for Tottenham, who lost their games to teams in a fairly decent run of form.

We can also look at how Arsenal did according to their league position and the league position of their opposition just before the game.

Arsenal position plot.png

This plot is also pretty illuminating. The bulk of Arsenal’s losses came against teams in the bottom half of the table, while a lot of their record against teams in the top five when they played them was actually pretty good, winning four, drawing three, and losing only one. There’s also a nice split between their wins and draws which shows that Arsenal generally beat the lower table teams and generally drew with the high mid-table teams when they played them.

Again, let’s compare that with Leicester and Tottenham.

Leicester only lost to teams in the top half of the table. Tottenham lost to teams in the relegation zone twice, but the rolling form plot showed that these relegation zone teams had been doing pretty well at the time.

To get a look at the whole league, we can plot the mean rolling form and the mean league position for the opposition for each team in losses. That’s a confusing sentence; another way of phrasing it is saying that this is looking at the average form and average league position of the teams that each team lost to last season.

form plot for losses plus lm.png

position plot for losses plus lm.png

Both these graphs show a slight relationship between how well a team did overall and the nature of their losses – the better performing teams tended to lose to better opposition, i.e., by losing to teams with higher points per game in the last five games and teams who were higher in the table at the time.

All except Arsenal. In fact, Arsenal were the worst team least season in terms of losing to teams in poor form. The teams that Arsenal lost to were in the worst average form and in the lowest average position compared with losses by any other team in the league.

In short, this contradicts the main narrative of Arsenal not being good enough against the top teams. Rather, Arsenal aren’t good enough against the bad teams, and lost out on the league by losing to teams in poor form in the relegation zone. Next time Arsenal play a poorly-performing bottom-table team, maybe a Hull or a Middlesborough in a bit of a rut, I’ll stick a tenner on Arsenal to lose; they’re only a little less likely to lose than they are to win.

wenger water bottle.gif


ERP graph competition!

A while back, I blogged about creating better ERP graphs. The data I used to generate those graphs has now been published, and all the data and analysis scripts from that paper are available to download here.

One of the things I find most fun about research is playing about with plotting graphs. There are so many different ways to visualise ERP data that it’s hard to pick one. I played around with a few different styles for my Collabra paper, and settled for plotting the two conditions with 95% confidence intervals. I was also tempted to plot the individual ERPs, as shown in the second plot, but felt the CIs were cleaner and more useful.

However, I also liked Guillaume’s approach of plotting the grand average difference wave and the difference waves for individual participants. I’ve done it with and without 95% CIs in plots 3 and 4. The difference in those two is that the green/orange distinction shows significance; the 320-784ms window was significant, the rest of the epoch wasn’t.

But I’d love to see how these can be improved! All my graphs and all the code needed are below. You can download eegdata.txt from my OSF page, but be careful not to click on the file name itself, otherwise your browser will probably freeze:

osf download instructions

After that, load it into R, and have at it. The only extra packages I use to make the graphs in this blog are ggplot2, ggthemes, tidyr, and dplyr.

If you think you can do better, email me your graphs and code to gwilym(dot)lockwood(at)mpi(dot)nl. I’ll post another blog in a few weeks with my favourite contributions… and I’ll buy the winner a beer 🙂


Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

eegdata$condition <- gsub("real", "Real", eegdata$condition)
eegdata$condition <- gsub("opposite", "Opposite", eegdata$condition)
# create dataframe by measuring across participants
dfsmallarea <- filter(eegdata, smallarea == "parietal")
dfsmallarea <- aggregate(measurement ~ smallarea*time*condition, dfsmallarea,mean)
# work it out for all trials
parietaldf <- filter(eegdata, smallarea == "parietal")
parietaldf <- aggregate(measurement ~ time*participant*condition*electrode, parietaldf,mean)
std <- function(x)sd(x)/sqrt(length(x))
# i.e. std is a function to give you the standard error of the mean
# this is the standard deviation of a sample divided by the square root of the sample size
SD <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard deviation at each time point, which will be huge
SE <- rep(NA,length(dfsmallarea$time))       # creates empty vector for standard error at each time point
CIupper <- rep(NA,length(dfsmallarea$time))  # creates empty vector for upper 95% confidence limit at each time point
CIlower <- rep(NA,length(dfsmallarea$time))  # creates empty vector for lower 95% confidence limit at each time point
for (i in 1:length(dfsmallarea$time)){
  something <- subset(parietaldf,time==dfsmallarea$time[i] & condition==dfsmallarea$condition[i], select=measurement)
  SD[i] = sd(something$measurement)
  SE[i] = std(something$measurement)
  CIupper[i] = dfsmallarea$measurement[i] + (SE[i] * 1.96)
  CIlower[i] = dfsmallarea$measurement[i] - (SE[i] * 1.96)
dfsmallarea$CIL <- CIlower
dfsmallarea$CIU <- CIupper
# Now let's plot things, starting with all trials

colours <- c("#D55E00", "#009E73")  # colourblind friendly - red/orange for opposite, green for real
dfsmallarea$time <- as.integer(as.character(dfsmallarea$time))plot <- ggplot(dfsmallarea, aes(x=time, y=measurement, color=condition)) + 
  geom_line(size=1, alpha = 1)+
  scale_linetype_manual(values=c(1,1) )+  #, guide=FALSE)+
  scale_y_continuous(limits=c(-7, 7), breaks=seq(-7,7,by=1))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_bw() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(aes(ymin = CIL, ymax = CIU, fill=condition), stat="identity", alpha = 0.3) + #95% CIs
  scale_fill_manual(values=colours)+ #, guide=FALSE) +
  scale_colour_manual(values=colours) #, guide=FALSE)

Created by Pretty R at inside-R.org


Parietal electrodes, all participants at once, all data (size 0.3, alpha 0.3) (lighter)

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
temp3$time <- as.integer(as.character(temp3$time))plot <- ggplot(temp3, aes(x=time, y=measurement,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3, aes(colour=condition))+
  geom_line(data = subset(temp3, GA == "Grand Average"), size=1.5, alpha=1, aes(colour=condition)) +
  scale_colour_manual(values=colours, guide=FALSE)+
  scale_y_continuous(limits=c(-15, 15), breaks=seq(-15,15,by=3))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - individual participants for all trials")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org


Parietal electrodes, diffwave and individual diffwaves

temp <- filter(eegdata, smallarea == "parietal")
temp <- select(temp, smallarea,time, measurement, condition, participant)
temp <- aggregate(measurement ~ smallarea*time*condition*participant, temp,mean)
temp2 <- aggregate(measurement ~ smallarea*time*condition, temp,mean)
temp2$participant <- "Grand Average"
temp2$GA <- "Grand Average"
temp$GA <- "individuals"
temp2 <- select(temp2, smallarea,time, measurement, condition, participant, GA)
temp3 <- rbind(temp, temp2)
temp3$conbypar = paste(temp3$condition, temp3$participant, sep="")
temp4 <- spread(temp2, condition, measurement) # unmelt to create diffwave calculation
temp5 <- spread(temp, condition, measurement) 
temp4$diffwave <- temp4$real - temp4$opposite
temp5$diffwave <- temp5$real - temp5$opposite
temp6 <- rbind(temp4, temp5)
temp6$conbypar = paste(temp6$condition, temp6$participant, sep="")
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes", "no")
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))

Created by Pretty R at inside-R.org


Parietal electrodes, diffwave and individual diffwaves plus CIs

for (i in 1:length(temp6$time)){
  something <- subset(temp6,time==temp6$time[i], select=diffwave)
  SD[i] = sd(something$diffwave)
  SE[i] = std(something$diffwave)
  SDupper[i] = temp6$diffwave[i] + SD[i]
  SDlower[i] = temp6$diffwave[i] - SD[i]
  SEupper[i] = temp6$diffwave[i] + SE[i]
  SElower[i] = temp6$diffwave[i] - SE[i]
  CIupper[i] = temp6$diffwave[i] + (SE[i] * 1.96)
  CIlower[i] = temp6$diffwave[i] - (SE[i] * 1.96)
temp6$CIL <- CIlower
temp6$CIU <- CIupper
temp6$sig <- ifelse(temp6$time %in% c(320:784), "yes",ifelse(temp6$time %in% c(-200:318), "no1",
temp6$time <- as.integer(as.character(temp6$time))
sigcolours <- c("#D55E00", "#D55E00", "#009E73")plot <- ggplot(temp6, aes(x=time, y=diffwave,group=conbypar)) + 
  geom_line(size=0.3, alpha=0.3)+
  geom_line(data = subset(temp6, GA == "Grand Average"), size=2, alpha=1, aes(colour=sig)) +
  scale_colour_manual(values=sigcolours, guide=FALSE)+
  scale_y_continuous(limits=c(-16, 16), breaks=seq(-16,16,by=2))+ 
  scale_x_continuous(limits=c(-200,1000),breaks=seq(-200,1000,by=100),labels=c("-200", "-100","0","100","200","300","400","500","600","700","800","900","1000"))+
  ggtitle("Parietal electrodes - difference wave and individual difference waves")+
  ylab("Amplitude (µV)")+
  xlab("Time (ms)")+
  theme_few() +
  geom_vline(xintercept=0) +
  geom_hline(yintercept=0)plot+   theme(plot.title=  element_text( face="bold"),
              axis.text = element_text(size=8))plot+ theme(plot.title=  element_text( face="bold"), axis.text = element_text(size=8)) +
  geom_smooth(data = subset(temp6, GA == "Grand Average"), linetype=0, aes(ymin = CIL, ymax = CIU,group=sig, fill=sig), stat="identity", alpha = 0.5) + #95% CIs
  scale_fill_manual(values=sigcolours, guide=FALSE)

Created by Pretty R at inside-R.org


How iconicity helps people learn new words: the MS Paint version

I have a new paper out!

Gwilym Lockwood, Peter Hagoort, and Mark Dingemanse. “How Iconicity Helps People Learn New Words: Neural Correlates and Individual Differences in Sound-Symbolic Bootstrapping.” Collabra 2, no. 1 (July 6, 2016). doi:10.1525/collabra.42.

The paper can be read and downloaded right here:

…and because we’re doing the whole open thing properly, you can also download the original stimuli, data, and analysis scripts here. You probably have no intention of sifting through it all, but the point is that you can:

The experiment was pretty similar to my Sound symbolism boosts novel word learning paper from a few months ago, except that this time it was only with the ideophones, and I used EEG to measure participants’ brain activity while they learned them. People learned 19 ideophones with their real Dutch translations and 19 ideophones with their opposite Dutch translations. After I told them about that and said sorry for the deception, they heard all the ideophones again and had to guess what the real translation was from a choice of two antonyms.

The first important thing is that the results were almost identical. People got the answers right 86.7% of the time for the ideophones in the real condition, and 71.3% of the time for the ideophones in the opposite condition. When they had to guess, they got  it right 73% of the time. These figures replicate the first study very closely (86.7% to 86.1%, 71.3% to 71.1%, and 73% to 72.3%), which is excellent news.

accuracy for JEPLMC and Collabra experiments - jitterdots (0-100) 9by7 for APS presentation

All kinds of things can happen in scientific studies, so replicating a study is really important for showing that the effect is real and not just coincidental. Sadly, replications aren’t considered to be very glamorous, so a shocking amount of published science is either unreplicated or unreplicable:

first experiment and replication blog picture

The second part of this new paper is that I also measured people’s brain activity using EEG. Once you average all the trials together, you get a signal of changing activity over time in response to a thing, which is known as an event-related potential. It looks a bit like this, and isn’t that useful by itself:

erp picture single line

Instead, you have to compare two conditions. If they differ at a certain point, that’s what tells you about how the brain processes things:

erp picture two lines

In this experiment, I found a big P3 effect in the test round:

figure 8 - Parietal electrodes, title, all trials, lines and 95pc, capitals in legend (29-3-16 submission)

The P3 is linked to memory and learning, so it’s not surprising that it came up in a task involving memory and learning. A lower P3 is linked to things being more difficult to learn, so again, it’s not surprising that the ideophones in the opposite condition have a lower P3 when they were harder to learn.

But, it wasn’t that simple. If it was a straight up learning effect, you would expect a correlation between how well people did in a condition and that condition’s P3 amplitude; people with lower test round scores in the opposite condition should have a lower P3 amplitude in the opposite condition. But they don’t.

However, there was a correlation between how sensitive participants were to sound symbolism (as measured by their meaning guessing accuracy in the task after the test) and how big the ERP difference between conditions was. When I split the participants into two groups (people scoring above and below the 73% average in the guessing task) and plotted their ERPs separately, it turns out that the P3 effect is big for the people who are more sensitive to sound symbolism and barely there at all for the people who are less sensitive to sound symbolism:

het talige brein picture

You can also see that the P3 amplitude in the real condition was the same across both groups. What’s behind the difference is how the amplitude in the opposite condition changes. This suggests that most people can recognise cross-modal correspondences and exploit them in word learning, but that some people are more sensitive to sound symbolism and get put off by cross-modal clashes as well.

I reckon that the variation in how sensitive people are to sound symbolism goes a little like this:

model of ss and cross modal perception

…and we do have some preliminary data from a massive cross-modal perception and synaesthesia study that shows that synaesthetes are better at the Japanese ideophone guessing game than regular people, but that’s another blog for another time.


Portugal: the worst Euro finalists ever

I haven’t been impressed with Portugal this tournament (and not just because I’m a bitter Welshman). They haven’t been very good; never really impressing in any of their matches, winning without dominating in the knock outs and held to draws by Iceland and Hungary in the group stages.

I’m pretty sure they’re the worst European Championship finalists I’ve seen, and perhaps the worst ever. But how can you measure how underwhelming a team is?

Step forward Elo ratings. Far better than the joke that is the FIFA rankings, Elo ratings adjust after every single international match based on the teams’ previous Elo ratings. For example, before Iceland and England faced off last week, Iceland had an Elo rating of 1688 and England had an Elo rating of 1929. After Iceland beat England, they exchanged 40 points – Iceland’s Elo rating went up 40 points to 1728, and England’s Elo rating went down 40 points to 1889. This is calculated using the status of the match, the number of goals scored, the result, and the expected result (more on that here, and have a browse of some more examples here).

I’ve taken every finalist (not including the winner of France vs. Germany, which kicks off about an hour from the time of writing) of the Euros since 1984, which was the first tournament where there were knock-out matches (before that, there was a group stage round-robin and the top two teams played off in a final). I’ve calculated each finalist’s mean Elo score throughout the tournament – not including the final itself – as well as calculating the mean Elo score of all the teams they played along the way.

That’s plotted right here:

Elo graph annotated

The dot size and colour shows the difference in each finalist’s Elo score from the start of the tournament until after the semifinal;the larger and lighter the dot, the more the team improved on their way to the final.

France in 2000 are probably the best team to reach a final – not only were they an excellent team at the time, they also beat a lot of strong opposition to get there. Spain in 2012 were the strongest team, but the opposition they faced wasn’t as tough. Greece in 2004 were the weakest team, but they beat some really strong teams along the way, which makes their achievement really impressive.

Portugal this year are dropping off the graph on the bottom left. They aren’t a high quality side – barely better than Greece in 2004 or Denmark in 1992 – but they haven’t been beating any impressive opposition either. They’ve not been that good, and they’ve had the weakest path to the final of any Euros finalist.

So, yes, it looks like my hunch was right – Portugal are the most underwhelming team to reach the final of a European Championship ever since there’s been a proper knock-out round. Sorry, Ronnie.

sad Ronaldo


Visualising competition in the Premier League and La Liga

Apologies, it’s taken me far too long to get round to writing this.

A little while ago, I set up a quiz to see whether people could accurately guess whether a spatially-plotted league table was from the Premier League or La Liga. The idea was to see if the (British) stereotype of the Premier League being competitive and La Liga being Barcelona and Real Madrid plus eighteen others was born out by the league tables. I did this because I’m pretty bored of reading comments like this:

“Right, yeah, English teams are bobbins in the Champions League because the Premier League might not have the highest quality, but it’s the best league in the world because of the competition and how close all the teams are. You never know who’s going to win it! It’s not like in Spain, where it’s always going to be Barcelona or Real Madrid winning it by a mile, and all the other teams don’t even matter.”

According to this stereotype, a spatial dotplot of the Premier League would look like this:

premier league stereotype

…while a spatial dotplot of La Liga would look like this:

la liga stereotype

So, I did what any reasonable person would do; I scraped a load of data, visualised the league tables as dotplots in random order, and created an online quiz to see if people could guess which league was which. If the differences between the two leagues are obvious, it should be easy enough to tell which league is which from a dotplot. The rest of this blog is basically a love letter to ggplot2.

People saw graphs like this, and simply had to say whether it was La Liga – Premier League or Premier League – La Liga:

League positions no lines no leagues 2008 classic

In this case, La Liga is on the left, and the Premier League is on the right. It’s the 2008-09 season, when Barcelona finished nine points ahead of Real Madrid and Manchester United finished four points ahead of Liverpool. The position of the league on the x-axis was shuffled, so that each league was on the left side and the right side five times each, and the questions were presented to people in a random order.

172 people did the quiz, and scored an average of 62%.

The most well-answered question was the 2011-12 season, where 74% of people correctly answered that La Liga is on the left and the Premier League is on the right:

League positions no lines leagues 2011 classic

…and the least well-answered question was the 2006-07 season, where only 35% of people correctly answered that La Liga is on the left and the Premier League is on the right:

League positions no lines leagues 2006 classic

This wasn’t a perfect psychology experiment. People doing the quiz already knew what it was about and were aware that I was looking into the stereotype, but it is a useful demonstration that it’s not easy to guess which league is which. Perhaps the stereotype isn’t entirely accurate. Indeed, if we take the ten seasons in the quiz plus the season just finished and do some descriptive statistics, the mean number of points in the Premier League is 52.1 and the mean number of points in La Liga is 52.4. while the standard deviation of points is 17.0 in the Premier League and 16.3 in La Liga. This would suggest that there’s no real difference in how widely the points are spread across the teams.

However, there might be a general trend when we look at all of them put together. If we take the ten seasons in the quiz, plus the season just finished, and plot the number of points per position per season, we get this graph:

League positions no lines leagues last 11 years classic

This actually does suggest that there may be something in the stereotype of the top teams in La Liga pulling away from the rest of the league. The dots at the top of the league are stretched higher than they are in the Premier League.

Similarly, we can look at the mean number of points per position in the last 11 seasons:

Mean league points per position no lines leagues last 11 years classic

…and this also seems to show that the top teams in La Liga are spread out a bit further than the top teams in the Premier League.

Sure enough, when we look at the stats for the top four teams, it shows a bit more spread; the mean number of points for the top four teams is very similar – 78.4 in the Premier League, 78.1 in La Liga – but the standard deviation is 8.2 in the Premier League and 12.1 in La Liga.

One last thing we can look at plotting is goal difference. Here’s the goal difference per position for each of the last 11 seasons (a goal difference of 0 is shown by the horizontal black line):

GD no lines leagues last 11 years classic

Again, the top teams in Spain seem to be more spread out compared to the rest of the league than the top teams in England are. In fact, when we average it together for the last 11 seasons, we get this:

Mean gd per position no lines leagues last 11 years classic

…which shows that, on average, the second-placed team in La Liga generally has a better goal difference than the winner of the Premier League (57.6 to 51.4), but that the third-placed team in La Liga generally has the same goal difference as the fourth-placed team in the Premier League (30.8 and 30.2 respectively).

How about combining these graphs, with points up the y-axis and goal difference denoted by dot size?

exaggerated mean pts and gd per position no lines leagues last 11 years classic

It seems that, on average, there may be some truth in the stereotype. The top two teams in La Liga dominate their competitors, both in terms of points and goal difference, whereas there seems to be less separating the top teams in the Premier League.

…and just for bonus points, let’s do the same thing for the top five European leagues over the last 11 seasons. This time it’s done by points per game rather than points, as the Bundesliga only has 18 teams:

exaggerated mean pts per game and gd per position no lines all five leagues last 11 years classic

It looks like the Premiership does have the most competitive title race after all.


Dutch lessons for Brexit.

I don’t post political things. Not because I’m 26 and apathetic like much of my generation, but because most political shit I see is virtue-signalling rather than actual discussion, and I don’t want to contribute to that.

I’m making an exception today.

I’m British and I voted for the UK to stay in the EU. This wasn’t a difficult decision for me; while I do agree with some of the arguments about the problems of the EU, I don’t think that leaving the EU solves any of those problems. Thing is, though, this vote was never really about EU membership. It’s a domestic British revolution expressed through the wrong international forum.

It’s a strange experience to watch this develop from somewhere else. I live in the Netherlands, where everything is continuing as normal and I can’t get a sense of what the atmosphere in the UK is like. I’ve had a few pro-EU friends from home say things like “this fucking country, I’m so jealous you get to miss out on this” or “I bet you’re glad you’re outside the UK right now!”.

I’m not. I might be the kind of middle-class multilingual metropolitan lefty liberal prick that the anger of a lot of Brexit voters is directed against, but the UK is still my country, my culture, my people, and I hate seeing it tearing itself apart. I wish I was back there now. I think that people who say “I don’t like X, I’m moving to Y”, whether it’s Brits upping sticks because of Brexit, or Americans moving to Canada because of Trump, is sort of cowardly.

And funnily enough, this is something that living in the Netherlands for three years has taught me.

British people are fucking weird about being British. Vast swaths of the country are embarrassed about being British, and English doubly so. It’s not new either. George Orwell wrote about it in the 1940s:

England is perhaps the only great country whose intellectuals are ashamed of their own nationality. In left-wing circles it is always felt that there is something slightly disgraceful in being an Englishman and that it is a duty to snigger at every English institution, from horse racing to suet puddings. It is a strange fact, but it is unquestionably true that almost any English intellectual would feel more ashamed of standing to attention during “God Save the King” than of stealing from a poor box.

The longer I spend in the Netherlands, the more I realise how ridiculous this is. I’ve seen a lot of people writing a lot of things recently about how patriotism leads to division, how flying the British flag is the start of a slippery slope that leads to white people shouting at brown people to fuck off home. Three years ago, I’d have agreed, but it’s arse about face and it’s taken me moving to the Netherlands to see that.

We need small acts of patriotism, of banal nationalism, because it’s the small overt acts of patriotism that allow a country to positively define itself. More than that, the small communal acts of patriotism are actually tools for integration and inclusivity, not against; it’s a signal that says “hey, this is who we are, this is what we do, and taking part will make you welcomed and accepted”.

The Dutch get this. They’re a country comfortable in their own skin, or at least, far more so than the UK. On Koningsdag, the Dutch dress up in orange, sell second-hand shit in parks, and get drunk in an overt celebration of Dutchness in its boozy, thrifty, orange glory. Through British lenses, this should naturally develop into Dutch people laying out markers of what isn’t Dutch and excluding anybody who doesn’t fit, but it’s not, it doesn’t work like that. Small acts of patriotism through which Dutch identity and nationality is positively expressed allows people to integrate better. I stick on my two-stripe 1974 shirt and get drunk with Dutch people, and in doing so, this is me, an immigrant on benefits who doesn’t speak the language that well, taking the chance to show that I want to be part of this place, and in doing so, this is the Dutch accepting me. That’s not to say the Dutch don’t have their problems… but I’ve never seen a British Het land van.

Without this, how can an immigrant – especially an immigrant from a much more culturally distant background – integrate into the UK if we’re too embarrassed to show what being British actually means? If you’ve moved to a different country, you’re probably there because you want to be. Without a template for integration, a forum for showing your enthusiasm to be here, these tensions feel inevitable.

And it’s more than immigration. This atmosphere, this feeling, this communal gezelligheid of Dutchness, creates an atmosphere where Dutch people care about Dutch people more. I don’t think it’s a coincidence that the Netherlands has some of the lowest economic inequality in Europe. Meanwhile, think of the message that this embarrassment sends to a lot of British people whose lives are pretty shit – not only are we doing nothing to address your problems, we’re telling you that your national identity, which is pretty much all you have to hold on to, is something to be ashamed of.

I have a horrible feeling that we’re going to waste this crisis. Because it’s not just a crisis, it’s an opportunity, an opportunity for the real, far-reaching reform that we so desperately need. Many people who voted to leave the EU say that this is a fundamental issue of democracy. Brilliant. Let’s take that enthusiasm and establish a 21st century proportional voting system (fuck, I’d settle for a 20th century voting system, it’s still more up to date), and let’s get rid of the House of Lords. Watching everything unfold after this referendum, the fact that four million people voted UKIP last year and got one MP was a deafening sign of things to come. Many people who voted to leave the EU say that this is about reinvestment in our own people rather than supporting other countries. Not sure I agree, but fine, let’s fucking do that, let’s take this imaginary £350m a week and renationalise public services, build affordable council housing, extend the NHS, support British industry, and decentralise everything from London so that the rest of the country actually has stuff to do.

This is Caroline Lucas’ time. If she can run with the Green party’s current policies and acknowledge and celebrate British (and especially English) patriotism, then we might just make something of this.

Sadly, nothing will happen if we can’t work out who we are. And if we don’t, then the UK really won’t be anything to be proud of.