Alteryx, data visualisation, Tableau

The Gaslight Analysis: when sentiment analysis doesn’t quite work.

I love sentiment analysis. It’s a great way of getting fascinating insights from a glut of text data. You can take a load of Yelp reviews, figure out how people feel about a place, and cross-validate it with the star rating. You can take the works of Jane Austen and plot narrative arcs. You can look at the texting styles of you and your girlfriend. If you’ve got a dataset with clear sentences in standard textbook English, you can find out all sorts of things.

But, here’s the thing with language; it’s gloriously, infuriatingly messy.

That makes it really hard to do really good sentiment analysis – certainly with the free, widely-available tools. Most of those assign certain emotional values to specific words; for example, in the NRC dataset often used with the R package Tidytext, the word “alive” has associations of ANTICIPATION, JOY, POSITIVE, and TRUST, while the word “afraid” has associations of FEAR and NEGATIVE.

This approach works great for sentences like this:

“I bought these shoes last week, and they’re amazing. They feel great, and they make me feel great. Good value too! 10/10, very happy about this.”

…but it doesn’t work for sentences like this:

“I don’t feel good about this. I don’t feel good about this at all. I’d love to get out of this situation right now.”

The second sentence is pretty obviously negative, but it works by negating words. The word “good” isn’t actually good, because it’s being negated by “don’t” a couple of words earlier. And “love” isn’t a positive emotion here, as it’s expressing the desire to get out of the situation, meaning that what’s going on is not a positive thing.

It’s possible to address this with sentiment analysis, but it’s complicated. You’d have to account for every possible way of negating/reversing a word, and there’s a lot of those. You’d have to account for every possible way that a word that’s positive in isolation could actually be referring to a negative overall situation, and that’s a huge task. This is why good sentiment analysis costs a fortune. It’s really complicated.

Luckily, people tend not to speak indirectly all the time, and in aggregate, the twisting, sentiment-negating sentences are cancelled out by the number of straightforward sentences where word-level sentiment analysis does work. But the caveat is that just because you’re using sentiment analysis, that doesn’t mean you’re using it well, and you should really cross-validate it with some other measures.

I’m exploring this with lyrics from Brian Fallon’s bands – The Gaslight Anthem, The Horrible Crowes, and his solo project. I’m looking at Fallon’s lyrics because:

  1. Song lyrics are enough of a deviation from standard English to pose problems for standard sentiment analysis;
  2. At the same time, song lyrics are some of the most obviously emotional usages of language we have;
  3. Fallon writes lyrics in a pretty clear style, often in full sentences, without too many obscure metaphors or references;
  4. I really like his music.

I’ve used the Tidytext package in R for doing sentiment analysis before. This time, I’m using the same NRC sentiment dataset, but trying it out in Alteryx instead. I’ve also visualised it in Tableau, and you can click any of these images to go to an interactive link where you can play around with it yourself.

So, first things first; let’s have a look at sentiment in each song:


Looks pretty good so far. Here’s lookin’ at you, kid is a wistful, regretful song; definitely on the negative side, quite a bit of sadness, very little joy. Click the graph to explore other Brian Fallon songs, if you know them.

There’s a lot of different sentiment measures available, so let’s simplify it to looking at positive and negative. Here’s lookin’ at you, kid has 13 negative words, and 4 positive words. If we take difference (9) and divide it by the biggest value (13), we get a ratio of positive to negative words:

Positive – Negative
MAX(Positive, Negative)

This accounts for the difference between positive and negative words, as well as the number. For example, if one song has 10 positive words and 5 negative words, and another song has 6 positive words and 1 negative word, the difference is the same, but the second song is more positive overall, because it has far more times the number of positive words than negative words.

If we calculate this +/- balance for each song, we can order them as follows:


There’s a nice mix of positive and negative songs, and if you know the songs, some of them definitely feel right; Here’s lookin’ at you, kid is negative, so is Get hurt, while 45 is an upbeat, positive song. But there’s definitely some weird ones in there. We did it when we were young is a sad, regretful song, but it’s up there in the top half of positive songs. That doesn’t seem right.

So let’s cross-validate this. Spotify’s Echo Nest data has a measure called Valence, which is a measure of how positive the mood of a song is. You can get all kinds of interesting measures for your Spotify playlist here. When we plot the Spotify Valence (branching off to the left for values under 50), we get this instead:


Spotify has Here’s lookin’ at you, kid as one of the most negative songs, along with Cherry blossoms, which seems about right to me, but has We did it when we were young as a pretty neutral song, which still doesn’t feel right. Have a look at the difference with Blue jeans and white t-shirts, as well – it’s one of the most negative songs according to Spotify, but one of the most positive according to sentiment analysis. I’d put it somewhere in the middle, maybe a bit more positive than neutral.

Since I keep using my own perspective as a fan and a human, I figured I’d better cross-validate both of these stats with what fans think. I set up a simple survey where to get Brian Fallon fans to rate each song for positivity on a scale of 1 to 7, where 1 meant really negative, 4 meant neutral, and 7 meant really positive. I stressed in the introduction, several times, that it’s not a rating of how much you like each song, or how positive each song makes you personally feel (like, I really like Fallon’s sad songs because they make me feel nice… but they’re still objectively sad), but about the emotion in the song itself. Around 15-20 fans answered for each song, so I averaged their ratings together to get a human-generated emotion rating per song. It’s not the most scientific approach, but it’s good enough for the purposes of this blog.

Here’s what we get, centred around an average of 4 for neutral songs:


This time, Blue jeans and white t-shirts comes in as I see it – fairly positive, but not hugely so. We did it when we were young is down there in the negative range, along with Get hurt and Here’s lookin’ at you, kid.

It’s fascinating to see how the three measures agree and disagree for each song. If we rank each song along each measure (with 1 being most positive and 85 being most negative), we can see how the rank difference varies. There are five possible combinations:

  1. All measures disagree with each other
  2. All measures agree with each other
  3. Spotify valence and fan rating agree, but sentiment analysis disagrees
  4. Sentiment analysis and fan rating agree, but Spotify valence disagrees
  5. Sentiment analysis and Spotify valence agree, but fan rating disagrees

…and there’s at least one example of each:

We can see which songs are most consistently rated across all three measures by looking at the difference between each song’s highest and lowest positivity rank:


Blood loss is the most consistently rated, with a rank difference of only four places, while The backseat has a massive rank difference of 81 – fans put it as the second most positive song in Brian Fallon’s catalogue, while sentiment analysis rates it as 83rd, ahead of only I believe Jesus brought us together and I witnessed a crime. Spotify puts it at 24th.

Another way of showing this variation is by creating scatterplots of each measure against each other, with each dot representing a song:


I’ve run simple correlations on each plot – not exactly statistically kosher, but this is all just exploratory. There is no correlation between valence and sentiment analysis, and more tellingly, no correlation between sentiment analysis and fan ratings. There is a correlation between valence and fan ratings, but it’s not particularly strong.

The overall point, then, is to be careful with sentiment analysis. It’s not that it doesn’t work – it can often work really well, and be a really useful line of investigating data. But relying on sentiment analysis alone, without checking whether it matches measures that should reflect the same kind of thing, might give you some false insights. You don’t want to have Great expectations, or you might Get hurt.


Quantifying three years of a long distance relationship

I read two really useful guides to processing text data recently; an analysis of Trump’s tweets to work out whether it’s him or an intern sending them, and a sentiment analysis of Pride and Prejudice. Three years of a long distance relationship means that I have a nice big corpus of Whatsapp messages between my girlfriend and me, so I did the romantic thing and quantified some of our interactions in R. Also, this required quite a bit of text munging in Excel first, which turned out to be far quicker and easier than using regex in this case.

First of all, let’s look at when we text each other throughout the day. We’re in different time zones, but only by an hour, and since texts are inherently dependent – one text is overwhelmingly likely to lead to another pretty soon after – I haven’t adjusted the times.

text no by hour of day.png

Our texting activity represents our general activity pretty well; nothing much going on until about 7am, then a slow start to the day, a bit of a post-lunch dip, and then an evening peak when we’re more likely to be out and about doing things.

We can also have a look at how many messages we send each other, and how that’s changed over time:

text no by date.png

We’ve sent each other a fairly similar number of texts per day throughout the long distance period, but it looks pretty bad on me that I have consistently sent fewer texts than her…

…or does it? When I plot the length of each text sent, I consistently write longer messages:

text length by date.png

So, there’s two distinct texting styles here; I write longer messages less frequently, she writes shorter messages more frequently. The other thing I like about the text length graph is that you can see the times when we’ve been together and not texted each other that much; three weeks in November 2014 when I was running experiments in London, three weeks around Christmas 2015, and a load of long weekends throughout. It’s not that we don’t text each other at all then, it’s more that those texts tend to be stuff like “have we got milk?”, or simply “pub?”.

Plotting log likelihood ratios of how much each of us uses each word in comparison to the other also captures our texting styles:

top 20 words each (no names).png

For example, we both use the word /ha/ to express laughter, but I spell it “ha” and she spells it “hah”. Likewise, “til” and “till” as abbreviations for “until”, and I seem to use “somebody” while she uses “someone”.

If we filter out equivalent words and proper names (like the pubs, supermarkets, and stations we go to most often), another difference in dialogue style appears:

top 10 words each (no proper names).png

I am apparently a lot more conversational; I write out interjections (hmm, oooh, hey, ohhh) and reactions (fuck’s comes from for fuck’s sake, hoera comes from the Dutch phrase hiep hiep hoera, and boourns comes from, erm, The Simpsons). Apart from hhmmm, she doesn’t write interjections or contextual replies at all. Apart from the interjections and replies, my main thing is adjectives; she tends towards nouns and verbs.

The next step is sentiment analysis. If I plot log likelihood bars for each sentiment, I seem to be an atrociously negative person:

sentiment error bars.png

…but this, I think, is more a problem with the way sentiment analysis works in the syuzhet and tidytext packages using NRC sentiment data. Each word in the NRC corpus has a given value, 0 or 1, for a range of sentiments, and this sentiment analysis style simply adds it up for each word in a given set.

Because of that, it doesn’t really capture the actual sentiment behind the way we’re using these words. Let’s look at the main words driving the differences in each sentiment:

sentiment log likelihood words.pngFor me, a lot of my disgust and anger is coming from the word damn. If I was texting damn! every time I stubbed my toe or something, perhaps that would be accurate; but in this case, a lot of the time I write damn is in sympathy, as in exchanges like:

“My computer crashed this afternoon and I lost all the work I’d done today”
“Damn, that’s horrible”

Meanwhile, the word coop is actually me talking about the coöp / co-op, where I get my groceries. I’m not talking about being trapped, either physically or mentally.

The same goes for my girlfriend being more positive. With words like engagement and ceremony, she’s not joyous or anticipatory about her own upcoming nuptials or anything; rather, several of her colleagues have got engaged and married recently, and most of her uses of the words engagement and ceremony are her complaining about how that’s the only topic of conversation at the office. As for assessment, council, and teacher, she works in education. These are generally neutral descriptions of what’s happened that day.

So, I was hoping to be able to plot some sentiment analyses to show our relationship over time, but either it doesn’t work for text messages, or we’re really fucking obtuse. I think it might be the former.

Instead, I’ll settle for showing how much we both swear over time:

expletives per month.png

Each dot represents the number of occurrences per month of a particular expletive. I’m clearly the more profane here, although I do waver a bit while she’s fairly consistent.

More importantly is how we talk about beer a similar amount:

beer per month.png

Since couples who drink together stay together (or in the words of this study, “concordant drinking couples reported decreased negative marital quality over time”), I think this bodes pretty well for us.

R, Uncategorized

Visualising football league tables

I was looking at the Premiership league table today, and it looks like this:

current league table

It’s pretty informative; we can see that Leicester are top, Aston Villa are bottom, and that the rest of the teams are somewhere in between. If we look at the points column on the far right, we can also see how close things are; Villa are stranded at the bottom and definitely going down, Leicester are five points clear, and there’s a close battle for the final Champions League spot between Manchester City, West Ham, and Manchester United, who are only separated by a single point.

Thing is, that requires reading the points column closely. If you take the league table as a simple visual guide, it doesn’t show the distribution of teams throughout the league very well. If you say that Stoke are 8th, that sounds like a solid mid-table season… but what it doesn’t tell you is that Stoke are as close to 4th place and the Champions League as they are to 10th place, which is also solid mid-table. A more visually honest league table would look something a little like this*:

current league table dragged about a bit

*definitely not to scale.

Screen-shotting a webpage and dragging things about in MS Paint isn’t the best way to go about this, so I’ve scraped the data and had a look at plotting it in R instead.

Firstly, let’s plot each team as a coloured dot, equally spaced apart in the way that the league table shows them:

League position right now

(colour-coding here is automatic; I tried giving each point the team home shirt colours, but just ended up with loads of red, blue, and white dots, which was actually a lot worse)

Now, let’s compare that with the distribution of points to show how the league positions are distributed. Here, I’ve jittered them slightly so that teams with equal points (West Ham and Manchester United in 5th and 6th, Everton and Bournemouth in 12th and 13th) don’t overlap:

League points right now

This is far more informative. It shows just how doomed Aston Villa are, and shows that there’s barely any difference between 10th and 15th. It also shows that the fight for survival is between Norwich, Sunderland, and Newcastle, who are all placed closely together.

Since the information is out there, it’d also be interesting to see how this applies to league position over time. Sadly, Premiership matches aren’t all played at 3pm on Saturday anymore, they’re staggered over several days. This means that the league table will change every couple of days, which is far too much to plot over most of a season. So, I wrote a webscraper to get the league tables every Monday between the start of the season and now, which roughly corresponds to a full round of matches.

Let’s start with looking at league position:

League position over time

This looks more like a nightmare tube map than an informative league table, but there are a few things we can pick out. Obviously, there’s how useless Aston Villa have been, rooted to the bottom since the end of October. We can also see the steady rise of Tottenham, in a dashing shade of lavender, working their way up from 8th in the middle of October to 2nd now. Chelsea’s recovery from flirting with relegation in December to being secure in mid-table now is fairly clear, while we can also see how Crystal Palace have done the reverse, plummeting from 5th at the end of the year to 16th now.

An alternative way of visualising how well teams do over time is by plotting their total number of points over time:

League points over time

This is visually more satisfying than looking at league position over time, as we can see how the clusters of teams in similar positions have formed. Aston Villa have been bottom since October, but they were at least relatively close to Sunderland even at the end of December. Since then, though, the gap between bottom and 19th as opened up to nine points. We can also see how Leicester and Arsenal were neck and neck in first and second for most of the season, but the moment when Leicester really roared ahead was in mid-February. Finally, the relegation fight again looks like it’s a competition between Norwich, Sunderland, and Newcastle for 17th; despite Crystal Palace’s slump, the difference between 16th and 17th is one of the biggest differences between consecutive positions in the league. This is because Norwich, Sunderland, and Newcastle haven’t won many points recently, whereas Swansea and Bournemouth, who were 16th and 15th and also close to the relegation zone back in February, have both had winning streaks in the last month.

One of the drawbacks with plotting points over time is that, for most of the early part of the season, teams are so close together that you can’t really see the clusters and trends.

So, we can also calculate a ratio of how many points a team has compared to the top and bottom team at any given week. To do this, I calculated the points difference between top and bottom teams each week, and then calculated every team’s points as a proportion of where they are.

For example, right now, Leicester have 66 points and Aston Villa have 16. That’s a nice round difference of 50 points across the whole league. Let’s express that points difference on a scale of 0 to 1, where Aston Villa are at one extreme end at 0 and Leicester are at the other extreme end at 1.

Tottenham, in 2nd, have 61 points, or five points fewer than Leicester and 45 points more than Aston Villa. This means that, proportionally, they’re 90% along the points difference spectrum. This means they get a relative position of 0.9, as shown below:

Relative league position over time

This is a lot more complicated, and perhaps needlessly so. It reminds me more of stock market data than a football league table. I plotted it this way to be able to show how close or far teams were from each other in the early parts of the season, but even then, the lines are messy and all over the place until about the start of October, when the main trends start to show. One thing that means is that however badly your team are doing in terms of points and position, there’s little use in sacking a manager before about November; there’s not enough data, and teams are too close together, to show whether it’s a minor blip or a terminal decline. Of course, if your team are doing badly in terms of points and position and playing like they’ve never seen a football before, then there’s a definite problem.

To make it really fancy/silly (delete as appropriate), I’ve plotted form guides of relative league position over time. Instead of joining each individual dot each week as above, it smooths over data points to create an average trajectory. At this point, labelling the relative position is meaningless as it isn’t designed to be read off precisely, but instead provides an overall guide to how well teams are doing:

Relative league position over time smooth narrative (span 0.5)

Here, the narratives of each team’s season are more obvious. Aston Villa started out okay, but sank like a stone after a couple of months. Sunderland were fairly awful for a fairly long time, but the upswing started with Sam Allardyce’s appointment in October and they’ve done well to haul themselves up and into contention for 17th. Arsenal had a poor start to the season, then shot up, rapidly to first near the end of the year, but then they did an Arsenal and got progressively worse from about January onwards. Still, their nosedive isn’t as bad as Manchester City’s; after being top for the first couple of months, they’ve drifted further and further down. It’s more pronounced since Pep Guardiola was announced as their next manager in February, but they were quietly in decline for a while before that anyway. Finally, looking at Chelsea’s narrative line is interesting. While they’ve improved since Guus Hiddink took over, their league position improvement is far more to do with other teams declining over the last couple of months. Four teams (Crystal Palace, Everton, Watford, and West Brom) have crossed Chelsea’s narrative line since February.

I don’t expect these graphs to catch on instead of league tables, but I definitely find them useful for visualising how well teams are doing in comparison to each other, rather than just looking at their position.

Education, Open Data, R

The gender gap in school achievement: exploring UK GCSE data

I was reading this article in the Washington Post a couple of days ago. It’s about data from Florida which shows that girls outperform boys at school, and that the gender gap is bigger at worse schools.

It’s well established that girls outperform boys at school, but seeing it visualised and quantified like that was fascinating, and I wanted to reproduce that data for UK schools. We frequently use American statistics to talk about social issues in the UK, which frustrates me; sometimes we’re close enough for it to generalise, but sometimes it doesn’t and it’s like there’s a gigantic metaphorical ocean between the two societies. We know that British girls outperform British boys, but I wanted to see how similar the situation is.

Luckily, the UK government has one of the best records for open data in the world, and so this information is pretty easily found here and here. The main challenge is actually getting through all the data to find the good bits, as so much of it is available, but I found it in the end. So, I shoved all that into R and messed about with some dataframes. Note that I’m not working with private schools here, just state schools… all 2488 of them which have full data for all metrics reported below. Also, all the data is only fully available for England, not the whole of the UK.

The first thing is to decide how to measure achievement. Here, I’m focusing on GCSEs, the standard qualification which most UK teenagers take at 16 and which marks the end of mandatory education. There are two good metrics for measuring GCSE achievement: the percentage of students who get at least five A*-C grades, and the average capped GCSE point score. The first is simple. Students generally take GCSEs in somewhere between seven and ten different subjects, and the percentage of them who score a grade C or above in at least five GCSEs is one of the main metrics that British people obsess over (for people outside the UK, I’m serious, the national newspapers print this figure for all state schools every August when exam results come out). The second is a little more complicated, and it’s explained here. It’s measured by attributing a certain number of points per exam grade (58 for an A*, 52 for an A, and so on down in sixes). It then measures only a student’s top eight GCSEs. So, if you took 11 GCSEs, scored 6 A*s, 4 As, and a B, you’d get 6 x 58 plus 2 x 52 equals 452. This is then averaged across the school. Literally nobody outside government departments ever uses this, but it’s actually a pretty good measure; focusing on the five A*-C rate is a bit blind to quality over quantity, as a student who gets four A*s and four Ds harms the school’s statistics while a student who gets five Cs and three Fs is good for the school’s statistics, despite the first student clearly doing better overall.

The next thing is to decode the wording of the original article: “the gender gap is bigger at worse schools”. There are several ways of talking about what makes a school good or bad, so I’ll focus on three different metrics:

  1. The rating given to each school by the assessment organisation Ofsted. Each school is inspected every couple of years, and gets given an overall grade: outstanding, good, requires improvement, or inadequate. This is a useful, state-sanctioned measure of how good a school is.
  2. The average GCSE achievement data per school. Presumably, better schools get better results. This is a useful measure of how good a school is in terms of what many parents say they care about.
  3. The average wealth of the student body at the school. Let’s face it, when a lot of middle-class British people say “we were lucky enough that our son got into a good local school”, what they actually mean is “we’re so glad there’s no poor people there”. We can measure the average wealth of the student body by looking at the percentage of students who are eligible for free school meals. The higher the percentage, the poorer the student intake.

Firstly, let’s look at the gender gap in GCSE achievement by Ofsted data. This is categorical, so we can have some nice straightforward histograms. Boys are in light blue, girls are in dark pink. Sure, it’s gendered, but it’s an effective and intuitive colour scheme.

histogram of five A star to C rate and each sex per ofsted rating.png

As you’d expect, the outstanding schools get better results than the good schools, and so on and so on. But, it seems that girls outperform boys across the board, regardless of how good the school is (I did an ANOVA on this; the gender gap effect is slightly less for outstanding schools, but it’s a negligible difference. The gender gap at outstanding schools is 7.5 percentage points versus about 8.5-9.5 percentage points for the other three assessments).

histogram of GCSE capped points score and each sex per ofsted rating.png

…and this is mirrored in the capped GCSE points average. Again there’s a tiny bit less of a gender gap in the outstanding schools compared to the rest, but girls do better than boys everywhere.

Right, so much for Ofsted. Let’s look at overall school GCSE achievement. This is continuous, so it’s going in a scatter plot. Plotting every single school’s boys’ and girl’s result was really messy, so this averages across schools on each percentage point on the x-axis (i.e. what you see at 50% is the average boys’ five A*-C rate and the average girls’ A*-C rate across all schools which got a 50% overall five A*-C rate). Likewise in the second plot with every single capped GCSE average points score, where each points score on the x-axis is rounded to a whole number and averaged with others of the same number. Rest assured that the lines of best fit are essentially identical in the larger, messier plots. I did do plots with standard errors, but thought I’d forgotten… then I looked closely, and realised that the standard errors were so small that they were barely distinguishable from the lines.

scatterplot of GCSE results for each sex across GCSE results.png

scatterplot of capped GCSE results for each sex across rounded school capped GCSE results.png

This one tells a clear story, and is very, very similar to Figure 1 in the Washington Post article which shows the standardised maths and reading assessment plot. However, there are two main differences:

  1. If anything the very worst schools seem to have less of a gender gap, especially in the five A*-C rate plot … although this is probably more about a lack of data at that end. (this is one of the few times I think it’s a good thing to have a lack of data)
  2. It basically doesn’t matter how good or bad the school is, the difference between boys and girls is consistent across all levels of achievement. The only place where boys and girls are almost equal is right at the top, where there’s a ceiling effect; assuming that each school is 50% boys, 50% girls, there can’t be a big difference between the two if a school is getting 99% five A*-Cs overall.

And now for the free school meals data, or the middle-class poverty aversion question. I’m going to bombard you with graphs here. First, just to show you, here are the messy ones where all rates for all schools are plotted:


…but like I said, it’s messy and hard to focus on, it’s like somebody spilt muisjes on the screen.

So, here’s the same plots but with all schools averaged together at each data point. This isn’t even at each percentage point, it’s to the nearest 0.1 of a percentage point, because there’s that much data.

scatterplot of GCSE results for each sex acrossfree school meal eligibility rate (loess se).png

scatterplot of average capped GCSE results for each sex across free school meal eligibility rate (loess se).png

This also tells a very clear story. The schools with richer students get better results. I also found out the Pope’s religion, and something about bears and woods. But, again, there are the same two main points:

  1. There seems to be less of a gender difference in achievement at worse (well, poorer) schools, but this is probably because there aren’t that many seriously deprived areas. Not to say we don’t have deprivation in the UK, we definitely do, and it’s growing, but there are very few schools where over half the students qualify for free school meals (which probably says more about our ridiculously strict benefits threshold rather than the state of poverty).
  2. The performance and achievement gap remains even at the very best (well, richest) schools.

There’s also race data available, but I feel like that’s a topic for another blog at another time. This one is already long enough!

The point is this: while the Washington Post article was fascinating, it doesn’t fully generalise to British society. In the UK, the gender gap for school achievement barely gets bigger at “worse” schools, regardless of how you measure what a bad school is… which is a good thing, I guess? In fact, the gender gap for school achievement seems to be entrenched across education achievement and wealth.

Are girls outperforming boys, or are boys lagging behind? Is it both? I’m not an education specialist, I’m just a guy with Rstudio, so I’m reluctant to speculate… but I will anyway.

I think what I’ve ruled out here is any obvious overriding education level or socio-economic effects of the gender achievement gap. It could be that girls are simply more intelligent than boys, although such a simplistic solution seems unlikely. It could be a social peer pressure effect, in that it is more acceptable to be feminine and work hard at school than it is to be masculine and work hard at school (although that wouldn’t explain the reports that this gender difference is present at very, very early ages). It could be that teaching is a female-dominated profession; female teachers may knowingly or unknowingly choose course materials preferred by girls over materials preferred by boys, female teachers may knowingly or unknowingly favour, reward, and encourage problem-solving strategies preferred by girls over strategies preferred by boys, etc. etc., and that this may get entrenched over time. It could be that a culture which encourages and promotes girls’ education, given their denial of access to it until relatively recently, accidentally creates a culture where boys feel undervalued and demotivated. It could be that girls collaborate with each other on homework and exam revision more than boys do, which has been shown to effectively improve learning. It could be that exams favour a stereotypical female attention to detail over a stereotypical male “good enough” approach. It could be that more boys than girls simply don’t give a shit about their handwriting enough to make their answers legible. It could be that girls hit puberty a bit earlier than boys and are therefore out of adolescence a bit earlier than boys, meaning that girls are on average more mature when they take their GCSEs (but again, not if there’s an early years difference too).

It’s probably all of the above, and more, and it’s complicated. And it’s a problem.

Cricket, R

Bigger isn’t always better – the case of the first innings in cricket

I’ve got an unsubstantiated hunch (the best kind of hunch!) about cricket. Well, not just one, I have loads, but this particular hunch is about the first innings of a cricket match, and that bigger isn’t always better.

I greatly enjoyed following England’s first innings against South Africa in the second Test in Cape Town. But, even with the high run rate while Stokes was smashing it everywhere, I was convinced that the higher that first innings got, the less likely we’d be to win it. This goes against the received wisdom in cricket, which is that the bigger the first innings score, the better it is.

So, I’ve had a look at all first innings scores in Tests from 1990 until now (there’s just over a thousand of them). Here’s simple density plot of the distributions of runs scored in the first innings per match result:

density plot of runs

What this seems to show is that there’s a limited sweet spot from just over 380 runs to about 500 runs where a win is the most likely result. Once a team scores over about 500 runs in the first innings, the most likely match result is a draw.

Part of that is probably because of how much time posting a huge first innings takes out of the game. What happens when we look at runs scored vs. balls taken in the first innings?

scatter plot of runs and balls simple

There’s a green cluster in the middle between about 350 and 550 runs and between about 700 and 800 balls. That, I reckon, is the sweet spot for the perfect first innings: scoring a high but not massive number of runs, without taking too much time. England took 755 balls (125.5 overs) in their first innings in Cape Town, so a win was still just about the most likely result there… but, this may just be an exception. We’ll see.

Here’s the same plot with some lines showing a run rate of 2, 3, and 4 runs per over (the steeper the line, the greater the run rate):

scatter plot of runs and balls

Visually, I’m convinced the sweet spot of 380-500 runs at a decent run rate is obviously there. So, let’s try looking at some simple percentages by comparing scores between 380-500 runs with scores over 500 runs, where runs are scored at over 3.5 runs an over:

Run rate over 3.5, runs between 380 and 500
won draw lost        = 62.32% win rate
43     16     10          = 2.69 win:draw ratio

Run rate over 3.5, runs over 500
won draw lost        = 54.29% win rate
57     47      1           = 1.21 win:draw ratio

The win rate goes down slightly for the higher scores, and the win:draw ratio goes down too. i.e. even if you’re scoring well, going beyond 500 just makes the draw more likely and doesn’t actually help your chances of winning.

But, that’s not quite a fair comparison. I said earlier that if you’re going to score more runs, you have to do it at a higher run rate, so comparing all scores above 3.5 an over isn’t exactly fair. Let’s now compare a good score at a good run rate with a high score at a high run rate. Again, I’m taking a good score to be 380-500 and a high score to be over 500. In terms of run rate, I’m quantifying a good run rate as between the mean run rate of all innings and the mean plus one standard deviation (i.e. between 3.13 and 3.72 runs per over), and a high run rate as above the mean plus one standard deviation (i.e. above 3.72 runs per over).

So, is a score of 380-500 at 3.13-3.72 runs per over better than a score of 500+ at 3.72+ ?

380-500 runs at 3.13-3.72 RPO (mean runs: 438 , mean RPO: 3.40)
won draw lost        = 56.10% win rate
46    20     16          = 2.3 win:draw ratio

500+ runs at 3.72+ RPO (mean runs: 587, mean RPO: 4.90)
won draw lost        = 57.14% win rate
44    32     1             = 1.375 win:draw ratio

…the lower, slower score isn’t better, but it isn’t worse either. The likelihood of winning stays the same; the only difference is that batting on makes losing much less likely and drawing much more likely.

This is really counterintuitive, and I find it hard to wrap my head around the fact that scoring 438 at 3.4 an over is about as likely to result in a win as scoring 587 at 4.9 an over. One possibility is that the matches which feature high first innings scores are played on absolute roads, like in the 1997 Colombo snoozeathon between India and Sri Lanka, meaning that a high second innings score is also pretty likely. Therefore, you’d expect the first and second innings scores to correlate in matches where the first innings was 500+ runs at 3.72+ RPO… but they don’t (r=0.07, p=0.52). Nor do the first and second innings scores correlate in matches where the first innings was between 380-500 runs at 3.13-3.72 RPO (r=-0.15, p=0.18). The only indication that a massive first innings score may mean that the pitch is easier to bat on is that the mean second innings score in response to a massive first innings score is 346.90, while the mean second innings score in response to a good first innings score is 307.09. A t-test between the two set of second innings scores is “relatively significant” (as an ever-hopeful colleague of mine used to say) with a p-value of 0.07, but that doesn’t cut it. This is another mystery for another blog post.

Right, back to looking at just the first innings scores and win rate. One last way of exploring this is by creating a matrix of win rates in bins of runs scored and run rate.

I’ve put all innings into bins of 50 runs and bins of 0.5 RPO. This means that every square in the following graphs is represented by a set of matches where that many runs have been scored at that rate. It’s only done for bins with at least five matches in (because you can’t really extrapolate from things where only one or two matches have happened, as that leads to a lot of 0% and 100% win rates).

This graph visualises the win rate per bin; the darker the green, the greater the likelihood of winning based on that kind of first innings:

rough matrix of runs, RPO, win rate - five matches or more, cropped

But what if, instead of plotting the simple win likelihood for all bins, we plot the most likely result based on that bin, along with the likelihood of that result? In this graph, the colour represents the top result – win, draw, or loss – and the intensity of that colour represents the likelihood – the more intense the colour, the more likely that result:

rough matrix of runs, RPO, top result, rate, cropped

In both matrices, the sweet spot with the most green and the most intense green falls within 400 and 500 runs… although it turns out that in terms of overall win likelihood, the best first innings is to score between 500 and 550 runs, scored at over 4 runs per over.

Ultimately, what this shows is that batting on past 500 or so makes losing the match hugely unlikely (but definitely not impossible), so if safety first is your watchword, have at it. However, if you want to win a Test match, there’s not much point in batting on past 500 or so in the first innings, 550 at most, no matter how fast you score (and if you do decide to go for the big imposing total, you’d better hurry up about it). Ben Stokes might have set a load of records, but with a bit of statistical sleuthing, he’d have realised it was pointless because his batting blitz was actually just making it harder for England to win.

Why bother creating these incredible cricketing memories when the statistics say hold back?

…because it’s much more entertaining. If you focus on the statistics all the time, you end up with a team like England under Peter Moores, where nobody knows anything before they’ve looked at the data. Fair enough, then.


Using R to stick Excel columns into individual .txt files

MS Excel is great for sorting out stimuli so that it’s all nice and neat and in one place, like when I’ve organised my millions of EEG trigger codes:


…but some programmes (I’m looking at you here, Presentation) require individual .txt files of each column:


It’s easy enough to just copy the columns, paste them into Notepad, and save as .txt files if you don’t have too many, but to do that every time you make changes is really frustrating.

It’s easy enough to get Matlab to sort this out for you as well, but I’m kind of Matlab-phobic and prefer to use R for everything that needs scripting.

So, here’s a quick and dirty little snippet of code that goes through a spreadsheet and saves the contents of every individual column as a separate .txt file with the title of whatever is in the first row. All you have to do first is save your Excel sheet as a .csv file.

stimuli <- read.csv("ALLthestimuli.csv", 
                    na.strings=c("", "NA"), 
for (i in 1:length(colnames(stimuli))){
    file = paste(as.character(colnames(stimuli)[i]), ".txt", sep=""), 
    row.names = FALSE, col.names = FALSE, quote = FALSE)

It’s messy code, but it does the job just fine and saves lots of time and frustration.

R, Science in general

scatterplot / dotplot / losttheplot

I’m not sure how to game search engine optimisation algorithms, but hopefully you’ll end up here if you’ve googled “things that are better than histograms” or “like scatter plots but with groups and paired and with lines” or “Weissgerber but in R not Excel” or something similar.

Anyway. Weissgerber et al. (2015) have a fantastic paper on data visualisation which is well worth a read.

(tl;dr version: histograms are dishonest and you should plot individual data points instead)

Helpfully, Weissgerber et al. include instructions for plotting these graphs in MS Excel at the end should you wish to give it a go. But, if MS Excel isn’t your bag, it’s easy enough to try in R…

…apart from the fact that nobody really agrees on what to call these plots, which makes it really hard to search for code examples online. Weissgerber et al. refer to them as scatterplots, but in most people’s minds, scatterplots are for plotting two continuous variables against each other. Other writers refer to them as dotplots or stripplots or stripcharts, but if you don’t know the name, you don’t know that this is what you’re looking for, and all you can find is advice on creating different graphs from the ones you want.

JEDI KNIGHT - these aren't the scatterplots you're looking for

As an example, here’s some of my own data from a behavioural task in which participants had to remember things in two different conditions. The histogram with 95% confidence intervals makes it fairly clear that participants are more accurate in condition one than condition two:

accuracy for each condition in percent

The scatterplots / dotplots / whateverplots also show the distribution of the data quite nicely, and because it’s paired data (each participant does both conditions), you can draw a line between each participant’s data point and make it obvious that most of the participants are better in condition one than in condition two. I’ve also jittered the dots so that multiple data points with the same value (e.g.the two 100% points in condition_one) don’t overlap:

accuracy for each condition in percent - jitterdots

It’s easy to generate these plots using ggplot2. All you need is a long form or melted dataframe (called dotdata here) with three columns: participant, condition, and accuracy.

dotdata$condition<- factor(dotdata$condition, as.character(dotdata$condition))
# re-order the levels in the order of appearance in the dataframe
# otherwise it plots it in alphabetical order
ggplot(dotdata, aes(x=condition, y=accuracy, group=participant)) +
  geom_point(aes(colour=condition), size=4.5, position=position_dodge(width=0.1)) +
  geom_line(size=1, alpha=0.5, position=position_dodge(width=0.1)) +
  xlab('Condition') +
  ylab('Accuracy (%)') +
  scale_colour_manual(values=c("#009E73", "#D55E00"), guide=FALSE) + 

Putting the graph into electroencephalography

ERPists – click to jump to the main point of this blog, which is about plotting measures of confidence and variance. Or just read on from the start, because there’s a lot of highly proficient MS Paint figures in here.

UPDATE! The paper where this dataset comes from is now published in Collabra. You can read the paper here and download all the raw data and analysis scripts here.

ERP graphs are often subjected to daft plotting practices that make them highly frustrating to look at.

ERPing the derp

Negative voltage is often (but not always) plotted upwards, which is counterintuitive but generally justified with “oh but that’s how we’ve always done it”. Axes are rarely labelled, apart from a small key tucked away somewhere in the corner of the graph which still doesn’t give you precise temporal accuracy (which is kind of the point of using EEG in the first place). And finally, these graphs are often generated using ERP programmes then saved as particular file extensions, which then get cramped up or kind of blurry when resized to fit journals’ image criteria. This means that a typical ERP graph looks something a little like this:

typical erp graph

…and the graph is supposed to be interpreted something a little like this:

erp intuitive 2

…although realistically, reading a typical ERP graph is a bit more like this:

erp context

Some of these problems are to do with standard practices; others, due to lack of expertise in generating graphics; and more still are due to journal requirements, which generally specify that graphics must conform to a size which is too small to allow for proper visual inspection of somebody’s data, and also charge approximately four million dollars for the privilege of having these little graphs in colour because of printing costs despite the fact that nobody really reads actual print journals anymore.

Anyway. Many researchers grumble about these pitfalls, but accept that it comes with the territory.

However, one thing I’ve rarely heard discussed, and even more rarely seen plotted, is the representation of different statistical information in ERP graphs.

ERP graphs show the mean voltage across participants on the y-axis at each time point represented on the x-axis (although because of sampling rates, it generally isn’t a different mean voltage for each millisecond, it’s more often a mean voltage for every two milliseconds). Taking the mean readings across trials and across participants is exactly what ERPs are for – they average out the many, many random or irrelevant fluctuations in the EEG data to generate a relatively consistent measure of a brain response to a particular stimulus.

Decades of research have shown that many of these ERPs are reliably generated, so if you get a group of people to read two sentences – one where the sentence makes perfect sense, like the researcher wrote the blog, and one where the final word is replaced with something that’s kind of weird, like the researcher wrote the bicycle – you can bet that there will be a bigger (i.e. more negative) N400 after the kind of weird final words than the ones that make sense. The N400 is named like that because it’s a negative-going wave that normally peaks at around 400ms.

Well, that is, it’ll look like that when you average across the group. You’ll get a nice clean average ERP showing quite clearly what the effect is (I’ve plotted it with positive-up axes, with time points labelled in 100ms intervals, and with two different colours to show the conditions):

standard N400

But, the strength of the ERP – that it averages out noisy data – is also its major weakness. As Steve Levinson points out in a provocative and entertaining jibe at the cognitive sciences, individual variation is huge, both between different groups across the world and between the thirty or so undergraduates who are doing ERP studies for course credit or beer money. The original sin of the cognitive sciences is to deny the variation and diversity in human cognition in an attempt to find the universal human cognitive capabilities. This means that averaging across participants in ERP studies and plotting that average is quite misleading of what’s actually going on… even if the group average is totally predictable. To test this out, I had a look at the ERP plot of a study that I’m writing up now (and to generate my plots, I use R and the ggplot2 package, both of which are brilliant). When I average across all 29 participants and plot the readings from the electrode right in the middle of the top of the head, it looks like this:

Cz electrode (RG onset timelock + NO GUIDE)

There’s a fairly clear effect of the green condition; there’s a P3 followed by a late positivity. This comes out as hugely statistically significant using ANOVAs (the traditional tool of the ERPist) and cluster-based permutation tests in the FieldTrip toolbox (which is also brilliant).

But. What’s it like for individual participants? Below, I’ve plotted some of the participants where no trials were lost to artefacts, meaning that the ERPs for each participant are clearer since they’ve been averaged over all the experimental trials.

Here’s participant 9:

ppt09 Cz electrode (RG onset timelock + NO GUIDE)

Participant 9 reflects the group average quite well. The green line is much higher than the orange line, peaking at about 300ms, and then the green line is also more positive than the orange line for the last few hundred milliseconds. This is nice.

Here’s participant 13:

ppt13 Cz electrode (RG onset timelock + NO GUIDE)

Participant 13 is not reflective of the group average. There’s no P3 effect, and the late positivity effect is actually reversed between conditions. There might even be a P2 effect in the orange condition. Oh dear. I wonder if this individual variation will get lost in the averaging process?

Here’s participant 15:

ppt15 Cz electrode (RG onset timelock + NO GUIDE)

Participant 15 shows the P3 effect, albeit about 100ms later than participant 9 does, but there isn’t really a late positivity here. Swings and roundabouts, innit.

However, despite this variation, if I average the three of them together, I get a waveform that is relatively close to the group average:

ppt9-13-15 Cz electrode (RG onset timelock + NO GUIDE)

The P3 effect is fairly clear, although the late positivity isn’t… but then again, it’s only from three participants, and EEG studies should generally use at least 20-25 participants. It would also be ideal if participants could do hundreds or thousands of trials so that the ERPs for each participant are much more reliable, but this experiment took an hour and a half as it is; nobody wants to sit in a chair strapped into a swimming cap full of electrodes for a whole day.

So, on the one hand, this shows that ERPs from a tenth of the sample size can actually be quite reflective of the group average ERPs… but on the other hand, this shows that even ERPs averaged over only three participants can still obscure the highly divergent readings of one of them.

Now, if only there were a way of calculating an average, knowing how accurate that average is, and also knowing what the variation in the sample size is like…

…which, finally, brings me onto the main point of this blog:

Why do we only plot the mean across all participants when we could also include measures of confidence and variance?

In behavioural data, it’s relatively common to plot line graphs where the line is the mean across participants, while there’s also a shaded area around the line which typically shows 95% confidence intervals. Graphs with confidence intervals look a bit like this (although normally a bit less like an earthworm with a go-faster stripe on it):

updated ci picture for eeg graphs


This is pretty useful in visualising data. It’s taking a statistical measure of how reliable the measurement is, and plotting it in a way that’s easy to see.

So. Why aren’t ERPs plotted with confidence intervals? The obvious stumbling point is the ridiculous requirements of journals (see above), which would make the shading quite hard to do. But, if we all realised that everything happens on the internet now, where colour printing isn’t a thing, then we could plot and publish ERPs that look like this:

Cz electrode (RG onset timelock + 95pc CIs + NO GUIDE)

It’s nice, isn’t it? It also makes it fairly clear where the main effects are; not only do the lines diverge, the shaded areas do too. This might even go some way towards addressing Steve Levinson’s valid concerns about cognitive science data ignoring individual data… although only within one population. My data was acquired from 18-30 year old Dutch university students, and cannot be generalised to, say, 75 year old illiterate Hindi speakers with any degree of certainty, let alone 95%.

This isn’t really measuring the variance within a sample, though. How can we plot an ERP graph which gives some indication of how participant 13 had a completely different response from participants 9 and 15? Well, we could try plotting it with the shaded areas showing one standard deviation either side of the mean instead. It looks like this:

Cz electrode (RG onset timelock + SDs + NO GUIDE)

…which, let’s face it, is pretty gross. The colours overlap a lot, and it’s just kind of messy. But, it’s still informative; it indicates a fair chunk of the variation within my 29 participants, and it’s still fairly clear where the main effects are.

Is this a valid way of showing ERP data? I quite like it, but I’m not sure if other ERP researchers would find this useful (or indeed sensible). I’m also not sure if I’ve missed something obvious about this which makes it impractical or incorrect. It could well be that the amplitudes at each time point aren’t normally distributed, which would require some more advanced approaches to showing confidence intervals, but it’s something to go on at least.

I’d love to hear people’s opinions in the comments below.

To summarise, then:

– ERP graphs aren’t all that great

– but they could be if we plotted them logically

– and they could be really great if we plotted more than just the sample mean