Engineers can't gauge their own interview performance. And that makes them harder to hire.

By Aline Lerner | Published:

interviewing.io is an anonymous technical interviewing platform. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle. In the past few months, we’ve amassed over 600 technical interviews along with their associated data and metadata. Interview questions tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role at a top company, and interviewers typically come from a mix of larger companies like Google, Facebook, and Twitter, as well as engineering-focused startups like Asana, Mattermark, KeepSafe, and more.

Over the course of the next few posts, we’ll be sharing some { unexpected, horrifying, amusing, ultimately encouraging } things we’ve learned. In this blog’s heroic maiden voyage, we’ll be tackling people’s surprising inability to gauge their own interview performance and the very real implications this finding has for hiring.

First, a bit about setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews. If both people find each other competent and pleasant, they have the option to unmask. Overall, interviewees tend to do quite well on the platform, with just under half of interviews resulting in a “yes” from the interviewer.

If you’re curious, we have a few public recordings of interviews done on the platform, so you can watch and see what an interview is really like. In addition to these, our feedback forms are attached below. There is one direct yes/no question, and we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, and one of those questions is about how well they think they did. In this post, we’ll be focusing on the technical score an interviewer gives an interviewee and the interviewee’s self-assessment (both are circled below). For context, a technical score of 3 or above seems to be the rough cut-off for hirability.

Perceived versus actual performance

Below, you can see the distribution of people’s actual technical performance (as rated by their interviewers) and the distribution of their perceived performance (how they rated themselves) for the same set of interviews.¹

You might notice right away that there is a little bit of disparity, but things get interesting when you plot perceived vs. actual performance for each interview. Below, is a heatmap of the data where the darker areas represent higher interview concentration. For instance, the darkest square represents interviews where both perceived and actual performance was rated as a 3. You can hover over each square to see the exact interview count (denoted by “z”).

If you run a regression on this data², you get an R-squared of only 0.24, and once you take away the worst interviews, it drops down even further to a 0.16. For context, R-squared is a measurement of how well you can fit empirical data to some mathematical model. It’s on a scale from 0 to 1 with 0 meaning that everything is noise and 1 meaning that everything fits perfectly. In other words, even though some small positive relationship between actual and perceived performance does exist, it is not a strong, predictable correspondence.

You can also see there’s a non-trivial amount of impostor syndrome going on in the graph above, which probably comes as no surprise to anyone who’s been an engineer.

Gayle Laakmann McDowell of Cracking the Coding Interview fame has written quite a bit about how bad people are at gauging their own interview performance, and it’s something that I had noticed anecdotally when I was doing recruiting, so it was nice to see some empirical data on that front. In her writing, Gayle mentions that it’s the job of a good interviewer to make you feel like you did OK even if you bombed. I was curious about whether that’s what was going on here, but when I ran the numbers, there wasn’t any relationship between how highly an interviewer was rated overall and how off their interviewees’ self-assessments were, in one direction or the other.

Ultimately, this isn’t a big data set, and we will continue to monitor the relationship between perceived and actual performance as we host more interviews, but we did find that this relationship emerged very early on and has continued to persist with more and more interviews — R-squared has never exceeded 0.26 to date.

Why this matters for hiring

Now here’s the actionable and kind of messed up part. As you recall, during the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very statistically significant relationship (p < 0.0008) between whether people think they did well and whether they’d want to work with the interviewer. This means that when people think they did poorly, they may be a lot less likely to want to work with you³. And by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

How can one mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! This way people don’t have time to go through the self-flagellation gauntlet that happens after a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway.

Lastly, a quick shout-out to Statwing and Plotly for making terrific data analysis and graphing tools respectively.

Footnotes

There are only 254 interviews represented here because not all interviews in our data set had comprehensive, mutual feedback. Moreover, we realize that raw scores don’t tell the whole story and will be focusing on standardization of these scores and the resulting rat’s nest in our next post. That said, though interviewer strictness does vary, we gate interviewers pretty heavily based on their background and experience, so the overall bar is high and comparable to what you’d find at a good company in the wild. ↩
Here we are referring to linear regression, and though we tried fitting a number of different curves to the data, they all sucked. ↩
In our data, people were 3 times less likely to want to work with their interviewers when they thought they did poorly. ↩