interviewing.io logo interviewing.io blog
better interviewing through data
Navigation
Featured

Uncategorized

You can’t fix diversity in tech without fixing the technical interview.

Posted on November 2nd, 2016.

In the last few months, several large players, including Google and Facebook, have released their latest and ultimately disappointing diversity numbers. Even with increased effort and resources poured into diversity hiring programs, Facebook’s headcount for women and people of color hasn’t really increased in the past 3 years. Google’s numbers have looked remarkably similar, and both players have yet to make significant impact in the space, despite a number of initiatives spanning everything from a points system rewarding recruiters for bringing in diverse candidates, to increased funding for tech education, to efforts to hire more diverse candidates in key leadership positions.

Why have gains in diversity hiring been so lackluster across the board?

Facebook justifies these disappointing numbers by citing the ubiquitous pipeline problem, namely that not enough people from underrepresented groups have access to the education and resources they need to be set up for success. And Google’s take appears to be similar, judging from what portion of their diversity-themed, forward-looking investments are focused on education.

In addition to blaming the pipeline, since Facebook’s and Google’s announcements, a growing flurry of conversations have loudly waxed causal about the real reason diversity hiring efforts haven’t worked. These have included everything from how diversity training isn’t sticky enough, to how work environments remain exclusionary and thereby unappealing to diverse candidates, to improper calibration of performance reviews to not accounting for how marginalized groups actually respond to diversity-themed messaging.

While we are excited that more resources are being allocated to education and inclusive workplaces, at interviewing.io, we posit another reason for why diversity hiring initiatives aren’t working. After drawing on data from thousands of technical interviews, it’s become clear to us that technical interviewing is a process whose results are nondeterministic and often arbitrary. We believe that technical interviewing is a broken process for everyone but that the flaws within the system hit underrepresented groups the hardest… because they haven’t had the chance to internalize just how much of technical interviewing is a numbers game. Getting a few interview invites here and there through increased diversity initiatives isn’t enough. It’s a beginning, but it’s not enough. It takes a lot of interviews to get used to the process and the format and to understand that the stuff you do in technical interviews isn’t actually the stuff you do at work every day. And it takes people in your social circle all going through the same experience, screwing up interviews here and there, and getting back on the horse to realize that poor performance in one interview isn’t predictive of whether you’ll be a good engineer.

A brief history of technical interviewing

A definitive work on the history of technical interviewing was surprisingly hard to find, but I was able to piece together a narrative by scouring books like How Would You Move Mount Fuji, Programming Interviews Exposed, and the bounty of the internets. The story goes something like this.

Technical interviewing has its roots as far back as 1950s Palo Alto, at Shockley Semiconductor Laboratories. Shockley’s interviewing methodology came out of a need to separate the innovative, rapidly moving, Cold War-fueled tech space from hiring approaches taken in more traditionally established, skills-based assembly-line based industry. And so, he relied on questions that could gauge analytical ability, intellect, and potential quickly. One canonical question in this category has to do with coins. You have 8 identical-looking coins, except one is lighter than the rest. Figure out which one it is with just two weighings on a pan balance.

The techniques that Shockley developed were adapted by Microsoft during the 90s, as the first dot-com boom spurred an explosion in tech hiring. As with the constraints imposed by both the volume and the high analytical/adaptability bar imposed by Shockley, Microsoft, too, needed to vet people quickly for potential — as software engineering became increasingly complex over the course of the dot-com boom, it was no longer possible to have a few centralized “master programmers” manage the design and then delegate away the minutiae. Even rank and file developers needed to be able to produce under a variety of rapidly evolving conditions, where just mastery of specific skills wasn’t enough.

The puzzle format, in particular, was easy to standardize because individual hiring managers didn’t have to come up with their own interview questions, and a company could quickly build up its own interchangeable question repository.

This mentality also applied to the interview process itself — rather than having individual teams run their own processes and pipelines, it made much more sense to standardize things. This way, in addition to questions, you could effectively plug and play the interviewers themselves — any interviewer within your org could be quickly trained up and assigned to speak with any candidate, independent of prospective team.

Puzzle questions were a good solution for this era for a different reason. Collaborative editing of documents didn’t become a thing until Google Docs’ launch in 2007. Without that capability, writing code in a phone interview was untenable — if you’ve ever tried to talk someone through how to code something up without at least a shared piece of paper in front of you, you know how painful it can be. In the absence of being able to write code in front of someone, the puzzle question was a decent proxy. Technology marched on, however, and its evolution made it possible to move from the proxy of puzzles to more concrete, coding-based interview questions. Around the same time, Google itself publicly overturned the efficacy of puzzle questions.

So where does this leave us? Technical interviews are moving in the direction of more concreteness, but they are still very much a proxy for the day-to-day work that a software engineer actually does. The hope was that the proxy would be decent enough, but it was always understood that that’s what they were and that the cost-benefit of relying on a proxy worked out in cases where problem solving trumped specific skills and where the need for scale trumped everything else.

As it happens, elevating problem-solving ability and the need for a scalable process are both eminently reasonable motivations. But here’s the unfortunate part: the second reason, namely the need for scalability, doesn’t apply in most cases. Very few companies are large enough to need plug and play interviewers. But coming up with interview questions and processes is really hard, so despite their differing needs, smaller companies often take their cues from the larger players, not realizing that companies like Google are successful at hiring because the work they do attracts an assembly line of smart, capable people… and that their success at hiring is often despite their hiring process and not because of it. So you end up with a de facto interviewing cargo cult, where smaller players blindly mimic the actions of their large counterparts and blindly hope for the same results.

The worst part is that these results may not even be repeatable… for anyone. To show you what I mean, I’ll talk a bit about some data we collected at interviewing.io.

Technical interviewing is broken for everybody

Interview outcomes are kind of arbitrary
interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs. Interviewers and interviewees meet in a collaborative coding environment and jump right into a technical interview question. After each interview, both sides rate one another, and interviewers rate interviewees on their technical ability. And the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some interesting and somewhat controlled comparative analysis.

We were curious to see how consistent the same interviewee’s performance was from interview to interview, so we dug into our data. After looking at thousands of interviews on the platform, we’ve discovered something alarming: interviewee performance from interview to interview varied quite a bit, even for people with a high average performance. In the graph below, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on interviewing.io. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes.

As you can see, roughly 25% of interviewees are consistent in their performance, but the rest are all over the place. And over a third of people with a high mean (>=3) technical performance bombed at least one interview.

Despite the noise, from the graph above, you can make some guesses about which people you’d want to interview. However, keep in mind that each person above represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey. Looking at this data, it’s not hard to see why technical interviewing is often perceived as a game. And, unfortunately, it’s a game where people often can’t tell how they’re doing.

No one can tell how they’re doing
I mentioned above that on interviewing.io, we collect post-interview feedback. In addition to asking interviewers how their candidates did, we also ask interviewees how they think they did. Comparing those numbers for each interview showed us something really surprising: people are terrible at gauging their own interview performance, and impostor syndrome is particularly prevalent. In fact, people underestimate their performance over twice as often as they overestimate it. Take a look at the graph below to see what I mean:

Note that, in our data, impostor syndrome knows no gender or pedigree — it hits engineers on our platform across the board, regardless of who they are or where they come from.

Now here’s the messed up part. During the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very strong relationship between whether people think they did well and whether they would indeed want to work with the interviewer — when people think they did poorly, even if they actually didn’t, they may be a lot less likely to want to work with you. And, by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

As a result, companies are losing candidates from all walks of life because of a fundamental flaw in the process.

Poor performances hit marginalized groups the hardest
Though impostor syndrome appears to hit engineers from all walks of life, we’ve found that women get hit the hardest in the face of an actually poor performance. As we learned above, poor performances in technical interviewing happen to most people, even people who are generally very strong. However, when we looked at our data, we discovered that after a poor performance, women are 7 times more likely to stop practicing than men:

A bevy of research appears to support confidence-based attrition as a very real cause for women departing from STEM fields, but I would expect that the implications of the attrition we witnessed extend beyond women to underrepresented groups, across the board.

What the real problem is

At the end of the day, because technical interviewing is indeed a game, like all games, it takes practice to improve. However, unless you’ve been socialized to expect and be prepared for the game-like aspect of the experience, it’s not something that you can necessarily intuit. And if you go into your interviews expecting them to be indicative of your aptitude at the job, which is, at the outset, not an unreasonable assumption, you will be crushed the first time you crash and burn. But the process isn’t a great or predictable indicator of your aptitude. And on top of that, you likely can’t tell how you’re doing even when you do well.

These are issues that everyone who’s gone through the technical interviewing gauntlet has grappled with. But not everyone has the wherewithal or social support to realize that the process is imperfect and to stick with it. And the less people like you are involved, whether it’s because they’re not the same color as you or the same gender or because not a lot of people at your school study computer science or because you’re a dropout or for any number of other reasons, the less support or insider knowledge or 10,000 foot view of the situation you’ll have. Full stop.

Inclusion and education isn’t enough

To help remedy the lack of diversity in its headcount, Facebook has committed to three actionable steps on varying time frames. The first step revolves around creating a more inclusive interview/work environment for existing candidates. The other two are focused on addressing the perceived pipeline problem in tech:

  • Short Term: Building a Diverse Slate of Candidates and an Inclusive Working Environment
  • Medium Term: Supporting Students with an Interest in Tech
  • Long Term: Creating Opportunity and Access

Indeed, efforts to promote inclusiveness and increased funding for education are extremely noble, especially in the face of potentially not being able to see results for years in the case of the latter. However, both take a narrow view of the problem and both continue to funnel candidates into a broken system.

Erica Baker really cuts to the heart of it in her blog post about Twitter hiring a head of D&I:

“What irks me the most about this is that no company, Twitter or otherwise, should have a VP of Diversity and Inclusion. When the VP of Engineering… is thinking about hiring goals for the year, they are not going to concern themselves with the goals of the VP of Diversity and Inclusion. They are going to say ‘hiring more engineers is my job, worrying about the diversity of who I hire is the job of the VP of Diversity and Inclusion.’ When the VP of Diversity and Inclusion says ‘your org is looking a little homogenous, do something about it,’ the VP of Engineering won’t prioritize that because the VP of Engineering doesn’t report to the VP of Diversity and Inclusion, so knows there usually isn’t shit the VP of Diversity and Inclusion can do if the Eng org doesn’t see some improvement in diversity.”

Indeed, this is sad, but true. When faced with a high-visibility conundrum like diversity hiring, a pragmatic and even reasonable reaction on any company’s part is to make a few high-profile hires and throw money at the problem. Then, it looks like you’re doing something, and spinning up a task force or a department or new set of titles is a lot easier than attempting to uproot the entire status quo.

As such, we end up with a newly minted, well-funded department pumping a ton of resources into feeding people who’ve not yet learned about the interviewing being a game into a broken, nondeterministic machine of a process made further worse by the fact that said process favors confidence and persistence over bona fide ability… and where the link between success in navigating said process and subsequent on-the-job performance is tenuous at best.

How to fix things

In the evolution of the technical interview, we saw a gradual reduction in the need for proxies as companies as the technology to write code together remotely emerged; with its advent, abstract, largely arbitrary puzzle questions could start to be phased out.

What’s the next step? Technology has the power to free us from relying on proxies, so that we can look at each individual as an indicative, unique bundle of performance-based data points. At interviewing.io, we make it possible to move away from proxies by looking at each interviewee as a collection of data points that tell a story, rather than one arbitrary glimpse of something they did once.

But that’s not enough either. Interviews themselves need to continue to evolve. The process itself needs to be repeatable, predictive of aptitude at the actual job, and not a system to be gamed, where a huge benefit is incurred by knowing the rules. And the larger organizations whose processes act as a template for everyone else need to lead the charge. Only then can we really be welcoming to a truly diverse group of candidates.

Featured

Uncategorized

After a lot more data, technical interview performance really is kind of arbitrary.

Posted on October 13th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously, and if things go well, get jobs at top companies in the process. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle.

In February of 2016, we published a post about how people’s technical interview performance, from interview to interview, seemed quite volatile. At the time, we just had a few hundred interviews to draw on, so as you can imagine, we were quite eager to rerun the numbers with the advent of more data. After drawing on over a thousand interviews, the numbers hold up. In other words, technical interview outcomes do really seem to be kind of arbitrary.

The setup

When an interviewer and an interviewee match on interviewing.io, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “poor” and 4 is “amazing!” (you can see the feedback form here). On our platform, a score of 3 or above has generally meant that the person was good enough to move forward.

At this point, you might say, that’s nice and all, but what’s the big deal? Lots of companies collect this kind of data in the context of their own pipelines. Here’s the thing that makes our data special: the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some pretty interesting and somewhat controlled comparative analysis.

Performance from interview to interview really is arbitrary

If you’ve read our first post on this subject, you’ll recognize the visualization below. For the as yet uninitiated, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on the platform. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes. If you hover over each , you can drill down and see how that person did in each of their interviews. Anytime you see bolded text with a dotted underline, you can hover over it to see relevant data viz. Try it now to expand everyone’s performance. You can also hover over the labels along the x-axis to drill into the performance of people whose means fall into those buckets.

Standard Dev vs. Mean of Interviewee Performance
(1316 Interviews w/ 259 Interviewees)

As you can see, roughly 20% of interviewees are consistent in their performance (down from 25% the last time we did this analysis), and the rest are all over the place. If you look at the graph above, despite the noise, you can probably make some guesses about which people you’d want to interview. However, keep in mind that each represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey.1 For instance:

  • Many people who scored at least one 4 also scored at least one 2.
  • And as you saw above, a good amount of people who scored at least one 4 also scored at least one 1.
  • If we look at high performers (mean of 3.3 or higher), we still see a fair amount of variation.
  • Things get really murky when we consider “average” performers (mean between 2.6 and 3.3).

What do the most volatile interviewees have in common?

In the plot below, you can see interview performance over time for interviewees with the highest standard deviations on the platform (the cutoff we used was a standard dev of 1 or more, and this accounted for roughly 12% of our users). Note that the mix of dashed and dotted lines is purely visual — this way it’s easier to follow each person’s performance path.

So, what do the most highly volatile performers have in common? The answer appears to be, well, nothing. About half were working at top companies while interviewing, and half weren’t. Breakdown of top school was roughly 60/40. And years of experience didn’t have much to do with it either — a plurality of interviewees having between 2 and 6 years of experience, with the rest all over the board (varying between 1 and 20 years).

So, all in all, the factors that go into performance volatility are likely a lot more nuanced than the traditional cues we often use to make value judgments about candidates.

Why does volatility matter?

I discussed the implications of these findings for technical hiring at length in the last post, but briefly, a noisy, non-deterministic interview process does no favors to either candidates or companies. Both end up expending a lot more effort to get a lot less signal than they ought, and in a climate where software engineers are at such a premium, noisy interviews only serve to exacerbate the problem.

But beyond micro and macro inefficiencies, I suspect there’s something even more insidious and unfortunate going on here. Once you’ve done a few traditional technical interviews, the volatility and lack of determinism in the process is something you figure out anecdotally and kind of accept. And if you have the benefit of having friends who’ve also been through it, it only gets easier. What if you don’t, however?

In a previous post, we talked about how women quit interview practice 7 times more often than men after just one bad interview. It’s not too much of a leap to say that this is probably happening to any number of groups who are underrepresented/underserved by the current system. In other words, though it’s a broken process for everyone, the flaws within the system hit these groups the hardest… because they haven’t had the chance to internalize just how much of technical interviewing is a game. More on this subject in our next post!

What can we do about it?

So, yes, the state of technical hiring isn’t great right now, but here’s what we can say. If you’re looking for a job, the best piece of advice we can give you is to really internalize that interviewing is a numbers game. Between the kind of volatility we discussed in this post, impostor syndrome, poor evaluation techniques, and how hard it can be to get meaningful, realistic practice, it takes a lot of interviews to find a great job.

And if you’re hiring people, in the absence of a radical shift in how we vet technical ability, we’ve learned that drawing on aggregate performance is much more meaningful than a making such an important decision based on one single, arbitrary interview. Not only can aggregative performance help correct for an uncharacteristically poor performance, but it can also weed out people who eventually do well in an interview by chance or those who, over time, simply up and memorize Cracking the Coding Interview. At interviewing.io, even after just a handful of interviews, we have a much better picture of what someone is capable of and where they stack up than a single company would after a single interview, and aggregate data tells a much more compelling, repeatable story than one, arbitrary data point.

1At this point you might say that it’s erroneous and naive to compare raw technical scores to one another for any number of reasons, not the least of which is that one interviewer’s 4 is another interviewer’s 2. For a comprehensive justification of using raw scores comparatively, please check out the appendix to our previous post on this subject. Just to make sure the numbers hold up, I reran them, and this time, our R-squared is even higher than before (0.41 vs. 0.39 last time).

Huge thanks to Ian Johnson, creator of d3 Building Blocks, who made the graph entitled Standard Dev vs. Mean of Interviewee Performance (the one with the icons) as well as all the visualizations that go with it.

Featured

Uncategorized

People are still bad at gauging their own interview performance. Here’s the data.

Posted on September 8th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously, and if things go well, get jobs at top companies in the process. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle.

At the end of 2015, we published a post about how people are terrible at gauging their own interview performance. At the time, we just had a few hundred interviews to draw on, so as you can imagine, we were quite eager to rerun the numbers with the advent of more data. After drawing on roughly one thousand interviews, we were surprised to find that the numbers have really held up, and that people continue to be terrible at gauging their own interview performance.

The setup

When an interviewer and an interviewee match on interviewing.io, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews.

If you’re curious, you can see what the feedback forms look like below — in addition to one direct yes/no question, we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, and one of those questions is about how well they think they did. For context, a technical score of 3 or above seems to be the rough cut-off for hirability.

Feedback form for interviewers

Feedback form for interviewers

Feedback form for interviewees

Feedback form for interviewees

Perceived versus actual performance… revisited

Below are two heatmaps of perceived vs. actual performance per interview (for interviews where we had both pieces of data). In each heatmap, the darker areas represent higher interview concentration. For instance, the darkest square represents interviews where both perceived and actual performance was rated as a 3. You can hover over each square to see the exact interview count (denoted by “z”).

The first heatmap is our old data:

And the second heatmap is our data as of August 2016:

As you can see, even with the advent of a lot more interviews, the heatmaps look remarkably similar. The R-squared for a linear regression on the first data set is 0.24. And for the more recent data set, it’s dropped to 0.18. In both cases, even though some small positive relationship between actual and perceived performance does exist, it is not a strong, predictable correspondence.

You can also see there’s a non-trivial amount of impostor syndrome going on in the graph above, which probably comes as no surprise to anyone who’s been an engineer. Take a look at the graph below to see what I mean.

The x-axis is the difference between actual and perceived performance, i.e. actual minus perceived. In other words, a negative value means that you overestimated your performance, and a positive one means that you underestimated it. Therefore, every bar above 0 is impostor syndrome country, and every bar below zero belongs to its foulsome, overconfident cousin, the Dunning-Kruger effect.1

On interviewing.io (though I wouldn’t be surprised if this finding extrapolated to the qualified engineering population at large), impostor syndrome plagues interviewees roughly twice as often as Dunning-Kruger. Which, I guess, is better than the alternative.

Why people underestimate their performance

With all this data, I couldn’t resist digging into interviews where interviewees gave themselves 1’s and 2’s but where interviewers gave them 4’s to try to figure out if there were any common threads. And, indeed, a few trends emerged. The interviews that tended to yield the most interviewee impostor syndrome were ones where question complexity was layered. In other words, the interviewer would start with a fairly simple question and then, when the interviewee completed it successfully, they would change things up to make it harder. Lather, rinse, repeat. In some cases, an interviewer could get through up to 4 layered tiers in about an hour. Inevitably, even a good interviewee will hit a wall eventually, even if the place where it happens is way further out than the boundary for most people who attempt the same question.

Another trend I observed had to do with interviewees beating themselves up for issues that mattered a lot to them but fundamentally didn’t matter much to their interviewer: off-by-one errors, small syntax errors that made it impossible to compile their code (even though everything was semantically correct), getting big-O wrong the first time and then correcting themselves, and so on.

Interestingly enough, how far off people were in gauging their own performance was independent of how highly rated (overall) their interviewer was or how strict their interviewer was.

With that in mind, if I learned anything from watching these interviews, it was this. Interviewing is a flawed, human process. Both sides want to do a good job, but sometimes the things that matter to each side are vastly different. And sometimes the standards that both sides hold themselves to are vastly different as well.

Why this (still) matters for hiring, and what you can do to make it better

Techniques like layered questions are important to sussing out just how good a potential candidate is and can make for a really engaging positive experience, so removing them isn’t a good solution. And there probably isn’t that much you can do directly to stop an engineer from beating themselves up over a small syntax error (especially if it’s one the interviewer didn’t care about). However, all is not lost!

As you recall, during the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very statistically significant relationship between whether people think they did well and whether they’d want to work with the interviewer. This means that when people think they did poorly, they may be a lot less likely to want to work with you. And by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

How can one mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! This way people don’t have time to go through the self-flagellation gauntlet that happens after a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway.

1I’m always terrified of misspelling “Dunning-Kruger” and not double-checking it because of overconfidence in my own spelling abilities.

Featured

Uncategorized

We built voice modulation to mask gender in technical interviews. Here’s what happened.

Posted on June 29th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs based on their interview performance rather than their resumes. Since we started, we’ve amassed data from thousands of technical interviews, and in this blog, we routinely share some of the surprising stuff we’ve learned. In this post, I’ll talk about what happened when we built real-time voice masking to investigate the magnitude of bias against women in technical interviews. In short, we made men sound like women and women sound like men and looked at how that affected their interview performance. We also looked at what happened when women did poorly in interviews, how drastically that differed from men’s behavior, and why that difference matters for the thorny issue of the gender gap in tech.

The setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, Twitch, and Yelp, as well as engineering-focused startups like Asana, Mattermark, and others.

After every interview, interviewers rate interviewees on a few different dimensions.

Feedback form for interviewers

Feedback form for interviewers

As you can see, we ask the interviewer if they would advance their interviewee to the next round. We also ask about a few different aspects of interview performance using a 1-4 scale. On our platform, a score of 3 or above is generally considered good.

Women historically haven’t performed as well as men…

One of the big motivators to think about voice masking was the increasingly uncomfortable disparity in interview performance on the platform between men and women1. At that time, we had amassed over a thousand interviews with enough data to do some comparisons and were surprised to discover that women really were doing worse. Specifically, men were getting advanced to the next round 1.4 times more often than women. Interviewee technical score wasn’t faring that well either — men on the platform had an average technical score of 3 out of 4, as compared to a 2.5 out of 4 for women.

Despite these numbers, it was really difficult for me to believe that women were just somehow worse at computers, so when some of our customers asked us to build voice masking to see if that would make a difference in the conversion rates of female candidates, we didn’t need much convincing.

… so we built voice masking

Since we started working on interviewing.io, in order to achieve true interviewee anonymity, we knew that hiding gender would be something we’d have to deal with eventually but put it off for a while because it wasn’t technically trivial to build a real-time voice modulator. Some early ideas included sending female users a Bane mask.

Early voice masking prototype

Early voice masking prototype (drawing by Marcin Kanclerz)

When the Bane mask thing didn’t work out, we decided we ought to build something within the app, and if you play the videos below, you can get an idea of what voice masking on interviewing.io sounds like. In the first one, I’m talking in my normal voice.

And in the second one, I’m modulated to sound like a man.2

Armed with the ability to hide gender during technical interviews, we were eager to see what the hell was going on and get some insight into why women were consistently underperforming.

The experiment

The setup for our experiment was simple. Every Tuesday evening at 7 PM Pacific, interviewing.io hosts what we call practice rounds. In these practice rounds, anyone with an account can show up, get matched with an interviewer, and go to town. And during a few of these rounds, we decided to see what would happen to interviewees’ performance when we started messing with their perceived genders.

In the spirit of not giving away what we were doing and potentially compromising the experiment, we told both interviewees and interviewers that we were slowly rolling out our new voice masking feature and that they could opt in or out of helping us test it out. Most people opted in, and we informed interviewees that their voice might be masked during a given round and asked them to refrain from sharing their gender with their interviewers. For interviewers, we simply told them that interviewee voices might sound a bit processed.

We ended up with 234 total interviews (roughly 2/3 male and 1/3 female interviewees), which fell into one of three categories:

  • Completely unmodulated (useful as a baseline)
  • Modulated without pitch change
  • Modulated with pitch change

You might ask why we included the second condition, i.e. modulated interviews that didn’t change the interviewee’s pitch. As you probably noticed, if you played the videos above, the modulated one sounds fairly processed. The last thing we wanted was for interviewers to assume that any processed-sounding interviewee must summarily have been the opposite gender of what they sounded like. So we threw that condition in as a further control.

The results

After running the experiment, we ended up with some rather surprising results. Contrary to what we expected (and probably contrary to what you expected as well!), masking gender had no effect on interview performance with respect to any of the scoring criteria (would advance to next round, technical ability, problem solving ability). If anything, we started to notice some trends in the opposite direction of what we expected: for technical ability, it appeared that men who were modulated to sound like women did a bit better than unmodulated men and that women who were modulated to sound like men did a bit worse than unmodulated women. Though these trends weren’t statistically significant, I am mentioning them because they were unexpected and definitely something to watch for as we collect more data.

On the subject of sample size, we have no delusions that this is the be-all and end-all of pronouncements on the subject of gender and interview performance. We’ll continue to monitor the data as we collect more of it, and it’s very possible that as we do, everything we’ve found will be overturned. I will say, though, that had there been any staggering gender bias on the platform, with a few hundred data points, we would have gotten some kind of result. So that, at least, was encouraging.

So if there’s no systemic bias, why are women performing worse?

After the experiment was over, I was left scratching my head. If the issue wasn’t interviewer bias, what could it be? I went back and looked at the seniority levels of men vs. women on the platform as well as the kind of work they were doing in their current jobs, and neither of those factors seemed to differ significantly between groups. But there was one nagging thing in the back of my mind. I spend a lot of my time poring over interview data, and I had noticed something peculiar when observing the behavior of female interviewees. Anecdotally, it seemed like women were leaving the platform a lot more often than men. So I ran the numbers.

What I learned was pretty shocking. As it happens, women leave interviewing.io roughly 7 times as often as men after they do badly in an interview. And the numbers for two bad interviews aren’t much better. You can see the breakdown of attrition by gender below (the differences between men and women are indeed statistically significant with P < 0.00001).

Also note that as much as possible, I corrected for people leaving the platform because they found a job (practicing interviewing isn’t that fun after all, so you’re probably only going to do it if you’re still looking), were just trying out the platform out of curiosity, or they didn’t like something else about their interviewing.io experience.

A totally speculative thought experiment

So, if these are the kinds of behaviors that happen in the interviewing.io microcosm, how much is applicable to the broader world of software engineering? Please bear with me as I wax hypothetical and try to extrapolate what we’ve seen here to our industry at large. And also, please know that what follows is very speculative, based on not that much data, and could be totally wrong… but you gotta start somewhere.

If you consider the attrition data points above, you might want to do what any reasonable person would do in the face of an existential or moral quandary, i.e. fit the data to a curve. An exponential decay curve seemed reasonable for attrition behavior, and you can see what I came up with below. The x-axis is the number of what I like to call “attrition events”, namely things that might happen to you over the course of your computer science studies and subsequent career that might make you want to quit. The y-axis is what portion of people are left after each attrition event. The red curve denotes women, and the blue curve denotes men.

Now, as I said, this is pretty speculative, but it really got me thinking about what these curves might mean in the broader context of women in computer science. How many “attrition events” does one encounter between primary and secondary education and entering a collegiate program in CS and then starting to embark on a career? So, I don’t know, let’s say there are 8 of these events between getting into programming and looking around for a job. If that’s true, then we need 3 times as many women studying computer science than men to get to the same number in our pipelines. Note that that’s 3 times more than men, not 3 times more than there are now. If we think about how many there are now, which, depending on your source, is between 1/3 and a 1/4 of the number of men, to get to pipeline parity, we actually have to increase the number of women studying computer science by an entire order of magnitude.

Prior art, or why maybe this isn’t so nuts after all

Since gathering these findings and starting to talk about them a bit in the community, I began to realize that there was some supremely interesting academic work being done on gender differences around self-perception, confidence, and performance. Some of the work below found slightly different trends than we did, but it’s clear that anyone attempting to answer the question of the gender gap in tech would be remiss in not considering the effects of confidence and self-perception in addition to the more salient matter of bias.

In a study investigating the effects of perceived performance to likelihood of subsequent engagement, Dunning (of Dunning-Kruger fame) and Ehrlinger administered a scientific reasoning test to male and female undergrads and then asked them how they did. Not surprisingly, though there was no difference in performance between genders, women underrated their own performance more often than men. Afterwards, participants were asked whether they’d like to enter a Science Jeopardy contest on campus in which they could win cash prizes. Again, women were significantly less likely to participate, with participation likelihood being directly correlated with self-perception rather than actual performance.3

In a different study, sociologists followed a number of male and female STEM students over the course of their college careers via diary entries authored by the students. One prevailing trend that emerged immediately was the difference between how men and women handled the “discovery of their [place in the] pecking order of talent, an initiation that is typical of socialization across the professions.” For women, realizing that they may no longer be at the top of the class and that there were others who were performing better, “the experience [triggered] a more fundamental doubt about their abilities to master the technical constructs of engineering expertise [than men].”

And of course, what survey of gender difference research would be complete without an allusion to the wretched annals of dating? When I told the interviewing.io team about the disparity in attrition between genders, the resounding response was along the lines of, “Well, yeah. Just think about dating from a man’s perspective.” Indeed, a study published in the Archives of Sexual Behavior confirms that men treat rejection in dating very differently than women, even going so far as to say that men “reported they would experience a more positive than negative affective response after… being sexually rejected.”

Maybe tying coding to sex is a bit tenuous, but, as they say, programming is like sex — one mistake and you have to support it for the rest of your life.

Why I’m not depressed by our results and why you shouldn’t be either

Prior art aside, I would like to leave off on a high note. I mentioned earlier that men are doing a lot better on the platform than women, but here’s the startling thing. Once you factor out interview data from both men and women who quit after one or two bad interviews, the disparity goes away entirely. So while the attrition numbers aren’t great, I’m massively encouraged by the fact that at least in these findings, it’s not about systemic bias against women or women being bad at computers or whatever. Rather, it’s about women being bad at dusting themselves off after failing, which, despite everything, is probably a lot easier to fix.

1Roughly 15% of our users are female. We want way more, but it’s a start.

2If you want to hear more examples of voice modulation or are just generously down to indulge me in some shameless bragging, we got to demo it on NPR and in Fast Company.

3In addition to asking interviewers how interviewees did, we also ask interviewees to rate themselves. After reading the Dunning and Ehrlinger study, we went back and checked to see what role self-perception played in attrition. In our case, the answer is, I’m afraid, TBD, as we’re going to need more self-ratings to say anything conclusive.

Featured

Uncategorized

Technical interview performance is kind of arbitrary. Here’s the data.

Posted on February 17th, 2016.

Note: Though I wrote most of the words in this post, there are a few people outside of interviewing.io whose work made it possible. Ian Johnson, creator of d3 Building Blocks, created the graph entitled Standard Dev vs. Mean of Interviewee Performance (the one with the icons) as well as all the interactive visualizations that go with it. Dave Holtz did all the stats work for computing the probability of people failing individual interviews. You can see more about his work on his blog.

interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs. In the past few months, we’ve amassed data from hundreds of interviews, and when we looked at how the same people performed from interview to interview, we were really surprised to find quite a bit of volatility, which, in turn, made us question the reliability of single interview outcomes.

The setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice1, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, and Yelp, as well as engineering-focused startups like Asana, Mattermark, KeepSafe, and more.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “meh” and 4 is “amazing!” (you can see the feedback form here). On our platform, a score of 3 or above has generally meant that the person was good enough to move forward.

At this point, you might say, that’s nice and all, but what’s the big deal? Lots of companies collect this kind of data in the context of their own pipelines. Here’s the thing that makes our data special: the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some pretty interesting and somewhat controlled comparative analysis.

Performance from interview to interview is pretty volatile

Let’s start with some visuals. In the graph below, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on the platform2. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes. If you hover over each , you can drill down and see how that person did in each of their interviews. Anytime you see bolded text with a dotted underline, you can hover over it to see relevant data viz. Try it now to expand everyone’s performance. You can also hover over the labels along the x-axis to drill into the performance of people whose means fall into those buckets.

Standard Dev vs. Mean of Interviewee Performance
(299 Interviews w/ 67 Interviewees)

As you can see, roughly 25% of interviewees are consistent in their performance, and the rest are all over the place3. If you look at the graph above, despite the noise, you can probably make some guesses about which people you’d want to interview. However, keep in mind that each represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey. For instance:

  • Many people who scored at least one 4 also scored at least one 2.
  • If we look at high performers (mean of 3.3 or higher), we still see a fair amount of variation.
  • Things get really murky when we consider “average” performers (mean between 2.6 and 3.3).

To me, looking at this data and then pretending that I had to make a hiring decision based on one interview outcome felt a lot like peering into some beautiful, lavishly appointed parlor through a keyhole. Sometimes you see a piece of art on the wall, sometimes you see the liquor selection, and sometimes you just see the back of the couch.

At this point you might say that it’s erroneous and naive to compare raw technical scores to one another for any number of reasons, not the least of which is that one interviewer’s 4 is another interviewer’s 2. We definitely share this concern and address it in the appendix of this post. It does bear mentioning, though, that most of our interviewers are coming from companies with strong engineering brands and that correcting for brand strength didn’t change interviewee performance volatility, nor did correcting for interviewer rating.

So, in a real life situation, when you’re trying to decide whether to advance someone to onsite, you’re probably trying to avoid two things — false positives (bringing in people below your bar by mistake) and false negatives (rejecting people who should have made it in). Most top companies’ interviewing paradigm is that false negatives are less bad than false positives. This makes sense right? With a big enough pipeline and enough resources, even with a high false negative rate, you’ll still get the people you want. With a high false positive rate, you might get cheaper hiring, but you do potentially irreversible damage to your product, culture, and future hiring standards in the process. And of course, the companies setting the hiring standards and practices for an entire industry ARE the ones with the big pipelines and seemingly inexhaustible resources.

The dark side of optimizing for high false negative rates, though, rears its head in the form of our current engineering hiring crisis. Do single interview instances, in their current incarnation, give enough signal? Or amidst so much demand for talent, are we turning away qualified people because we’re all looking at a large, volatile graph through a tiny keyhole?

So, hyperbolic moralizing aside, given how volatile interview performance is, what are the odds that a good candidate will fail an individual phone screen?

Odds of failing a single interview based on past performance

Below, you can see the distribution of mean performance throughout our population of interviewees.

In order to figure out the probability that a candidate with a given mean score would fail an interview, we had to do some stats work. First, we broke interviewees up into cohorts based on their mean scores (rounded to the nearest 0.25). Then, for each cohort, we calculated the probability of failing, i.e. of getting a score of 2 or less. Finally, to work around our starting data set not being huge, we resampled our data. In our resampling procedure, we treated an interview outcome as a multinomial distribution, or in other words, pretended that each interview was a roll of a weighted, 4-sided die corresponding to that candidate’s cohort. We then re-rolled the dice a bunch of times to create a new, “simulated” dataset for each cohort and calculated new probabilities of failure for each cohort using these data sets. Below, you can see the results of repeating this process 10,000 times.

As you can see, a lot of the distributions above overlap with one another. This is important because these overlaps tell us that there may not be statistically significant differences between those groups (e.g. between 2.75 and 3). Certainly, with the advent of LOT more data, the delineations between cohorts may become clearer. On the other hand, if we do need a huge amount of data to detect differences in failure rate, it might suggest that people are intrinsically highly variable in their performance. At the end of the day, while we can confidently say that there is a significant difference between the bottom end of the spectrum (2.25) versus the top end (3.75), for people in the middle, things are murky.

Nevertheless, using these distributions, we did attempt to compute the probability that a candidate with a certain mean score would fail a single interview (see below — the shaded areas encapsulate a 95% confidence interval). The fact that people who are overall pretty strong (e.g. mean ~= 3) can mess up technical interviews as much as 22% of the time shows that there’s definitely room for improvement in the process, and this is further exacerbated by the general murkiness in the middle of the spectrum.

Is interviewing doomed?

Generally, when we think of interviewing, we think of something that ought to have repeatable results and carry a strong signal. However, the data we’ve collected, meager though it might be, tells a different story. And it resonates with both my anecdotal experience as a recruiter and with the sentiments we’ve seen echoed in the community. Zach Holman’s Startup Interviewing is Fucked hits on the disconnect between interview process and the job it’s meant to fill, the fine gentlemen of TripleByte reached similar conclusions by looking at their own data, and one of the more poignant expressions of inconsistent interviewing results recently came from rejected.us.

You can bet that many people who are rejected after a phone screen by Company A but do better during a different phone screen and ultimately end up somewhere traditionally reputable are getting hit up by Company A’s recruiters 6 months later. And despite everyone’s best efforts, the murky, volatile, and ultimately stochastic circle jerk of a recruitment process marches on.

So yes, it’s certainly one possible conclusion is that technical interviewing itself is indeed fucked and doesn’t provide a reliable, deterministic signal for one interview instance. Algorithmic interviews are a hotly debated topic and one we’re deeply interested in teasing apart. One thing in particular we’re very excited about is tracking interview performance as a function of interview type, as we get more and more different interviewing types/approaches happening on the platform. Indeed, one of our long-term goals is to really dig into our data, look at the landscape of different interview styles, and make some serious data-driven statements about what types of technical interviews lead to the highest signal.

In the meantime, however, I am leaning toward the idea that drawing on aggregate performance is much more meaningful than a making such an important decision based on one single, arbitrary interview. Not only can aggregative performance help correct for an uncharacteristically poor performance, but it can also weed out people who eventually do well in an interview by chance or those who, over time, submit to the beast and memorize Cracking the Coding Interview. I know it’s not always practical or possible to gather aggregate performance data in the wild, but at the very least, in cases where a candidate’s performance is borderline or where their performance differs wildly from what you’d expect, it might make sense to interview them one more time, perhaps focusing on slightly different material, before making the final decision.

Appendix: The part where we tentatively justify using raw scores for comparative performance analysis

For the skeptical, inquiring minds among you who realize that using raw coding scores to evaluate an interviewee has some pretty obvious problems, we’ve included this section. The issue is that even though our interviewers tend to come from companies with high engineering bars, raw scores are still comprised of just one piece of feedback, they don’t adjust for interviewer strictness (e.g. one interviewer’s 4 could be another interviewer’s 2), and they don’t adjust well to changes in skill over time. Internally, we actually use a more complex and comprehensive rating system when determining skill, and if we can show that raw scores align with the ratings we calculate, then we don’t feel so bad about using raw scores comparatively.

Our rating system works something like this:

  1. We create a single score for each interview based on a weighted average of each feedback item.
  2. For each interviewer, we pit all the interviewees they’ve interviewed against one another using this score.
  3. We use a Bayesian ranking system (a modified version of Glicko-2) to generate a rating for each interviewee based on the outcome of these competitions.

As a result, each person is only rated based on their score as it compares to other people who were interviewed by the same interviewer. That means one interviewer’s score is never directly compared to another’s, and so we can correct for the hairy issue of inconsistent interviewer strictness.

So, why am I bringing this up at all? You’re all smart people, and you can tell when someone is waving their hands around and pretending to do math. Before we did all this analysis, we wanted to make sure that we believed our own data. We’ve done a lot of work to build a ratings system we believe in, so we correlated that with raw coding scores to see how strong they are at determining actual skill.

These results are pretty strong. Not strong enough for us to rely on raw scores exclusively but strong enough to believe that raw scores are useful for determining approximate candidate strength.

1While listening to interviews day in and day out, I came up with a drinking game. Every time someone thinks the answer is hash table, take a drink. And every time the answer actually is hash table, take two drinks.4

2This is data as of January 2016, and there are only 299 interviews because not all interviews have enough feedback data and because we threw out everyone with less than 2 interviews. Moreover, one thing we don’t show in this graph is the passage of time, so you can see people’s performance over time — it’s kind of a hot mess.

3We were curious to see if volatility varied at all with people’s mean scores. In other words, were weaker players more volatile than strong ones? The answer is no — when we ran a regression on standard deviation vs. mean, we couldn’t come up with any meaningful relationship (R-squared ~= 0.03), which means that people are all over the place regardless of how strong they are on average.

4I almost died.

Thanks to Andrew Marsh for co-authoring the appendix, to Plotly for making a terrific graphing product, and to everyone who read drafts of this behemoth.

Featured

Uncategorized

Engineers can’t gauge their own interview performance. And that makes them harder to hire.

Posted on December 15th, 2015.

interviewing.io is an anonymous technical interviewing platform. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle. In the past few months, we’ve amassed over 600 technical interviews along with their associated data and metadata. Interview questions tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role at a top company, and interviewers typically come from a mix of larger companies like Google, Facebook, and Twitter, as well as engineering-focused startups like Asana, Mattermark, KeepSafe, and more.

Over the course of the next few posts, we’ll be sharing some { unexpected, horrifying, amusing, ultimately encouraging } things we’ve learned. In this blog’s heroic maiden voyage, we’ll be tackling people’s surprising inability to gauge their own interview performance and the very real implications this finding has for hiring.

First, a bit about setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews. If both people find each other competent and pleasant, they have the option to unmask. Overall, interviewees tend to do quite well on the platform, with just under half of interviews resulting in a “yes” from the interviewer.

If you’re curious, you can see what the feedback forms look like below. As you can see, in addition to one direct yes/no question, we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, and one of those questions is about how well they think they did. In this post, we’ll be focusing on the technical score an interviewer gives an interviewee and the interviewee’s self-assessment (both are circled below). For context, a technical score of 3 or above seems to be the rough cut-off for hirability.

Feedback form for interviewers

Feedback form for interviewers

Feedback form for interviewees

Feedback form for interviewees

Perceived versus actual performance

Below, you can see the distribution of people’s actual technical performance (as rated by their interviewers) and the distribution of their perceived performance (how they rated themselves) for the same set of interviews.1

You might notice right away that there is a little bit of disparity, but things get interesting when you plot perceived vs. actual performance for each interview. Below, is a heatmap of the data where the darker areas represent higher interview concentration. For instance, the darkest square represents interviews where both perceived and actual performance was rated as a 3. You can hover over each square to see the exact interview count (denoted by “z”).

If you run a regression on this data2, you get an R-squared of only 0.24, and once you take away the worst interviews, it drops down even further to a 0.16. For context, R-squared is a measurement of how well you can fit empirical data to some mathematical model. It’s on a scale from 0 to 1 with 0 meaning that everything is noise and 1 meaning that everything fits perfectly. In other words, even though some small positive relationship between actual and perceived performance does exist, it is not a strong, predictable correspondence.

You can also see there’s a non-trivial amount of impostor syndrome going on in the graph above, which probably comes as no surprise to anyone who’s been an engineer.

Gayle Laakmann McDowell of Cracking the Coding Interview fame has written quite a bit about how bad people are at gauging their own interview performance, and it’s something that I had noticed anecdotally when I was doing recruiting, so it was nice to see some empirical data on that front. In her writing, Gayle mentions that it’s the job of a good interviewer to make you feel like you did OK even if you bombed. I was curious about whether that’s what was going on here, but when I ran the numbers, there wasn’t any relationship between how highly an interviewer was rated overall and how off their interviewees’ self-assessments were, in one direction or the other.

Ultimately, this isn’t a big data set, and we will continue to monitor the relationship between perceived and actual performance as we host more interviews, but we did find that this relationship emerged very early on and has continued to persist with more and more interviews — R-squared has never exceeded 0.26 to date.

Why this matters for hiring

Now here’s the actionable and kind of messed up part. As you recall, during the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very statistically significant relationship (p < 0.0008) between whether people think they did well and whether they’d want to work with the interviewer. This means that when people think they did poorly, they may be a lot less likely to want to work with you3. And by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

How can one mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! This way people don’t have time to go through the self-flagellation gauntlet that happens after a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway.

Lastly, a quick shout-out to Statwing and Plotly for making terrific data analysis and graphing tools respectively.

1There are only 254 interviews represented here because not all interviews in our data set had comprehensive, mutual feedback. Moreover, we realize that raw scores don’t tell the whole story and will be focusing on standardization of these scores and the resulting rat’s nest in our next post. That said, though interviewer strictness does vary, we gate interviewers pretty heavily based on their background and experience, so the overall bar is high and comparable to what you’d find at a good company in the wild.

2Here we are referring to linear regression, and though we tried fitting a number of different curves to the data, they all sucked.

3In our data, people were 3 times less likely to want to work with their interviewers when they thought they did poorly.