interviewing.io logo interviewing.io blog
better interviewing through data
Navigation
CategoryUncategorized
Featured

Uncategorized

What do the best interviewers have in common? We looked at thousands of real interviews to find out.

Posted on November 29th, 2017.

At interviewing.io, we’ve analyzed and written at some depth about what makes for a good interview from the perspective of an interviewee. However, despite the inherent power imbalance, interviewing is a two-way street. I wrote a while ago about how, in this market, recruiting isn’t about vetting as much as it is about selling, and not engaging candidates in the course of talking to them for an hour is a woefully missed opportunity. But, just like solving interview questions is a learned skill that takes time and practice, so, too, is the other side of the table. Being a good interviewer takes time and effort and a fundamental willingness to get out of autopilot and engage meaningfully with the other person.

Of course, everyone and their uncle has strong opinions about what makes someone a good interviewer, so instead of waxing philosophical, we’ll present some data and focus on analytically answering questions like… Does it matter how strong of an engineering brand your company has, for instance? Do the questions you ask actually help get candidates excited? How important is it to give good hints to your candidate? How much should you talk about yourself? And is it true that, at the end of the day, what you say is way less important than how you make people feel?1 And so on.

Before I delve into our findings, I’ll say a few words about interviewing.io and the data we collect.

The setup

interviewing.io is an anonymous technical interviewing platform. On interviewing.io, people can practice technical interviewing anonymously, and if things go well, unlock real (still anonymous) interviews with companies like Lyft, Twitch, Quora, and more.

The cool thing is that both practice and real interviews with companies take place within the interviewing.io ecosystem. As a result, we’re able to collect quite a bit of interview data and analyze it to better understand technical interviewing. One of the most important pieces of data we collect is feedback from both the interviewer and interviewee about how they thought the interview went and what they thought of each other. If you’re curious, you can see what the feedback forms for interviewers and interviewees look like below — in addition to one direct yes/no question, we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, one of which is their own take on how they thought they did.

Feedback form for interviewers

Feedback form for interviewers

Feedback form for interviewees

Feedback form for interviewees

In this post, we’ll be analyzing feedback and outcomes of thousands of real interviews with companies to figure out what traits the best interviewers have in common.

Before we get into the nitty-gritty of individual interviewer behaviors, let’s first put the value of a good interviewer in context by looking at the impact of a company’s brand on the outcome. After all, if brand matters a lot, then maybe being a good interviewer isn’t as important as we might think.

Brand strength

So, does brand really matter for interview outcomes? One quick caveat before we get into the data: every interview on the platform is user-initiated. In other words, once you unlock our jobs portal (you have to do really well in practice interviews to do so), you decide who you talk to. So, candidates talking to companies on our platform will be predisposed to move forward because they’ve chosen the company in the first place. And, as should come as no surprise to anyone, companies with a very strong brand have an easier time pulling candidates (on our platform and out in the world at large) than their lesser-known counterparts. Moreover, many of the companies we work with do have a pretty strong brand, so our pool isn’t representative of the entire branding landscape. However, all is not lost — in addition to working with very recognizable brands, we work with a number of small, up-and-coming startups, so we hope that if you, the reader, are coming from a company that’s doing cool stuff but that hasn’t yet become a household name, our findings likely apply to you. And, as you’ll see, getting candidates in the door isn’t the same as keeping them.

To try to quantify brand strength, we used three different measures: the company’s Klout Score (yes, that still exists), its Mattermark Mindshare Score, and its score on Glassdoor (under general reviews).2

When we looked at interview outcomes relative to brand strength, its impact was not statistically significant. In other words, we found that brand strength didn’t matter at all when it came to either whether the candidate wanted to move forward or how excited the candidate was to work at the company.

This was a bit surprising, so I decided to dig deeper. Maybe brand strength doesn’t matter overall but matters when the interviewer or the questions they asked aren’t highly rated? In other words, can brand buttress less-than-stellar interviewers? Not so, according to our data. Brand didn’t matter even when you corrected for interviewer quality. In fact, of the top 10 best-rated companies on our platform, half have no brand to speak of, 3 are mid-sized YC companies that command respect in Bay Area circles but are definitely not universally recognizable, and only 2 have anything approaching household name status.

So, what’s the takeaway here? Maybe the most realistic thing we can say is that while brand likely matters a lot for getting candidates in the door, once they’re in, no matter how well-branded you are, they’re yours to lose.

Choosing the question

If brand doesn’t matter once you’ve actually gotten a candidate in the door, then what does? Turns out, the questions you ask matter a TON. As you recall, feedback on interviewing.io is symmetric, which means that in addition to the interviewer rating the candidate, the candidate also rates the interviewer, and one of the things we ask candidates is how good the question(s) they got asked were.

Question quality was extremely significant (p < 0.002 with an effect size of 1.25) when it came to whether the candidate wanted to move forward with the company. This held both when candidates did well and when they did poorly.

While we obviously can’t share the best questions (these are company interviews, after all), we can look at what candidates had to say about the best and worst-rated questions on the platform.

The good

I liked the fact that questions were building on top of each other so that previous work was not wasted and
finding ways to improve on the given solution.

Always nice to get questions that are more than just plain algorithms.

Really good asking of a classic question, opened my mind up to edge cases and considerations that I never contemplated the couple of times I’ve been exposed to the internals of this data structure.

This was the longest interviewing.io interview I have ever done, and it is also the most enjoyable one! I really like how we started with a simple data structure and implemented algorithms on top of it. It felt like working on a simple small-scale project and was fun.

He chose an interesting and challenging interview problem that made me feel like I was learning while I was solving it. I can’t think of any improvements. He would be great to work with.

I liked the question — it takes a relatively simple algorithms problem (build and traverse a tree) and adds some depth. I also liked that the interviewer connected the problem to a real product at [Redacted] which made it feel like less like a toy problem and more like a pared-down version of a real problem.

This is my favorite question that I’ve encountered on this site. it was one of the only ones that seem like it had actual real-life applicability and was drawn from a real (or potentially real) business challenge. And it also nicely wove in challenges like complexity, efficiency, and blocking.

The bad

Question wasn’t straightforward and it required a lot of thinking/understanding since functions/data structures weren’t defined until a lot later. [Redacted] is definitely a cool company to work for, but some form of structure in interviews would have been a lot more helpful. Spent a long time figuring out what the question is even asking, and interviewer was not language-agnostic.

I was expecting a more technical/design question that showcases the ability to think about a problem. Having a domain-specific question (regex) limits the ability to show one’s problem-solving skills. I am sure with enough research one could come up with a beautiful regex expression but unless this is something one does often, I don’t think it [makes for] a very good assessment.

This is not a good general interview question. A good interview question should have more than one solution with simplified constraints.

Anatomy of a good interview question

  1. Layer complexity (including asking a warmup)
  2. No trivia
  3. Real-world components/relevance to the work the company is doing are preferable to textbook algorithmic problems
  4. If you’re asking a classic algorithmic question, that’s ok, but you ought to bring some nuance and depth to the table, and if you can teach the interviewee something interesting in the process, even better!

Asking the question

One of the other things we ask candidates after their interviews is how helpful their interviewer was in guiding them to the solution. Providing your candidate with well-timed hints that get them out of the weeds without giving away too much is a delicate art that takes a lot of practice (and a lot of repetition), but how much does it matter?

As it turns out, being able to do this well matters a ton. Being good at providing hints was extremely significant (p < 0.00001 with an effect size of 2.95) when it came to whether the candidate wanted to move forward with the company (as before, we corrected for whether the interview went well).

You can see for yourself what candidates thought of their interviewers when it came to their helpfulness and engagement below. Though this attribute is a bit harder to quantify, it seems that hint quality is actually a specific instance of something bigger, namely the notion of turning something inherently adversarial into a collaborative exercise that leaves both people in a better place than where they started.3

And if you can’t do that every time, then at the very least, be present and engaged during the interview. And no matter what the devil on your shoulder tells you, no good will ever come of opening Reddit in another tab.4

One of the most memorable, pithy conversations I ever had about interviewing was with a seasoned engineer who had spent years as a very senior software architect at a huge tech company before going back to what he’d always liked in the first place, writing code. He’d conducted a lot of interviews over a career spanning several decades, and after trying out a number of different interview styles, what he settled on was elegant, simple, and satisfying. According to him, the purpose of any interview is to “see if we can be smart together.” I like that so much, and it’s advice I repeat whenever anyone will listen.

The good

I liked that you laid out the structure of the interview at the outset and mentioned that the first question did not have any tricks. That helped set the pace of the interview so I didn’t spend an inordinate amount of time on the first one.

The interview wasn’t easy, but it was really fun. It felt more like making a design discussion with a colleague than an interview. I think the question was designed/prepared to fill the 45 minute slot perfectly.

I’m impressed by how quickly he identified the issue (typo) in my hash computation code and how gently he led me to locating it myself with two very high-level hints (“what other tests cases would you try?” and “would your code always work if you look for the the pattern that’s just there at the beginning of the string?”). Great job!

He never corrected me, instead asked questions and for me to elaborate in areas where I was incorrect – I very much appreciate this.

The question seemed very overwhelming at first but the interviewer was good at helping to break it down into smaller problems and suggest we focus on one of those first.

The bad

[It] was a little nerve-wracking hearing you yawn while I was coding.

What I found much more difficult about this interview was the lack of back and forth as I went along, even if it was simple affirmation that “yes, that code you just wrote looks good”. There were times when it seemed like I was the only one who had talked in the past five minutes (I’m sure that’s an exaggeration). This made it feel much more like a performance than like a collaboration, and my heart was racing at the end as a result.

While the question was very straightforward, and [he] was likely looking for me to blow through it with no prompting whatsoever in order to consider moving forward in an interview process, it would have been helpful to get a discussion or even mild hinting from him when I was obviously stuck thinking about an approach to solve the the problem. While I did get to the answer in the end, having a conversation about it would have made it feel more like a journey and learning experience. That would have also been a strong demonstration of the collaborative culture that exists while working with teams of people at a tech company, and would have sold me more vis-a-vis my excitement level.

If an interview is set to 45 minutes, the questions should fit this time frame, because people plan accordingly. I think that if you plan to have a longer interview you should notify the interviewee beforehand, so he can be ready for it.

One issue I had with the question though is what exactly he was trying to evaluate from me with the question. At points we talking about very nitty-gritty details about python linked list or array iteration, but it was unclear at any point if that was what he was judging me on. I think in the future he could outline at the beginning what exactly he was looking for with the problem in order to keep the conversation focused and ensure he is well calibrated judging candidates.

Try to be more familiar with all the possible solutions to the problem you choose to pose to the candidate. Try to work on communicating more clearly with the candidate.

Anatomy of a good interview

  1. Set expectations, and control timing/pacing
  2. Be engaged!
  3. Familiarity with the problem and its associated rabbit holes/garden paths
  4. Good balance of hints and letting candidate think
  5. Turn the interview into a collaborative exercise where both people are free to be smart together

The art of storytelling… and the importance of being human

Beyond choosing and crafting good questions and being engaged (but not overbearing) during the interview, what else do top-rated interviewers have in common?

The pervasive common thread I noticed among the best interviewers on our platform is, as above, a bit hard to quantify but dovetails well with the notion of being engaged and creating a collaborative experience. It’s taking a dehumanizing process and elevating it to an organic experience between two capable, thinking humans. Many times, that translates into revealing something real about yourself and telling a story. It can be sharing a bit about the company you work at and why, out of all the places you could have landed, you ended up there. Or some aspect of the company’s mission that resonated with you specifically. Or how the projects you’ve worked on tie into your own, personal goals.

The good

I like the interview format, in particular how it was primarily a discussion about cool tech, as well as an honest description of the company… the discussion section was valuable, and may be a better gauge of fit anyway. It’s nice to see a company which places value on that 🙂

The interviewer was helpful throughout the interview. He didn’t mind any questions on their company’s internal technology decisions, or how it’s structured. I liked that the interviewer gave me a good insight of how the company functions.

Extremely kind and very generous with explaining everything they do at [redacted]. Really interested in the technical challenges they’re working on. Great!

Interesting questions but the most valuable and interesting thing were the insights he gave me about [redacted]. He sounded very passionate about engineering in general, particularly about the challenges they are facing at [redacted]. Would love to work with him.

The bad

[A] little bit of friendly banter (even if it’s just “how are you doing”?) at the very beginning of the interview would probably help a bit with keeping the candidate calm and comfortable.

I thought the interview was very impersonal, [and] I could not get a good read on the goal or mission of the company.

And, as we wrote about in a previous post, one of the most genuine, human things of all is giving people immediate, actionable feedback. As you recall, during the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very statistically significant relationship (p < 0.00005)5 between whether people think they did well and whether they’d want to work with the interviewer. This means that when people think they did poorly, they may be a lot less likely to want to work with you. And by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

How can one mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! This way people don’t have time to go through the self-flagellation gauntlet that happens after a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway.

How to be human

  1. Talk about what your company does… and what specifically about it appealed to you and made you want to join
  2. Talk about what you’re currently working on and how that fits in with what you’re passionate about
  3. When you like a candidate, give positive feedback as quickly as you can to save them from the self-flagellation that they’ll likely go through otherwise… and which might make them rationalize away wanting to work with you
  4. And, you know, be friendly. A little bit of warmth can go a long way.

Becoming a better interviewer

Interviewing people is hard. It’s hard to come up with good questions, it’s hard to give a good interview, and it’s especially hard to be human in the face of conducting a never-ending parade of interviews. But, being a good interviewer is massively important. As we saw, while your company’s brand will get people in the door, once they’ve reached the technical interview, the playing field is effectively level, and you can no longer use your brand as a crutch to mask poor questions or a lack of engagement. And in this market, where the best candidates have a ton of options, when wielded properly, a good interview that elevates a potentially cold, transactional interaction into something real and genuine can become the selling point that gets great engineers to work for you, whether you’re a household name or a startup that just got its first users.

Given how important it is to do interviews well, what are some things you can do to get better right away? One thing I found incredibly useful for coming up with good, original questions is to start a shared doc with your team where every time someone solves a problem they think is interesting, no matter how small, they jot down a quick note. These notes don’t have to be fleshed out at all, but they can be the seeds for unique interview questions that give candidates insight into the day-to-day at your company. Turning these disjointed seeds into interview questions takes thought and effort — you have to prune out a lot of the details and distill the essence of the problem into something it doesn’t take the candidate a lot of work/setup to grok, and you’ll likely have to iterate on the question a few times before you get it right — but they payoff can be huge.

Another thing you can do to get actionable feedback like the kind you saw in this post (and then immediately level up) is to get on interviewing.io as an interviewer. If you interview people in our double-blind practice pool, no one will know who you are or which company you represent, which means that you get a truly unbiased take on your interviewing ability, which includes your question quality, how excited people would be to work with you, and how good you are at helping people along without giving away too much. It’s also a great way to go beyond your team, which can be pretty awkward, and try out new questions on a very engaged, high-quality user base. You’ll also get to keep replays of your interviews so you can revisit crucial moments and figure out exactly what you need to do to get better next time.

Become a better interviewer with honest, actionable feedback from candidates

Become a better interviewer with honest, actionable feedback from candidates

Want to hone your skills as an interviewer? Want to help new interviewers at your company warm up before they officially get added to your interview loops? You can sign up to our platform as an interviewer, or (especially for groups) ping us at interviewers@interviewing.io.

1“People will forget what you said, people will forget what you did, but people will never forget how you made them feel.” -Maya Angelou

2It’s important to call out that brand and engineering brand are two separate things that can diverge pretty wildly. For instance, Target has a strong brand overall but probably not the best engineering brand (sorry). Heap, on the other hand, is one of the better-respected places to work among engineers (both on interviewing.io and off), but it doesn’t have a huge overall brand. Both the Klout and Mattermark Mindshare scores aren’t terrible for quantifying brand strength, but they’re not amazing at engineering brand strength (they’re high for Target and low for Heap). The Glassdoor score is a bit better because reviewers tend to skew engineering-heavy, but it’s still not that great of a measure. So, if anyone has a better way to quantify this stuff, let me know. If I were doing it, I’d probably look at GitHub repos of the company and its employees, who their investors are, and so on and so forth. But that’s a project that’s out of scope for this post.

3If you’re familiar with Dan Savage’s campsite rule for relationships, I think there should be a similar for interviewing… leave your candidates in better shape than when you found them.

4Let us save you the time: Trump is bad, dogs are cute, someone ate something.

5This time with even more significance!

Featured

Uncategorized

If you care about diversity, don’t just hire from the same five schools

Posted on October 24th, 2017.

EDIT: Our university hiring platform is now on Product Hunt!

If you’re a software engineer, you probably believe that, despite some glitches here and there, folks who have the technical chops can get hired as software engineers. We regularly hear stories about college dropouts, who, through hard work and sheer determination, bootstrapped themselves into millionaires. These stories appeal to our sense of wonder and our desire for fairness in the world, but the reality is very different. For many students looking for their first job, the odds of breaking into a top company are slim because they will likely never even have the chance to show their skills in an interview. For these students (typically ones without a top school on their resume), their first job is often a watershed moment where success or failure can determine which opportunities will be open to them from that point forward and ultimately define the course of their entire career. In other words, having the right skills as a student is nowhere near enough to get you a job at a top-tier tech company.

To make this point concrete, consider three (fictitious, yet indicative) student personas, similar in smarts and skills but attending vastly different colleges. All are seeking jobs as software engineers at top companies upon graduation.

Mason goes to Harvard. He has a mediocre GPA but knows that doesn’t matter to tech companies, where some of his friends already work. Come September, recent graduates and alums fly back to campus on their company’s dime in order to recruit him. While enjoying a nice free meal in Harvard Square, he has the opportunity to ask these successful engineers questions about their current work. If he likes the company, all he has to do is accept the company’s standing invitation to interview on campus the next morning.

Emily is a computer science student at a mid-sized school ranked in the top 30 for computer science. She has solid coursework in algorithms under her belt, a good GPA, and experience as an engineering intern at a local bank. On the day of her campus’s career fair, she works up the courage to approach companies – this will be her only chance to interact with companies where she dreams of working. Despite the tech industry being casual, the attire of this career fair is business formal with a tinge of sweaty. So after awkwardly putting together an outfit she would never wear again1, she locates an ancient printer on the far side of campus and prints 50 copies of her resume. After pushing through the lines in order to line up at the booths of tech companies, she gives her resume to every single tech company at the fair over the course of several hours. She won’t find out for two more weeks if she got any interviews.

Anthony goes to a state school near the town where he grew up. He is top in his class, as well as a self-taught programmer, having gone above and beyond his coursework to hack together some apps. His school’s career fair has a bunch of local, non-tech employers. He has no means of connecting with tech companies face-to-face and doesn’t know anyone who works in tech. So, he applies to nearly a hundred tech companies indiscriminately through their website online, uploading his resume and carefully crafted cover letter. He will probably never hear from them.

Career fair mania

The status quo in university recruiting revolves around career fairs and in-person campus recruiting, which have serious limitations. For one, they are extremely expensive, especially at elite schools. Prime real estate at the MIT career fair will run you a steep $18,000, for entry alone. That’s not counting the price of swag (which gets more exorbitant each year), travel, and, most importantly, the opportunity cost of attending engineers’ time. While college students command the lowest salaries, it’s not uncommon for tech companies to spend 50% more on recruiting a student than a senior engineer.

At elite schools, the lengths to which companies go to differentiate themselves is becoming more exorbitant with each passing year. In fact, students at elite colleges suffer from company overload because every major tech company, big and small, is trying to recruit them. All of this, while students at non-elite colleges are scrambling to get their foot in the door without any recruiters, let alone VPs of high-profile companies, visiting their campus.

Of course, due to this cost, companies are limited in their ability to visit colleges in person, and even large companies can visit around 15 or 20 colleges at most. This strategy overlooks top students at solid CS programs that are out of physical reach.

In an effort to overcome this, companies are attending conferences and hackathons out of desperation to reach students at other colleges. The sponsorship tier for the Grace Hopper Conference, the premier gathering for women in tech, tops out at $100,000, with the sponsorship tier to get a single interview booth starting at $30,000. Additionally, larger companies send representatives (usually engineers) to large hackathons in an effort to recruit students in the midst of a 48-hour all-nighter. However, the nature of in-person career fairs and events are that not all students will be present. Grace Hopper is famously expensive to attend as a student, especially when factoring in airfare and hotel.

This cost is inefficient at best, and prohibitive at worst, especially for small startups with low budget and brand. Career fairs serve a tiny portion of companies and a tiny portion of students, and the rest are caught in the pecuniary crossfire. Demand for talented engineers out of college who bring a different lived experience to tech has never been higher, yet companies are passing on precisely these students via traditional methods. Confounding the issue even further is the fundamental question of whether having attended a top school has much bearing on candidate quality in the first place (more on that in the section on technical screening below).

Homogeneity of hires

The focus of companies on elite schools has notable, negative implications for the diversity of their applicants. In particular, many schools that companies traditionally visit are notably lacking in diversity, especially when it comes to race and socioeconomic status. According to a survey of computer science students at Stanford, there were just fifteen Hispanic female and fifteen black female computer science majors in the 2015 graduating class total. In this analysis, the Stanford 2015 CS major was 9% Hispanic and 6% black. According to a 2015 analysis, the Harvard CS major was just 3% black and 5 percent Hispanic. Companies that are diversity-forward and constrained to recruiting at the same few schools end up competing over this small pool of diverse students. Meanwhile, there is an entire ocean of qualified, racially diverse students from less traditional backgrounds whom companies are overlooking.

The focus on elite schools also has meaningful implications on socioeconomic diversity. According to a detailed New York Times infographic, “four in 10 students from the top 0.1 percent attend an Ivy League or elite university, roughly equivalent to the share of students from poor families who attend any two- or four-year college.” The infographic highlights the rigid segmentation of students by class background in college matriculation.

Source: New York Times

The article finds that the few lower-income students who end up at elite colleges do about as well as their more affluent classmates but that attending an elite versus non-elite college makes a huge difference in future income.

The focus of tech companies on elite schools lends credence to this statistic, codifying the rigidity with which students at elite college are catapulted into the 1 percent, while others are left behind. Career-wise, it’s that first job or internship you get while you’re still in school that can determine what opportunities you have access to in the future. And yet, students at non-elite colleges have trouble accessing these very internships and jobs, or even getting a meager first round interview, contributing to the lack of social mobility in our society not for lack of skills but for lack of connections. This sucks. A lot.

The technical screen

Let’s return to our three students. Let’s say that Emily, the student who attended her college’s career fair, gets called back by one or two companies for a first round interview if her resume meets the criteria that companies are looking for. Not having an internship at a top tech company already — quite the catch-22 — puts her at a disadvantage. Anthony has little to no chance of hearing back from employers via his applications online, but let’s say that by some miracle lands a phone screen with one of the tech giants (his best shot, as there are more recruiters to look through the resume dump on the other end).

What are their experiences when it comes to prepping for upcoming technical interviews?

Mason, the Harvard student, attends an event on campus with Facebook engineers teaching him how to pass the technical interview. He also accepts a few interviews at companies he’s less excited with for practice, and just in case. While he of course needs be sharp and prepare in order to get good at these sorts of algorithmic problems, he has all of the resources he could ask for and more at his disposal. Unsurprisingly, his Facebook interview goes well.

Emily’s school has an informal, undergraduate computer science club in which they are collectively reading technical interviewing guides and trying to figure out what tech companies want from them. She has a couple interviews lined up, but all of which are for jobs she’s desperate to get. They trade tips after interviews but ultimately have a shaky understanding of they did right and wrong in the absence of post-interview feedback from companies. Only a couple of alumni from their school have made it to top tech companies in the past, and so they lack the kinds of information that Mason has on what companies are looking for. (E.g. Don’t be afraid to take hints, make sure to explain your thought process, what the heck is this CoderPad thing anyway…)

Anthony doesn’t know anyone who has a tech job like the one he’s interviewing for, and only one of his friends is also interviewing. He doesn’t know where to start when it comes to getting ready for his upcoming interview at GoogFaceSoft. He only has one shot at it with no practice interviews lined up. He prepares by googling “tech interview questions” and stumbles upon a bunch of unrealistic interview questions, many of them behavioral or outdated. He might be offered the interview and be fit for the job, but he sure doesn’t know how to pass the interview.

For students who may be unfamiliar with the art of the technical interview, algorithmic interviews can be mystifying, leading to an imbalance of information on how to succeed. Given that technical interviewing is a game, it is important that everyone knows the rules, spoken and unspoken. There are many practice resources available, but no amount of reading and re-reading Cracking the Coding Interview can prepare you for that moment when you are suddenly in a live, technical phone screen with another human.

We built a better way to hire

Ultimately, as long as university hiring relies on a campus-by-campus approach, the status quo will continue to be fundamentally inefficient and unmeritocratic. No company, not even the tech giants, can cover every school or every resume submitted online. And, in the absence of any meaningful information on a student’s resume, companies default to their university as the only proxy. This approach is inefficient at best and, at worst, it’s the first in a series of watershed moments that derail the promise of social mobility for the non-elite, taking with them any hope of promoting diversity among computer science students.

Because this level of inequity, placed for maximum damage right at the start of people’s careers, really pissed us off, we decided to do something about it. interviewing.io’s answer to the unfortunate status quo is a university-specific hiring platform. If you’re already familiar with how core interviewing.io works, you’ll see that the premise is exactly the same. We give out free practice to students, and use their performance in practice to identify top performers, completely independently of their pedigree. Those top performers then get to interview with companies like Lyft and Quora on our platform. In other words, we’re excited to provide students with pathways into tech that don’t involve going to an elite school or knowing someone on the inside. So far, we’ve been very pleased with the results. You can see our student demographics and where they’re coming from below. Students from all walks of life, whether they’re from MIT or a school you’d never visit, are flocking to the platform, and we couldn’t be prouder.

school tier distribution

interviewing.io evaluates students based on their coding skills, not their resume. We are open to students regardless of their university affiliation, college major, and pretty much anything else (we ask for your class year to make sure you’re available when companies want you and that’s about it). Unlike traditional campus recruiting, we attract students organically (getting free practice with engineers from top companies is a pretty big draw) from schools big and small from across the country.

student heatmap

We’re also proud that almost 40 percent of our university candidates come from backgrounds that are underrepresented in tech.

student heatmap

Because of our completely blind, skills-first approach, we’ve seen an interesting phenomenon happen time and time again: when a student unmasks at the end of a successful interview, the company in question realizes that the student who just aced their technical phone screen was one whose resume was sitting at the bottom of the pile all along.

In addition to identifying top students who bring a different lived experience to tech, we’re excited about the economics of our model. With interviewing.io, a mid-sized startup can staff their entire intern class for the same cost as attending 1-2 career fairs at top schools… with a good chunk of those interns coming from underrepresented backgrounds. Want to hire interns and new grads in the most efficient, fair way possible? Sign up to be an employer on our university platform!

Meena runs interviewing.io’s university hiring platform. We help companies hire college students from all over the US, with a focus on diversity. Prior to joining interviewing.io, Meena was a software engineer at Clever, and before that, Meena was in college on the other side of the engineer interviewing equation.

1At least her school didn’t send out this.

Featured

Uncategorized

We analyzed thousands of technical interviews on everything from language to code style. Here’s what we found.

Posted on June 13th, 2017.

Note: Though I wrote most of the words in this post, the legendary Dave Holtz did the heavy lifting on the data side. See more of his work on his blog.

If you’re reading this post, there’s a decent chance that you’re about to re-enter the crazy and scary world of technical interviewing. Maybe you’re a college student or fresh grad who is going through the interviewing process for the first time. Maybe you’re an experienced software engineer who hasn’t even thought about interviews for a few years. Either way, the first step in the interviewing process is usually to read a bunch of online interview guides (especially if they’re written by companies you’re interested in) and to chat with friends about their experiences with the interviewing process (both as an interviewer and interviewee). More likely than not, what you read and learn in this first, “exploratory” phase of the interview process will inform how you choose to prepare moving forward.

There are a few issues with this typical approach to interview preparation:

  • Most interview guides are written from the perspective of one company. While Company A may really value efficient code, Company B may place more of an emphasis on high-level problem-solving skills. Unless your heart is set on Company A, you probably don’t want to give too much weight to what they value.
  • People lie sometimes, even if they don’t mean to. In writing, companies may say they’re language agnostic, or that it’s worthwhile to explain your thought process, even if the answer isn’t quite right. However, it’s not clear if this is actually how they act! We’re not saying that tech companies are nefarious liars who are trying to mislead their applicant pool. We’re just saying that sometimes implicit biases sneak in and people aren’t even aware of them.
  • A lot of the “folk knowledge” that you hear from friends and acquaintances may not be based in fact at all. A lot of people assume that short interviews spell doom. Similarly, everyone can recall one long interview after which they’ve thought to themselves, “I really hit it off with that interviewer, I’ll definitely get passed onto the next stage.” In the past, we’ve seen that people are really bad at gauging how they did in interviews. This time, we wanted to look directly at indicators like interview length and see if those actually matter.

Here at interviewing.io, we are uniquely positioned to approach technical interviews and their outcomes in a data-driven way. This time, we’ve opted for a quick (if not dirty) and quantitative analysis. In other words, rather than digging deep into individual interviews, we focused on easily measurable attributes that many interviews share, like duration and language choice. In upcoming posts, we’ll be delving deeper into the interview content itself. If you’re new to our blog and want to get some context about how interviewing.io works and what interview data we collect, please take a look at the section called “The setup” below. Otherwise, please skip over that and head straight for the results!

The setup

interviewing.io is a platform where people can practice technical interviewing anonymously, and if things go well, unlock the ability to interview anonymously, whenever they’d like, with top companies like Uber, Lyft, and Twitch. The cool thing is that both practice interviews and real interviews with companies take place within the interviewing.io ecosystem. As a result, we’re able to collect quite a bit of interview data and analyze it to better understand technical interviews, the signal they carry, what works and what doesn’t, and which aspects of an interview might actually matter for the outcome.

Each interview, whether it’s practice or real, starts with the interviewer and interviewee meeting in a collaborative coding environment with voice, text chat, and a whiteboard, at which point they jump right into a technical question. Interview questions tend to fall into the category of what you’d encounter in a phone screen for a back-end software engineering role. During these interviews, we collect everything that happens, including audio transcripts, data and metadata describing the code that the interviewee wrote and tried to run, and detailed feedback from both the interviewer and interviewee about how they think the interview went and what they thought of each other.

If you’re curious, you can see what the feedback forms for interviewers and interviewees look like below — in addition to one direct yes/no question, we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, and one of the things we ask is whether an interviewee has previously seen the question they just worked on.

Feedback form for interviewers

Feedback form for interviewers

Feedback form for interviewees

Feedback form for interviewees

The results

Before getting into the thick of it, it’s worth noting that the conclusions below are based on observational data, which means we can’t make strong causal claims… but we can still share surprising relationships we’ve observed and explain what we found so you can draw your own conclusions.

Having seen the interview question before

“We’re talking about practice!” -Allen Iverson

First thing’s first. It doesn’t take a rocket scientist to suggest that one of the best ways to do better in interviews is to… practice interviewing. There are a number of resources out there to help you practice, ours among them. One of the main benefits of working through practice problems is that you reduce the likelihood of being asked to solve something you’ve never seen before. Balancing that binary search tree will be much less intimidating if you’ve already done it once or twice.

We looked at a sample of ~3000 interviews and compared the outcome to whether the interviewee had seen the interview question before. You can see the results in the plot below.

seen_interview_before_plot

Unsurprisingly, interviewees who had seen the question were 16.6% more likely to be considered hirable by their interviewer. This difference is statistically significant (p < 0.001).1

Does it matter what language you code in?

“Whoever does not love the language of his birth is lower than a beast and a foul smelling fish.” -Jose Rizal

You might imagine that different languages lead to better interviews. For instance, maybe the readability of Python gives you a leg up in interviews. Or perhaps the fact that certain languages handle data structures in a particularly clean way makes common interview questions easier. We wanted to see whether or not there were statistically significant differences in interview performance across different interview languages.

To investigate, we grouped interviews on our platform by interview language and filtered out any languages that were used in fewer than 5 interviews (this only threw out a handful of interviews). After doing this, we were able to look at interview outcome and how it varied as a function of interview language.

The results of that analysis are in the chart below. Any non-overlapping confidence intervals represent a statistically significant difference in how likely an interviewee is to ‘pass’ an interview, as a function of interview language. Although we don’t do a pairwise comparison for every possible pair of languages, the data below suggest that generally speaking, there aren’t statistically significant differences between the success rate when interviews are conducted in different languages.2

interview_varies_with_success_rate_plot

That said, one of the most common mistakes we’ve observed qualitatively is people choosing languages they’re not comfortable in and then messing up basic stuff like array length lookup, iterating over an array, instantiating a hash table, and so on. This is especially mortifying when interviewees purposely pick a fancy-sounding language to impress their interviewer. Trust us, wielding your language of choice comfortably beats out showing off in a fancy-sounding language you don’t know well, every time.

Even if language doesn’t matter… is it advantageous to code in the company’s language of choice?

“God help me, I’ve gone native.” -Margaret Blaine

It’s all well and good that, in general, interview language doesn’t seem particularly correlated with performance. However, you might imagine that there could be an effect depending on the language that a given company uses. You could imagine a Ruby shop saying “we only hire Ruby developers, if you interview in Python we’re less likely to hire you.” On the flip side, you could imagine that a company that writes all of their code in Python is going to be much more critical of an interviewee in Python – they know the ins and outs of the language, and might judge the candidate for doing all sorts of “non-pythonic” things during their interview.

The chart below is similar to the chart which showed differences in interview success rate (as measured by interviewers being willing to hire the interviewee) for C++, Java, and Python. However, this chart also breaks out performance by whether or not the interview language is in the company’s stack. We restrict this analysis to C++, Java and Python because these are the three languages where we had a good mixture of interviews where the company did and did not use that language. The results here are mixed. When the interview language is Python or C++, there’s no statistically significant difference between the success rates for interviews where the interview language is or is not a language in the company’s stack. However, interviewers who interviewed in Java were more likely to succeed when interviewing with a Java shop (p=0.037).

So, why is it that coding in the company’s language seems to be helpful when it’s Java, but not when it’s Python or C++? One possible explanation is that the communities that exist around certain programming languages (such as Java) place a higher premium on previous experience with the language. Along these lines, it’s also possible that interviewers from companies that use Java are more likely to ask questions that favor those with a pre-existing knowledge of Java’s idiosyncrasies.

language_success_rate_company_plot

What about the relationship between what language you program in and how good of a communicator you’re perceived to be?

“To handle a language skillfully is to practice a kind of evocative sorcery.” -Charles Baudelaire

Even if language choice doesn’t matter that much for overall performance (Java-wielding companies notwithstanding), we were curious whether different language choices led to different outcomes in other interview dimensions. For instance, an extremely readable language, like Python, may lead to interview candidates who are assessed to have communicated better. On the other hand, a low-level language like C++ might lead to higher scores for technical ability. Furthermore, very readable or low-level languages might lead to correlations between these two scores (for instance, maybe they’re a C++ interview candidate who can’t explain at all what he or she is doing but who writes very efficient code). The chart below suggests that there isn’t really any observable difference between how candidates’ technical and communication abilities are perceived, across a variety of programming languages.

Furthermore, no matter what, poor technical ability seems highly correlated with poor communication ability – regardless of language, it’s relatively rare for candidates to perform well technically but not effectively communicate what they’re doing (or vice versa), largely (and fortunately) debunking the myth of the incoherent, fast-talking, awkward engineer.3

Interview duration

“It’s fine when you careen off disasters and terrifyingly bad reviews and rejection and all that stuff when you’re young; your resilience is just terrific.” -Harold Prince

We’ve all had the experience of leaving an interview and just feeling like it went poorly. Often, that feeling of certain underperformance is motivated by rules of thumb that we’ve either come up with ourselves or heard repeated over and over again. You might find yourself thinking, “the interview didn’t last long? That’s probably a bad sign… ” or “I barely wrote anything in that interview! I’m definitely not going to pass.” Using our data, we wanted to see whether these rules of thumb for evaluating your interview performance had any merit.

First, we looked at the length of the interview. Does a shorter interviewer mean you were such a trainwreck that the interviewer just had to stop the interview early? Or was it maybe the case that the interviewer had less time than normal, or had seen in just a short amount of time that you were an awesome candidate? The plot below shows the distributions of interview length (measured in minutes) for both successful and unsuccessful candidates. A quick look at this chart suggests that there is no difference in the distribution of interview lengths between interviews that go well and interviews that don’t — the average length of interviews where the interviewer wanted to hire the candidate was 51.00 minutes, whereas the average length of interviews where the interviewer did not was 49.95 minutes. This difference is not statistically significant.4

interview_duration_plot

Amount of code written

“Brevity is the soul of wit.” -William Shakespeare

You may have experienced an interview where you were totally stumped. The interviewer asks you a question you barely understand, you repeat back to him or her “binary search what?”, and you basically write no code during your interview. You might hope that you could still pass an interview like this through sheer wit, charm, and high-level problem-solving skills. In order to assess whether or not this was true, we looked at the final character length of code written by the interviewee. The plot below shows the distributions of character length for both successful and unsuccessful. A quick look at this chart suggests that there is a difference between the two — interviews that don’t go well tend to have less code. There are two phenomena that may contribute to this. First, unsuccessful interviewers may write less code to begin with. Additionally, they may be more prone to delete large swathes of code they’ve written that either don’t run or don’t return the expected result.

interview_code_length_plot

On average, successful interviews had final interview code that was on average 2045 characters long, whereas unsuccessful ones were, on average, 1760 characters long. That’s a big difference! This finding is statistically significant and probably not very surprising.

Code modularity

“The mark of a mature programmer is willingness to throw out code you spent time on when you realize it’s pointless.” -Bram Cohen

In addition to just look at how much code you write, we can also think about the type of code you write. Conventional wisdom suggests that good programmers don’t recycle code – they write modular code that can be reused over and over again. We wanted to know if that type of behavior was actually rewarded during the interview process. In order to do so, we looked at interviews conducted in Python5 and counted how many function definitions appeared in the final version of the interview. We wanted to know if successful interviewees defined more functions — while having more function handlers is not the definition of modularity, in our experience, it’s a pretty strong signal of it. As always, it’s impossible to make strong causal claims about this – it might be the case that certain interviewers (who are more or less lenient) ask interview questions that lend themselves to more or fewer functions. Nonetheless, it is an interesting trend to investigate!

The plot below shows the distribution of the number of Python functions defined for both candidates who the interviewer said they would hire and candidates who the interviewer said they would not hire. A quick look at this chart suggests that there is a difference in the distribution of function definitions between interviews that go well and interviews that don’t. Successful interviewees seem to define more functions.

python_functions_plot

On average, successful candidates interviewing in Python define 3.29 functions, whereas unsuccessful candidates define 2.71 functions. This finding is statistically significant. The upshot here is that interviewers really do reward the kind of code they say they want you to write.

Does it matter if your code runs?

“Move fast and break things. Unless you are breaking stuff, you are not moving fast enough.” -Mark Zuckerberg
“The most effective debugging tool is still careful thought, coupled with judiciously placed print statements.” -Brian Kernighan

A common refrain in technical interviews is that interviewers don’t actually care if your code runs – what they care about is problem-solving skills. Since we collect data on the code interviewees run and whether or not that code compiles, we wanted to see if there was evidence for this in our data. Is there any difference between the percentage of code that compiles error-free in successful interviews versus unsuccessful interviews? Furthermore, can interviewees actually still get hired, even if they make tons of syntax errors?

In order to get at this question, we looked at the data. We restricted our dataset to interviews longer than 10 minutes with more than 5 unique instances of code being executed. This helped filter out interviews where interviewers didn’t actually want the interviewee to run code, or where the interview was cut short for some reason. We then measured the percent of code runs that resulted in errors.5 Of course, there are some limitations to this approach – for instance, candidates could execute code that does compile but gives a slightly incorrect answer. They could also get the right answer and write it to stderr! Nonetheless, this should give us a directional sense of whether or not there’s a difference.

The chart below gives a summary of this data. The x-axis shows the percentage of code executions that were error-free in a given interview. So an interview with 3 code executions and 1 error message would count towards the “30%-40%” bucket. The y-axis indicates the percentage of all interviews that fall in that bucket, for both successful and unsuccessful interviews. Just eyeballing the chart below, one gets the sense that on average, successful candidates run more code that goes off without an error. But is this difference statistically significant?

does_code_compile2

On average, successful candidates’ code ran successfully (didn’t result in errors) 64% of the time, whereas unsuccessful candidates’ attempts to compile code ran successfully 60% of the time, and this difference was indeed significant. Again, while we can’t make any causal claims, the main takeaway is that successful candidates do usually write code that runs better, despite what interviewers may tell you at the outset of an interview.

Should you wait and gather your thoughts before writing code?

“Never forget the power of silence, that massively disconcerting pause which goes on and on and may at last induce an opponent to babble and backtrack nervously.” -Lance Morrow

We were also curious whether or not successful interviewees tended to take their time in the interview. Interview questions are often complex! After being presented with a question, there might be some benefit to taking a step back and coming up with a plan, rather than jumping right into things. In order to get a sense of whether or not this was true, we measured how far into a given interview candidates first executed code. Below is a histogram showing how far into interviews both successful and unsuccessful interviewees first ran code. Looking quickly at the histogram, you can tell that successful candidates do in fact wait a bit longer to start running code, although the magnitude of the effect isn’t huge.

how_soon_run_code_plot

More specifically, on average, candidates with successful interviews first run code 27% of the way through the interview, whereas candidates with unsuccessful interviews first run code 23.9% of the way into the interview, and this difference is significant. Of course, there are alternate explanations for what’s happening here. For instance, perhaps successful candidates are better at taking the time to sweet-talk their interviewer. Furthermore, the usual caveat that we can’t make causal claims applies – if you just sit in an interview for an extra 5 minutes in complete silence, it won’t help your chances. Nonetheless, there does seem to be a difference between the two cohorts.

Conclusions

All in all, this post was our first attempt to understand what does and does not typically lead to an interviewer saying “you know what, I’d really like to hire this person.” Because all of our data are observational, its hard to make causal claims about what we see. While successful interviewees may exhibit certain behaviors, adopting those behaviors doesn’t guarantee success. Nonetheless, it does allow us to support (or call bullshit on) a lot of the advice you’ll read on the internet about how to be a successful interviewee.

That said, there is much still to be done. This was a first, quantitative pass over our data (which is, in many ways, a treasure trove of interview secrets), but we’re excited to do a deeper, qualitative dive and actually start to categorize different questions to see which carry the most signal as well as really get our head around 2nd order behaviors that you can’t measure easily by running a regex over a code sample or measuring how long an interview took. If you want to help us with this and are excited to listen to a bunch of technical interviews, drop me a line (at aline@interviewing.io)!

1All error bars in this post represent a 95% confidence interval.

2There were more languages than these on our platform, but the more obscure the language, the less data points we have. For instance, all interviews in Brainfuck were clearly successful. Kidding.

3The best engineers I’ve met have also been legendarily good at breaking down complex concepts and explaining them to laypeople. Why the infuriating myth of the socially awkward, incoherent tech nerd continues to exist, I have absolutely no idea.

4For every comparison of distributions in this post, we use both a Fisher-Pitman permutation test to compare the difference in the means of the distributions.

5We limit this analysis to interviews in Python because it lends itself particularly well to the identification of function definitions with a simple parsing script.

6We calculate this by looking at what percentage of the time the interviewee executed code that resulted in either an error or non-error output contained the term “error” or “traceback.”

Featured

Uncategorized

LinkedIn endorsements are dumb. Here’s the data.

Posted on February 27th, 2017.

If you’re an engineer who’s been endorsed on LinkedIn for any number of languages/frameworks/skills, you’ve probably noticed that something isn’t quite right. Maybe they’re frameworks you’ve never touched or languages you haven’t used since freshman year of college. No matter the specifics, you’re probably at least a bit wary of the value of the LinkedIn endorsements feature. The internets, too, don’t disappoint in enumerating some absurd potential endorsements or in bemoaning the lack of relevance of said endorsements, even when they’re given in earnest.

Having a gut feeling for this is one thing, but we were curious about whether we could actually come up with some numbers that showed how useless endorsements can be, and we weren’t disappointed. If you want graphs and numbers, scroll down to the “Here’s the data” section below. Otherwise, humor me and read my completely speculative take on why endorsements exist in the first place.

LinkedIn endorsements are just noisy crowdsourced tagging

Pretend for a moment that you’re a recruiter who’s been tasked with filling an engineering role. You’re one of many people who pays LinkedIn ~$9K/year for a recruiter seat on their platform1. That hefty price tag broadens your search radius (which is otherwise artificially constrained) and lets you search the entire system. Let’s say you have to find a strong back-end engineer. How do you begin?

Unfortunately, LinkedIn’s faceted search (pictured below) doesn’t come with a “can code” filter2.

So, instead of searching for what you really want, you have to rely on proxies. Some obvious proxies, even though they’re not that great, might be where someone went to school or where they’ve worked before. However, if you need to look for engineering ability, you’re going to have to get more specific. If you’re like most recruiters, you’ll first look for the main programming language your company uses (despite knowledge of a specific language not being a good indicator of programming ability and despite most hiring managers not caring which languages their engineers know) and then go from there.

Now pretend you’re LinkedIn. You have no data about how good people are at coding, and though you do have a lot of resume/biographical data, that doesn’t tell the whole story. You can try relying on engineers filling in their own profiles with languages they know, but given that engineers tend to be pretty skittish about filling in their LinkedIn profile with a bunch of buzzwords, what do you do?

You build a crowdsourced tagger, of course! Then, all of a sudden, your users will do your work for you. Why do I think this is the case? Well, if LinkedIn cared about true endorsements rather than perpetuating the skills-based myth that keeps recruiters in their ecosystem, they could have written a weighted endorsement system by now, at the very least. That way, an endorsement from someone with expertise in some field might mean more than an endorsement from your mom (unless, of course, she’s an expert in the field).

But they don’t do that, or at least they don’t surface it in candidate search. It’s not worth it. Because the point of endorsements isn’t to get at the truth. It’s to keep recruiters feeling like they’re getting value out of the faceted search they’re paying almost $10K per seat for. In other words, improving the fidelity of endorsements would likely cannibalize LinkedIn’s revenue.

You could make the counterargument that despite the noise, LinkedIn endorsements still carry enough signal to be a useful first-pass filter and that having them is more useful than not having them. This is the question I was curious about, so I decided to cross-reference our users’ interview data with their LinkedIn endorsements.

The setup

So, what data do we have? First, for context, interviewing.io is a platform where people can practice technical interviewing anonymously with interviewers from top companies and, in the process, find jobs. Do well in practice, and you get guaranteed (and anonymous!) technical interviews at companies like Uber, Twitch, Lyft, and more. Over the course of our existence, we’ve amassed performance data from close to 5,000 real and practice interviews.

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “poor” and 4 is “amazing!”. On our platform, a score of 3 or above has generally meant that the person was good enough to move forward. You can see what our feedback form looks like below:

new_interviewer_feedback_circled

As promised, I cross-referenced our data with our users’ LinkedIn profiles and found some interesting, albeit not that surprising, stuff.

Endorsements vs. what languages people actually program in

The first thing I looked at was whether the programming language people interviewed in most frequently had any relationship to the programming language for which they were most endorsed. It was nice that, across the board, people tended to prefer one language for their interviews, so we didn’t really have a lot of edge cases to contend with.

It turns out that people’s interview language of choice matched their most endorsed language on LinkedIn just under 50% of the time.

Of course, just because you’ve been endorsed a lot for a specific language doesn’t mean that you’re not good at the other languages you’ve been endorsed for. To dig deeper, I took a look at whether our users had been endorsed for their interview language of choice at all. It turns out that people were endorsed for their language of choice 72% of the time. This isn’t a particularly powerful statement, though, because most people on our platform have been endorsed for at least 5 programming languages.

That said, even when an engineer had been endorsed for their interview language of choice, that language appeared in their “featured skills” section only 31% of the time. This means that most of the time, recruiters would have to click “View more” (see below) to see the language that people prefer to code in, if it’s even listed in the first place.

So, how often were people endorsed for their language of choice? Quantifying endorsements3 is a bit fuzzy, but to answer this meaningfully, I looked at how often people were endorsed for that language relative to how often they were endorsed for their most-endorsed language, in the cases when the two languages weren’t the same (recall that this happened about half the time). Perhaps if these numbers were close to 1 most of the time, then endorsements might carry some signal. As you can see in the histogram below, this was not the case at all.

The x-axis above is how often people were endorsed for their interview language of choice relative to their most-endorsed language. The bars on the left are cases when someone was barely endorsed for their language of choice, and all the way to right are cases when people were endorsed for both languages equally as often. All told, the distribution is actually pretty uniform, making for more noise than signal.

Endorsements vs. interview performance

The next thing I looked at was whether there was any correspondence between how heavily endorsed someone was on LinkedIn and their interview performance. This time, to quantify the strength of someone’s endorsements4, I looked at how many times someone was endorsed for their most-endorsed language and correlated that to their average technical score in interviews on interviewing.io.

Below, you can see a scatter plot of technical ability vs. LinkedIn endorsements, as well as my attempt to fit a line through it. As you can see, the R^2 is piss-poor, meaning that there isn’t a relationship between how heavily endorsed someone is and their technical ability to speak of.

Endorsements vs. no endorsements… and closing thoughts

Lastly, I took a look at whether having any endorsements in the first place mattered with respect to interview performance. If I’m honest, I was hoping there’d be a negative correlation, i.e. if you don’t have endorsements, you’re a better coder. After running some significance testing, though, it became clear that having any endorsements at all (or not) doesn’t matter.

So, where does this leave us? As long as there’s money to be made in peddling low-signal proxies, endorsements won’t go away and probably won’t get much better. It is my hope, though, that any recruiters reading this will take a second look at the candidates they’re sourcing and try to, where possible, look at each candidate as more than the sum of their buzzword parts.

Thanks to Liz Graves for her help with the data annotation for this post.

1Roughly 60% of LinkedIn’s revenue comes from recruiting, so you can see why this stuff matters.

2You know what comes with a can code filter? interviewing.io does! We know how people are doing rigorous, live technical interviews, which, in turn, lets us reliably predict how well they will do in future interviews. Roughly 60%3 of our candidates pass technical phone screens and make it onsite. Want to use us to hire?

3There are a lot of possible approaches to comparing endorsements, to each other and to other stuff. In this post, I decided to, as much as possible mimic how a recruiter might think about a candidate’s endorsements when looking at their profile. Recruiters are busy (I know; I used to be one) and get paid to make quick judgments. Therefore, given that LinkedIn doesn’t normalize endorsements for you, if a recruiter wanted to do it, they’d have to actually add up all of someone’s endorsements and then do a bunch of pairwise division. This isn’t sustainable, and it’s much easier and faster to look at the absolute numbers. For this exact reason, when comparing the endorsements for two languages, I chose to normalize the relative to each other rather than relative to all other endorsements. And when trying to quantify the strength of someone’s programming endorsements as a whole, I opted to just count the number of endorsements for someone’s most-endorsed language.

4See footnote 3 above; I used the same rationale.

Featured

Uncategorized

Lessons from 3,000 technical interviews… or how what you do after graduation matters way more than where you went to school

Posted on December 28th, 2016.

The first blog post I published that got any real attention was called “Lessons from a year’s worth of hiring data“. It was my attempt to understand what attributes of someone’s resume actually mattered for getting a software engineering job. Surprisingly, as it turned out, where someone went to school didn’t matter at all, and by far and away, the strongest signal came from the number of typos and grammatical errors on their resume.

Since then, I’ve discovered (and written about) how useless resumes are, but ever since writing that first post, I’ve been itching to do something similar with interviewing.io’s data. For context, interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs — do well in practice, and you get guaranteed (and anonymous!) technical interviews at companies like Uber, Twitch, Lyft, and more. Over the course of our existence, we’ve amassed performance data from thousands of real and practice interviews. Data from these interviews sets us up nicely to look at what signals from an interviewee’s background might matter when it comes to performance.

As often happens, what we found was surprising, and some of it runs counter to things I’ve said and written on the subject. More on that in a bit.

The setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, and Uber, as well as engineering-focused startups like Asana, Mattermark, KeepSafe, and more.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “poor” and 4 is “amazing!”. On our platform, a score of 3 or above has generally meant that the person was good enough to move forward. You can see what our feedback form looks like below:

new_interviewer_feedback_circled

The results

To run the analysis for this post, we cross-referenced interviewees’ average technical scores (circled in red in the feedback form above) with the attributes below to see which ones mattered most. Here’s the full attribute list1:

  • Attended a top computer science school
  • Worked at a top company
  • Took classes on Udacity/Coursera2
  • Founded a startup
  • Master’s degree
  • Years of experience

Of all of these, only 3 attributes emerged as statistically significant: top school, top company, and classes on Udacity/Coursera. Apparently, as the fine gentlemen of Metallica once said, nothing else matters. In the graph below, you can see the effect size of each of the significant attributes (attributes that didn’t achieve significance don’t have bars).

As I said at the outset, these results were quite surprising, and I’ll take a stab at explaining each of the outcomes below.

Top school & top company

Going into this, I expected top company to matter but not top school. The company thing makes sense — you’re selecting people who’ve successfully been through at least one interview gauntlet, so the odds of them succeeding at future ones should be higher.

Top school is a bit more muddy, and it was indeed the least impactful of the significant attributes. Why did schooling matter in this iteration of the data but didn’t matter when I was looking at resumes? I expect the answer lies in the disparity between performance in an isolated technical phone screen versus what happens when a candidate actually goes on site. With the right preparation, the technical phone interview is manageable, and top schools often have rigorous algorithms classes and a culture of preparing for technical phone screens (to see why this culture matters and how it might create an unfair advantage for those immersed in it, see my post about how we need to rethink the technical interview). Whether passing an algorithmic technical phone screen means you’re a great engineer is another matter entirely and hopefully the subject of a future post.

Udacity/Coursera

MOOC participation (Udacity and Coursera in particular, as those were the ones interviewing.io users gravitated to most) mattering as much as it did (and mattering way more than pedigree) was probably the most surprising finding here, and so it merited some additional digging.

In particular, I was curious about the interplay between MOOCs and top schools, so I partitioned MOOC participants into people who had attended top schools vs. people who hadn’t. When I did that, something startling emerged. For people who attended top schools, completing Udacity or Coursera courses didn’t appear to matter. However, for people who did not, the effect was huge, so huge, in fact, that it dominated the board. Moreover, interviewees who attended top schools performed significantly worse than interviewees who had not attended top schools but HAD taken a Udacity or Coursera course.

So, what does this mean? Of course (as you’re probably thinking to yourself while you read this), correlation doesn’t imply causation. As such, rather than MOOCs being a magic pill, I expect that people who gravitate toward online courses (and especially those who might have a chip on their shoulder about their undergrad pedigree and end up drinking from the MOOC firehose) already tend to be abnormally driven. But, even with that, I’d be hard pressed to say that completing great online CS classes isn’t going to help you become a better interviewee, especially if you didn’t have the benefit of a rigorous algorithms class up until then. Indeed, a lot of the courses we saw people take focused around algorithms, so it’s no surprise that supplementing your preparation with courses like this could be tremendously useful. Some of the most popular courses we saw were:

Udacity
Design of Computer Programs
Intro to Algorithms
Computability, Complexity & Algorithms

Coursera
Algorithms Specialization
Functional Programming Principles in Scala
Machine Learning
Algorithms on Graphs

Founder status

Having been a founder didn’t matter at all when it came to technical interview performance. This, too, isn’t that surprising. The things that make one a good founder are not necessarily the things that make one a good engineer, and if you just came out of running a startup and are looking to get back into an individual contributor role, odds are, your interview skills will be a bit rusty. This is, of course, true of folks who’ve been in industry but out of interviewing for some time, as you’ll see below.

Master’s degree & years of experience

No surprises here. I’ve ranted quite a bit about the disutility of master’s degrees, so I won’t belabor the point.

Years of experience, too, shouldn’t be that surprising. For context, our average user has about 5 years of experience, with most having between 2 and 10. I think we’ve all anecdotally observed that the time spent away from your schooling doesn’t do you any favors when it comes to interview prep. You can see a scatter plot of interview performance vs. years of experience below as well as my attempt to fit a line through it (as you can see, the R^2 is piss poor, meaning that there isn’t a relationship to speak of).

Closing thoughts

If you know me, or even if you’ve read some of my writing, you know that, in the past, I’ve been quite loudly opposed to the concept of pedigree as a useful hiring signal. With that in mind, I feel like I owe clearly acknowledge, up front, that we found this time runs counter to my stance. But that’s the whole point, isn’t it? You live, you get some data, you make some graphs, you learn, you make new graphs, and you adjust. Even with this new data, I’m excited to see that what mattered way more than pedigree was the actions people took to better themselves (in this case, rounding out their existing knowledge with MOOCs), regardless of their background.

Most importantly, these findings have done nothing to change interviewing.io’s core mission. We’re creating an efficient and meritocratic way for candidates and companies to find each other, and as long as you can code, we couldn’t care less about who you are or where you come from. In our ideal world, all these conversations about which proxies matter more than others would be moot non-starters because coding ability would stand for, well, coding ability. And that’s the world we’re building.

Thanks to Roman Rivilis for his help with data annotation for this post.

1For fun, we tried relating browser and operating system choice to interview performance, (smugly) expecting Chrome users to dominate. Not so. Browser choice didn’t matter, nor did what OS people used while interviewing.

2We got this data from looking at interviewees’ LinkedIn profiles.

Featured

Uncategorized

You can’t fix diversity in tech without fixing the technical interview.

Posted on November 2nd, 2016.

In the last few months, several large players, including Google and Facebook, have released their latest and ultimately disappointing diversity numbers. Even with increased effort and resources poured into diversity hiring programs, Facebook’s headcount for women and people of color hasn’t really increased in the past 3 years. Google’s numbers have looked remarkably similar, and both players have yet to make significant impact in the space, despite a number of initiatives spanning everything from a points system rewarding recruiters for bringing in diverse candidates, to increased funding for tech education, to efforts to hire more diverse candidates in key leadership positions.

Why have gains in diversity hiring been so lackluster across the board?

Facebook justifies these disappointing numbers by citing the ubiquitous pipeline problem, namely that not enough people from underrepresented groups have access to the education and resources they need to be set up for success. And Google’s take appears to be similar, judging from what portion of their diversity-themed, forward-looking investments are focused on education.

In addition to blaming the pipeline, since Facebook’s and Google’s announcements, a growing flurry of conversations have loudly waxed causal about the real reason diversity hiring efforts haven’t worked. These have included everything from how diversity training isn’t sticky enough, to how work environments remain exclusionary and thereby unappealing to diverse candidates, to improper calibration of performance reviews to not accounting for how marginalized groups actually respond to diversity-themed messaging.

While we are excited that more resources are being allocated to education and inclusive workplaces, at interviewing.io, we posit another reason for why diversity hiring initiatives aren’t working. After drawing on data from thousands of technical interviews, it’s become clear to us that technical interviewing is a process whose results are nondeterministic and often arbitrary. We believe that technical interviewing is a broken process for everyone but that the flaws within the system hit underrepresented groups the hardest… because they haven’t had the chance to internalize just how much of technical interviewing is a numbers game. Getting a few interview invites here and there through increased diversity initiatives isn’t enough. It’s a beginning, but it’s not enough. It takes a lot of interviews to get used to the process and the format and to understand that the stuff you do in technical interviews isn’t actually the stuff you do at work every day. And it takes people in your social circle all going through the same experience, screwing up interviews here and there, and getting back on the horse to realize that poor performance in one interview isn’t predictive of whether you’ll be a good engineer.

A brief history of technical interviewing

A definitive work on the history of technical interviewing was surprisingly hard to find, but I was able to piece together a narrative by scouring books like How Would You Move Mount Fuji, Programming Interviews Exposed, and the bounty of the internets. The story goes something like this.

Technical interviewing has its roots as far back as 1950s Palo Alto, at Shockley Semiconductor Laboratories. Shockley’s interviewing methodology came out of a need to separate the innovative, rapidly moving, Cold War-fueled tech space from hiring approaches taken in more traditionally established, skills-based assembly-line based industry. And so, he relied on questions that could gauge analytical ability, intellect, and potential quickly. One canonical question in this category has to do with coins. You have 8 identical-looking coins, except one is lighter than the rest. Figure out which one it is with just two weighings on a pan balance.

The techniques that Shockley developed were adapted by Microsoft during the 90s, as the first dot-com boom spurred an explosion in tech hiring. As with the constraints imposed by both the volume and the high analytical/adaptability bar imposed by Shockley, Microsoft, too, needed to vet people quickly for potential — as software engineering became increasingly complex over the course of the dot-com boom, it was no longer possible to have a few centralized “master programmers” manage the design and then delegate away the minutiae. Even rank and file developers needed to be able to produce under a variety of rapidly evolving conditions, where just mastery of specific skills wasn’t enough.

The puzzle format, in particular, was easy to standardize because individual hiring managers didn’t have to come up with their own interview questions, and a company could quickly build up its own interchangeable question repository.

This mentality also applied to the interview process itself — rather than having individual teams run their own processes and pipelines, it made much more sense to standardize things. This way, in addition to questions, you could effectively plug and play the interviewers themselves — any interviewer within your org could be quickly trained up and assigned to speak with any candidate, independent of prospective team.

Puzzle questions were a good solution for this era for a different reason. Collaborative editing of documents didn’t become a thing until Google Docs’ launch in 2007. Without that capability, writing code in a phone interview was untenable — if you’ve ever tried to talk someone through how to code something up without at least a shared piece of paper in front of you, you know how painful it can be. In the absence of being able to write code in front of someone, the puzzle question was a decent proxy. Technology marched on, however, and its evolution made it possible to move from the proxy of puzzles to more concrete, coding-based interview questions. Around the same time, Google itself publicly overturned the efficacy of puzzle questions.

So where does this leave us? Technical interviews are moving in the direction of more concreteness, but they are still very much a proxy for the day-to-day work that a software engineer actually does. The hope was that the proxy would be decent enough, but it was always understood that that’s what they were and that the cost-benefit of relying on a proxy worked out in cases where problem solving trumped specific skills and where the need for scale trumped everything else.

As it happens, elevating problem-solving ability and the need for a scalable process are both eminently reasonable motivations. But here’s the unfortunate part: the second reason, namely the need for scalability, doesn’t apply in most cases. Very few companies are large enough to need plug and play interviewers. But coming up with interview questions and processes is really hard, so despite their differing needs, smaller companies often take their cues from the larger players, not realizing that companies like Google are successful at hiring because the work they do attracts an assembly line of smart, capable people… and that their success at hiring is often despite their hiring process and not because of it. So you end up with a de facto interviewing cargo cult, where smaller players blindly mimic the actions of their large counterparts and blindly hope for the same results.

The worst part is that these results may not even be repeatable… for anyone. To show you what I mean, I’ll talk a bit about some data we collected at interviewing.io.

Technical interviewing is broken for everybody

Interview outcomes are kind of arbitrary
interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs. Interviewers and interviewees meet in a collaborative coding environment and jump right into a technical interview question. After each interview, both sides rate one another, and interviewers rate interviewees on their technical ability. And the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some interesting and somewhat controlled comparative analysis.

We were curious to see how consistent the same interviewee’s performance was from interview to interview, so we dug into our data. After looking at thousands of interviews on the platform, we’ve discovered something alarming: interviewee performance from interview to interview varied quite a bit, even for people with a high average performance. In the graph below, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on interviewing.io. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes.

As you can see, roughly 25% of interviewees are consistent in their performance, but the rest are all over the place. And over a third of people with a high mean (>=3) technical performance bombed at least one interview.

Despite the noise, from the graph above, you can make some guesses about which people you’d want to interview. However, keep in mind that each person above represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey. Looking at this data, it’s not hard to see why technical interviewing is often perceived as a game. And, unfortunately, it’s a game where people often can’t tell how they’re doing.

No one can tell how they’re doing
I mentioned above that on interviewing.io, we collect post-interview feedback. In addition to asking interviewers how their candidates did, we also ask interviewees how they think they did. Comparing those numbers for each interview showed us something really surprising: people are terrible at gauging their own interview performance, and impostor syndrome is particularly prevalent. In fact, people underestimate their performance over twice as often as they overestimate it. Take a look at the graph below to see what I mean:

Note that, in our data, impostor syndrome knows no gender or pedigree — it hits engineers on our platform across the board, regardless of who they are or where they come from.

Now here’s the messed up part. During the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very strong relationship between whether people think they did well and whether they would indeed want to work with the interviewer — when people think they did poorly, even if they actually didn’t, they may be a lot less likely to want to work with you. And, by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

As a result, companies are losing candidates from all walks of life because of a fundamental flaw in the process.

Poor performances hit marginalized groups the hardest
Though impostor syndrome appears to hit engineers from all walks of life, we’ve found that women get hit the hardest in the face of an actually poor performance. As we learned above, poor performances in technical interviewing happen to most people, even people who are generally very strong. However, when we looked at our data, we discovered that after a poor performance, women are 7 times more likely to stop practicing than men:

A bevy of research appears to support confidence-based attrition as a very real cause for women departing from STEM fields, but I would expect that the implications of the attrition we witnessed extend beyond women to underrepresented groups, across the board.

What the real problem is

At the end of the day, because technical interviewing is indeed a game, like all games, it takes practice to improve. However, unless you’ve been socialized to expect and be prepared for the game-like aspect of the experience, it’s not something that you can necessarily intuit. And if you go into your interviews expecting them to be indicative of your aptitude at the job, which is, at the outset, not an unreasonable assumption, you will be crushed the first time you crash and burn. But the process isn’t a great or predictable indicator of your aptitude. And on top of that, you likely can’t tell how you’re doing even when you do well.

These are issues that everyone who’s gone through the technical interviewing gauntlet has grappled with. But not everyone has the wherewithal or social support to realize that the process is imperfect and to stick with it. And the less people like you are involved, whether it’s because they’re not the same color as you or the same gender or because not a lot of people at your school study computer science or because you’re a dropout or for any number of other reasons, the less support or insider knowledge or 10,000 foot view of the situation you’ll have. Full stop.

Inclusion and education isn’t enough

To help remedy the lack of diversity in its headcount, Facebook has committed to three actionable steps on varying time frames. The first step revolves around creating a more inclusive interview/work environment for existing candidates. The other two are focused on addressing the perceived pipeline problem in tech:

  • Short Term: Building a Diverse Slate of Candidates and an Inclusive Working Environment
  • Medium Term: Supporting Students with an Interest in Tech
  • Long Term: Creating Opportunity and Access

Indeed, efforts to promote inclusiveness and increased funding for education are extremely noble, especially in the face of potentially not being able to see results for years in the case of the latter. However, both take a narrow view of the problem and both continue to funnel candidates into a broken system.

Erica Baker really cuts to the heart of it in her blog post about Twitter hiring a head of D&I:

“What irks me the most about this is that no company, Twitter or otherwise, should have a VP of Diversity and Inclusion. When the VP of Engineering… is thinking about hiring goals for the year, they are not going to concern themselves with the goals of the VP of Diversity and Inclusion. They are going to say ‘hiring more engineers is my job, worrying about the diversity of who I hire is the job of the VP of Diversity and Inclusion.’ When the VP of Diversity and Inclusion says ‘your org is looking a little homogenous, do something about it,’ the VP of Engineering won’t prioritize that because the VP of Engineering doesn’t report to the VP of Diversity and Inclusion, so knows there usually isn’t shit the VP of Diversity and Inclusion can do if the Eng org doesn’t see some improvement in diversity.”

Indeed, this is sad, but true. When faced with a high-visibility conundrum like diversity hiring, a pragmatic and even reasonable reaction on any company’s part is to make a few high-profile hires and throw money at the problem. Then, it looks like you’re doing something, and spinning up a task force or a department or new set of titles is a lot easier than attempting to uproot the entire status quo.

As such, we end up with a newly minted, well-funded department pumping a ton of resources into feeding people who’ve not yet learned about the interviewing being a game into a broken, nondeterministic machine of a process made further worse by the fact that said process favors confidence and persistence over bona fide ability… and where the link between success in navigating said process and subsequent on-the-job performance is tenuous at best.

How to fix things

In the evolution of the technical interview, we saw a gradual reduction in the need for proxies as companies as the technology to write code together remotely emerged; with its advent, abstract, largely arbitrary puzzle questions could start to be phased out.

What’s the next step? Technology has the power to free us from relying on proxies, so that we can look at each individual as an indicative, unique bundle of performance-based data points. At interviewing.io, we make it possible to move away from proxies by looking at each interviewee as a collection of data points that tell a story, rather than one arbitrary glimpse of something they did once.

But that’s not enough either. Interviews themselves need to continue to evolve. The process itself needs to be repeatable, predictive of aptitude at the actual job, and not a system to be gamed, where a huge benefit is incurred by knowing the rules. And the larger organizations whose processes act as a template for everyone else need to lead the charge. Only then can we really be welcoming to a truly diverse group of candidates.

Featured

Uncategorized

After a lot more data, technical interview performance really is kind of arbitrary.

Posted on October 13th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously, and if things go well, get jobs at top companies in the process. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle.

In February of 2016, we published a post about how people’s technical interview performance, from interview to interview, seemed quite volatile. At the time, we just had a few hundred interviews to draw on, so as you can imagine, we were quite eager to rerun the numbers with the advent of more data. After drawing on over a thousand interviews, the numbers hold up. In other words, technical interview outcomes do really seem to be kind of arbitrary.

The setup

When an interviewer and an interviewee match on interviewing.io, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “poor” and 4 is “amazing!” (you can see the feedback form here). On our platform, a score of 3 or above has generally meant that the person was good enough to move forward.

At this point, you might say, that’s nice and all, but what’s the big deal? Lots of companies collect this kind of data in the context of their own pipelines. Here’s the thing that makes our data special: the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some pretty interesting and somewhat controlled comparative analysis.

Performance from interview to interview really is arbitrary

If you’ve read our first post on this subject, you’ll recognize the visualization below. For the as yet uninitiated, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on the platform. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes. If you hover over each , you can drill down and see how that person did in each of their interviews. Anytime you see bolded text with a dotted underline, you can hover over it to see relevant data viz. Try it now to expand everyone’s performance. You can also hover over the labels along the x-axis to drill into the performance of people whose means fall into those buckets.

Standard Dev vs. Mean of Interviewee Performance
(1316 Interviews w/ 259 Interviewees)

As you can see, roughly 20% of interviewees are consistent in their performance (down from 25% the last time we did this analysis), and the rest are all over the place. If you look at the graph above, despite the noise, you can probably make some guesses about which people you’d want to interview. However, keep in mind that each represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey.1 For instance:

  • Many people who scored at least one 4 also scored at least one 2.
  • And as you saw above, a good amount of people who scored at least one 4 also scored at least one 1.
  • If we look at high performers (mean of 3.3 or higher), we still see a fair amount of variation.
  • Things get really murky when we consider “average” performers (mean between 2.6 and 3.3).

What do the most volatile interviewees have in common?

In the plot below, you can see interview performance over time for interviewees with the highest standard deviations on the platform (the cutoff we used was a standard dev of 1 or more, and this accounted for roughly 12% of our users). Note that the mix of dashed and dotted lines is purely visual — this way it’s easier to follow each person’s performance path.

So, what do the most highly volatile performers have in common? The answer appears to be, well, nothing. About half were working at top companies while interviewing, and half weren’t. Breakdown of top school was roughly 60/40. And years of experience didn’t have much to do with it either — a plurality of interviewees having between 2 and 6 years of experience, with the rest all over the board (varying between 1 and 20 years).

So, all in all, the factors that go into performance volatility are likely a lot more nuanced than the traditional cues we often use to make value judgments about candidates.

Why does volatility matter?

I discussed the implications of these findings for technical hiring at length in the last post, but briefly, a noisy, non-deterministic interview process does no favors to either candidates or companies. Both end up expending a lot more effort to get a lot less signal than they ought, and in a climate where software engineers are at such a premium, noisy interviews only serve to exacerbate the problem.

But beyond micro and macro inefficiencies, I suspect there’s something even more insidious and unfortunate going on here. Once you’ve done a few traditional technical interviews, the volatility and lack of determinism in the process is something you figure out anecdotally and kind of accept. And if you have the benefit of having friends who’ve also been through it, it only gets easier. What if you don’t, however?

In a previous post, we talked about how women quit interview practice 7 times more often than men after just one bad interview. It’s not too much of a leap to say that this is probably happening to any number of groups who are underrepresented/underserved by the current system. In other words, though it’s a broken process for everyone, the flaws within the system hit these groups the hardest… because they haven’t had the chance to internalize just how much of technical interviewing is a game. More on this subject in our next post!

What can we do about it?

So, yes, the state of technical hiring isn’t great right now, but here’s what we can say. If you’re looking for a job, the best piece of advice we can give you is to really internalize that interviewing is a numbers game. Between the kind of volatility we discussed in this post, impostor syndrome, poor evaluation techniques, and how hard it can be to get meaningful, realistic practice, it takes a lot of interviews to find a great job.

And if you’re hiring people, in the absence of a radical shift in how we vet technical ability, we’ve learned that drawing on aggregate performance is much more meaningful than a making such an important decision based on one single, arbitrary interview. Not only can aggregative performance help correct for an uncharacteristically poor performance, but it can also weed out people who eventually do well in an interview by chance or those who, over time, simply up and memorize Cracking the Coding Interview. At interviewing.io, even after just a handful of interviews, we have a much better picture of what someone is capable of and where they stack up than a single company would after a single interview, and aggregate data tells a much more compelling, repeatable story than one, arbitrary data point.

1At this point you might say that it’s erroneous and naive to compare raw technical scores to one another for any number of reasons, not the least of which is that one interviewer’s 4 is another interviewer’s 2. For a comprehensive justification of using raw scores comparatively, please check out the appendix to our previous post on this subject. Just to make sure the numbers hold up, I reran them, and this time, our R-squared is even higher than before (0.41 vs. 0.39 last time).

Huge thanks to Ian Johnson, creator of d3 Building Blocks, who made the graph entitled Standard Dev vs. Mean of Interviewee Performance (the one with the icons) as well as all the visualizations that go with it.

Featured

Uncategorized

People are still bad at gauging their own interview performance. Here’s the data.

Posted on September 8th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously, and if things go well, get jobs at top companies in the process. We started it because resumes suck and because we believe that anyone, regardless of how they look on paper, should have the opportunity to prove their mettle.

At the end of 2015, we published a post about how people are terrible at gauging their own interview performance. At the time, we just had a few hundred interviews to draw on, so as you can imagine, we were quite eager to rerun the numbers with the advent of more data. After drawing on roughly one thousand interviews, we were surprised to find that the numbers have really held up, and that people continue to be terrible at gauging their own interview performance.

The setup

When an interviewer and an interviewee match on interviewing.io, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. After each interview, people leave one another feedback, and each party can see what the other person said about them once they both submit their reviews.

If you’re curious, you can see what the feedback forms look like below — in addition to one direct yes/no question, we also ask about a few different aspects of interview performance using a 1-4 scale. We also ask interviewees some extra questions that we don’t share with their interviewers, and one of those questions is about how well they think they did. For context, a technical score of 3 or above seems to be the rough cut-off for hirability.

Feedback form for interviewers

Feedback form for interviewers

Feedback form for interviewees

Feedback form for interviewees

Perceived versus actual performance… revisited

Below are two heatmaps of perceived vs. actual performance per interview (for interviews where we had both pieces of data). In each heatmap, the darker areas represent higher interview concentration. For instance, the darkest square represents interviews where both perceived and actual performance was rated as a 3. You can hover over each square to see the exact interview count (denoted by “z”).

The first heatmap is our old data:

And the second heatmap is our data as of August 2016:

As you can see, even with the advent of a lot more interviews, the heatmaps look remarkably similar. The R-squared for a linear regression on the first data set is 0.24. And for the more recent data set, it’s dropped to 0.18. In both cases, even though some small positive relationship between actual and perceived performance does exist, it is not a strong, predictable correspondence.

You can also see there’s a non-trivial amount of impostor syndrome going on in the graph above, which probably comes as no surprise to anyone who’s been an engineer. Take a look at the graph below to see what I mean.

The x-axis is the difference between actual and perceived performance, i.e. actual minus perceived. In other words, a negative value means that you overestimated your performance, and a positive one means that you underestimated it. Therefore, every bar above 0 is impostor syndrome country, and every bar below zero belongs to its foulsome, overconfident cousin, the Dunning-Kruger effect.1

On interviewing.io (though I wouldn’t be surprised if this finding extrapolated to the qualified engineering population at large), impostor syndrome plagues interviewees roughly twice as often as Dunning-Kruger. Which, I guess, is better than the alternative.

Why people underestimate their performance

With all this data, I couldn’t resist digging into interviews where interviewees gave themselves 1’s and 2’s but where interviewers gave them 4’s to try to figure out if there were any common threads. And, indeed, a few trends emerged. The interviews that tended to yield the most interviewee impostor syndrome were ones where question complexity was layered. In other words, the interviewer would start with a fairly simple question and then, when the interviewee completed it successfully, they would change things up to make it harder. Lather, rinse, repeat. In some cases, an interviewer could get through up to 4 layered tiers in about an hour. Inevitably, even a good interviewee will hit a wall eventually, even if the place where it happens is way further out than the boundary for most people who attempt the same question.

Another trend I observed had to do with interviewees beating themselves up for issues that mattered a lot to them but fundamentally didn’t matter much to their interviewer: off-by-one errors, small syntax errors that made it impossible to compile their code (even though everything was semantically correct), getting big-O wrong the first time and then correcting themselves, and so on.

Interestingly enough, how far off people were in gauging their own performance was independent of how highly rated (overall) their interviewer was or how strict their interviewer was.

With that in mind, if I learned anything from watching these interviews, it was this. Interviewing is a flawed, human process. Both sides want to do a good job, but sometimes the things that matter to each side are vastly different. And sometimes the standards that both sides hold themselves to are vastly different as well.

Why this (still) matters for hiring, and what you can do to make it better

Techniques like layered questions are important to sussing out just how good a potential candidate is and can make for a really engaging positive experience, so removing them isn’t a good solution. And there probably isn’t that much you can do directly to stop an engineer from beating themselves up over a small syntax error (especially if it’s one the interviewer didn’t care about). However, all is not lost!

As you recall, during the feedback step that happens after each interview, we ask interviewees if they’d want to work with their interviewer. As it turns out, there’s a very statistically significant relationship between whether people think they did well and whether they’d want to work with the interviewer. This means that when people think they did poorly, they may be a lot less likely to want to work with you. And by extension, it means that in every interview cycle, some portion of interviewees are losing interest in joining your company just because they didn’t think they did well, despite the fact that they actually did.

How can one mitigate these losses? Give positive, actionable feedback immediately (or as soon as possible)! This way people don’t have time to go through the self-flagellation gauntlet that happens after a perceived poor performance, followed by the inevitable rationalization that they totally didn’t want to work there anyway.

1I’m always terrified of misspelling “Dunning-Kruger” and not double-checking it because of overconfidence in my own spelling abilities.

Featured

Uncategorized

We built voice modulation to mask gender in technical interviews. Here’s what happened.

Posted on June 29th, 2016.

interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs based on their interview performance rather than their resumes. Since we started, we’ve amassed data from thousands of technical interviews, and in this blog, we routinely share some of the surprising stuff we’ve learned. In this post, I’ll talk about what happened when we built real-time voice masking to investigate the magnitude of bias against women in technical interviews. In short, we made men sound like women and women sound like men and looked at how that affected their interview performance. We also looked at what happened when women did poorly in interviews, how drastically that differed from men’s behavior, and why that difference matters for the thorny issue of the gender gap in tech.

The setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, Twitch, and Yelp, as well as engineering-focused startups like Asana, Mattermark, and others.

After every interview, interviewers rate interviewees on a few different dimensions.

Feedback form for interviewers

Feedback form for interviewers

As you can see, we ask the interviewer if they would advance their interviewee to the next round. We also ask about a few different aspects of interview performance using a 1-4 scale. On our platform, a score of 3 or above is generally considered good.

Women historically haven’t performed as well as men…

One of the big motivators to think about voice masking was the increasingly uncomfortable disparity in interview performance on the platform between men and women1. At that time, we had amassed over a thousand interviews with enough data to do some comparisons and were surprised to discover that women really were doing worse. Specifically, men were getting advanced to the next round 1.4 times more often than women. Interviewee technical score wasn’t faring that well either — men on the platform had an average technical score of 3 out of 4, as compared to a 2.5 out of 4 for women.

Despite these numbers, it was really difficult for me to believe that women were just somehow worse at computers, so when some of our customers asked us to build voice masking to see if that would make a difference in the conversion rates of female candidates, we didn’t need much convincing.

… so we built voice masking

Since we started working on interviewing.io, in order to achieve true interviewee anonymity, we knew that hiding gender would be something we’d have to deal with eventually but put it off for a while because it wasn’t technically trivial to build a real-time voice modulator. Some early ideas included sending female users a Bane mask.

Early voice masking prototype

Early voice masking prototype (drawing by Marcin Kanclerz)

When the Bane mask thing didn’t work out, we decided we ought to build something within the app, and if you play the videos below, you can get an idea of what voice masking on interviewing.io sounds like. In the first one, I’m talking in my normal voice.

And in the second one, I’m modulated to sound like a man.2

Armed with the ability to hide gender during technical interviews, we were eager to see what the hell was going on and get some insight into why women were consistently underperforming.

The experiment

The setup for our experiment was simple. Every Tuesday evening at 7 PM Pacific, interviewing.io hosts what we call practice rounds. In these practice rounds, anyone with an account can show up, get matched with an interviewer, and go to town. And during a few of these rounds, we decided to see what would happen to interviewees’ performance when we started messing with their perceived genders.

In the spirit of not giving away what we were doing and potentially compromising the experiment, we told both interviewees and interviewers that we were slowly rolling out our new voice masking feature and that they could opt in or out of helping us test it out. Most people opted in, and we informed interviewees that their voice might be masked during a given round and asked them to refrain from sharing their gender with their interviewers. For interviewers, we simply told them that interviewee voices might sound a bit processed.

We ended up with 234 total interviews (roughly 2/3 male and 1/3 female interviewees), which fell into one of three categories:

  • Completely unmodulated (useful as a baseline)
  • Modulated without pitch change
  • Modulated with pitch change

You might ask why we included the second condition, i.e. modulated interviews that didn’t change the interviewee’s pitch. As you probably noticed, if you played the videos above, the modulated one sounds fairly processed. The last thing we wanted was for interviewers to assume that any processed-sounding interviewee must summarily have been the opposite gender of what they sounded like. So we threw that condition in as a further control.

The results

After running the experiment, we ended up with some rather surprising results. Contrary to what we expected (and probably contrary to what you expected as well!), masking gender had no effect on interview performance with respect to any of the scoring criteria (would advance to next round, technical ability, problem solving ability). If anything, we started to notice some trends in the opposite direction of what we expected: for technical ability, it appeared that men who were modulated to sound like women did a bit better than unmodulated men and that women who were modulated to sound like men did a bit worse than unmodulated women. Though these trends weren’t statistically significant, I am mentioning them because they were unexpected and definitely something to watch for as we collect more data.

On the subject of sample size, we have no delusions that this is the be-all and end-all of pronouncements on the subject of gender and interview performance. We’ll continue to monitor the data as we collect more of it, and it’s very possible that as we do, everything we’ve found will be overturned. I will say, though, that had there been any staggering gender bias on the platform, with a few hundred data points, we would have gotten some kind of result. So that, at least, was encouraging.

So if there’s no systemic bias, why are women performing worse?

After the experiment was over, I was left scratching my head. If the issue wasn’t interviewer bias, what could it be? I went back and looked at the seniority levels of men vs. women on the platform as well as the kind of work they were doing in their current jobs, and neither of those factors seemed to differ significantly between groups. But there was one nagging thing in the back of my mind. I spend a lot of my time poring over interview data, and I had noticed something peculiar when observing the behavior of female interviewees. Anecdotally, it seemed like women were leaving the platform a lot more often than men. So I ran the numbers.

What I learned was pretty shocking. As it happens, women leave interviewing.io roughly 7 times as often as men after they do badly in an interview. And the numbers for two bad interviews aren’t much better. You can see the breakdown of attrition by gender below (the differences between men and women are indeed statistically significant with P < 0.00001).

Also note that as much as possible, I corrected for people leaving the platform because they found a job (practicing interviewing isn’t that fun after all, so you’re probably only going to do it if you’re still looking), were just trying out the platform out of curiosity, or they didn’t like something else about their interviewing.io experience.

A totally speculative thought experiment

So, if these are the kinds of behaviors that happen in the interviewing.io microcosm, how much is applicable to the broader world of software engineering? Please bear with me as I wax hypothetical and try to extrapolate what we’ve seen here to our industry at large. And also, please know that what follows is very speculative, based on not that much data, and could be totally wrong… but you gotta start somewhere.

If you consider the attrition data points above, you might want to do what any reasonable person would do in the face of an existential or moral quandary, i.e. fit the data to a curve. An exponential decay curve seemed reasonable for attrition behavior, and you can see what I came up with below. The x-axis is the number of what I like to call “attrition events”, namely things that might happen to you over the course of your computer science studies and subsequent career that might make you want to quit. The y-axis is what portion of people are left after each attrition event. The red curve denotes women, and the blue curve denotes men.

Now, as I said, this is pretty speculative, but it really got me thinking about what these curves might mean in the broader context of women in computer science. How many “attrition events” does one encounter between primary and secondary education and entering a collegiate program in CS and then starting to embark on a career? So, I don’t know, let’s say there are 8 of these events between getting into programming and looking around for a job. If that’s true, then we need 3 times as many women studying computer science than men to get to the same number in our pipelines. Note that that’s 3 times more than men, not 3 times more than there are now. If we think about how many there are now, which, depending on your source, is between 1/3 and a 1/4 of the number of men, to get to pipeline parity, we actually have to increase the number of women studying computer science by an entire order of magnitude.

Prior art, or why maybe this isn’t so nuts after all

Since gathering these findings and starting to talk about them a bit in the community, I began to realize that there was some supremely interesting academic work being done on gender differences around self-perception, confidence, and performance. Some of the work below found slightly different trends than we did, but it’s clear that anyone attempting to answer the question of the gender gap in tech would be remiss in not considering the effects of confidence and self-perception in addition to the more salient matter of bias.

In a study investigating the effects of perceived performance to likelihood of subsequent engagement, Dunning (of Dunning-Kruger fame) and Ehrlinger administered a scientific reasoning test to male and female undergrads and then asked them how they did. Not surprisingly, though there was no difference in performance between genders, women underrated their own performance more often than men. Afterwards, participants were asked whether they’d like to enter a Science Jeopardy contest on campus in which they could win cash prizes. Again, women were significantly less likely to participate, with participation likelihood being directly correlated with self-perception rather than actual performance.3

In a different study, sociologists followed a number of male and female STEM students over the course of their college careers via diary entries authored by the students. One prevailing trend that emerged immediately was the difference between how men and women handled the “discovery of their [place in the] pecking order of talent, an initiation that is typical of socialization across the professions.” For women, realizing that they may no longer be at the top of the class and that there were others who were performing better, “the experience [triggered] a more fundamental doubt about their abilities to master the technical constructs of engineering expertise [than men].”

And of course, what survey of gender difference research would be complete without an allusion to the wretched annals of dating? When I told the interviewing.io team about the disparity in attrition between genders, the resounding response was along the lines of, “Well, yeah. Just think about dating from a man’s perspective.” Indeed, a study published in the Archives of Sexual Behavior confirms that men treat rejection in dating very differently than women, even going so far as to say that men “reported they would experience a more positive than negative affective response after… being sexually rejected.”

Maybe tying coding to sex is a bit tenuous, but, as they say, programming is like sex — one mistake and you have to support it for the rest of your life.

Why I’m not depressed by our results and why you shouldn’t be either

Prior art aside, I would like to leave off on a high note. I mentioned earlier that men are doing a lot better on the platform than women, but here’s the startling thing. Once you factor out interview data from both men and women who quit after one or two bad interviews, the disparity goes away entirely. So while the attrition numbers aren’t great, I’m massively encouraged by the fact that at least in these findings, it’s not about systemic bias against women or women being bad at computers or whatever. Rather, it’s about women being bad at dusting themselves off after failing, which, despite everything, is probably a lot easier to fix.

1Roughly 15% of our users are female. We want way more, but it’s a start.

2If you want to hear more examples of voice modulation or are just generously down to indulge me in some shameless bragging, we got to demo it on NPR and in Fast Company.

3In addition to asking interviewers how interviewees did, we also ask interviewees to rate themselves. After reading the Dunning and Ehrlinger study, we went back and checked to see what role self-perception played in attrition. In our case, the answer is, I’m afraid, TBD, as we’re going to need more self-ratings to say anything conclusive.

Featured

Uncategorized

Technical interview performance is kind of arbitrary. Here’s the data.

Posted on February 17th, 2016.

Note: Though I wrote most of the words in this post, there are a few people outside of interviewing.io whose work made it possible. Ian Johnson, creator of d3 Building Blocks, created the graph entitled Standard Dev vs. Mean of Interviewee Performance (the one with the icons) as well as all the interactive visualizations that go with it. Dave Holtz did all the stats work for computing the probability of people failing individual interviews. You can see more about his work on his blog.

interviewing.io is a platform where people can practice technical interviewing anonymously and, in the process, find jobs. In the past few months, we’ve amassed data from hundreds of interviews, and when we looked at how the same people performed from interview to interview, we were really surprised to find quite a bit of volatility, which, in turn, made us question the reliability of single interview outcomes.

The setup

When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice1, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, and Yelp, as well as engineering-focused startups like Asana, Mattermark, KeepSafe, and more.

After every interview, interviewers rate interviewees on a few different dimensions, including technical ability. Technical ability gets rated on a scale of 1 to 4, where 1 is “meh” and 4 is “amazing!” (you can see the feedback form here). On our platform, a score of 3 or above has generally meant that the person was good enough to move forward.

At this point, you might say, that’s nice and all, but what’s the big deal? Lots of companies collect this kind of data in the context of their own pipelines. Here’s the thing that makes our data special: the same interviewee can do multiple interviews, each of which is with a different interviewer and/or different company, and this opens the door for some pretty interesting and somewhat controlled comparative analysis.

Performance from interview to interview is pretty volatile

Let’s start with some visuals. In the graph below, every represents the mean technical score for an individual interviewee who has done 2 or more interviews on the platform2. The y-axis is standard deviation of performance, so the higher up you go, the more volatile interview performance becomes. If you hover over each , you can drill down and see how that person did in each of their interviews. Anytime you see bolded text with a dotted underline, you can hover over it to see relevant data viz. Try it now to expand everyone’s performance. You can also hover over the labels along the x-axis to drill into the performance of people whose means fall into those buckets.

Standard Dev vs. Mean of Interviewee Performance
(299 Interviews w/ 67 Interviewees)

As you can see, roughly 25% of interviewees are consistent in their performance, and the rest are all over the place3. If you look at the graph above, despite the noise, you can probably make some guesses about which people you’d want to interview. However, keep in mind that each represents a mean. Let’s pretend that, instead, you had to make a decision based on just one data point. That’s where things get dicey. For instance:

  • Many people who scored at least one 4 also scored at least one 2.
  • If we look at high performers (mean of 3.3 or higher), we still see a fair amount of variation.
  • Things get really murky when we consider “average” performers (mean between 2.6 and 3.3).

To me, looking at this data and then pretending that I had to make a hiring decision based on one interview outcome felt a lot like peering into some beautiful, lavishly appointed parlor through a keyhole. Sometimes you see a piece of art on the wall, sometimes you see the liquor selection, and sometimes you just see the back of the couch.

At this point you might say that it’s erroneous and naive to compare raw technical scores to one another for any number of reasons, not the least of which is that one interviewer’s 4 is another interviewer’s 2. We definitely share this concern and address it in the appendix of this post. It does bear mentioning, though, that most of our interviewers are coming from companies with strong engineering brands and that correcting for brand strength didn’t change interviewee performance volatility, nor did correcting for interviewer rating.

So, in a real life situation, when you’re trying to decide whether to advance someone to onsite, you’re probably trying to avoid two things — false positives (bringing in people below your bar by mistake) and false negatives (rejecting people who should have made it in). Most top companies’ interviewing paradigm is that false negatives are less bad than false positives. This makes sense right? With a big enough pipeline and enough resources, even with a high false negative rate, you’ll still get the people you want. With a high false positive rate, you might get cheaper hiring, but you do potentially irreversible damage to your product, culture, and future hiring standards in the process. And of course, the companies setting the hiring standards and practices for an entire industry ARE the ones with the big pipelines and seemingly inexhaustible resources.

The dark side of optimizing for high false negative rates, though, rears its head in the form of our current engineering hiring crisis. Do single interview instances, in their current incarnation, give enough signal? Or amidst so much demand for talent, are we turning away qualified people because we’re all looking at a large, volatile graph through a tiny keyhole?

So, hyperbolic moralizing aside, given how volatile interview performance is, what are the odds that a good candidate will fail an individual phone screen?

Odds of failing a single interview based on past performance

Below, you can see the distribution of mean performance throughout our population of interviewees.

In order to figure out the probability that a candidate with a given mean score would fail an interview, we had to do some stats work. First, we broke interviewees up into cohorts based on their mean scores (rounded to the nearest 0.25). Then, for each cohort, we calculated the probability of failing, i.e. of getting a score of 2 or less. Finally, to work around our starting data set not being huge, we resampled our data. In our resampling procedure, we treated an interview outcome as a multinomial distribution, or in other words, pretended that each interview was a roll of a weighted, 4-sided die corresponding to that candidate’s cohort. We then re-rolled the dice a bunch of times to create a new, “simulated” dataset for each cohort and calculated new probabilities of failure for each cohort using these data sets. Below, you can see the results of repeating this process 10,000 times.

As you can see, a lot of the distributions above overlap with one another. This is important because these overlaps tell us that there may not be statistically significant differences between those groups (e.g. between 2.75 and 3). Certainly, with the advent of LOT more data, the delineations between cohorts may become clearer. On the other hand, if we do need a huge amount of data to detect differences in failure rate, it might suggest that people are intrinsically highly variable in their performance. At the end of the day, while we can confidently say that there is a significant difference between the bottom end of the spectrum (2.25) versus the top end (3.75), for people in the middle, things are murky.

Nevertheless, using these distributions, we did attempt to compute the probability that a candidate with a certain mean score would fail a single interview (see below — the shaded areas encapsulate a 95% confidence interval). The fact that people who are overall pretty strong (e.g. mean ~= 3) can mess up technical interviews as much as 22% of the time shows that there’s definitely room for improvement in the process, and this is further exacerbated by the general murkiness in the middle of the spectrum.

Is interviewing doomed?

Generally, when we think of interviewing, we think of something that ought to have repeatable results and carry a strong signal. However, the data we’ve collected, meager though it might be, tells a different story. And it resonates with both my anecdotal experience as a recruiter and with the sentiments we’ve seen echoed in the community. Zach Holman’s Startup Interviewing is Fucked hits on the disconnect between interview process and the job it’s meant to fill, the fine gentlemen of TripleByte reached similar conclusions by looking at their own data, and one of the more poignant expressions of inconsistent interviewing results recently came from rejected.us.

You can bet that many people who are rejected after a phone screen by Company A but do better during a different phone screen and ultimately end up somewhere traditionally reputable are getting hit up by Company A’s recruiters 6 months later. And despite everyone’s best efforts, the murky, volatile, and ultimately stochastic circle jerk of a recruitment process marches on.

So yes, it’s certainly one possible conclusion is that technical interviewing itself is indeed fucked and doesn’t provide a reliable, deterministic signal for one interview instance. Algorithmic interviews are a hotly debated topic and one we’re deeply interested in teasing apart. One thing in particular we’re very excited about is tracking interview performance as a function of interview type, as we get more and more different interviewing types/approaches happening on the platform. Indeed, one of our long-term goals is to really dig into our data, look at the landscape of different interview styles, and make some serious data-driven statements about what types of technical interviews lead to the highest signal.

In the meantime, however, I am leaning toward the idea that drawing on aggregate performance is much more meaningful than a making such an important decision based on one single, arbitrary interview. Not only can aggregative performance help correct for an uncharacteristically poor performance, but it can also weed out people who eventually do well in an interview by chance or those who, over time, submit to the beast and memorize Cracking the Coding Interview. I know it’s not always practical or possible to gather aggregate performance data in the wild, but at the very least, in cases where a candidate’s performance is borderline or where their performance differs wildly from what you’d expect, it might make sense to interview them one more time, perhaps focusing on slightly different material, before making the final decision.

Appendix: The part where we tentatively justify using raw scores for comparative performance analysis

For the skeptical, inquiring minds among you who realize that using raw coding scores to evaluate an interviewee has some pretty obvious problems, we’ve included this section. The issue is that even though our interviewers tend to come from companies with high engineering bars, raw scores are still comprised of just one piece of feedback, they don’t adjust for interviewer strictness (e.g. one interviewer’s 4 could be another interviewer’s 2), and they don’t adjust well to changes in skill over time. Internally, we actually use a more complex and comprehensive rating system when determining skill, and if we can show that raw scores align with the ratings we calculate, then we don’t feel so bad about using raw scores comparatively.

Our rating system works something like this:

  1. We create a single score for each interview based on a weighted average of each feedback item.
  2. For each interviewer, we pit all the interviewees they’ve interviewed against one another using this score.
  3. We use a Bayesian ranking system (a modified version of Glicko-2) to generate a rating for each interviewee based on the outcome of these competitions.

As a result, each person is only rated based on their score as it compares to other people who were interviewed by the same interviewer. That means one interviewer’s score is never directly compared to another’s, and so we can correct for the hairy issue of inconsistent interviewer strictness.

So, why am I bringing this up at all? You’re all smart people, and you can tell when someone is waving their hands around and pretending to do math. Before we did all this analysis, we wanted to make sure that we believed our own data. We’ve done a lot of work to build a ratings system we believe in, so we correlated that with raw coding scores to see how strong they are at determining actual skill.

These results are pretty strong. Not strong enough for us to rely on raw scores exclusively but strong enough to believe that raw scores are useful for determining approximate candidate strength.

1While listening to interviews day in and day out, I came up with a drinking game. Every time someone thinks the answer is hash table, take a drink. And every time the answer actually is hash table, take two drinks.4

2This is data as of January 2016, and there are only 299 interviews because not all interviews have enough feedback data and because we threw out everyone with less than 2 interviews. Moreover, one thing we don’t show in this graph is the passage of time, so you can see people’s performance over time — it’s kind of a hot mess.

3We were curious to see if volatility varied at all with people’s mean scores. In other words, were weaker players more volatile than strong ones? The answer is no — when we ran a regression on standard deviation vs. mean, we couldn’t come up with any meaningful relationship (R-squared ~= 0.03), which means that people are all over the place regardless of how strong they are on average.

4I almost died.

Thanks to Andrew Marsh for co-authoring the appendix, to Plotly for making a terrific graphing product, and to everyone who read drafts of this behemoth.