Monday, June 23, 2014

Data dredging and statistical significance explained, or how to win a meritless discrimination lawsuit


I just came up with the following interview question for data scientists:

You apply for a position at a company and interview with five individuals, each of which then casts a hire / no-hire vote. You don't get the job and decide to file a discrimination lawsuit against the company. You don't specify which of the five interviewers was being discriminatory or if you were being discriminated against based on gender, race, religion, age, or sexual orientation. The judge forces the company to disclose historical hire/no-hire decisions for each of the five interviewers and declares that if you can show that one of the interviewers is significantly prejudiced (at p = 0.05) with respect to one of the five characteristics, then they will rule in your favor. Assuming that none of the interviewers are actually biased, what is your chance of winning the case?

Can you guess the answer?

The answer is: approximately 72%. Yep, you are almost three times as likely to win the case than to lose it, even when no actual discrimination is taking place.

This apparent paradox is a typical example of data dredging and it is what people mean when they warn against the perils of big data, or even downfall of science. The problem is not intrinsic to big data or science, but stems from what most problems tend to stem from: people doing things without fully understanding what they are doing.

So I thought I would take this chance to explain some of the key concepts behind statistical significance and data dredging using the discrimination lawsuit paradox as an illustration.

Statistical significance

Let's say you have historical interview decision data for interviewer X and it looks like this:

CandidateGenderX's decision
1femalehire
2femaleno hire
3malehire
4femalehire
5maleno hire
6maleno hire
7femalehire
8femaleno hire
9femaleno hire
10femalehire
11maleno hire
12maleno hire
13femaleno hire
14maleno hire
15femalehire
16maleno hire
17malehire
18malehire
19maleno hire
20maleno hire

The question is: is X discriminating based on gender, or in other words, does X have a bias towards men or women?

We can see that he (or she?) voted hire for 3 out of 11 men (27%) and 5 out of 9 women (56%) — more than double the rate. So it looks like he may have some bias against men. But does this really mean that he is discriminating, or did he just happen to come across some really qualified female candidates and some really unqualified male ones?

This is the question which statistical significance attempts to answer, but it is important to realize that no mathematical system can actually answer this question conclusively, so we go after "the next best thing" and instead try to answer a related question:

"Let's give X the benefit of the doubt and assume that he is not discriminating (the 'null hypothesis'). What are the odds that he would be so unlucky with his male candidates to end up with this amount (or more) of skew towards women?"

To answer this question, let's assume that X was completely blind to the gender when he gave away his eight hire votes and that gender was randomly assigned to the candidates after this fact. If that were the case, the chance of men getting x of the hire votes and women getting 8-x would be:
(11) * (9) / (20), so:
x
8-x
8
Hire's for menHire's for womenProbability
080.007%
170.314%
263.668%
3516.504%
4433.008%
5330.807%
6213.203%
712.358%
800.131%

The p-value is simply the sum of the probabilities in the row in question (in this case 16.504%) and all the rows which are even more "extreme" (in this case, even more biased against men). So in this case the p-value is around 20%, or 0.2, which is usually not considered "statistically significant". The lower the p-value, the more statistically significant the result. Typical p-value thresholds for achieving statistical significance include 0.1, 0.05, or 0.01. So if there had been only two men among the eight that X voted hire for, the bias would have been considered significant having a p-value of around 0.04. But since he (or she) voted hire for three men, the bias should not be considered statically significant.

A common misconception at this point is to assume that this means that the probability that X is biased is 1-p = 80%. But that's simply not how probability works, which we can easily see by applying Bayes rule:

P(X is biased | this or more extreme result)
     = P(NOT null hypothesis | this or more extreme result)
     = 1 - P(null hypothesis | this or more extreme result)
     = 1 - P(this or more extreme result | null hypothesis) * P(null hypothesis) / P(this or more extreme result)
     = 1 - p * P(null hypothesis) / P(this or more extreme result)
     ≠ 1 - p

This misconception keeps getting repeated leading to a lot of confusion, but the truth is that you just can't compute the actual probability that X is biased, unless you have access to all the different parallel universes in some of which X is biased and others isn't.

Note: In real life we would also have to account for the overall rate of hire vs. no-hire decisions for the two genders, which may not actually be equal for reasons unrelated to discrimination. For example, girls may tend to chose different colleges than boys and some colleges may be better than others at preparing candidates for whatever is the job in question. But this would make the math more complicated, so for the purpose of this example, let's assume that the overall rates are the same for men and women in the general population.

Data dredging

Coming back to the original interview question: we have five different interviewers and each of the five interviewers can be prejudiced with respect to five different possible factors (gender, race, religion, age and sexual orientation). This makes 25 possible combinations of interviewer and prejudice. For each of these 25 combinations we can compute the p-value just like we did in the previous section. Assuming that none of the interviewers is actually biased, the probability of any one of these results showing statistically significant bias with a p-value threshold of 0.05 is 0.05 — that's just the sheer definition of p-value. But this means that the probability of at least one of the 25 showing statistically significant bias is 1 - (1-0.05)25 ≈ 0.72. 

In general, the more hypotheses you are testing in hope of uncovering a "statistically significant" one, the more likely that at least one of them will show significance due to pure chance even if no underlying pattern exists (a false positive). This chance gets lower if you lower the p-value threshold, but no matter how low the p-value threshold is, given enough possible hypotheses, the probability of at least one false positive quickly approaches 1 (see graph below). 



Conclusion

This shows that you should always take every statistically significant result with a grain of salt and ask yourself: how many hypotheses were tested before this seemingly significant result was found? You can see many examples of this in medicine (testing numerous drugs to see what works, or testing multiple combinations of genes and conditions to find a correlation), as well as software development (A/B testing multiple features and adopting only the ones which show "significant" improvement). There is nothing wrong with computing p-values, but taking action without knowing what they mean exactly can quite literally have perilous results.

Having gone through this whole chain of thought, I decided that it probably wasn't a good idea to give this interview question to actual candidates, just in case it gives them any ideas... ;)