A Guide to “Big Data’s Disparate Impact”
November 19 2014What does it mean that data mining can reproduce patterns of unfairness? What counts as unlawful discrimination when it comes to an algorithm? What guidance do our laws provide?
These are the questioned tackled by Big Data’s Disparate Impact, a recent paper by Solon Barocas and Andrew D. Selbst. The paper begins with the technical fundamentals of data mining, and uses this foundation to analyze the policy questions that lie beyond. It concludes with a clear and unsettling message: “to a large degree, existing law cannot handle these problems.”
I’ve been wanting to preview this piece for Equal Future for some time. Hopefully, this whirlwind tour—greatly simplified for brevity—inspires you to explore the full paper.
What is data mining?
Data mining is the practice of sifting through data for useful patterns and relationships. These patterns, once discovered, are usually incorporated into a “model” that can be used to predict future outcomes.
For example: By analyzing thousands of employees’ job histories, a computer might discover a close correlation between the distance of an individual’s commute and that individual’s tenure at their job. This insight might be incorporated into a job recruiting model and used by an employer to evaluate its applicants.
How can data mining “go wrong?”
Although data mining conjures up images of sophisticated computers, Barocas and Selbst explain that humans guide data mining processes at many steps along the way. Each of these steps is a chance for something to go awry:
- People must translate real life questions into problems that a computer can understand. A data mining algorithm must be told what it’s looking for. This process, called determining “target variables,” can have a big impact on how a data mining process performs.
For example: What counts as a “good employee”? Different people will define “good” in many different ways (for example, someone who hits high sales goals, or who has a spotless discipline record, or who stays in the job for many years). This definition will frame the entire data mining process.
- People must give a computer data to learn from. If this “training data” is incomplete, inaccurate, or itself biased, then the results of the data mining process are likely to be flawed.
For example: Sorting medical school applicants based on prior admissions decisions can be a prejudiced process if these prior admissions decisions were infected with racial discrimination.
- People must decide what kind of information a computer should pay attention to. This process, called “feature selection,” determines which parts of the training data are relevant to consider. It may also be difficult to obtain data that is “sufficiently rich” to permit precise distinctions.
For example: Hiring decisions that significantly weigh the reputation of an applicant’s college or university might exclude “equally competent members of protected classes” if those members “happen to graduate from these colleges or universities at disproportionately low rates.” There may be more precise and accurate features available.
How can seemingly benign data serve as a “proxy” for more sensitive data?
It can be hard to “hide” demographic traits from a data mining process, especially when those traits are strongly tied to the ultimate purpose of the data mining. Here, it’s easiest to skip straight to an example.
For example: Imagine you are trying to build a model that predicts a person’s height. Men tend to be taller than women. Your training data doesn’t contain information about individuals’ sex, but it does include information about individuals’ occupations. A data mining process might learn that preschool teachers (more than 97% of whom are women) tend to be shorter than construction workers (more than 90% of whom are men). This insight simply reflects the fact that each profession is a reliable “proxy” for sex, which is itself correlated with height.
In the other words, given a big enough dataset, a data mining process will “determine the extent to which membership in a protected class is relevant to the sought-after trait whether or not [protected class membership] is an input.” There are ways to test for such proxying, but this can be a difficult and involved process.
How might data mining be harmful even when everything “goes right?”
A perfectly-designed data mining process can accurately expose existing unfairness in society. For example, mainstream credit scores are not equally distributed across racial groups. However, there is strong evidence that they are predictive (and not because they proxy for race). These scores reflect the fact that some minorities face a range of unique obstacles that make it more difficult for them to repay loans.
What does the law have to say when data mining has a disparate impact?
The authors conclude that the law is “largely ill equipped to address the discrimination that results from data mining.” They focus their analysis on Title VII—which prohibits employment discrimination based on race, color, religion, sex and national origin—because it expressly allows for “disparate impact” liability when a neutral practice has unfair results. Put simply, Title VII tries to put an end to historical discriminatory trends, while still permitting employers a reasonable amount of discretion in hiring. The crux of the law is the “business necessity” defense, which allows employment practices that are legitimate, useful, and accurate—even if those practices do have a disparate impact.
But the precise contours of disparate impact law remain murky, making it hard to measure what should be allowed when it comes to data mining. When is a factor too strongly correlated with protected status, making it illegitimate to use? How should we compensate for the fact that collected datasets themselves reflect existing real-world biases? Is restricting an algorithm’s inputs really the best way of making sure that its results will be fair?
No one yet has answers to these important questions. Giving concrete meaning to civil rights protections in the context of computerized decisions will require an ongoing exchange between technologists and the civil rights community.