Differential Privacy: The Big Tech Solution to Big Data Privacy

Gregory Hong is an IPilogue Writer and a 1L JD candidate at Osgoode Hall Law School.

The AI revolution has brought about significant concerns about the privacy of big data. Thankfully, over the past decade, big tech has found a solution to this problem: differential privacy, which actors have implemented in various ways. The technology is not limited to big tech anymore either; the U.S. government has implemented differential privacy for their 2020 census data. Furthermore, the European Union is contemplating following suit – indicating that policymakers are on board with differential privacy as a standard means of protecting large, tabulated datasets.

What problem does differential privacy aim to solve?

Differential privacy was created to combat the Fundamental Law of Information Recovery, which states that “overly accurate answers to too many questions will destroy privacy in a spectacular way.” For instance, in a striking example,

Latanya Sweeney showed that gender, date of birth, and zip code are sufficient to uniquely identify the vast majority of Americans. By linking these attributes in a supposedly anonymized healthcare database to public voter records, she was able to identify the individual health record of the Governor of Massachusetts.

In 2008, a de-anonymization attack against the Netflix Prize dataset, which at the time contained anonymous movie ratings of 500,000 Netflix subscribers. The attacker compared this to the Internet Movie Database (IMDb) and successfully identified the Netflix records of known users, uncovering information such as their apparent political preferences.

How does one defend against such an attack?

De-anonymization attacks follow the principle that overly accurate answers to too many questions will destroy privacy. Defending a database against too many questions is impractical, thus there must be a method to make answers inaccurate without affecting the data’s utility. Per Microsoft’s AI Lab, this method is achieved by introducing “statistical noise”. The noise (effectively small alterations to the data) is significant enough to protect the individual’s privacy, but small enough that it will not impact the extracted answers’ accuracy.

Why is this relevant to law?

Differential privacy essentially protects an individual’s information by presenting the impression that their information were not used in the analysis at all, which is more likely to comply with legal requirements for privacy protection. Differential privacy also masks individual contributions to ensure that using an individual’s data will not reveal any personally identifiable information, making it impossible to infer any information specific to an individual.

Alabama v. United States Department of Commerce raised (and voluntarily dismissed) legal arguments against differential privacy by alleging that “the defendants’ decision to produce “manipulated” census data to the states for redistricting would result in the delivery of inaccurate data for geographic regions beyond the state's total population in violation of the Census Act”. As the plaintiff voluntarily dismissed the case, we will need to wait to see if this argument is successful in the future. However, it is obvious that the courts find the addition of statistical noise to violate the data’s integrity, which would be a serious problem for differential privacy.