Publication | Open Access
De-anonymizing private data by matching statistics
27
Citations
10
References
2013
Year
Unknown Venue
Privacy ProtectionEngineeringInformation SecurityInformation ForensicsPseudonymizationSummary StatisticsAnonymized HistogramsData ScienceData AnonymizationData ManagementStatisticsDe-anonymizing Private DataData PrivacyData Re-identificationComputer SciencePrivacy AnonymityMobile UserPrivacyData SecurityCryptographyStatistical Inference
Recent research has illustrated privacy breaches that can be effected on an anonymized dataset by an attacker who has access to auxiliary information about the users. Most of these attack strategies rely on the uniqueness of specific aspects of the users' data - e.g., observing a mobile user at just a few points on the time-location space are sufficient to uniquely identify him/her from an anonymized set of users. In this work, we consider de-anonymization attacks on anonymized summary statistics in the form of histograms. Such summary statistics are useful for many applications that do not need knowledge about exact user behavior. We consider an attacker who has access to an anonymized set of histograms of K users' data and an independent set of data belonging to the same users. Modeling the users' data as i.i.d., we study the composite hypothesis testing problem of identifying the correct matching between the anonymized histograms from the first set and the user data from the second. We propose a Generalized Likelihood Ratio Test as a solution to this problem and show that the solution can be identified using a minimum weight matching algorithm on an K × K complete bipartite weighted graph. We show that a variant of this solution is asymptotically optimal as the data lengths are increased.We apply the algorithm on mobility traces of over 1000 users on EPFL campus collected during two weeks and show that up to 70% of the users can be correctly matched. These results show that anonymized summary statistics of mobility traces themselves contain a significant amount of information that can be used to uniquely identify users by an attacker who has access to auxiliary information about the statistics.
| Year | Citations | |
|---|---|---|
Page 1
Page 1