Protocols and Data Structures for Knowledge Discovery on Distributed Private Databases
File version
Author(s)
Primary Supervisor
Estivill-Castro, Vladimir
Other Supervisors
Topor, Rodney
Editor(s)
Date
Size
File type(s)
Location
License
Abstract
Data mining has developed many techniques for automatic analysis of today’s rapidly collected data. Yahoo collects 12 TB daily of query logs and this is a quarter of what Google collects. For many important problems, the data is actually collected in distributed format by different institutions and organisations, and it can relate to businesses and individuals. The accuracy of knowledge that data mining brings for decision making depends on considering the collective datasets that describe a phenomenon. But privacy, confidentiality and trust emerge as major issues in the analysis of partitioned datasets among competitors, governments and other data holders that have conflicts of interest. Managing privacy is of the utmost importance in the emergent applications of data mining. For example, data mining has been identified as one of the most useful tools for the global collective fight on terror and crime [80]. Parties holding partitions of the database are very interested in the results, but may not trust the others with their data, or may be reluctant to release their data freely without some assurances regarding privacy. Data mining technology that reveals patterns in large databases could compromise the information that an individual or an organisation regards as private. The aim is to find the right balance between maximising analysis results (that are useful for each party) and keeping the inferences that disclose private information about organisation or individuals at a minimum. We address two core data analysis tasks, namely clustering and regression. For these to be solvable in the privacy context, we focus on the protocol’s efficiency and practicality. Because associative queries are central to clustering (and to many other data mining tasks), we provide protocols for privacy-preserving knear neighbour (k-NN) queries. Our methods improve previous methods for k-NN queries in privacy-preserving data-mining (which are based on Fagin’s A0 algorithm) because we do leak at least an order of magnitude less candidates and we achieve logarithmic performance on average. The foundations of our methods for k-NN queries are two pillars, firstly data structures and secondly, metrics. This thesis provides protocols for privacy-preserving computation of various common metrics and for construction of necessary data structures. We present here new algorithms for secure-multiparty-computation of some basic operations (like a new solution for Yao’s comparison problem and new protocols to perform linear algebra, in particular the scalar product). These algorithms will be used for the construction of protocols for different metrics (we provide protocols for all Minkowski metrics, the cosine metrics and the chessboard metric) and for performing associative queries in the privacy context. In order to be efficient, our protocols for associative queries are supported by specific data structures. Thus, we present the construction of privacy-preserving data structures like R-Trees [42, 7], KD-Trees [8, 53, 33] and the SASH [8, 60]. We demonstrate the use of all these tools, and we provide a new version of the well known clustering algorithm DBSCAN [42, 7]. This new version is now suitable for applications that demand privacy. Similarly, we apply our machinery and provide new multi-linear regression protocols that are now suitable for privacy applications. Our algorithms are more efficient than earlier methods and protocols. In particular, the cost associated with ensuring privacy provides only a linear-cost overhead for most of the protocols presented here. That is, our methods are essentially as costly as concentrating all the data in one site, performing the data-mining task, and disregarding privacy. However, in some cases we make use of a third-trusted party. This is not a problem when more than two parties are involved, since there is always one party that can act as the third.
Journal Title
Conference Title
Book Title
Edition
Volume
Issue
Thesis Type
Thesis (PhD Doctorate)
Degree Program
Doctor of Philosophy (PhD)
School
School of Information and Communication Technology
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
The author owns the copyright in this thesis, unless stated otherwise.
Item Access Status
Public
Note
Access the data
Related item(s)
Subject
Data mining
Data analysis tasks
Secure-multiparty-computation
Computer protocols
Knowledge discovery