Coupled Clustering Ensemble by Exploring Data Interdependence

Loading...
Thumbnail Image
File version

Accepted Manuscript (AM)

Author(s)
Wang, Can
Chi, Chi-Hung
She, Zhong
Cao, Longbing
Stantic, Bela
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2018
Size
File type(s)
Location
License
Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution. It is an effective technique for improving the quality of clustering results. Current clustering ensemble algorithms are usually built on the pairwise agreements between clusterings that focus on the similarity via consensus functions, between data objects that induce similarity measures from partitions and re-cluster objects, and between clusters that collapse groups of clusters into meta-clusters. In most of those models, there is a strong assumption on IIDness (i.e., independent and identical distribution), which states that base clusterings perform independently of one another and all objects are also independent. In the real world, however, objects are generally likely related to each other through features that are either explicit or even implicit. There is also latent but definite relationship among intermediate base clusterings because they are derived from the same set of data. All these demand a further investigation of clustering ensembles that explores the interdependence characteristics of data. To solve this problem, a new coupled clustering ensemble (CCE) framework that works on the interdependence nature of objects and intermediate base clusterings is proposed in this article. The main idea is to model the coupling relationship between objects by aggregating the similarity of base clusterings, and the interactive relationship among objects by addressing their neighborhood domains. Once these interdependence relationships are discovered, they will act as critical supplements to clustering ensembles. We verified our proposed framework by using three types of consensus function: clustering-based, object-based, and cluster-based. Substantial experiments on multiple synthetic and real-life benchmark datasets indicate that CCE can effectively capture the implicit interdependence relationships among base clusterings and among objects with higher clustering accuracy, stability, and robustness compared to 14 state-of-the-art techniques, supported by statistical analysis. In addition, we show that the final clustering quality is dependent on the data characteristics (e.g., quality and consistency) of base clusterings in terms of sensitivity analysis. Finally, the applications in document clustering, as well as on the datasets with much larger size and dimensionality, further demonstrate the effectiveness, efficiency, and scalability of our proposed models.

Journal Title

ACM Transactions on Knowledge Discovery from Data

Conference Title
Book Title
Edition
Volume

12

Issue

6

Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement

© ACM, 2018. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Volume 12 Issue 6, October 2018, Article No. 63, https://doi.org/10.1145/3230967

Item Access Status
Note
Access the data
Related item(s)
Subject

Artificial intelligence

Pattern recognition

Data mining and knowledge discovery

Cybersecurity and privacy

Data management and data science

Distributed computing and systems software

Persistent link to this record
Citation
Collections