Speeding up Subgraph Isomorphism Search in Large Graphs
MetadataShow full item record
Graph is a widely used model to represent complicated data in many domains. Finding subgraph isomorphism is a fundamental function for many graph databases and data mining applications handling graph data. This thesis studies this classic problem by considering a set of novel techniques from three different aspects. This thesis first considers speeding up subgraph isomorphism search by exploiting relationships among data vertices. Most of the subgraph isomorphism algorithms of the In-Memory model (IM) are based on a backtracking method which computes the solutions by incrementally enumerating all candidate combinations. We observed that all current algorithms blindly verify each individual mapping separately, often leading to extensive duplicate calculations. We propose two novel concepts, Syntactic Equivalence and Query Dependent Equivalence, by using which we group specific candidate data vertices into a hypervertex. The data vertices belonging to the same hypervertex can be mapped to the same query vertex. Thus, all the vertices falling into the same hypervertex can be determined whether to contribute to a solution simultaneously instead of calculating them separately. Our extensive experimental study on real datasets shows that existing subgraph isomorphism algorithms can be significantly boosted by our approach. Secondly, this thesis considers multi-query optimization where multiple queries are processed together so as to reduce the overall processing time. We propose a novel method for efficiently detecting useful common subgraphs and a data structure to organize them. We propose a heuristic algorithm based on the data structure to compute a query execution order so that cached intermediate results can be effectively utilized. To balance memory usage and the time for cached results retrieval, we present a novel structure for caching the intermediate results. We provide strategies to revise existing single-query subgraph isomorphism algorithms to seamlessly utilize the cached results, which leads to significant performance improvement. Experiments over real datasets proved the effectiveness and efficiency of our multi-query optimization approach. In the third part, this thesis considers the subgraph isomorphism search under distributed environments. We observed that current state-of-the-art distributed solutions either rely on crippling joins or cumbersome indices, which leads those solutions hard to be practically used. Moreover, most of them follow the synchronous model whose performance is often bottlenecked by the machine with the worst performance in the cluster. Motivated by this, in this thesis, we utilize a dramatically different approach and propose PADS , a Practical Asynchronous Distributed Subgraph enumeration system. We conducted extensive experiments to evaluate the performance of Pads. Compared with existing join-oriented solution, our system not only shows significant superiority in terms of query processing efficiency but also has outstanding practicality. Even compared with heavy indexed solution, our approach also has better performance in many cases.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Info & Comm Tech
The author owns the copyright in this thesis, unless stated otherwise.
In-Memory model (IM)
Query Dependent Equivalence