Design and evaluation of small–large outer joins in cloud computing environments

Loading...
Thumbnail Image
File version
Accepted Manuscript (AM)
Author(s)
Cheng, L
Tachmazidis, I
Kotoulas, S
Antoniou, G
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2017
Size
File type(s)
Location
License
http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract

Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small–large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small–large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small–large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.

Journal Title
Journal of Parallel and Distributed Computing
Conference Title
Book Title
Edition
Volume
110
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
© 2017 Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Licence (http://creativecommons.org/licenses/by-nc-nd/4.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, providing that the work is properly cited.
Item Access Status
Note
Access the data
Related item(s)
Subject
Software engineering
Persistent link to this record
Citation
Cheng, L; Tachmazidis, I; Kotoulas, S; Antoniou, G, Design and evaluation of small–large outer joins in cloud computing environments, Journal of Parallel and Distributed Computing, 2017, 110, pp. 2-15
Collections