Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems

No Thumbnail Available
File version
Author(s)
Dai, Fei
Chen, Yawen
Huang, Zhiyi
Zhang, Haibo
Tian, Hui
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
2024
Size
File type(s)
Location

Tianjin, China

License
Abstract

Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. This paper examines the efficiency differences in distributed DNN training across optical and electrical interconnect systems using various all-reduce algorithms. We first explore the Ring and Recursive Doubling (RD) all-reduce algorithms in both systems, followed by formulating a communication cost model for these algorithms. Performance comparison is then carried out via extensive experiments. Our results reveal that, in 1024-node systems, the Ring algorithm outperforms the RD algorithm in optical and electrical interconnects when data transfer exceeds 64 MB and 1024 MB, respectively. We also find that both Ring and RD algorithms in optical interconnect systems reduce average communication time by around 75% compared to electrical interconnect systems across four different DNNs. Interestingly, the communication time of the RD algorithm, but not the Ring algorithm, reduces as the number of wavelengths increase in optical interconnects. These findings provide valuable insights into DNN training optimization across various interconnect systems and lay the groundwork for future related research.

Journal Title
Conference Title

Algorithms and Architectures for Parallel Processing: 23rd International Conference, ICA3PP 2023, Tianjin, China, October 20–22, 2023, Proceedings, Part I

Book Title
Edition
Volume

14487

Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject

Information and computing sciences

Persistent link to this record
Citation

Dai, F; Chen, Y; Huang, Z; Zhang, H; Tian, H, Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems, Algorithms and Architectures for Parallel Processing: 23rd International Conference, ICA3PP 2023, Tianjin, China, October 20–22, 2023, Proceedings, Part I, 2024, pp. 401-418