RegMiner: Towards constructing a large regression dataset from code evolution history
File version
Author(s)
Lin, Y
Ng, SH
Wu, Y
Peng, X
Dong, JS
Mei, H
Griffith University Author(s)
Primary Supervisor
Other Supervisors
Editor(s)
Date
Size
File type(s)
Location
Virtual
License
Abstract
Bug datasets lay significant empirical and experimental foundation for various SE/PL researches such as fault localization, software testing, and program repair. Current well-known datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolution history. We focus on regression bugs, as they (1) manifest how a bug is introduced and fixed (as non-regression bugs), (2) support regression bug analysis, and (3) incorporate more specification (i.e., both the original passing version and the fixing version) than nonregression bug dataset for bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regression-inducing commit, and pass a previous working commit. We address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 1035 regressions over 147 projects in 8 weeks, creating the largest replicable regression dataset within the shortest period, to the best of our knowledge. Our extensive experiments show that (1) RegMiner can construct the regression dataset with very high precision and acceptable recall, and (2) the constructed regression dataset is of high authenticity and diversity. We foresee that a continuously growing regression dataset opens many data-driven research opportunities in the SE/PL communities.
Journal Title
Conference Title
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
Book Title
Edition
Volume
Issue
Thesis Type
Degree Program
School
Publisher link
Patent number
Funder(s)
Grant identifier(s)
Rights Statement
Rights Statement
Item Access Status
Note
Access the data
Related item(s)
Subject
Software engineering
Persistent link to this record
Citation
Song, X; Lin, Y; Ng, SH; Wu, Y; Peng, X; Dong, JS; Mei, H, RegMiner: Towards constructing a large regression dataset from code evolution history, ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 314-326