Missing value imputation for the analysis of incomplete traffic accident data
Death, injury and disability resulting from road traffic crashes continue to be a major global public health problem. Recent data suggest that the number of fatalities from traffic crashes is in excess of 1.25 million people each year with non-fatal injuries affecting a further 20–50 million people. It is predicted that by 2030 road traffic accidents will have progressed to be the 5th leading cause of death and that the number of people who will die annually from traffic accidents will have doubled from current levels. Both developed and developing countries suffer from the consequences of increase in human population, and therefore, vehicle population. Therefore, methods to reduce accident severity are of great interest to traffic agencies and the public at large. To analyse traffic accident factors effectively we need a complete traffic accident historical database. Any missing data in the database could prevent the discovery of important environmental and road accident factors and lead to invalid conclusions. In this paper, we present a novel imputation method that exploits the within-record and between-record correlations to impute missing data of numerical or categorical values. In addition, our algorithm accounts for uncertainty in real world data by sampling from a list of potential imputed values according to their affinity degree. We evaluated our algorithm using four publicly available traffic accident databases from the United States, the first of which is the largest open federal database (explore.data.gov) in the United States, and the second is based on the National Incident Based Reporting System (NIBRS) of the city and county of Denver (data.opencolorado.org). The other two are from New York's open data portal (Motor Vehicle Crashes-case information: 2011 and Motor Vehicle Crashes-individual information: 2011, data.ny.gov). We compare our algorithm with four state-of-the-art imputation methods using missing value imputation accuracy and RMSE. Our results indicate that the proposed method performs significantly better than the existing algorithms we compared.
Information and Computing Sciences not elsewhere classified