Application of hierarchical agglomerative and k-means clustering on product sales data for forecasting

No Thumbnail Available
Journal Title
Journal ISSN
Volume Title
School of Business | Master's thesis
Degree programme
Information and Service Management (ISM)
This master’s thesis applies two clustering methods to item or product sales data of a grocery retailer to potentially achieve greater business value in terms of improved quality of the demand forecasts. The main theoretical contribution of this thesis is to add to the existing information on time series clustering, with a focus on grocery retail product sales data. The practical contribution of this study is to recommend a suitable clustering method, based on the empirical analysis performed on retailer’s data, which will potentially improve the forecast accuracy of products. Weekly sales data for the years 2018 and 2019 for 3379 products are collected from one of the stores of a major European grocery retail chain to apply Hierarchical Agglomerative Clustering (HAC) (Ward,1963) and K-means (MacQueen, 1967) clustering methods on their sales time series. After clustering, sample time series from the clusters are visually analyzed for similarities in their behavior. To find empirical evidence of the effect of clustering, forecasts are computed for clustered time series for the test period of 9 weeks that is January and February 2020. These aggregated forecasts are then disaggregated to item level and forecast accuracies are calculated for each time series. Recommendations on suitable clustering methods are provided based on a comparison between the forecast accuracies achieved by using each of the clustering methods and forecast accuracies achieved by the default forecasting approach currently used by the retailer’s forecast support system (FSS). The results of this study suggest that clustering does not identify clear groups of time series similar in behavior for retailer’s data. The default forecasting approach performs better than HAC by providing better forecast accuracies for 1847 items out of the total of 3379 items. 1691 items, which is marginally more than half of the total, get better forecast accuracy with the default forecasting approach than K-means for 5 or more weeks in the test period. In conclusion, clustering methods randomly assign time series to clusters and there is no clear evidence of logical grouping of the time series. Therefore, this study recommends against using clustering as a means of improving forecasts, rather, suggests exploring more advanced forecasting techniques as an avenue for future research.
Thesis advisor
Malo, Pekka
clustering, time series clustering, forecasting, forecast accuracy
Other note