Mean Absolute Percentage Error (MAPE) has served its duty and should now retire

Blog

Mean Absolute Percentage Error (MAPE) has served its duty and should now retire

According to Gartner (2018 Gartner Sales & Operations Planning Success Survey), the most popular evaluation metric for forecasts in Sales and Operations Planning is Mean Absolute Percentage Error (MAPE). This needs to change. Modern forecasts concern small quantities on a disaggregated level such as product-location-day. For such granular forecasts, MAPE values are extremely hard to judge and thereby disqualify as useful forecast quality indicators. MAPE also deeply misleads users by both exaggerating some problems and disguising others, nudging them to choose forecasts with systematic bias. The situations in which MAPE is suitable become increasingly rare. This is not dry theory: We simulate a supermarket that relies on a MAPE-optimizing forecast value fed into replenishment. The under- and overstocks in the fast- and slow-sellers quickly push the store out of business.

When absolute and relative errors contradict — whom should we trust?

You predicted a demand of 7.2 apples and 9 were eventually sold. You predicted 91.8 bottles of water and 108 were sold. You predicted 1.9 cans of tuna and one was sold. How do you judge these forecasting errors? A straightforward approach is to compute the absolute deviation of the prediction to the actual and divide by that actual, i.e. the relative absolute error, possibly as a percentage value (absolute percentage error, APE). That reads much more involved than it is: Coming up with APE as a first shot for “forecast quality evaluation” is quite typical. For the three examples, you obtain APEs of seemingly moderate 20% (=|7.2-9|/7.2), modest 15% (=|91.8-108|/108) and alarming 90% (=|1.9-1|/1), respectively. The MAPE, mean absolute percentage error, is the arithmetic mean of these three percentages, and amounts to 41.67%. These error percentages convey that the forecast on tuna is worse than the one on apples, and the forecast on bottles outperforms the others. But does this truly reflect forecast quality? Look again at the beginning of this section — the large absolute difference between forecasted and actual water bottles is worrisome, and its small relative error cannot really reassure you. On the other hand, the 90% error on tuna could be due to random (bad) luck — it amounts to only a single item. Should you keep your intuition quiet, and blindly rely on the APEs? Consequently, should you revise the tuna forecast and leave the water forecast as it is? If another forecast is issued, with an overall MAPE of only 30%, is that new forecast necessarily better?

Of course, under no circumstance would I ever seriously ask you to ignore your human judgement! This unpleasant paradox is resolved below: MAPE is unsuitable for modern probabilistic forecasts on granular level (i.e. on product-location-day, on which “small” numbers or even “0” can occur), due to several intolerable and unsolvable problems. A forecast’s MAPE doesn’t tell us how good that forecast is, but how oddly APE behaves.

Consciously ignoring scale: When percentage errors can make sense

Before diving into granular forecasting in retail (on product-location-day level), let’s suppose to predict a much larger quantity: The yearly gross domestic product (GDP) of countries, measured in US$. Such forecast might be used to define policies for entire countries, and these policies should be equally applicable to countries of different sizes. Therefore, it is fair to weight each country equally in this use case: A 5% error on the US GPD (about 25 trillion US$) hurts just as much as a 5% error on the Tuvalu GPD (about 66 million US$, 380,000 times smaller than the US GDP). Here, absolute percentage error (APE) makes sense: The actual GDP is never close to 0 (which would cause a terrible headache when dividing by it, I’ll come to that below), and the forecast aim is not to get the overall GDP of the planet right, but to be close as possible for each individual country, across scales ranging from millions to trillions. Minimizing the total absolute error of the model (i.e. error in US$, not in percentages) puts the largest economies into the spotlight and disregards the small ones. It does not weight each country equally, but by its economic power. A model with a nice 3% error on the US GDP and an unacceptable 200% error on the Tuvalu GDP would appear to be “better” than a model with 4% error on the US GDP and 10% error on Tuvalu GDP in absolute US$ terms. MAPE, on the other hand, points towards using the latter forecast, which sacrifices a lot of absolute GDP accuracy on the US (1% of 25 trillion US$) for a modest absolute improvement of the accuracy on Tuvalu (190% of 66 million US$). The US GPD is much larger than Tuvalu’s, but one would consciously, and for good reasons, decide to treat them equally. Both the US and Tuvalu can be considered “large” in the sense that one can’t expect statistical fluctuations or “bad luck” to be responsible for forecast error — i.e. deviations will typically be statistically significant and point towards model improvement potential.

In summary, whenever single instances of a forecast of different values should be treated in an equal way, i.e. whenever we are fine with comparing enormous apples to miniscule oranges, MAPE can make sense. But is an equal treatment always fair?

Loading component...

Loading component...

Loading component...