In the world of machine learning, performance measures are king. We use them to evaluate the quality of our models and compare their effectiveness with others. However, there’s more to performance than just average risk loss. In this article, we explore the limitations of relying solely on risk-based measures and why it’s time to broaden our horizons in evaluating machine learning models.
Risk-Based Measures: The Classical Paradigm
Risk-based measures have been the de facto standard for evaluating machine learning performance since the dawn of statistical machine learning. The most common measure is the average test loss, which represents how well a model generalizes to new data. While this measure has served us well in many cases, it can also be misleading when applied too broadly.
The Problem with Average Risk
Averaging risk across different classes or samples can mask important differences between them. For instance, consider a binary classification problem where class A has a higher average test loss than class B. While this might lead us to conclude that the model is worse for class A, it could be because the model is more accurate on class B. By abstracting away these nuances, we risk overlooking important aspects of performance that are critical in certain applications.
Examples and Implications
To drive this point home, let’s consider a few examples:
- In robust empirical optimization, average test loss is not the optimal choice for measuring performance. As the name suggests, this paradigm focuses on optimizing expected performance under worst-case scenarios. Prioritizing average test loss in such cases can lead to suboptimal solutions that are not robust enough.
- Game theory provides another instance where average test loss is not sufficient. In some cases, we might want to prioritize fairness over accuracy, and measuring performance solely through average test loss can obscure these trade-offs.
- Distributional robustness is another area where average test loss falls short. By focusing too much on minimizing average risk, we may neglect important considerations like calibration or the robustness of our models to worst-case scenarios.
Conclusion: Time for a Broader Definition of Performance:
While average test loss has been an indispensable tool in evaluating machine learning performance, it’s time to broaden our definition of "good" performance. By incorporating new measures that capture robustness, fairness, and other critical aspects of performance, we can develop more sophisticated evaluation metrics that better capture the complexities of real-world applications. It’s no longer enough to rely solely on risk-based measures; instead, let’s embrace a more nuanced understanding of what constitutes good performance in machine learning.