I provide mentorship to others on Machine Learning and AI, as a way to contribute to the ML community. I helped my mentees review their machine learning projects, share suggestions, and discuss ideas for improvement. Recently, there were several discussions on the topic of “metrics”.
Last month, one of my mentees sent me her project on ML and asked me for feedback. She was working on a side project to predict the spread of wildfire. She did a good job diving into the data and analyzing the factors like wind, rain, temperature, etc. I said to her: “This project looks interesting and you did good baseline work. But what’s the metric that you want to optimize for?”
Two weeks before, another friend of mine asked me about getting into the ML space and interview tips. She asked me, “how should I introduce my project at an ML interview?” “Start from your problem and metric. What is the key metric you want to optimize for? Then you can explain what are the challenges you faced and how did you improve on the metric by addressing those challenges”.
Metrics are keys to Machine Learning projects:
- When I start a project, I will spend effort picking the right metrics I want to optimize for. 
- When I read a paper, I will deliberately look for the key metric to evaluate in the experiments of the paper. 
- When I listen to others’ experiments, I will ask about the key metrics that they have been using for evaluation, what their baseline looks like, and what makes the improvements. 
Why Metrics are important?
There are a dozen of things you can do in a typical machine learning project: adjust input features, change model architecture, change hyperparameters, etc. To maximize your own productivity, you want to do the most impactful thing first. How do you tell which factor can be more impactful than others? This is where the metrics come in. If changing a model architecture leads to 0.5% improvement on the key metric, whereas cleaning the data for an hour will lead to 10% improvement, it should be very clear what to do next: clean the data.
Metrics provide us an unbiased answer to how well our ML application serves our requirements. If we are developing a detection application, the metrics are meant to tell how accurate the application can detect the object of our interest. If we are developing a translation tool, the metrics are meant to tell how well the tool can translate from the source language to the target language. This gives a quantitative idea of how far are we on the way to achieving the final objectives.
Metrics help us prioritize. In real life, we are always constrained by resources, time, and our valuable attentions. We are forced to make a careful choice on how to allocate constrained resources to maximize the chance of success. Having a clear metric in a project will guide engineers and researchers to concentrate their time and attention on the very things that can improve the metric as much as possible.
What metrics to use?
It’s important to select good, clear metrics, but it is hard. There are a few exercises that can help you select good metrics:
- Clearly define your task: what exactly are you trying to achieve? If you want to build an automated system, what are the input and output? If you want to improve user engagement, what defines engagement? 
- Clearly define your success criteria: What does success look like for your project? 
Good metrics need to be well-defined, numerical-based, not a vague idea (engagement, quality, etc).
The good thing is, for many of the projects, there’s a chance you can leverage the predecessors’ experience and work. If it is an established field, there is usually a common metric that is agreed upon by the community. If it is a new field, you can also look at similar works from the past and consider the metrics they have used to come up with yours. Commonly-used metrics include accuracy, precision-recall, fairness, latency, throughput, etc.
Having a Single-Number Metric
One lesson I learned from Andrew Ng is always trying to establish a single-number evaluation metric to optimize. Having multiple metrics makes it harder to compare methods and leads to confusion. For example:
Curtsey to Andrew’s “Machine Learning Yearning” Book
Classifier A has a better recall rate, whereas classifier B has higher precision. Neither classifier is obviously superior, so people inside a team may have different opinions on which one to pick.
Instead, if we use F1, it is very straightforward to choose the classifier A and everyone can agree on that. Note that you can always adjust the beta weight in the F-score to reflect your actual preference between precision and recall.
“Having a single-number evaluation metric speeds up your ability to make a decision when you are selecting among a large number of classifiers. It gives a clear preference ranking among all of them, and therefore a clear direction for progress.” — Andrew Ng
What is the limitation of metrics?
Metrics serve as high-level guidance on the current model performance and the gap until the final objective. On the other hand, it deliberately ignores lots of details, clues, and caveats. For example, when developing a machine vision system, my metric can tell me that my system is able to capture 95% of the target defects, but it won’t tell me why the remaining 5% is missed and how to close the gap. I still need to conduct error analysis in order to find my biggest root causes of error and improve them. A compass can point you in the right direction, but you also need a map and awareness of your surrounding in order to walk out of a forest.
Final Thought
Metrics are critical in machine learning projects. They help a team to prioritize their resources and concentrate on a single, clear objective. I am always amazed to see that, once my team is aligned on a single metric to optimize, the speed and momentum we will be able to execute. In the end, we will usually be able to accomplish the goals that seem impossible in the beginning.
Interested to read more on this topic? Here are my recommendations:
 
                
               
             
             
             
            