Photo credit: venturebeat.com
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
During my initial role as a machine learning product manager, a thought-provoking question ignited intense discussions among various teams: How can we ascertain whether this product truly functions effectively? The product I oversaw served both internal staff and external clients, facilitating the internal assessment of customer challenges to prioritize solutions. Given the intricate interrelationships between these users, it became essential to select the right metrics to gauge the product’s influence and direction.
Neglecting to evaluate if your product is meeting expectations is akin to flying a plane without guidance from air traffic control. Without clarity on successes and failures, making informed decisions for customers becomes nearly impossible. Moreover, if metrics aren’t clearly defined, team members may create alternative metrics, risking inconsistencies. Each team might develop its own interpretation of ‘accuracy’ or ‘quality,’ leading to divergent objectives.
For instance, when I discussed my annual objectives and metrics with the engineering team, their response was, “This metric is business-focused; we already track precision and recall.”
Initial Steps: Determining What You Need to Understand About Your AI Product
Once you embark on the journey of establishing product metrics, where should you start? From my experience, the challenge of managing an ML product with diverse customer bases necessitates a thoughtful approach to defining metrics. What indicators signal a model’s effectiveness? Evaluating internal team performance based on our models might not provide timely insights, while assessing customer acceptance of recommended solutions could yield misleading conclusions, particularly if customers engage with alternatives merely to access support.
In the current landscape of large language models (LLMs), we’re not limited to singular output types; we also consider text, images, and even music. The myriad outputs necessitate a more expansive range of metrics, spanning various formats, customer types, and more.
As I strategize metrics for my products, I typically begin by clarifying what I wish to learn about their impact on customers through a set of pivotal questions. Identifying these core inquiries simplifies the process of selecting appropriate metrics. Here are some illustrative examples:
Did the customer receive an output? → Metric for coverage
How long did it take for the product to deliver an output? → Metric for latency
Did the user appreciate the output? → Metrics for customer feedback, adoption, and retention
Following the identification of key questions, the next step is to formulate sub-questions related to ‘input’ and ‘output’ signals. Output metrics serve as lagging indicators, reflecting events that have already occurred. In contrast, input metrics act as leading indicators, allowing for trend analysis and outcome predictions. Below are examples of integrating sub-questions for both lagging and leading indicators. Not every question will require this dual metric structure.
Did the customer receive an output? → Coverage
How long did it take for the product to yield results? → Latency
Did the user find the output satisfactory? → Customer feedback, adoption, and retention
Did the user assess the output as correct/incorrect? (Output)
Was the output deemed good/fair? (Input)
The third and final phase involves determining how to collect your metrics. Most data is amassed through systematic instrumentation via data engineering. However, certain questions (like the third listed above) may benefit from manual evaluation or automated assessments of model outputs. While developing automated evaluations is ideal, initiating with manual assessments of “was the output good/fair” and creating clear rubrics for classification will help lay a solid foundation for a more structured automated evaluation process later on.
Practical Applications: AI Search and Listing Descriptions
The aforementioned framework can be utilized across any ML-based product to determine essential metrics. For instance, consider search capabilities.
Question MetricsNature of MetricDid the customer receive an output? → Coverage: % of search sessions displaying results to customers
Output
How long did it take for the product to generate an output? → Latency: Time taken to present search results
Output
Did the user appreciate the output? → Customer feedback, adoption, and retention
Did the user indicate whether the output was correct or incorrect? (Output) Was the output considered good or fair? (Input)
% of search sessions receiving positive feedback on results or % of sessions leading to user clicks
% of search results labeled as ‘good/fair’ per quality standard
Output
Input
Let’s extend this to generating descriptions for listings (such as menu items on Doordash or product entries on Amazon).
Question MetricsNature of MetricDid the customer receive an output? → Coverage: % of listings with generated descriptions
Output
How long did it take to produce an output? → Latency: Time taken to generate descriptions for users
Output
Did the user like the output? → Customer feedback, adoption, and retention
Did the user indicate whether the output was right or wrong? (Output) Was the output judged as good or fair? (Input)
% of listings with generated descriptions necessitating edits by the content team/seller/customer
% of descriptions rated as ‘good/fair’ according to the quality rubric
Output
Input
This approach is adaptable for multiple ML-based products and can significantly aid in defining suitable metrics for your model.
Sharanya Rao is a group product manager at Intuit.
Source
venturebeat.com