In our previous article, “AI Evals: Understanding and Writing Them — Must-Have Skills for AI PMs”, we explored what AI evals are, why they are critical for building reliable AI products, the R-F-R-G-E-S-R-L evals framework, and the key challenges in evaluating AI systems.
In this article, we will take a deeper dive into the different types of AI evals, examining how each type works, what it measures, and how it can be applied to ensure your AI products meet both technical and business expectations.
Evaluating AI isn’t just about technical correctness; it’s about ensuring outputs align with user needs and business impact. Metrics can be grouped by task type, and each can be reference-based or reference-free depending on whether they compare outputs to a known “ground truth.” In reference-based evals, the model output is compared to a predefined "Gold" answer. While In reference-free evals, output is evaluated based on internal logic or the source context.
Now, lets look into different types of AI evaluation metrics (classification, regression, ranking, generative, retrieval) with reference-based vs reference-free distinction.
1️⃣ Classification Metrics
Task: Predict discrete labels (e.g., spam vs non-spam)
Reference-Based Examples: Accuracy, Precision, Recall, F1 Score, ROC-AUC
Reference-Free Examples: Human evaluation for label appropriateness or fairness
What it catches: Correct predictions, misclassification patterns, edge-case failures
PM Perspective: Helps prioritize improvements on high-impact categories and measure alignment with user expectations
Metrics | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
Accuracy | Overall correctness | Fraction of correct predictions over total predictions | When class distribution is balanced | Simple overall measure, but can hide poor performance on minority classes |
Precision | Correctness of positive predictions | True Positives / (True Positives + False Positives) | When false positives are costly (e.g., misclassifying legitimate email as spam) | Helps PMs understand risk of incorrect alerts to users |
Recall (Sensitivity) | Coverage of positive cases | True Positives / (True Positives + False Negatives) | When missing positives is costly (e.g., detecting fraud, disease) | Helps PMs prioritize catching all critical cases |
F1 Score | Balance of precision & recall | Harmonic mean of precision and recall | When there’s class imbalance or trade-offs between false positives/negatives | Provides a single score for comparing models in edge-case scenarios |
ROC-AUC | Model’s ability to distinguish classes | Area under the Receiver Operating Characteristic curve (TPR vs FPR at different thresholds) | When overall discrimination matters | Shows model’s robustness across thresholds, useful for threshold tuning |
2️⃣ Regression Metrics
Task: Predict continuous values (e.g., sales, temperature)
Reference-Based Examples: Mean Absolute Error (MAE), Mean Squared Error (MSE), R²
Reference-Free Examples: Sometimes, strict numeric ground truth may be unavailable, incomplete, or business context matters more than exact numbers.
Expert review of predicted trends: Humans check if forecasts follow expected patterns (e.g., seasonal peaks, expected declines).
Sanity checks: Ensure outputs are within plausible ranges or logical constraints.
Anomaly detection: Identify predictions that are statistically or operationally unusual, even if exact ground truth isn’t known.
What it catches: Accuracy of predictions, deviation patterns, trends in errors
PM Perspective: Ensures outputs are reliable enough to drive decisions and business KPIs
Metric | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
Mean Absolute Error (MAE) | Average magnitude of errors | Average of |Predicted − Actual| across all samples | When all errors are equally important | Provides an intuitive sense of average deviation, easy to explain to stakeholders |
Mean Squared Error (MSE) | Average squared errors | Average of (Predicted − Actual)² | When large errors are especially costly | Penalizes big mistakes more, useful for risk-sensitive forecasts |
Root Mean Squared Error (RMSE) | Same as MSE, but in original units | Square root of MSE | Makes interpretation easier in units of the target variable | Easy for PMs to communicate expected error magnitude |
R² (Coefficient of Determination) | How well the model explains variance | 1 − (SS_residual / SS_total) | When understanding variance capture matters | Shows overall predictive power relative to simple baselines |
3️⃣ Ranking / Retrieval Metrics
Task: Return ranked lists (e.g., search results, recommendations)
Reference-Based Examples: Precision@K, Recall@K, NDCG, MRR, Hit Rate
Reference-Free Examples: Sometimes ground truth relevance is unavailable or incomplete, or user behavior provides stronger signals than labeled data.
Embedding similarity: Measures semantic similarity between query and retrieved items.
User engagement metrics: Click-through rate (CTR), dwell time, or other interaction data.
Relevance scoring by AI judge: LLMs can judge whether results are useful without explicit labels.
PM Perspective: Measures impact on engagement and usability, not just algorithmic correctness
Metric | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
Precision@K | Fraction of relevant items in the top K results | Count of relevant items in top K / K | When users only see the top results | Shows whether top results meet user needs — critical for UX |
Recall@K | Fraction of all relevant items retrieved in top K | Relevant items in top K / Total relevant items | When completeness is important | Helps PMs ensure important items are not missed |
NDCG (Normalized Discounted Cumulative Gain) | Quality of ranking, prioritizing highly relevant items at the top | Sum of relevance scores discounted by position, normalized | When ranking order matters | Measures user satisfaction with ranking, not just raw retrieval |
MRR (Mean Reciprocal Rank) | Average position of first relevant item | Reciprocal of rank of first relevant item, averaged over queries | When first relevant result is most important | Useful for quick-access tasks, e.g., search or QA |
Hit Rate / Top-K Accuracy | Whether at least one relevant item appears in top K | Binary check per query | When any relevant item in top K suffices | Shows success of retrieval for user’s top expectations |
4️⃣ Generative / LLM Metrics
Task: Generate text, code, summaries, or answers
Reference-Based Examples: BLEU, ROUGE, METEOR, Exact Match, Perplexity
Reference-Free Examples: Reference-free metrics assess output quality without needing a fixed answer, often using human judgment or embeddings.
Human Evaluation: Experts rate outputs on fluency, coherence, helpfulness, or safety.
LLM-as-Judge: A stronger LLM evaluates outputs based on a rubric (e.g., helpfulness, relevance).
BERTScore / Embedding Similarity: Measures semantic similarity between output and context/reference.
Faithfulness / Grounding Checks: Evaluates whether factual claims are supported by evidence or retrieved data.
What it catches: Fluency, coherence, factual accuracy, helpfulness, hallucinations
PM Perspective: Balances technical quality with user-perceived utility, critical for LLM-driven products
Metric | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
BLEU | N-gram overlap between generated and reference text | Counts matching n-grams | Machine translation or structured text | Gives an objective score for literal similarity, but may miss meaning |
ROUGE | Recall-focused n-gram overlap | Measures how much reference content is covered | Summarization tasks | Useful for measuring coverage of key content |
METEOR | Semantic and lexical matching | Includes synonyms, stemming, paraphrase matching | Translation, summarization | More aligned with human judgment than BLEU |
Exact Match (EM) | Strict string match | Checks if generated text exactly matches reference | Question answering | Useful for precise tasks but too strict for creative output |
Perplexity | Model “surprise” on reference text | Likelihood of reference sequence under model | Language modeling | Lower perplexity indicates better language modeling |
5️⃣ Embedding / Similarity Metrics
Task: Measure semantic similarity or clustering (e.g., vector search, recommendations)
Reference-Based Examples: Cosine similarity against labeled reference pairs
Reference-Free Examples: When no labeled reference exists or human validation is more relevant:
Human validation: Experts judge whether results are semantically correct.
LLM-as-judge: Stronger LLM scores similarity, relevance, or appropriateness.
User engagement metrics: Indirect measure of semantic quality (clicks, dwell time).
What it catches: Correctness of semantic relationships, similarity ranking, clustering quality
PM Perspective: Ensures that AI understands meaning, not just keywords
Metric | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
Cosine Similarity | Angle-based similarity between vectors | Measures similarity of query and candidate embeddings | Semantic search, matching tasks | Ensures AI retrieves or ranks meaningfully similar content |
Euclidean / L2 Distance | Magnitude-based distance in embedding space | Smaller distance → higher similarity | Clustering or nearest neighbor search | Detects items that are close in meaning |
Dot Product / Inner Product | Weighted similarity considering magnitude | Used in large-scale retrieval | When ranking relevance in embeddings | Useful for real-time recommendation or ranking pipelines |
6️⃣ Business / Composite Metrics
Task: Tie AI outputs to real-world impact
Reference-Based Examples: Task success rate, reduction in support load, CTR improvements
Reference-Free Examples: Sometimes, business impact is subjective or qualitative.
Human judgment of usefulness: Experts or end-users rate whether AI outputs are helpful.
Alignment with product goals: Human review ensures AI outputs support strategic objectives, even if not numerically measurable.
What it catches: Whether AI delivers real value to users and the business
PM Perspective: Bridges the gap between technical evaluation and product outcomes
Metric | What It Measures | How It Works | When It Matters | PM Insight |
|---|---|---|---|---|
Task Success Rate | Fraction of tasks completed correctly | Compare AI-assisted completion to expected results | Workflow automation, self-service bots | Shows how well AI meets user goals |
Reduction in Support Load | Impact on operational efficiency | Measure decrease in support tickets or human interventions | Customer support, service automation | Demonstrates business ROI of AI deployment |
CTR / Engagement Improvements | User interaction with AI-driven content | Track clicks, views, or interactions compared to baseline | Recommendations, search, content personalization | Shows user adoption and satisfaction |
Foundation Models vs Application-Centric Evals
When evaluating AI, it’s important to distinguish between foundation models and application-centric use cases, because the evaluation approach differs significantly.
1️⃣ Foundation Model Evals
Focus: Assess the core capabilities of large pre-trained models — such as LLMs, multimodal models, or vision models — independent of a specific product.
Key Characteristics:
General-purpose testing: Measures abilities like reasoning, summarization, coding, or multi-lingual understanding.
Reference-based and reference-free metrics: Often combines benchmarks (e.g., MMLU, GLUE, HumanEval) with human evaluation for creativity, factuality, or alignment.
Goal: Understand model potential, strengths, and weaknesses before deployment.
Example Metrics:
Accuracy on standard NLP benchmarks
Factuality checks on generated text
Performance on reasoning or multi-step tasks
MMLU (Massive Multitask Language Understanding): Tests a model’s knowledge across 57 subjects, from STEM to humanities, using multiple-choice questions.
GLUE (General Language Understanding Evaluation): Measures core NLP abilities like sentiment analysis, sentence similarity, and reasoning across standard datasets.
HumanEval: Evaluates code-generation skills by checking if model-written code solves programming problems and passes test cases.
These benchmarks assess core capabilities of foundation models, helping PMs understand what the model can do before applying it to a product.
2️⃣ Application-Centric Evals
Focus: Evaluate how the AI performs within a specific product or workflow, measuring impact on users and business outcomes.
Key Characteristics:
Task-specific: Measures how well the AI performs in context, e.g., customer support automation, search, or content recommendation.
Outcome-oriented: Uses metrics tied to user success and product KPIs (e.g., task completion rate, CTR, reduction in support tickets).
Goal: Ensure AI outputs are helpful, safe, and aligned with business goals.
Example Metrics:
Task success rate in a virtual assistant
Engagement improvement from recommendations
Reduction in manual work for internal workflows
PM Perspective:
Connects AI performance to real user impact.
Reveals product-level gaps not visible in foundation-model benchmarks.
Key Takeaways
Foundation model evals answer “What can this model do?”
Application-centric evals answer “How well does this model solve our users’ problems?”
Evaluating AI is more than just measuring technical correctness — it’s about ensuring models deliver real value to users and the business. From classification and regression to generative, ranking, embedding, and business-focused metrics, a strong evaluation strategy combines reference-based and reference-free approaches to capture both objective performance and subjective quality.
For AI PMs, understanding these metrics is not just a technical skill — it’s a product skill. It enables you to identify failures, prioritize improvements, measure impact, and confidently make product decisions that align AI performance with user needs and business goals.