Different types of Evaluation Metrics

In our previous article, “AI Evals: Understanding and Writing Them — Must-Have Skills for AI PMs”, we explored what AI evals are, why they are critical for building reliable AI products, the R-F-R-G-E-S-R-L evals framework, and the key challenges in evaluating AI systems.

In this article, we will take a deeper dive into the different types of AI evals, examining how each type works, what it measures, and how it can be applied to ensure your AI products meet both technical and business expectations.

Evaluating AI isn’t just about technical correctness; it’s about ensuring outputs align with user needs and business impact. Metrics can be grouped by task type, and each can be reference-based or reference-free depending on whether they compare outputs to a known “ground truth.” In reference-based evals, the model output is compared to a predefined "Gold" answer. While In reference-free evals, output is evaluated based on internal logic or the source context.

Now, lets look into different types of AI evaluation metrics (classification, regression, ranking, generative, retrieval) with reference-based vs reference-free distinction.

1️⃣ Classification Metrics

Task: Predict discrete labels (e.g., spam vs non-spam)
Reference-Based Examples: Accuracy, Precision, Recall, F1 Score, ROC-AUC
Reference-Free Examples: Human evaluation for label appropriateness or fairness
What it catches: Correct predictions, misclassification patterns, edge-case failures
PM Perspective: Helps prioritize improvements on high-impact categories and measure alignment with user expectations

Metrics	What It Measures	How It Works	When It Matters	PM Insight
Accuracy	Overall correctness	Fraction of correct predictions over total predictions	When class distribution is balanced	Simple overall measure, but can hide poor performance on minority classes
Precision	Correctness of positive predictions	True Positives / (True Positives + False Positives)	When false positives are costly (e.g., misclassifying legitimate email as spam)	Helps PMs understand risk of incorrect alerts to users
Recall (Sensitivity)	Coverage of positive cases	True Positives / (True Positives + False Negatives)	When missing positives is costly (e.g., detecting fraud, disease)	Helps PMs prioritize catching all critical cases
F1 Score	Balance of precision & recall	Harmonic mean of precision and recall	When there’s class imbalance or trade-offs between false positives/negatives	Provides a single score for comparing models in edge-case scenarios
ROC-AUC	Model’s ability to distinguish classes	Area under the Receiver Operating Characteristic curve (TPR vs FPR at different thresholds)	When overall discrimination matters	Shows model’s robustness across thresholds, useful for threshold tuning

2️⃣ Regression Metrics

Task: Predict continuous values (e.g., sales, temperature)
Reference-Based Examples: Mean Absolute Error (MAE), Mean Squared Error (MSE), R²
Reference-Free Examples: Sometimes, strict numeric ground truth may be unavailable, incomplete, or business context matters more than exact numbers.

Expert review of predicted trends: Humans check if forecasts follow expected patterns (e.g., seasonal peaks, expected declines).
Sanity checks: Ensure outputs are within plausible ranges or logical constraints.
Anomaly detection: Identify predictions that are statistically or operationally unusual, even if exact ground truth isn’t known.

What it catches: Accuracy of predictions, deviation patterns, trends in errors
PM Perspective: Ensures outputs are reliable enough to drive decisions and business KPIs

Metric	What It Measures	How It Works	When It Matters	PM Insight
Mean Absolute Error (MAE)	Average magnitude of errors	Average of \|Predicted − Actual\| across all samples	When all errors are equally important	Provides an intuitive sense of average deviation, easy to explain to stakeholders
Mean Squared Error (MSE)	Average squared errors	Average of (Predicted − Actual)²	When large errors are especially costly	Penalizes big mistakes more, useful for risk-sensitive forecasts
Root Mean Squared Error (RMSE)	Same as MSE, but in original units	Square root of MSE	Makes interpretation easier in units of the target variable	Easy for PMs to communicate expected error magnitude
R² (Coefficient of Determination)	How well the model explains variance	1 − (SS_residual / SS_total)	When understanding variance capture matters	Shows overall predictive power relative to simple baselines

3️⃣ Ranking / Retrieval Metrics

Task: Return ranked lists (e.g., search results, recommendations)
Reference-Based Examples: Precision@K, Recall@K, NDCG, MRR, Hit Rate
Reference-Free Examples: Sometimes ground truth relevance is unavailable or incomplete, or user behavior provides stronger signals than labeled data.

Embedding similarity: Measures semantic similarity between query and retrieved items.
User engagement metrics: Click-through rate (CTR), dwell time, or other interaction data.
Relevance scoring by AI judge: LLMs can judge whether results are useful without explicit labels.

PM Perspective: Measures impact on engagement and usability, not just algorithmic correctness

Metric	What It Measures	How It Works	When It Matters	PM Insight
Precision@K	Fraction of relevant items in the top K results	Count of relevant items in top K / K	When users only see the top results	Shows whether top results meet user needs — critical for UX
Recall@K	Fraction of all relevant items retrieved in top K	Relevant items in top K / Total relevant items	When completeness is important	Helps PMs ensure important items are not missed
NDCG (Normalized Discounted Cumulative Gain)	Quality of ranking, prioritizing highly relevant items at the top	Sum of relevance scores discounted by position, normalized	When ranking order matters	Measures user satisfaction with ranking, not just raw retrieval
MRR (Mean Reciprocal Rank)	Average position of first relevant item	Reciprocal of rank of first relevant item, averaged over queries	When first relevant result is most important	Useful for quick-access tasks, e.g., search or QA
Hit Rate / Top-K Accuracy	Whether at least one relevant item appears in top K	Binary check per query	When any relevant item in top K suffices	Shows success of retrieval for user’s top expectations

4️⃣ Generative / LLM Metrics

Task: Generate text, code, summaries, or answers
Reference-Based Examples: BLEU, ROUGE, METEOR, Exact Match, Perplexity
Reference-Free Examples: Reference-free metrics assess output quality without needing a fixed answer, often using human judgment or embeddings.

Human Evaluation: Experts rate outputs on fluency, coherence, helpfulness, or safety.
LLM-as-Judge: A stronger LLM evaluates outputs based on a rubric (e.g., helpfulness, relevance).
BERTScore / Embedding Similarity: Measures semantic similarity between output and context/reference.
Faithfulness / Grounding Checks: Evaluates whether factual claims are supported by evidence or retrieved data.

What it catches: Fluency, coherence, factual accuracy, helpfulness, hallucinations
PM Perspective: Balances technical quality with user-perceived utility, critical for LLM-driven products

Metric	What It Measures	How It Works	When It Matters	PM Insight
BLEU	N-gram overlap between generated and reference text	Counts matching n-grams	Machine translation or structured text	Gives an objective score for literal similarity, but may miss meaning
ROUGE	Recall-focused n-gram overlap	Measures how much reference content is covered	Summarization tasks	Useful for measuring coverage of key content
METEOR	Semantic and lexical matching	Includes synonyms, stemming, paraphrase matching	Translation, summarization	More aligned with human judgment than BLEU
Exact Match (EM)	Strict string match	Checks if generated text exactly matches reference	Question answering	Useful for precise tasks but too strict for creative output
Perplexity	Model “surprise” on reference text	Likelihood of reference sequence under model	Language modeling	Lower perplexity indicates better language modeling

5️⃣ Embedding / Similarity Metrics

Task: Measure semantic similarity or clustering (e.g., vector search, recommendations)
Reference-Based Examples: Cosine similarity against labeled reference pairs
Reference-Free Examples: When no labeled reference exists or human validation is more relevant:

Human validation: Experts judge whether results are semantically correct.
LLM-as-judge: Stronger LLM scores similarity, relevance, or appropriateness.
User engagement metrics: Indirect measure of semantic quality (clicks, dwell time).

What it catches: Correctness of semantic relationships, similarity ranking, clustering quality
PM Perspective: Ensures that AI understands meaning, not just keywords

Metric	What It Measures	How It Works	When It Matters	PM Insight
Cosine Similarity	Angle-based similarity between vectors	Measures similarity of query and candidate embeddings	Semantic search, matching tasks	Ensures AI retrieves or ranks meaningfully similar content
Euclidean / L2 Distance	Magnitude-based distance in embedding space	Smaller distance → higher similarity	Clustering or nearest neighbor search	Detects items that are close in meaning
Dot Product / Inner Product	Weighted similarity considering magnitude	Used in large-scale retrieval	When ranking relevance in embeddings	Useful for real-time recommendation or ranking pipelines

6️⃣ Business / Composite Metrics

Task: Tie AI outputs to real-world impact
Reference-Based Examples: Task success rate, reduction in support load, CTR improvements
Reference-Free Examples: Sometimes, business impact is subjective or qualitative.

Human judgment of usefulness: Experts or end-users rate whether AI outputs are helpful.
Alignment with product goals: Human review ensures AI outputs support strategic objectives, even if not numerically measurable.
What it catches: Whether AI delivers real value to users and the business
PM Perspective: Bridges the gap between technical evaluation and product outcomes

Metric	What It Measures	How It Works	When It Matters	PM Insight
Task Success Rate	Fraction of tasks completed correctly	Compare AI-assisted completion to expected results	Workflow automation, self-service bots	Shows how well AI meets user goals
Reduction in Support Load	Impact on operational efficiency	Measure decrease in support tickets or human interventions	Customer support, service automation	Demonstrates business ROI of AI deployment
CTR / Engagement Improvements	User interaction with AI-driven content	Track clicks, views, or interactions compared to baseline	Recommendations, search, content personalization	Shows user adoption and satisfaction

Foundation Models vs Application-Centric Evals

When evaluating AI, it’s important to distinguish between foundation models and application-centric use cases, because the evaluation approach differs significantly.

1️⃣ Foundation Model Evals

Focus: Assess the core capabilities of large pre-trained models — such as LLMs, multimodal models, or vision models — independent of a specific product.

Key Characteristics:

General-purpose testing: Measures abilities like reasoning, summarization, coding, or multi-lingual understanding.
Reference-based and reference-free metrics: Often combines benchmarks (e.g., MMLU, GLUE, HumanEval) with human evaluation for creativity, factuality, or alignment.
Goal: Understand model potential, strengths, and weaknesses before deployment.

Example Metrics:

Accuracy on standard NLP benchmarks
Factuality checks on generated text
Performance on reasoning or multi-step tasks

MMLU (Massive Multitask Language Understanding): Tests a model’s knowledge across 57 subjects, from STEM to humanities, using multiple-choice questions.
GLUE (General Language Understanding Evaluation): Measures core NLP abilities like sentiment analysis, sentence similarity, and reasoning across standard datasets.
HumanEval: Evaluates code-generation skills by checking if model-written code solves programming problems and passes test cases.

These benchmarks assess core capabilities of foundation models, helping PMs understand what the model can do before applying it to a product.

2️⃣ Application-Centric Evals

Focus: Evaluate how the AI performs within a specific product or workflow, measuring impact on users and business outcomes.

Key Characteristics:

Task-specific: Measures how well the AI performs in context, e.g., customer support automation, search, or content recommendation.
Outcome-oriented: Uses metrics tied to user success and product KPIs (e.g., task completion rate, CTR, reduction in support tickets).
Goal: Ensure AI outputs are helpful, safe, and aligned with business goals.

Example Metrics:

Task success rate in a virtual assistant
Engagement improvement from recommendations
Reduction in manual work for internal workflows

PM Perspective:

Connects AI performance to real user impact.
Reveals product-level gaps not visible in foundation-model benchmarks.

Key Takeaways

❝

Foundation model evals answer “What can this model do?”
Application-centric evals answer “How well does this model solve our users’ problems?”

Evaluating AI is more than just measuring technical correctness — it’s about ensuring models deliver real value to users and the business. From classification and regression to generative, ranking, embedding, and business-focused metrics, a strong evaluation strategy combines reference-based and reference-free approaches to capture both objective performance and subjective quality.

For AI PMs, understanding these metrics is not just a technical skill — it’s a product skill. It enables you to identify failures, prioritize improvements, measure impact, and confidently make product decisions that align AI performance with user needs and business goals.