In our previous article, “AI Evals: Understanding and Writing Them — Must-Have Skills for AI PMs”, we explored what AI evals are, why they are critical for building reliable AI products, the R-F-R-G-E-S-R-L evals framework, and the key challenges in evaluating AI systems.

In this article, we will take a deeper dive into the different types of AI evals, examining how each type works, what it measures, and how it can be applied to ensure your AI products meet both technical and business expectations.

Evaluating AI isn’t just about technical correctness; it’s about ensuring outputs align with user needs and business impact. Metrics can be grouped by task type, and each can be reference-based or reference-free depending on whether they compare outputs to a known “ground truth.” In reference-based evals, the model output is compared to a predefined "Gold" answer. While In reference-free evals, output is evaluated based on internal logic or the source context.

Now, lets look into different types of AI evaluation metrics (classification, regression, ranking, generative, retrieval) with reference-based vs reference-free distinction.

1️⃣ Classification Metrics

Task: Predict discrete labels (e.g., spam vs non-spam)
Reference-Based Examples: Accuracy, Precision, Recall, F1 Score, ROC-AUC
Reference-Free Examples: Human evaluation for label appropriateness or fairness
What it catches: Correct predictions, misclassification patterns, edge-case failures
PM Perspective: Helps prioritize improvements on high-impact categories and measure alignment with user expectations

Metrics

What It Measures

How It Works

When It Matters

PM Insight

Accuracy

Overall correctness

Fraction of correct predictions over total predictions

When class distribution is balanced

Simple overall measure, but can hide poor performance on minority classes

Precision

Correctness of positive predictions

True Positives / (True Positives + False Positives)

When false positives are costly (e.g., misclassifying legitimate email as spam)

Helps PMs understand risk of incorrect alerts to users

Recall (Sensitivity)

Coverage of positive cases

True Positives / (True Positives + False Negatives)

When missing positives is costly (e.g., detecting fraud, disease)

Helps PMs prioritize catching all critical cases

F1 Score

Balance of precision & recall

Harmonic mean of precision and recall

When there’s class imbalance or trade-offs between false positives/negatives

Provides a single score for comparing models in edge-case scenarios

ROC-AUC

Model’s ability to distinguish classes

Area under the Receiver Operating Characteristic curve (TPR vs FPR at different thresholds)

When overall discrimination matters

Shows model’s robustness across thresholds, useful for threshold tuning

2️⃣ Regression Metrics

Task: Predict continuous values (e.g., sales, temperature)
Reference-Based Examples: Mean Absolute Error (MAE), Mean Squared Error (MSE), R²
Reference-Free Examples: Sometimes, strict numeric ground truth may be unavailable, incomplete, or business context matters more than exact numbers.

  • Expert review of predicted trends: Humans check if forecasts follow expected patterns (e.g., seasonal peaks, expected declines).

  • Sanity checks: Ensure outputs are within plausible ranges or logical constraints.

  • Anomaly detection: Identify predictions that are statistically or operationally unusual, even if exact ground truth isn’t known.


What it catches: Accuracy of predictions, deviation patterns, trends in errors
PM Perspective: Ensures outputs are reliable enough to drive decisions and business KPIs

Metric

What It Measures

How It Works

When It Matters

PM Insight

Mean Absolute Error (MAE)

Average magnitude of errors

Average of |Predicted − Actual| across all samples

When all errors are equally important

Provides an intuitive sense of average deviation, easy to explain to stakeholders

Mean Squared Error (MSE)

Average squared errors

Average of (Predicted − Actual)²

When large errors are especially costly

Penalizes big mistakes more, useful for risk-sensitive forecasts

Root Mean Squared Error (RMSE)

Same as MSE, but in original units

Square root of MSE

Makes interpretation easier in units of the target variable

Easy for PMs to communicate expected error magnitude

R² (Coefficient of Determination)

How well the model explains variance

1 − (SS_residual / SS_total)

When understanding variance capture matters

Shows overall predictive power relative to simple baselines

3️⃣ Ranking / Retrieval Metrics

Task: Return ranked lists (e.g., search results, recommendations)
Reference-Based Examples: Precision@K, Recall@K, NDCG, MRR, Hit Rate
Reference-Free Examples: Sometimes ground truth relevance is unavailable or incomplete, or user behavior provides stronger signals than labeled data.

  • Embedding similarity: Measures semantic similarity between query and retrieved items.

  • User engagement metrics: Click-through rate (CTR), dwell time, or other interaction data.

  • Relevance scoring by AI judge: LLMs can judge whether results are useful without explicit labels.


PM Perspective: Measures impact on engagement and usability, not just algorithmic correctness

Metric

What It Measures

How It Works

When It Matters

PM Insight

Precision@K

Fraction of relevant items in the top K results

Count of relevant items in top K / K

When users only see the top results

Shows whether top results meet user needs — critical for UX

Recall@K

Fraction of all relevant items retrieved in top K

Relevant items in top K / Total relevant items

When completeness is important

Helps PMs ensure important items are not missed

NDCG (Normalized Discounted Cumulative Gain)

Quality of ranking, prioritizing highly relevant items at the top

Sum of relevance scores discounted by position, normalized

When ranking order matters

Measures user satisfaction with ranking, not just raw retrieval

MRR (Mean Reciprocal Rank)

Average position of first relevant item

Reciprocal of rank of first relevant item, averaged over queries

When first relevant result is most important

Useful for quick-access tasks, e.g., search or QA

Hit Rate / Top-K Accuracy

Whether at least one relevant item appears in top K

Binary check per query

When any relevant item in top K suffices

Shows success of retrieval for user’s top expectations

4️⃣ Generative / LLM Metrics

Task: Generate text, code, summaries, or answers
Reference-Based Examples: BLEU, ROUGE, METEOR, Exact Match, Perplexity
Reference-Free Examples: Reference-free metrics assess output quality without needing a fixed answer, often using human judgment or embeddings.

  • Human Evaluation: Experts rate outputs on fluency, coherence, helpfulness, or safety.

  • LLM-as-Judge: A stronger LLM evaluates outputs based on a rubric (e.g., helpfulness, relevance).

  • BERTScore / Embedding Similarity: Measures semantic similarity between output and context/reference.

  • Faithfulness / Grounding Checks: Evaluates whether factual claims are supported by evidence or retrieved data.


What it catches: Fluency, coherence, factual accuracy, helpfulness, hallucinations
PM Perspective: Balances technical quality with user-perceived utility, critical for LLM-driven products

Metric

What It Measures

How It Works

When It Matters

PM Insight

BLEU

N-gram overlap between generated and reference text

Counts matching n-grams

Machine translation or structured text

Gives an objective score for literal similarity, but may miss meaning

ROUGE

Recall-focused n-gram overlap

Measures how much reference content is covered

Summarization tasks

Useful for measuring coverage of key content

METEOR

Semantic and lexical matching

Includes synonyms, stemming, paraphrase matching

Translation, summarization

More aligned with human judgment than BLEU

Exact Match (EM)

Strict string match

Checks if generated text exactly matches reference

Question answering

Useful for precise tasks but too strict for creative output

Perplexity

Model “surprise” on reference text

Likelihood of reference sequence under model

Language modeling

Lower perplexity indicates better language modeling

5️⃣ Embedding / Similarity Metrics

Task: Measure semantic similarity or clustering (e.g., vector search, recommendations)
Reference-Based Examples: Cosine similarity against labeled reference pairs
Reference-Free Examples: When no labeled reference exists or human validation is more relevant:

  • Human validation: Experts judge whether results are semantically correct.

  • LLM-as-judge: Stronger LLM scores similarity, relevance, or appropriateness.

  • User engagement metrics: Indirect measure of semantic quality (clicks, dwell time).


What it catches: Correctness of semantic relationships, similarity ranking, clustering quality
PM Perspective: Ensures that AI understands meaning, not just keywords

Metric

What It Measures

How It Works

When It Matters

PM Insight

Cosine Similarity

Angle-based similarity between vectors

Measures similarity of query and candidate embeddings

Semantic search, matching tasks

Ensures AI retrieves or ranks meaningfully similar content

Euclidean / L2 Distance

Magnitude-based distance in embedding space

Smaller distance → higher similarity

Clustering or nearest neighbor search

Detects items that are close in meaning

Dot Product / Inner Product

Weighted similarity considering magnitude

Used in large-scale retrieval

When ranking relevance in embeddings

Useful for real-time recommendation or ranking pipelines

6️⃣ Business / Composite Metrics

Task: Tie AI outputs to real-world impact
Reference-Based Examples: Task success rate, reduction in support load, CTR improvements
Reference-Free Examples: Sometimes, business impact is subjective or qualitative.

  • Human judgment of usefulness: Experts or end-users rate whether AI outputs are helpful.

  • Alignment with product goals: Human review ensures AI outputs support strategic objectives, even if not numerically measurable.
    What it catches: Whether AI delivers real value to users and the business
    PM Perspective: Bridges the gap between technical evaluation and product outcomes

Metric

What It Measures

How It Works

When It Matters

PM Insight

Task Success Rate

Fraction of tasks completed correctly

Compare AI-assisted completion to expected results

Workflow automation, self-service bots

Shows how well AI meets user goals

Reduction in Support Load

Impact on operational efficiency

Measure decrease in support tickets or human interventions

Customer support, service automation

Demonstrates business ROI of AI deployment

CTR / Engagement Improvements

User interaction with AI-driven content

Track clicks, views, or interactions compared to baseline

Recommendations, search, content personalization

Shows user adoption and satisfaction

Foundation Models vs Application-Centric Evals

When evaluating AI, it’s important to distinguish between foundation models and application-centric use cases, because the evaluation approach differs significantly.

1️⃣ Foundation Model Evals

Focus: Assess the core capabilities of large pre-trained models — such as LLMs, multimodal models, or vision models — independent of a specific product.

Key Characteristics:

  • General-purpose testing: Measures abilities like reasoning, summarization, coding, or multi-lingual understanding.

  • Reference-based and reference-free metrics: Often combines benchmarks (e.g., MMLU, GLUE, HumanEval) with human evaluation for creativity, factuality, or alignment.

  • Goal: Understand model potential, strengths, and weaknesses before deployment.

Example Metrics:

  • Accuracy on standard NLP benchmarks

  • Factuality checks on generated text

  • Performance on reasoning or multi-step tasks

  • MMLU (Massive Multitask Language Understanding): Tests a model’s knowledge across 57 subjects, from STEM to humanities, using multiple-choice questions.

  • GLUE (General Language Understanding Evaluation): Measures core NLP abilities like sentiment analysis, sentence similarity, and reasoning across standard datasets.

  • HumanEval: Evaluates code-generation skills by checking if model-written code solves programming problems and passes test cases.

These benchmarks assess core capabilities of foundation models, helping PMs understand what the model can do before applying it to a product.

2️⃣ Application-Centric Evals

Focus: Evaluate how the AI performs within a specific product or workflow, measuring impact on users and business outcomes.

Key Characteristics:

  • Task-specific: Measures how well the AI performs in context, e.g., customer support automation, search, or content recommendation.

  • Outcome-oriented: Uses metrics tied to user success and product KPIs (e.g., task completion rate, CTR, reduction in support tickets).

  • Goal: Ensure AI outputs are helpful, safe, and aligned with business goals.

Example Metrics:

  • Task success rate in a virtual assistant

  • Engagement improvement from recommendations

  • Reduction in manual work for internal workflows

PM Perspective:

  • Connects AI performance to real user impact.

  • Reveals product-level gaps not visible in foundation-model benchmarks.

Key Takeaways

Foundation model evals answer “What can this model do?”
Application-centric evals answer “How well does this model solve our users’ problems?”

Evaluating AI is more than just measuring technical correctness — it’s about ensuring models deliver real value to users and the business. From classification and regression to generative, ranking, embedding, and business-focused metrics, a strong evaluation strategy combines reference-based and reference-free approaches to capture both objective performance and subjective quality.

For AI PMs, understanding these metrics is not just a technical skill — it’s a product skill. It enables you to identify failures, prioritize improvements, measure impact, and confidently make product decisions that align AI performance with user needs and business goals.

Reply

Avatar

or to participate

Keep Reading

No posts found