Evaluating ML/GenAI Models

Posted Feb 17, 2026 Updated Mar 14, 2026

By Henry Xiao

6 min read

When your data is messy, incomplete, biased, or misaligned with the task, it inevitably shows up as weak or unstable evaluation metrics. This guide explains major evaluation metrics across generative AI, binary classification, and regression - including real-world examples, what “good” looks like, and remediation strategies when metrics are low.

Summary Table
Deep Dive

Summary Table of Machine Learning & Generative AI Evaluation Metrics

Note: Thresholds are general industry heuristics; real “good” values depend on domain, data distribution, and business constraints.

Text Generative-AI Metrics

Metric	What It Measures	When to Use	Good Threshold	Targets (Model Types)
ROUGE‑1	Unigram (word‑level) overlap	Summarization, QA, content recall	0.4–0.5+ good; 0.5–0.6+ strong	Summarization models, LLM eval
ROUGE‑2	Bigram (two‑word sequence) overlap	Fluency, coherence, phrase accuracy	0.2–0.3 good; >0.35 strong	Summaries, paraphrasers, LLMs
ROUGE‑L	Longest Common Subsequence (structure similarity)	Narrative structure, sentence‑level accuracy	>0.4 useful; >0.5 strong	Summarization, reasoning LLMs
BLEU	Precision‑oriented n‑gram match + brevity penalty	Translation, paraphrasing	30–40 good; 45–55+ strong	Machine translation, multilingual LLMs

Binary Classification Metrics

Metric	What It Measures	When to Use	Good Threshold	Targets (Model Types)
AUC‑ROC	Ranking ability between positive & negative classes	Balanced datasets; general classification	0.7–0.8 OK; 0.8–0.9 good; >0.9 high	Fraud detection, risk scoring, medical tests
AUC‑PR	Precision‑Recall tradeoff	Imbalanced datasets (rare events)	2–3× the positive base rate	Fraud, medical diagnosis, anomaly detection
Precision	% of predicted positives that are true	When false positives hurt	>0.8 common target	Spam filters, fraud alerts
Recall (TPR)	% of actual positives detected	When false negatives hurt	>0.7–0.8 good; >0.85 strong	Medical tests, safety models
False Positive Rate (FPR)	% of negatives incorrectly predicted positive	Critical when over-flagging is costly	<10% acceptable; <5% strong	Security screening, compliance systems
F1 Score	Harmonic mean of precision & recall	Balanced precision/recall importance	>0.7 good; >0.8 strong	Churn prediction, NER, general classification

Regression Metrics

Metric	What It Measures	When to Use	Good Threshold	Targets (Model Types)
R²	% of variance explained by model	General regression, explainability	>0.5 OK; >0.7 strong; >0.9 excellent	Pricing models, risk scoring
MAE	Average absolute error	Interpretability needed; robust to outliers	<10–20% of mean target	Forecasting, scheduling
MSE	Squared error (punishes large mistakes)	When large errors are extra costly	Lower is better (scale-dependent)	Safety systems, physical modeling
RMSE	Root of MSE, in same units as target	Comparing to business KPIs	<10–20% relative error	Demand forecasting, time-series models

Deep Dive

1. Text Generation Metrics (ROUGE, BLEU)

Text generation metrics measure how similar model outputs are to reference texts. They are widely used for tasks like summarization, translation, and generative AI fine-tuning.


flowchart LR    
A[Reference Text] --> C[Compare Token Overlap]    
B[Generated Text] --> C    
C --> D1[ROUGE-1: Unigrams]    
C --> D2[ROUGE-2: Bigrams]    
C --> D3[ROUGE-L: Longest Common Subsequence]

ROUGE‑1

Definition: Measures overlap of individual words (unigrams).
Usage: Summarization and content recall.

Real‑world example

Reference: “The Federal Reserve raised interest rates due to inflation concerns.”
Generated summary containing many of these words → higher ROUGE‑1.

Decent quality

0.4–0.5+ = good
0.5–0.6+ = strong

If low, improve by

Adding domain‑specific training samples
Fixing noisy or inconsistent reference summaries
Improving decoding strategies (beam search, temperature)

ROUGE‑2

Definition: Measures bigram overlap (two-word sequences).
Usage: Fluency and phrase accuracy.

Real‑world example

Bigram match: “raised interest” appearing in both reference and output.

Decent quality

0.2–0.3 = good
>0.35 = strong

If low

Add more high-quality examples with clear phrasing
Improve tokenization and text normalization

ROUGE‑L

Definition: Longest Common Subsequence (LCS) between generated and reference text.
Usage: Evaluates structural similarity.

Real‑world example

Two summaries using similar sentence structures achieve higher ROUGE‑L even if some words differ.

Decent quality

0.4+ = useful
0.5+ = strong

If low

Fine‑tune on datasets emphasizing reasoning and structure
Remove inconsistent or unstructured reference texts ***

BLEU

Definition: Precision‑based n‑gram match score with brevity penalty.
Usage: Machine translation, paraphrasing.

Real‑world example

“How are you?” → “¿Cómo estás?”
Perfect n‑gram match → BLEU = 100 (for a short sentence).

Decent quality

30–40 = acceptable
45–55+ = high-quality translation

If low

Add multi‑reference translations
Improve preprocessing and normalization
Include more domain-specific parallel corpora

2. Binary Classification Metrics

Used in fraud detection, healthcare diagnostics, spam filtering, customer churn prediction, and more.

flowchart TD
    A[Model Predictions] --> B[Confusion Matrix]
    B --> C1["Precision = TP / (TP + FP)"]
    B --> C2["Recall = TP / (TP + FN)"]
    B --> C3[F1 = Harmonic Mean]
    B --> C4[TPR, FPR]

AUC‑ROC

Definition: Measures ability to distinguish positive vs. negative classes.
Usage: Ranking and risk scoring.

Real-world example

A fraud detection model with AUC‑ROC = 0.90 correctly ranks random fraud cases above non‑fraud cases 90% of the time.

Decent quality

0.7–0.8 = acceptable
0.8–0.9 = good
>0.9 = excellent

If low

Remove label leakage
Add more representative examples
Use better feature engineering

AUC‑PR

Definition: Precision‑recall curve area, best for imbalanced data.

Real-world example

Cancer detection where positives are 1%.
AUC‑PR benchmark is 0.01 (base rate).
A model achieving 0.40 is performing extremely well.

Decent quality

2–3× the base rate is reasonable
Higher is better

If low

Improve sampling of positive cases
Apply oversampling/SMOTE techniques
Use task‑specific loss weighting

Precision

Definition: % of predicted positives that are correct.

Real-world example

Spam classifier:
100 emails marked as spam → 95 truly spam → precision = 0.95

Decent quality

>0.8 typical for many domains

If low

Raise decision threshold
Reduce noisy negative samples

Recall (True Positive Rate)

Definition: % of actual positives the model detects.

Real-world example

Medical imaging model detecting diabetic retinopathy.
Recall = 0.95 means 95% of positive cases are captured.

Decent quality

>0.7–0.8 = acceptable
>0.85 = strong

If low

Lower decision threshold
Add better domain‑specific features

False Positive Rate (FPR)

Definition: % of negatives incorrectly predicted as positives.

Real-world example

Airport screening model:
5 out of 100 non‑threat items flagged → FPR = 5%

Decent quality

<10% acceptable
<5% strong

If high

Improve calibration
Re‑evaluate model thresholds

F1 Score

Definition: Harmonic mean of precision and recall.

Real-world example

Churn model:
Precision = 0.8, Recall = 0.8 → F1 = 0.8

Decent quality

>0.7 good
>0.8 strong

If low

Increase minority-class examples
Adjust threshold to balance precision/recall

3. Regression Metrics

Used in forecasting, pricing models, risk modeling, and time-series problems.

R² (Coefficient of Determination)

Definition: Measures how much variance in the target is explained.

Real-world example

House prices:
R² = 0.85 → 85% of price variance explained by features.

Decent quality

>0.5 acceptable
>0.7 strong
>0.9 excellent

If low

Add more predictive features
Consider nonlinear models

MAE (Mean Absolute Error)

Definition: Mean absolute difference between predictions and true values.

Real-world example

Weather forecasting:
MAE = 2°F → predictions usually within ±2°F.

Decent quality

<10–20% of mean target value

If high

Remove outliers
Normalize or scale features

MSE (Mean Squared Error)

Definition: Squared average error; punishes large mistakes.

Real-world example

Autonomous drone path prediction — large deviations are dangerous, so MSE is critical.

If high

Investigate extreme errors
Add regularization to prevent overfitting

RMSE (Root Mean Squared Error)

Definition: Square root of MSE; interpretable in target units.

Real-world example

Retail demand forecasting:
Average demand 100 units/day, RMSE = 12 → 12% average error.

Decent quality

<10–20% relative error

If high

Add seasonal features
Use ensemble or gradient boosting models

AI, ML Ops

AI ML Ops

This post is licensed under CC BY 4.0 by the author.

Table of Contents

Summary Table of Machine Learning & Generative AI Evaluation Metrics

Text Generative-AI Metrics

Binary Classification Metrics

Regression Metrics

Deep Dive

1. Text Generation Metrics (ROUGE, BLEU)

ROUGE‑1

Real‑world example

Decent quality

If low, improve by

ROUGE‑2

Real‑world example

Decent quality

If low

ROUGE‑L

Real‑world example

Decent quality

If low

BLEU

Real‑world example

Decent quality

If low

2. Binary Classification Metrics

AUC‑ROC

Real-world example

Decent quality

If low

AUC‑PR

Real-world example

Decent quality

If low

Precision

Real-world example

Decent quality

If low

Recall (True Positive Rate)

Real-world example

Decent quality

If low

False Positive Rate (FPR)

Real-world example

Decent quality

If high

F1 Score

Real-world example

Decent quality

If low

3. Regression Metrics

R² (Coefficient of Determination)

Real-world example

Decent quality

If low

MAE (Mean Absolute Error)

Real-world example

Decent quality

If high

MSE (Mean Squared Error)

Real-world example

If high

RMSE (Root Mean Squared Error)

Real-world example

Decent quality

If high

Trending Tags