Post

Evaluating ML/GenAI Models

Evaluating ML/GenAI Models

When your data is messy, incomplete, biased, or misaligned with the task, it inevitably shows up as weak or unstable evaluation metrics. This guide explains major evaluation metrics across generative AI, binary classification, and regression - including real-world examples, what “good” looks like, and remediation strategies when metrics are low.


Table of Contents


Summary Table of Machine Learning & Generative AI Evaluation Metrics

Note: Thresholds are general industry heuristics; real “good” values depend on domain, data distribution, and business constraints.


Text Generative-AI Metrics

MetricWhat It MeasuresWhen to UseGood ThresholdTargets (Model Types)
ROUGE‑1Unigram (word‑level) overlapSummarization, QA, content recall0.4–0.5+ good; 0.5–0.6+ strongSummarization models, LLM eval
ROUGE‑2Bigram (two‑word sequence) overlapFluency, coherence, phrase accuracy0.2–0.3 good; >0.35 strongSummaries, paraphrasers, LLMs
ROUGE‑LLongest Common Subsequence (structure similarity)Narrative structure, sentence‑level accuracy>0.4 useful; >0.5 strongSummarization, reasoning LLMs
BLEUPrecision‑oriented n‑gram match + brevity penaltyTranslation, paraphrasing30–40 good; 45–55+ strongMachine translation, multilingual LLMs

Binary Classification Metrics

MetricWhat It MeasuresWhen to UseGood ThresholdTargets (Model Types)
AUC‑ROCRanking ability between positive & negative classesBalanced datasets; general classification0.7–0.8 OK; 0.8–0.9 good; >0.9 highFraud detection, risk scoring, medical tests
AUC‑PRPrecision‑Recall tradeoffImbalanced datasets (rare events)2–3× the positive base rateFraud, medical diagnosis, anomaly detection
Precision% of predicted positives that are trueWhen false positives hurt>0.8 common targetSpam filters, fraud alerts
Recall (TPR)% of actual positives detectedWhen false negatives hurt>0.7–0.8 good; >0.85 strongMedical tests, safety models
False Positive Rate (FPR)% of negatives incorrectly predicted positiveCritical when over-flagging is costly<10% acceptable; <5% strongSecurity screening, compliance systems
F1 ScoreHarmonic mean of precision & recallBalanced precision/recall importance>0.7 good; >0.8 strongChurn prediction, NER, general classification

Regression Metrics

MetricWhat It MeasuresWhen to UseGood ThresholdTargets (Model Types)
% of variance explained by modelGeneral regression, explainability>0.5 OK; >0.7 strong; >0.9 excellentPricing models, risk scoring
MAEAverage absolute errorInterpretability needed; robust to outliers<10–20% of mean targetForecasting, scheduling
MSESquared error (punishes large mistakes)When large errors are extra costlyLower is better (scale-dependent)Safety systems, physical modeling
RMSERoot of MSE, in same units as targetComparing to business KPIs<10–20% relative errorDemand forecasting, time-series models

Deep Dive

1. Text Generation Metrics (ROUGE, BLEU)

Text generation metrics measure how similar model outputs are to reference texts. They are widely used for tasks like summarization, translation, and generative AI fine-tuning.



flowchart LR    
A[Reference Text] --> C[Compare Token Overlap]    
B[Generated Text] --> C    
C --> D1[ROUGE-1: Unigrams]    
C --> D2[ROUGE-2: Bigrams]    
C --> D3[ROUGE-L: Longest Common Subsequence]

ROUGE‑1

Definition: Measures overlap of individual words (unigrams).
Usage: Summarization and content recall.

Real‑world example

Reference: “The Federal Reserve raised interest rates due to inflation concerns.”
Generated summary containing many of these words → higher ROUGE‑1.

Decent quality

  • 0.4–0.5+ = good
  • 0.5–0.6+ = strong

If low, improve by

  • Adding domain‑specific training samples
  • Fixing noisy or inconsistent reference summaries
  • Improving decoding strategies (beam search, temperature)

ROUGE‑2

Definition: Measures bigram overlap (two-word sequences).
Usage: Fluency and phrase accuracy.

Real‑world example

Bigram match: “raised interest” appearing in both reference and output.

Decent quality

  • 0.2–0.3 = good
  • >0.35 = strong

If low

  • Add more high-quality examples with clear phrasing
  • Improve tokenization and text normalization

ROUGE‑L

Definition: Longest Common Subsequence (LCS) between generated and reference text.
Usage: Evaluates structural similarity.

Real‑world example

Two summaries using similar sentence structures achieve higher ROUGE‑L even if some words differ.

Decent quality

  • 0.4+ = useful
  • 0.5+ = strong

If low

  • Fine‑tune on datasets emphasizing reasoning and structure
  • Remove inconsistent or unstructured reference texts ***

BLEU

Definition: Precision‑based n‑gram match score with brevity penalty.
Usage: Machine translation, paraphrasing.

Real‑world example

“How are you?” → “¿Cómo estás?”
Perfect n‑gram match → BLEU = 100 (for a short sentence).

Decent quality

  • 30–40 = acceptable
  • 45–55+ = high-quality translation

If low

  • Add multi‑reference translations
  • Improve preprocessing and normalization
  • Include more domain-specific parallel corpora

2. Binary Classification Metrics

Used in fraud detection, healthcare diagnostics, spam filtering, customer churn prediction, and more.

flowchart TD
    A[Model Predictions] --> B[Confusion Matrix]
    B --> C1["Precision = TP / (TP + FP)"]
    B --> C2["Recall = TP / (TP + FN)"]
    B --> C3[F1 = Harmonic Mean]
    B --> C4[TPR, FPR]

AUC‑ROC

Definition: Measures ability to distinguish positive vs. negative classes.
Usage: Ranking and risk scoring.

Real-world example

A fraud detection model with AUC‑ROC = 0.90 correctly ranks random fraud cases above non‑fraud cases 90% of the time.

Decent quality

  • 0.7–0.8 = acceptable
  • 0.8–0.9 = good
  • >0.9 = excellent

If low

  • Remove label leakage
  • Add more representative examples
  • Use better feature engineering

AUC‑PR

Definition: Precision‑recall curve area, best for imbalanced data.

Real-world example

Cancer detection where positives are 1%.
AUC‑PR benchmark is 0.01 (base rate).
A model achieving 0.40 is performing extremely well.

Decent quality

  • 2–3× the base rate is reasonable
  • Higher is better

If low

  • Improve sampling of positive cases
  • Apply oversampling/SMOTE techniques
  • Use task‑specific loss weighting

Precision

Definition: % of predicted positives that are correct.

Real-world example

Spam classifier:
100 emails marked as spam → 95 truly spam → precision = 0.95

Decent quality

  • >0.8 typical for many domains

If low

  • Raise decision threshold
  • Reduce noisy negative samples

Recall (True Positive Rate)

Definition: % of actual positives the model detects.

Real-world example

Medical imaging model detecting diabetic retinopathy.
Recall = 0.95 means 95% of positive cases are captured.

Decent quality

  • >0.7–0.8 = acceptable
  • >0.85 = strong

If low

  • Lower decision threshold
  • Add better domain‑specific features

False Positive Rate (FPR)

Definition: % of negatives incorrectly predicted as positives.

Real-world example

Airport screening model:
5 out of 100 non‑threat items flagged → FPR = 5%

Decent quality

  • <10% acceptable
  • <5% strong

If high

  • Improve calibration
  • Re‑evaluate model thresholds

F1 Score

Definition: Harmonic mean of precision and recall.

Real-world example

Churn model:
Precision = 0.8, Recall = 0.8 → F1 = 0.8

Decent quality

  • >0.7 good
  • >0.8 strong

If low

  • Increase minority-class examples
  • Adjust threshold to balance precision/recall

3. Regression Metrics

Used in forecasting, pricing models, risk modeling, and time-series problems.


R² (Coefficient of Determination)

Definition: Measures how much variance in the target is explained.

Real-world example

House prices:
R² = 0.85 → 85% of price variance explained by features.

Decent quality

  • >0.5 acceptable
  • >0.7 strong
  • >0.9 excellent

If low

  • Add more predictive features
  • Consider nonlinear models

MAE (Mean Absolute Error)

Definition: Mean absolute difference between predictions and true values.

Real-world example

Weather forecasting:
MAE = 2°F → predictions usually within ±2°F.

Decent quality

  • <10–20% of mean target value

If high

  • Remove outliers
  • Normalize or scale features

MSE (Mean Squared Error)

Definition: Squared average error; punishes large mistakes.

Real-world example

Autonomous drone path prediction — large deviations are dangerous, so MSE is critical.

If high

  • Investigate extreme errors
  • Add regularization to prevent overfitting

RMSE (Root Mean Squared Error)

Definition: Square root of MSE; interpretable in target units.

Real-world example

Retail demand forecasting:
Average demand 100 units/day, RMSE = 12 → 12% average error.

Decent quality

  • <10–20% relative error

If high

  • Add seasonal features
  • Use ensemble or gradient boosting models

This post is licensed under CC BY 4.0 by the author.