... LIVE
📊 Enter Confusion Matrix Values
📌 Enter the four values from your binary classification confusion matrix. Predicted Positive is the left column (TP + FP); Actual Positive is the top row (TP + FN).
← PREDICTED →
ACTUAL
TP
True Positive
FN
False Negative (Type II)
FP
False Positive (Type I)
TN
True Negative
Confusion Matrix Summary
Accuracy
⚠️ Disclaimer: Choose your evaluation metric based on your specific problem, class balance, and the relative costs of false positives vs false negatives. No single metric is universally best.

Sources & Methodology

All formulas verified against Google ML Crash Course, Wikipedia, and peer-reviewed research on classification metrics.
📘
Google Machine Learning Crash Course — Classification Metrics
Authoritative reference for accuracy, precision, recall, TPR, FPR, and their relationships used in machine learning model evaluation. Defines all metrics used in this calculator.
📊
Chicco D., Jurman G. (2020) — The advantages of the MCC over F1 score and accuracy in binary classification evaluation
Peer-reviewed paper demonstrating that Matthews Correlation Coefficient (MCC) is a more reliable metric than F1 score and accuracy for imbalanced binary classification tasks. Basis for recommending MCC for imbalanced datasets.
All formulas (n = TP+TN+FP+FN):
Accuracy = (TP+TN)/n | Precision = TP/(TP+FP) | Recall = TP/(TP+FN) Specificity = TN/(TN+FP) | F1 = 2\xD7(P\xD7R)/(P+R) | NPV = TN/(TN+FN) Balanced Acc = (Recall+Specificity)/2 | FPR = FP/(FP+TN) | FNR = FN/(FN+TP) MCC = (TP\xD7TN\u2212FP\xD7FN) / \u221A((TP+FP)(TP+FN)(TN+FP)(TN+FN)) Values are undefined (NaN) when denominator is zero. FDR = 1 - Precision. FOR = 1 - NPV.

Accuracy, Precision, Recall, F1 — Complete Guide to Classification Metrics

In binary classification, accuracy alone is rarely enough to evaluate a model. A spam filter that marks every email as "not spam" would be 100% accurate on a dataset with no spam — yet completely useless. To properly evaluate classifiers, you need a suite of metrics derived from the confusion matrix: the 2×2 table of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

The Confusion Matrix Explained

Predicted Positive | Predicted Negative
Actual Positive: TP (correctly identified positive) | FN (missed positive — Type II error)
Actual Negative: FP (false alarm — Type I error) | TN (correctly identified negative)

Example: Cancer screening on 200 patients (100 with cancer, 100 without).
Model predicts: 80 TP (correctly identified cancer), 20 FN (missed cancer), 10 FP (false alarms), 70 TN.

All 12 Classification Metrics — Formulas and Interpretation

MetricFormulaMeaningBest For
Accuracy(TP+TN)/nOverall fraction correctBalanced classes
Precision (PPV)TP/(TP+FP)Of predicted positives, how many are realCostly false positives
Recall (Sensitivity)TP/(TP+FN)Of actual positives, how many foundCostly false negatives
Specificity (TNR)TN/(TN+FP)Of actual negatives, how many identifiedClinical tests
F1 Score2PR/(P+R)Harmonic mean of precision & recallImbalanced data
Balanced Accuracy(Recall+Specificity)/2Average per-class accuracyHighly imbalanced
MCC(TP×TN−FP×FN)/√(…)Correlation between actual & predictedBest overall for imbalanced
NPVTN/(TN+FN)Of predicted negatives, how many are realRuling out disease
FPRFP/(FP+TN)False alarm rate (1 − Specificity)ROC curve x-axis
FNRFN/(FN+TP)Miss rate (1 − Recall)Safety-critical

When Accuracy Is Misleading — Class Imbalance Problem

If 99% of data belongs to the negative class, a model that always predicts negative achieves 99% accuracy while being completely useless. This is the accuracy paradox:

Rule: When the minority class is less than 20–30% of the data, do not use accuracy as your primary metric. Use F1, balanced accuracy, or MCC instead.

Precision vs Recall Trade-off

Precision and recall are inversely related at most threshold settings. Raising the classification threshold increases precision but decreases recall. Lowering it increases recall but decreases precision. The F1 score is the harmonic mean that balances both. For asymmetric costs, use F-beta where β > 1 to emphasize recall, or β < 1 to emphasize precision.

Fβ = (1+β²) × (Precision × Recall) / (β² × Precision + Recall)
β = 1: standard F1 (equal weight). β = 2: double weight on recall (cancer detection). β = 0.5: double weight on precision (content recommendation).

Matthews Correlation Coefficient (MCC) — The Best Single Metric

MCC is considered the single most reliable metric for binary classification evaluation, especially for imbalanced data. It is essentially the Pearson correlation coefficient between the actual and predicted binary labels. MCC = +1 means perfect prediction, MCC = 0 means no better than random, MCC = −1 means perfect inverse prediction. Unlike F1 and accuracy, MCC gives a high score only when the model performs well on both classes.

MCC = (TP×TN − FP×FN) / √((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Example: TP=80, TN=70, FP=10, FN=20.
Numerator: 80×70 − 10×20 = 5600 − 200 = 5400
Denominator: √((90)(100)(80)(90)) = √(64,800,000) = 8050
MCC = 5400/8050 = 0.671 (strong positive correlation)
💡 Quick metric selection guide: Balanced classes, equal error costs → use Accuracy. High FP cost (spam filter, false security alerts) → optimize Precision. High FN cost (cancer screening, fraud) → optimize Recall. Need one number for imbalanced data → use MCC. Need intuitive single metric → use F1 or Balanced Accuracy. Clinical diagnostic test evaluation → use Sensitivity and Specificity separately (they are threshold-invariant).
Frequently Asked Questions
Accuracy = (TP + TN) / (TP + TN + FP + FN). It measures the fraction of all predictions that are correct. Example: TP=80, TN=70, FP=10, FN=20. Total n = 180. Accuracy = (80+70)/180 = 150/180 = 83.3%. This is the simplest metric but can be misleading for imbalanced datasets where one class is much more common than the other.
Precision (PPV) = TP/(TP+FP). It measures: of all the cases you predicted as positive, how many are actually positive? Example: you predicted 90 positives (TP=80, FP=10). Precision = 80/90 = 88.9%. Accuracy considers all four cells (TP, TN, FP, FN). Precision only considers the positive predictions (TP, FP). Use precision when false positives are costly: flagging legitimate emails as spam is worse than missing spam.
Recall (sensitivity, TPR) = TP/(TP+FN). It measures: of all actual positives, how many did you correctly identify? Example: 100 actual positives (TP=80, FN=20). Recall = 80/100 = 80%. Use recall when false negatives are costly: missing a cancer diagnosis (FN) is far worse than a false alarm (FP). In these cases, maximize recall even if it means more false positives.
F1 = 2×(Precision×Recall)/(Precision+Recall). It is the harmonic mean of precision and recall. Use when you want to balance both and when classes are somewhat imbalanced. Example: Precision=0.889, Recall=0.80. F1 = 2×0.711/1.689 = 0.842 = 84.2%. F1 penalizes extreme values: a model with 100% precision and 0% recall gets F1=0, not the average (50%).
Balanced accuracy = (Recall + Specificity)/2. It averages sensitivity and specificity, giving equal weight to both classes regardless of their size. For a model that always predicts the majority class: Recall = 0%, Specificity = 100%, balanced accuracy = 50% (correctly identifying it as no better than chance). Regular accuracy would give 90%+ for a 90% majority class. Use balanced accuracy whenever your positive class represents less than 30% of data.
MCC = (TP×TN−FP×FN)/√((TP+FP)(TP+FN)(TN+FP)(TN+FN)). Ranges from −1 to +1. MCC = +1: perfect predictions. MCC = 0: random chance. MCC = −1: perfectly wrong. Unlike F1 and accuracy, MCC uses all four cells of the confusion matrix and is only high when both precision and recall are high AND both specificity and NPV are high. Research by Chicco & Jurman (2020) recommends MCC as the single most reliable metric for binary classification evaluation.
Specificity (TNR) = TN/(TN+FP). Sensitivity (Recall) = TP/(TP+FN). Together they describe a test’s performance: Sensitivity asks "what fraction of disease cases does the test detect?" Specificity asks "what fraction of healthy cases does the test correctly clear?" They are used in clinical diagnostic test evaluation. Both are threshold-invariant (unlike precision which changes with class balance). The ROC curve plots sensitivity vs (1−specificity) for all thresholds.
NPV = TN/(TN+FN). It answers: when the test says negative, how likely is the patient truly negative? Critical in medicine: a high NPV means the test is good at ruling OUT disease. Unlike specificity, NPV depends on disease prevalence. For rare diseases (low prevalence), even a moderately specific test has very high NPV. For common diseases, NPV is lower even with the same test.
Accuracy is misleading when classes are imbalanced. Example: 1000 patients, 10 with rare disease (1%). A model predicting "healthy" for everyone gets 99% accuracy but 0% recall — it misses every patient. This is why you always need to check both accuracy AND recall (or F1 or MCC) when class imbalance exists. Rule of thumb: if the minority class < 20% of data, do not use accuracy as your primary metric.
Use Accuracy: balanced classes, all errors cost equally (e.g. digit recognition). Use F1: imbalanced data, you want to balance precision and recall (information retrieval, spam detection). Use MCC: highly imbalanced data, you want the single most reliable metric (medical diagnosis, fraud, rare event detection). Use Balanced Accuracy: when MCC is hard to interpret and you want an intuitive 0–100% scale. For medical tests: always report sensitivity AND specificity separately.
FPR (false positive rate) = FP/(FP+TN) = 1 − Specificity. It measures the fraction of actual negatives incorrectly flagged as positive. Used as the x-axis of ROC curves. FNR (false negative rate) = FN/(FN+TP) = 1 − Recall. It measures the fraction of actual positives missed. The ideal classifier has FPR=0 and FNR=0. In practice there is a trade-off: lowering the decision threshold to reduce FNR increases FPR, and vice versa.
A confusion matrix is a 2×2 table for binary classification showing: TP (predicted positive, actually positive), FP (predicted positive, actually negative — Type I error), FN (predicted negative, actually positive — Type II error), TN (predicted negative, actually negative). All 10+ classification metrics are derived from these four values. For multi-class problems, the confusion matrix is k×k for k classes, and per-class metrics are computed using one-vs-rest comparisons.
Related Statistics Calculators
Popular Calculators
🧮

Missing a Statistics Calculator?

Can’t find the statistical tool you need? Tell us — we build new calculators every week.