... LIVE
📊 Select Mode
📌 Enter visitors and conversions for Control (A) and Variant (B). The calculator will test whether B is statistically different from A.
🔵 CONTROL (A)
n
Enter n ≥ 1.
x
Enter conversions (0 to n).
🟢 VARIANT (B)
n
Enter n ≥ 1.
x
Enter conversions (0 to n).
95% is the standard for most CRO tests
Two-tailed detects regressions too
📌 Plan your A/B test before running it. Find the sample size per variant you need to reliably detect a given improvement in conversion rate.
%
Current conversion rate of the control
Enter baseline rate between 0.01% and 99%.
%
Relative lift you want to detect (e.g. 20% = from 5% to 6%)
Enter MDE > 0.
Probability of detecting a true effect
visits
Used to estimate test runtime in days
Test Result
⚠️ Disclaimer: Statistical significance is a necessary but not sufficient condition for making business decisions. Consider effect size, practical significance, and business context before acting on results.

Sources & Methodology

Two-proportion z-test formulas verified against NIST and standard statistical methods textbooks. Sample size formulas use exact power calculations.
📘
NIST/SEMATECH e-Handbook — Two-Proportion Z-Test
Reference for two-proportion z-test methodology, pooled proportion standard error calculation, and p-value determination for one-tailed and two-tailed tests.
📊
Cohen, J. (1988) — Statistical Power Analysis (2nd ed.)
Definitive reference for sample size calculations incorporating statistical power (1−β) and Type I error (α) for two-proportion tests, including the relationship between MDE, sample size, and power.
Significance test formulas:
p⊂1; = c⊂1;/n⊂1; (control rate)   p⊂2; = c⊂2;/n⊂2; (variant rate) Pooled: ṗ = (c⊂1;+c⊂2;)/(n⊂1;+n⊂2;)   SE = √(ṗ(1−ṗ)(1/n⊂1;+1/n⊂2;)) z = (p⊂2;−p⊂1;) / SE   |   Uplift = (p⊂2;−p⊂1;)/p⊂1; × 100% Sample size formula:
n = (zα/2 + zβ)² × (p⊂1;(1−p⊂1;)+p⊂2;(1−p⊂2;)) / (p⊂2;−p⊂1;)² Normal CDF computed via Abramowitz & Stegun rational approximation.

A/B Testing Statistics — Complete Guide to Significance, Sample Size & MDE

A/B testing (also called split testing) is the gold standard for evidence-based decision making in product development, conversion rate optimization (CRO), email marketing, and user experience design. The statistical foundation is the two-proportion z-test, which determines whether the observed difference in conversion rates between control and variant is real (statistically significant) or just random noise.

How A/B Test Statistical Significance Works

The null hypothesis (H⊂0;) states that both variants have the same true conversion rate. Statistical significance means rejecting this null hypothesis — concluding that the observed difference is too large to be explained by random chance at your chosen confidence level.

z = (p⊂2; − p⊂1;) / √(ṗ(1−ṗ)(1/n⊂1; + 1/n⊂2;))
Where: p⊂1; = control conversion rate, p⊂2; = variant rate, ṗ = pooled rate.

Example: Control: 5000 visitors, 250 conversions (5.0%). Variant: 5000 visitors, 290 conversions (5.8%).
ṗ = (250+290)/(5000+5000) = 540/10000 = 0.054
SE = √(0.054 × 0.946 × (1/5000 + 1/5000)) = √(0.0000102) = 0.003201
z = (0.058 − 0.050) / 0.003201 = 0.008/0.003201 = 2.499
Two-tailed p = 2 × (1 − Φ(2.499)) = 0.0124
Result: p = 1.24% < 5% → Statistically significant at 95% confidence

One-Tailed vs Two-Tailed A/B Tests

This is one of the most debated decisions in A/B testing methodology:

Most CRO practitioners and companies recommend two-tailed testing as the default. One-tailed is appropriate only when you know for certain that the change cannot possibly harm the metric of interest.

Sample Size Planning — The Most Important Step

Running an A/B test without a pre-planned sample size is one of the most common CRO mistakes. It leads to the peeking problem and inflated false positive rates. Calculate required sample size before you start, then do not check results until you have collected the full planned sample.

n per variant = (zα/2 + zβ)² × (p⊂1;(1−p⊂1;) + p⊂2;(1−p⊂2;)) / (p⊂2;−p⊂1;)²
Key inputs:
zα/2 = 1.960 for 95% confidence, 1.645 for 90%, 2.576 for 99%
zβ = 0.842 for 80% power, 1.036 for 85%, 1.282 for 90%

Example: Baseline 5%, detect 20% relative lift (5% → 6%), 95% CI, 80% power.
n = (1.960 + 0.842)² × (0.05 × 0.95 + 0.06 × 0.94) / (0.06 − 0.05)²
n = 7.853 × 0.1039 / 0.0001 = 8,165 visitors per variant

Minimum Detectable Effect (MDE) — Setting Realistic Test Goals

The MDE is the smallest relative improvement you care to detect. Setting MDE too small leads to impractically large sample requirements. Setting it too large means missing real improvements.

Baseline RateMDE (Relative)Absolute Changen per variant (95% CI, 80% power)Days (500/day)
5%10%5% → 5.5%30,75361.5 days
5%20%5% → 6%8,16516.3 days
5%50%5% → 7.5%1,5183.0 days
10%10%10% → 11%28,81757.6 days
10%20%10% → 12%7,53415.1 days
20%10%20% → 22%24,25248.5 days

The Peeking Problem — Why You Cannot Check Early

The peeking problem is the #1 statistical error in A/B testing. If you check results multiple times and stop the test as soon as you see significance, your actual false positive rate is much higher than your nominal alpha. Research by Johari et al. (2015) showed that peeking at results 5 times and stopping when p<0.05 inflates the false positive rate from 5% to approximately 14%.

🚫 Do not peek: Pre-commit to your sample size, collect all the data, and check results exactly once. If you need ongoing monitoring, use sequential testing methods (like the alpha-spending approach or mSPRT) specifically designed to control false positives with multiple looks. Standard p-values are only valid for a single pre-specified sample size check.

Relative Uplift vs Absolute Uplift

Absolute uplift = variant rate − control rate. Easy to understand but ignores the base rate. Relative uplift = (variant − control) / control × 100. Better for comparing across different baselines. A 1% absolute improvement on a 2% baseline is a 50% relative improvement — massive. The same 1% absolute improvement on a 50% baseline is only 2% relative — modest. Always report both; misleading results often arise from reporting only one.

Multiple Variants and Bonferroni Correction

When testing A vs B, C, D simultaneously (A/B/C/D test or multivariate test), each additional comparison inflates your false positive rate. With 3 variants tested against control (3 tests), the probability of at least one false positive at α=0.05 per test is 1−(0.95)³ = 14.3%. The Bonferroni correction divides α by the number of comparisons: for 3 tests, use α=0.05/3=0.017 per test. This is conservative; the Holm-Bonferroni step-down method is slightly more powerful.

Frequently Asked Questions
Statistical significance means the observed difference in conversion rates is unlikely to be due to random chance. At 95% confidence, there is only a 5% probability that you would see this large a difference if there were no true underlying effect. Measured by p-value: p<0.05 = significant at 95%, p<0.01 = significant at 99%. A significant result does not automatically mean you should implement the change — consider effect size, practical significance, and business context too.
Use a two-proportion z-test: (1) Compute conversion rates p⊂1;=x⊂1;/n⊂1; and p⊂2;=x⊂2;/n⊂2;. (2) Pooled proportion ṗ=(x⊂1;+x⊂2;)/(n⊂1;+n⊂2;). (3) SE=√(ṗ(1−ṗ)(1/n⊂1;+1/n⊂2;)). (4) z=(p⊂2;−p⊂1;)/SE. (5) Convert z to p-value via standard normal. For two-tailed at 95%: significant if |z|>1.960. This calculator handles all steps automatically.
n per variant = (zα/2+zβ)²×(p⊂1;(1−p⊂1;)+p⊂2;(1−p⊂2;))/(p⊂2;−p⊂1;)². For baseline 5%, 20% relative MDE (to 6%), 95% CI, 80% power: n=8,165 per variant. Higher confidence, higher power, or smaller MDE all increase n. Always use the sample size planner tab above BEFORE starting the test, not after.
MDE is the smallest relative lift you want to reliably detect. Example: if baseline is 5% and you want to detect a 20% relative improvement, MDE=20% means you aim to detect a change from 5% to 6%. Choose MDE based on the smallest business-meaningful improvement: if a 5% relative lift is too small to be worth implementing, set MDE to at least 5%. Smaller MDE = more data required. Do not set MDE post-hoc based on what you observed.
Two-tailed asks: is variant different from control in either direction? Significance at 95%: |z|>1.960. Recommended default — protects against regressions. One-tailed asks: is variant better than control? Significance at 95%: z>1.645. Requires 16% fewer observations but ignores negative effects. Best practice: always use two-tailed unless you can guarantee the change cannot possibly hurt. Many companies have been burned by one-tailed tests that showed "significance" while the variant was actually hurting the metric.
Peeking means checking test results before collecting the full pre-planned sample and stopping when you see significance. Each check inflates the false positive rate. Checking 5 times and stopping at p<0.05 gives a ~14% actual false positive rate instead of 5%. Solutions: (1) Pre-commit to sample size and check once at the end. (2) Use sequential testing methods (mSPRT, always-valid p-values) designed for continuous monitoring. (3) Apply Bonferroni or alpha-spending corrections for multiple planned checks.
Power = 1−β = probability of correctly detecting a true effect. Standard: 80% power means 20% chance of missing a real improvement. Higher power needs more data: 90% power requires ~33% more visitors than 80% power. Low power wastes resources running tests that have low probability of detecting real improvements. Set power during test design. Running underpowered tests is like flipping a biased coin: you might miss real winners.
Absolute uplift = variant rate − control rate. Example: 5.8%−5.0% = 0.8 percentage points. Relative uplift = (5.8−5.0)/5.0×100 = 16% relative improvement. Always report both. Relative uplift is more meaningful for comparing across different baselines: 0.8% absolute is a 16% relative improvement on a 5% base but only 1.6% relative on a 50% base. Misleading reporting often uses whichever number looks bigger.
Run time = required n per variant / daily traffic per variant. Always run for at least 1–2 full business cycles (7–14 days minimum) to capture day-of-week effects. Some products have strong weekly seasonality (weekdays vs weekends) that can distort short tests. Do not stop early even if the test looks significant. Minimum recommendation from most CRO practitioners: 7 days regardless of when statistical significance is reached.
An inconclusive result means you failed to reject the null hypothesis — but it does NOT mean control and variant are identical. It means your data does not provide sufficient evidence to conclude a true difference exists at your chosen confidence level. Options: (1) Collect more data (check your power). (2) Reconsider the MDE — was the effect too small to detect with your traffic? (3) Investigate if the test was contaminated. (4) If running for long enough with good power, declare the change neutral and do not implement.
Frequentist (this calculator): uses p-values and pre-specified sample size. Standard in most companies. Clear stopping rules but requires pre-commitment. Bayesian: calculates P(variant > control) directly, allows flexible stopping, more intuitive results, but requires prior beliefs and is harder to implement correctly. Both are valid. Frequentist with good study design (this calculator) is the most widely understood and replicated approach. For continuous monitoring with no fixed sample size, Bayesian or sequential frequentist methods are better choices.
When testing multiple variants (A vs B, C, D), divide α by the number of comparisons. For 3 variants vs control (3 tests) at 95% CI: α=0.05/3=0.017 per test → z must exceed 2.39. This prevents inflation of the family-wise false positive rate. Each test needs to be more stringent to maintain the overall 5% error rate. Alternatively use the Holm-Bonferroni sequential step-down method, which is more powerful than the simple Bonferroni correction.
Related Statistics Calculators
Popular Calculators
🧮

Missing a Statistics Calculator?

Can’t find the statistical tool you need? Tell us — we build new calculators every week.