Statistics calculators give researchers, analysts, and students the numbers behind data — but statistics is also the field with the highest rate of systematic misinterpretation. A JAMA survey found that 100% of medical residents misinterpreted p-values despite 88% expressing confidence in their understanding. The p-value is not the probability the null hypothesis is true. Statistical significance is not practical significance. And dividing by N instead of N-1 when calculating sample standard deviation produces a biased estimate that understates the true population variance.
Statistics is both a discipline and a source of systematic errors when the underlying concepts are unclear. Standard deviation has two versions (population and sample) that differ by whether you divide by N or N-1, and the N-1 correction exists for a specific mathematical reason that matters for small samples. P-values measure one very specific thing and are routinely misinterpreted as measuring something else entirely. And sample size has a quadratic relationship with precision — halving your margin of error requires four times as many observations. The percentile calculator, quartile calculator, and sample size calculator cover the mechanical calculations; this page covers the conceptual context that makes those calculations interpretable.
Standard deviation measures how spread out data values are from the mean. Two versions exist. Population standard deviation (σ) divides the sum of squared deviations by N — the full count of data points. This is correct when you have every member of the population you care about. Sample standard deviation (s) divides by N-1. This is used when your data is a sample drawn from a larger population and you want to estimate the population’s standard deviation. The N-1 denominator is Bessel’s correction, named after Friedrich Bessel. The reason: when you calculate a sample mean, all data points are defined relative to that sample mean. The data points are systematically closer to their own sample mean than they would be to the true population mean. Dividing by N produces a variance that systematically underestimates the true population variance. Dividing by N-1 corrects for this bias. For large samples (N=100), the difference is 1% and negligible. For small samples (N=5), dividing by N gives an estimate 25% smaller than dividing by N-1 — a meaningful bias.
A p-value is: the probability of observing data as extreme as or more extreme than your result, assuming the null hypothesis is true. Nothing more, nothing less. P = 0.04 means: if there were truly no effect (null hypothesis), there would be only a 4% chance of seeing data this extreme or more extreme purely by chance. A JAMA study found that 88% of medical residents felt confident in their understanding of p-values — and 100% had the interpretation wrong. The most common error: treating the p-value as the probability the null hypothesis is true, or the probability the result is real. These are completely different things (this error is called the “inverse probability fallacy”). A p = 0.04 does not mean there is a 96% chance the result is real. It means the data would be observed by chance only 4% of the time if there were no effect.
Statistical significance (p < 0.05) tells you a result is unlikely to be noise. Effect size (Cohen’s d, r, eta-squared) tells you how large the effect is. With large enough samples, even trivially small effects become statistically significant. A study with n = 500 participants finding p < 0.001 for a meditation intervention sounds compelling. If Cohen’s d = 0.10, the standardised effect is 0.1 standard deviations — a 1.5-point change on a 100-point scale. Statistically significant, practically negligible. Cohen’s benchmarks: small effect = d ≈ 0.2, medium = d ≈ 0.5, large = d ≈ 0.8. The APA Publication Manual now requires effect sizes to be reported alongside p-values for this reason. A p-value without an effect size is an incomplete statistical report.
P-value misinterpretation is the norm, not the exception — even among experts: A seminal JAMA survey (Goodman 2008) tested medical residents on p-value interpretation. 88% expressed fair to complete confidence in their understanding. 100% got the interpretation wrong. The most common errors: treating p-value as the probability the null is true, treating it as the probability the result will replicate, and conflating statistical significance with practical importance. The American Statistical Association issued a formal statement in 2016 clarifying that a p-value below 0.05 does not by itself constitute adequate evidence for a scientific claim. The replication crisis in psychology, medicine, and social science is partly traceable to over-reliance on p < 0.05 without consideration of effect size, study power, and pre-registration.
Cohen’s benchmarks are guidelines for effect interpretation in the absence of domain-specific context. A d = 0.2 is “small” in abstract terms but may be very meaningful if the intervention is cheap and safe. A d = 0.8 may be meaningful in education but insufficient in medical treatment contexts. Always interpret effect size relative to the domain and practical stakes.
| Cohen’s d | Classification | Distribution Overlap | Practical Example |
|---|---|---|---|
| 0.10 | Negligible | ~92% overlap | 1.5-pt change on 100-pt scale (often noise) |
| 0.20 | Small | ~85% overlap | Height difference between 15–16 year olds |
| 0.50 | Medium | ~67% overlap | IQ difference between clerical and semi-skilled workers |
| 0.80 | Large | ~53% overlap | IQ difference between PhD and typical college freshman |
| 1.20 | Very Large | ~37% overlap | Substantial clinically meaningful difference |
| 2.0+ | Huge | <22% overlap | Group differences clearly visible without statistics |
For estimating a proportion at 95% confidence, assuming maximum variance (p = 0.5). To halve the margin of error, you need four times the sample. This diminishing return is why large surveys require careful cost-benefit analysis: going from ±5% to ±2.5% precision costs 4× more but delivers only 2× more precision.
| Sample Size (n) | Margin of Error | To Halve MOE | Notes |
|---|---|---|---|
| 100 | ±9.8% | → need n=400 | Rough estimates only |
| 385 | ±5.0% | → need n=1,537 | Common survey standard |
| 600 | ±4.0% | → need n=2,401 | Political polling minimum |
| 1,067 | ±3.0% | → need n=4,268 | National survey standard |
| 1,537 | ±2.5% | → need n=6,147 | High-precision surveys |
| 9,604 | ±1.0% | → need n=38,416 | Census-level precision |
A z-score measures how many standard deviations a data point is from the mean. Positive z-scores are above average; negative are below. Used for normalising datasets and calculating percentile rank from a known distribution.
| Z-Score | Percentile Rank | Meaning |
|---|---|---|
| −3.0 | 0.13% | Extreme low — rarer than 1 in 750 |
| −2.0 | 2.28% | Low — bottom 2.3% |
| −1.0 | 15.87% | Below average |
| 0.0 | 50.00% | Exactly at the mean |
| +1.0 | 84.13% | Above average |
| +1.65 | 95.05% | Top 5% threshold (one-tailed) |
| +1.96 | 97.50% | 95% confidence interval boundary |
| +2.0 | 97.72% | Top 2.3% |
| +2.576 | 99.50% | 99% confidence interval boundary |
| +3.0 | 99.87% | Top 0.13% — rare |
Statistical significance ≠ practical significance — the most consequential misunderstanding in applied statistics: With large enough samples, any effect, no matter how small, will be statistically significant. A randomised trial with 10,000 participants comparing two teaching methods might find p < 0.0001 — massively significant. If the effect is d = 0.05, that is approximately a 0.75-point improvement on a 100-point test. The finding is real (not noise), replicable, and practically meaningless for most policy decisions. Conversely, a study with n = 30 finding p = 0.08 and d = 0.65 has a genuine medium effect that failed to reach significance because the study was underpowered — the failure to reach p < 0.05 is a failure of sample size, not a failure of the effect. The ASA and APA both now require effect size reporting alongside p-values precisely because p-values alone cannot distinguish “real but tiny” from “real and meaningful.”
Use the standard deviation calculator for any dataset where you want to understand spread. Enter the data and specify whether you want population standard deviation (you have all the data you care about) or sample standard deviation (your data is a sample and you want to estimate the population). For most research and survey contexts, use sample standard deviation (N-1). Use the ascending order calculator as a first step before calculating median, quartiles, and percentiles — all of these require sorted data. Use the percentile calculator to find where a specific value ranks in a distribution, and the quartile calculator to find Q1, Q2, Q3, and IQR for box plot construction and outlier detection.
When interpreting any p-value result: read the p-value correctly (probability of data this extreme given null is true), report the effect size alongside it (Cohen’s d for means, r for correlations, odds ratio for categorical outcomes), and consider study power. A non-significant p-value does not mean “no effect” — it means “insufficient evidence to reject the null at this significance level.” An underpowered study can miss a real and meaningful effect. Always run a power analysis before data collection to ensure your sample size is large enough to detect the effect size you consider meaningful.
Use the sample size calculator before designing any survey or study. Enter your required confidence level (95% is standard), desired margin of error (5% for general surveys, 3% for policy-relevant research), and expected proportion (use 0.5 if unknown — this maximises the required sample). Remember the quadratic relationship: cutting margin of error in half requires four times the sample. Population size has minimal impact on required sample size for large populations (a nationally representative sample of 385 achieves ±5% margin of error whether the population is 100,000 or 100 million) — this counterintuitive result confuses many people who assume larger populations always require larger samples.
Three statistical errors are pervasive across fields. First: p-hacking — running multiple analyses and reporting only the one that crosses p < 0.05. Each additional analysis increases the chance of false positives; with 20 independent tests at α = 0.05, at least one significant result is expected by chance alone even if all null hypotheses are true. Second: treating a non-significant result as evidence of no effect, especially from an underpowered study. Absence of evidence is not evidence of absence when the study had insufficient power to detect a meaningful effect. Third: using N instead of N-1 for sample standard deviation on small datasets, which systematically understates spread and affects all downstream calculations including confidence intervals and t-tests that depend on the standard deviation estimate.
Most used tools across all 14 categories