What a reliability number actually tells you (and what it doesn't)

A new survey lands in front of you. The reliability tile reads α = 0.85, color-coded green. The instinct is to relax. Reliability is "good," so the scale is fine, the data is fine, and the analysis can move on to the interesting parts.

It is the right instinct, partly. An alpha of 0.85 is a useful signal. It is not a verdict. The number answers a specific, narrow question, and a lot of decisions depend on understanding which question that is.

The narrow question alpha answers

Cronbach's alpha (and split-half reliability, and most of the close cousins ReliCheck reports) measures one thing: how consistently the items in a scale move together within a single administration. When all six items on a "team trust" scale tend to be answered similarly by the same respondent, the scale is internally consistent. When responses to those six items look unrelated, the scale is not.

That property matters. A scale that is not internally consistent cannot be averaged into a meaningful composite, because there is no underlying signal for the average to be averaging. So a low alpha is a real warning. A high alpha says you have cleared that bar.

Three things alpha does not tell you

1. Whether the scale measures the right construct

A high alpha says items hang together. It does not say what they hang together around. A six-item scale labeled "engagement" can produce alpha = 0.91 while actually measuring "how much I like my manager." The label on the scale is a hypothesis. The reliability statistic does not test that hypothesis.

Construct validity, the question of whether a score reflects what the name claims, is built up from separate evidence: correlations with related and unrelated measures, expert review, and predictive checks against outcomes. Reliability sits underneath those tests. It does not replace them.

2. Whether the scale is unidimensional

Alpha can stay respectable even when a scale is measuring two correlated things at once. A "wellbeing" scale that mixes physical and emotional items can produce alpha = 0.82 simply because the two facets are positively related. Summing them into one score then loses information that mattered.

The fix is to look at the inter-item correlation matrix and the KMO statistic alongside alpha. When unidimensionality matters (it usually does for a single composite score), no single number tells you the answer.

3. Whether the score will replicate

An alpha computed on 25 respondents is a reliability estimate, not a reliability fact. The same scale could plausibly come back at 0.62 or 0.91 in the next sample of 25. Below about 100 respondents, treat reliability numbers as preliminary and report them with the sample size attached. Above 100, the estimate stabilizes. Above 300, it is roughly trustworthy as a property of the scale in that population.

Treat reliability as transferable, never as portable. An instrument that produced alpha = 0.88 with full-time employees may produce 0.65 with contract workers or 0.71 with a translated version. Reliability is a property of scores in a sample, not a property of the instrument in the abstract.

How to read a reliability number well

  1. Note the value and the sample size in the same breath. "Alpha = 0.85, n = 142" reads differently than "Alpha = 0.85, n = 22."
  2. Check item-total correlations next. A high alpha with one weak or negatively correlated item often means the alpha is being held up by everything except that item, which is worth knowing.
  3. Look at the inter-item correlation matrix. Healthy single-construct scales show correlations roughly in the 0.30 to 0.70 range. Anything tightly above 0.85 across the matrix suggests redundancy, not consistency.
  4. Pair reliability with at least one validity check before claiming the scale measures what its name says. Convergent or criterion correlations from a pilot are usually enough.

The bottom line

Reliability is a useful, narrow check. A high reliability number rules out one specific kind of bad scale (one that is not measuring anything consistently). It does not rule out the other kinds: scales measuring the wrong construct, scales secretly measuring two things at once, or scales whose nice numbers came from a sample too small to generalize from.

For the longer treatment of every flavor of reliability and validity, see the Reliability guide. For the formulas behind the numbers, the Methodology page.