Home / Resources / Reliability guide
Reliability guide

Reliability and validity, in plain language

A working researcher's guide to the statistics that decide whether your survey results mean anything. What each one is, when to use it, and how to read the numbers ReliCheck shows you.

What this guide covers

This is a practical guide for anyone whose work depends on a survey number being defensible: instructors writing a course evaluation report, an HR team explaining engagement scores, an evaluator submitting outcomes to a funder. We'll move from short definitions to specific tests to the gotchas that catch even careful researchers off guard.

Reliability: the short version

Reliability is whether your survey is a steady ruler. A steady ruler is the first thing you need before you can trust a measurement. If you measured the same thing again, would you get a similar answer? Reliability does not tell you whether you measured the right thing, only whether you measured it cleanly.

Plain language

A reliable scale is a steady ruler. A ruler can be steady and still be the wrong ruler for the job. Reliability is necessary, but it is not sufficient.

Validity: the short version

Validity is whether your measurement reflects what you intended to measure. A scale labeled "engagement" needs to actually capture engagement, not job satisfaction or general optimism. Validity is harder to demonstrate than reliability and usually requires more than one piece of evidence.

Types of reliability

Internal consistency

How well the items in a single scale agree with each other. Reported as Cronbach's alpha or split-half. Best when your items are supposed to measure the same underlying idea.

Test-retest reliability

How stable the score is when the same respondents take the survey again, days or weeks later. Reported as a correlation between time-1 and time-2 scores.

Parallel-forms reliability

How closely two equivalent versions of a scale produce the same scores when given to the same respondents. Used when you need to give different forms to different people.

Inter-rater reliability

How much different raters agree when scoring open responses, observations, or behaviors. Reported as ICC, Cohen's kappa, or weighted kappa.

Types of validity

Content validity

Whether the items in your scale fully cover the construct. Usually demonstrated through expert review and item mapping.

Construct validity

Whether the scale relates to other measures the way theory predicts. The bedrock of psychometric defensibility.

Criterion validity

Whether the scale predicts an outcome you actually care about: retention, performance, change over time.

Face validity

Whether the items look like they measure the construct. The weakest form, but politically important when stakeholders read your items.

Which statistic applies?

Pick the statistic that matches what you actually need to defend. A course evaluation needs internal consistency on each subscale and content validity for the topics covered. A pre/post outcome study needs test-retest reliability and a defensible scale. An employee engagement instrument needs both internal consistency and construct validity against productivity or retention outcomes.

Reading a reliability number

Cronbach's alpha and split-half

Conventional thresholds, with the usual caveat that thresholds are conventions, not laws:

  • α ≥ 0.80: strong; safe to report.
  • 0.70 ≤ α < 0.80: acceptable; report with confidence intervals.
  • 0.60 ≤ α < 0.70: weak; investigate items, sample size, dimensionality.
  • α < 0.60: unreliable; do not report a single composite score.

KMO sampling adequacy

Tells you whether your sample is large and varied enough for the items to be analyzed together. KMO at least 0.70 is what you want for a defensible scale.

Item-total correlations

Each item's correlation with the rest of the scale (corrected for the item's own contribution). An item with a corrected item-total correlation below 0.20 is a weak contributor and should be reviewed.

In the dashboard

You don't have to read the numbers cold. The Reliability Explainer at the top of the Reliability tab summarizes the alpha tier in plain language and names the weakest items by their actual prompt text. The numbers are still there one click below the narration.

McDonald's omega

Cronbach's alpha is the workhorse of reliability reporting, but it assumes every item contributes equally to the underlying construct (the tau-equivalence assumption). Real scales rarely meet that assumption; some items are stronger indicators than others. McDonald's omega is a closely related coefficient that drops the assumption: it estimates reliability under a congeneric model where item loadings can differ. ReliCheck reports both side by side on the Reliability tab.

How to read it

Omega lives on the same 0-to-1 scale as alpha and is interpreted using the same conventional bands. Omega at or above 0.80 is strong; 0.70 to 0.80 is workable; below 0.70 calls for investigation. For most scales alpha and omega land within a couple of points of each other.

When alpha and omega disagree

If omega is noticeably higher than alpha, the scale is fine and a few items just carry more weight than others; alpha is being penalized by the tau-equivalence assumption it cannot meet. If omega is noticeably lower than alpha (rare in practice), the scale may be more multidimensional than alpha gives it credit for; check the factor structure on the Validity tab. The pragmatic move when reporting is to lead with omega when scales are heterogeneous and alpha when reviewers will look for it. Both numbers in the tab let you do either without recomputing.

How the Strength Index uses the gap

The Survey Strength Index's Reliability domain anchors on McDonald's omega when it can be estimated, with Cronbach's alpha as the fallback. The same conventional bands apply: omega at or above 0.90, 0.80, 0.70, and 0.60 award progressively fewer points within the 25-point Reliability allocation. Beyond the base band, the index also reads the gap between alpha and omega as a tau-equivalence proxy. When the two metrics agree, the scale is unidimensional enough for the standard interpretation. When they disagree by more than 0.05 the index deducts a small penalty and adds a "1-factor assumption is mildly strained" signal; above 0.10 the penalty rises and the signal escalates to "items may not load equally on one factor." Surveys whose alpha and omega both clear 0.80 with a gap under 0.05 earn an explicit positive signal noting that reliability holds under the weaker measurement-model assumption.

Standard error of measurement and confidence intervals

Reliability is a property of a scale, not a measurement of a single person's score with infinite precision. The Standard Error of Measurement (SEM) translates the reliability coefficient back into the units of the scale itself, so a reader can answer "how close is this person's observed score to their true score?" SEM is computed as SD multiplied by the square root of (1 minus alpha), where SD is the standard deviation of the summed scale scores across respondents. A scale with alpha 0.85 and a total-score SD of 6.0 has an SEM of 2.32, meaning observed scores are within roughly plus or minus 4.5 points of the true score 95 percent of the time.

ReliCheck reports SEM inline under the alpha gauge on the Reliability tab so the precision number is one glance away from the reliability number it depends on.

Confidence intervals on alpha and omega

Alpha and omega are sample estimates of population reliability, so they carry sampling error. ReliCheck reports a 95 percent confidence interval under each gauge so a reviewer sees the band of plausible population values rather than a single point estimate.

The alpha CI uses the Feldt-Woodruff method, which inverts an F-distribution test on the ratio of error to total variance. The lower and upper bounds are 1 minus (1 minus alpha) times the relevant F critical value at the chosen significance level. The implementation works on any complete-case Likert matrix with at least two items and two respondents.

The omega CI uses a non-parametric bootstrap percentile method. ReliCheck resamples respondents with replacement 100 times, recomputes omega on each resample, and reports the 2.5th and 97.5th percentiles. Because each resample fits a one-factor PAF the compute is small but not free, so the omega CI is gated to surveys with at least 30 respondents and 3 items. Below that threshold the dashboard reads "95% CI needs n at least 30" so users know why the band is absent rather than seeing a blank.

Per-construct reliability

Most useful surveys measure more than one thing. A workplace climate survey might cover engagement, autonomy, manager support, and growth opportunities, each with three to six items. Whole-scale alpha lumps them together and reports one number for the instrument, which is rarely what reviewers want to see. The honest question is whether each construct holds together on its own.

ReliCheck reports per-construct alpha and omega for every construct tag that appears on the Likert items. The Construct Mapper AI can propose a grouping, or the analyst can tag each item by hand in the Questions builder. Once items carry a construct field, the Reliability page's Item statistics sub-tab adds a Per-construct reliability table with one row per construct: items, alpha, omega, and a note. Constructs with fewer than two items are listed but not scored; constructs with two items show alpha but not omega; constructs with three or more items show both. A KPI strip at the top of the table names the strongest subscale and compares each result to the whole-scale alpha so a reviewer can see at a glance whether the instrument's reliability is driven by one strong construct or spread evenly across the survey.

Test-retest reliability

Alpha, omega, and split-half all describe internal consistency within a single administration. Test-retest answers a different question: when the same respondents take the survey twice, do they get similar scores? Stability across time is the prerequisite for any instrument used to track change.

ReliCheck pairs respondents across two waves of the same survey. The analyst picks the wave column (a single-choice question whose answer marks Time 1 vs Time 2, or a column from a longitudinal upload) and the respondent ID column. The analyzer pairs every complete-case respondent who appears in both waves and reports two coefficients: paired Pearson r (rank-order stability) and ICC(3,1), the two-way mixed-effects intraclass correlation that psychometricians cite as the formal test-retest coefficient.

How to read the numbers

The bands match what alpha and omega use. ICC at or above 0.80 is strong stability; the instrument is safe to use for individual change tracking. 0.70 to 0.80 is workable; group-level change tracking is reliable, individual scores carry some noise. 0.50 to 0.70 is modest; use group means with confidence intervals and avoid high-stakes individual decisions. Below 0.50 the instrument is too noisy for change interpretation; revise the items or confirm the respondent ID column paired correctly.

When constructs are tagged

The dashboard adds a per-construct breakdown: ICC and paired r for each subscale, computed on the column-sliced subset of the response matrix that covers only that construct's items. A multi-construct survey can show very different stability profiles across constructs, which a global ICC would hide.

Common reasons test-retest comes in low

Three frequent causes of a weak ICC have nothing to do with the items: respondent IDs that did not match across waves (a typo on Wave 2 makes the pair miss), too long a gap between waves so real change shows up as instability, or too small a sample to estimate ICC reliably. Check pairing before blaming the instrument.

Common pitfalls

Treating alpha as proof of validity

A high alpha tells you the items hang together. It does not tell you they measure what you claim. Validity needs separate evidence.

Reverse-scored items left unflagged

If a "negatively worded" item is included without flipping its scoring, alpha tanks and the scale looks broken when it is not. Always show which items were reverse-scored.

Chasing alpha by deleting items

Removing items to push alpha higher narrows your construct and breaks content validity. The remaining items might be highly consistent and also highly redundant.

Reporting reliability from a tiny sample

Alpha estimates from fewer than 30 respondents are unstable. Report the count alongside the value, and treat anything below the threshold as provisional.

Assuming reliability transfers

A scale validated with college students may not be reliable in a workplace cohort. Re-check internal consistency in your population every time you use a scale.

What ReliCheck shows you, and what it does not

ReliCheck reports internal consistency (Cronbach's alpha, McDonald's omega, split-half with Spearman-Brown correction, standard error of measurement), KMO sampling adequacy, item-total correlations, alpha-if-deleted, inter-item correlations, and per-item descriptives. It surfaces sample-size warnings and missing-data patterns automatically.

What ReliCheck does not do: tell you whether your construct is the right one, whether your sampling frame is appropriate, or whether your interpretation is sound. Reliability is a check on the measurement; validity is a judgment on the meaning. The numbers help; they do not replace the analyst.

From numbers to decisions

After reliability is computed, ReliCheck runs three more checks on the dataset itself. The Response Quality Check flags straight-lining, duplicate response vectors, and very short open-ended answers. The "Can I Use These Results?" Advisor bundles reliability, response quality, and sample size into a four-tier verdict: yes, yes with cautions, use with care, or not yet. See all seven tools →

Put the guide into practice

Open a sample report to see how reliability, item flags, and AI summaries appear together in a real ReliCheck output.