Why it matters

Two questions every survey result has to answer.

Before you trust a number on a slide, ask: are people answering consistently, and is the survey actually measuring what we think it is? Those two questions are the whole job of reliability and validity, and they're the difference between a finding you can defend and a guess wearing a percentage sign.

When the numbers can't be trusted

Bad survey data doesn't just sit there. It makes decisions for you

The trouble with unreliable or invalid data isn't that it looks suspicious. It looks fine. It moves through the deck, into the strategy, into next quarter's budget. Only later does anyone realize the conclusion was built on noise.

Scenario one

The "engagement is up" report

An HR team announces a six-point lift in employee engagement. The CEO praises the wellness initiative. Six months later, the best engineers leave anyway.

The survey was measuring how comfortable people felt being honest, not how engaged they actually were.

The cost: a year of investment in the wrong intervention, and a leadership team that no longer believes the people data.

Scenario two

The brand tracker that swings

A marketing team reports a 14-point NPS jump after a campaign. Leadership doubles the budget. Next quarter NPS drops 19 points, with no campaign change.

The instrument bounced because the question wording was tweaked between waves and the sample mix was different. Nothing actually changed in the brand.

The cost: ad spend chasing a phantom signal, and a board that stops trusting any tracker number.

Scenario three

The accreditation rejection

A program submits course-evaluation data to its accreditor. The reviewer asks a single question: "What's the reliability of this instrument?" Nobody knows.

The submission goes back for another year of data collection, this time with proper psychometric reporting.

The cost: a delayed accreditation cycle, faculty re-doing surveys they thought were done, and a director's hardest year on the job.

In plain English

Reliability and validity, defined without the textbook

Two ideas. Each one answers a different question about whether your numbers can carry the weight you're about to put on them.

1 Reliability

Reliability

= consistency.

A reliable survey gives you the same answer when nothing real has changed. If you ask the same person the same questions twice in a calm week, the answers should look alike. If two questions are supposed to measure the same thing, they should agree.

The bathroom-scale test. A reliable scale gives you the same weight when you step on it three times in a row. An unreliable scale gives you 162, 168, and 159, and now no number it ever shows you can be trusted.

2 Validity

Validity

= measuring the right thing.

A valid survey actually measures what it claims to measure. A "leadership effectiveness" score should reflect leadership effectiveness, not how nice the manager seems on a good Tuesday, or how scared the team is to be honest.

The scale that's heavy by 30 lbs. It's perfectly consistent. Step on it ten times, get the same answer ten times. But every reading is wrong. Reliability without validity is a precise way to be wrong.

The four states your data can be in

Each target shows where the answers land when you ask the same question many times. The bullseye is the truth.

Reliable & valid

Tight, on the truth. The goal.

Reliable, not valid

Consistently wrong. Looks confident, isn't.

Valid, not reliable

Centered on average, but you can't trust any single answer.

Neither

Noise. Decisions made here are decisions made by chance.

What ReliCheck measures for you

The checks that turn raw answers into evidence

You don't need to memorize the names. The platform runs these in the background and tells you, in plain language, where your data is strong and where it isn't.

Reliability checks

Are answers consistent enough to trust?

Internal consistency Cronbach's α

If five questions are supposed to measure the same idea, do they actually agree with each other? A score above 0.70 says yes; below 0.50 says you're measuring different things you've been treating as one.

Test-retest stability

When the same person answers a week later (and nothing real has changed), do they answer roughly the same? If not, the instrument is reading mood, not the thing you wanted to measure.

Inter-rater agreement

When two reviewers grade the same open-ended response, do they agree? If not, your "themes" are really one analyst's opinion dressed up as data.

Validity checks

Are we measuring the right thing?

Content validity

Do the questions actually cover the topic, not just the easy parts of it? A "wellbeing" survey that only asks about sleep is missing most of wellbeing.

Construct validity

Are we really measuring "trust in leadership," or just "I had a good week"? We test this by checking that scores move with things they should move with, and don't move with things they shouldn't.

Criterion validity

Do survey scores predict the outcome you actually care about, like turnover, repurchase, or course completion? If a high score doesn't connect to a real-world result, the measure isn't doing its job.

Why it matters for your team

The same two questions look different from every chair

Reliability and validity aren't abstract. They're the difference between a finding that survives the next meeting and one that quietly falls apart.

For researchers

Findings that survive peer review, and the ones that don't.

Reviewers don't ask "is this surprising?" They ask "is this measurable?" If you can't report Cronbach's α, McDonald's ω, and convergent validity evidence, the manuscript stalls.

Bad data here looks like: a scale with α = 0.41 that you only catch at the analysis stage, after 18 months of fieldwork.

Reliability flags during pilot, before full data collection
Item-rest correlations to spot questions dragging the score down
Codebooks & method appendices ready for submission

For marketers

NPS lifts your CFO will actually believe.

A tracker is only useful if the number means the same thing this quarter as it did last quarter. Different wording, different sample, different mode, and "we're up 12 points" turns into "we changed the ruler."

Bad data here looks like: a brand tracker that bounces 18 points wave to wave, until leadership stops reading it altogether.

Question wording locked across waves; deviations flagged
Significance testing on every reported lift
Sample-composition warnings when audiences shift

For People & HR

Engagement scores managers actually trust.

If a pulse score swings 1.2 points between Tuesday and Thursday, no manager will act on it. If a 9/10 engagement reading comes from a team that just lost three engineers, you're measuring social desirability, not engagement.

Bad data here looks like: a "we're doing great" report two weeks before a wave of resignations.

Validated engagement scales with published reliability
Anonymity thresholds (k-anonymity) so honest answers stay honest
Drift detection, when a score moves more than the instrument can

For education

Course evaluations and accreditation submissions that hold up.

Accreditation reviewers will ask one question about every instrument: what's its reliability? If you can't answer, the cycle restarts. And section-level evaluations swing wildly when the instrument is noisy, making instructor decisions feel arbitrary.

Bad data here looks like: the same instructor scoring 4.7 in one section and 3.2 in another, with no way to tell what's signal and what's noise.

Section- and instructor-level reliability rollups
Reliability flags when α drops below threshold for a comparison
Audit-ready method appendices for accreditation submissions

Run your next survey on data you can defend

ReliCheck does the reliability and validity checks in the background, and shows you, in plain language, where your data is strong and where it isn't.

Start free See it in action