Writing a new scale is the easy part. Establishing that the scale measures what its name claims and that the score is consistent enough to act on takes more work than most teams plan for. The good news is that the process is repeatable, and even a 30-respondent pilot can carry you most of the way there.
Below is the workflow our team recommends when someone is building a new instrument from scratch (or substantially adapting an existing one for a new population). It assumes you cannot run a 500-respondent calibration study before launch, which is the situation most working teams face.
Step 1: Expert review before respondents see anything
Two to four people who know the construct review every item before any data collection. Their job is to flag items that conflate two ideas, items that lean on jargon, and items where the answer scale does not match the question stem. A 90-minute session with three reviewers will catch more design errors than 200 respondents will.
Capture the review in writing. ReliCheck's AI question reviewer flags the same kinds of issues automatically (double-barreled, leading, vague, double-negative), but the human review catches construct-specific problems that no AI will.
Step 2: Cognitive interviews with five respondents
Sit with five people from the target population, hand them the instrument on the device they will use, and ask them to think aloud as they answer. Five interviews is enough to surface the patterns that will derail a larger sample. Common findings: a word the team thought was universal that is not, a frequency anchor that means different things to different people, a question order that creates a halo effect.
Five interviews is a number from the usability research literature, not a guess. With a homogeneous target population, five participants surface the great majority of comprehension problems. Heterogeneous populations need more.
Step 3: Run a pilot with 30 to 100 respondents
This is the smallest pilot worth taking time for. Aim for 30 complete-case respondents at the low end. With 30, you can compute a real Cronbach's α and item-total correlations, even though the estimates will have wide confidence intervals. With 100, the estimates stabilize and you can start trusting them as a property of the scale.
What to look at:
- α and ω. If they sit above 0.70 with no dropped items, the scale is internally consistent enough to keep iterating on. Below 0.70, something needs attention.
- Item-total correlations. Any item below 0.30 is a weak contributor. Below 0.20, drop or rewrite it.
- Inter-item correlation matrix. Healthy scales show correlations roughly in the 0.30 to 0.70 range. Anything tightly above 0.85 across the matrix is redundancy, not consistency. Pull one of the redundant pair.
- α-if-deleted. If removing a single item raises α by 0.05 or more, that item is fighting the scale. Look at its wording before deciding to drop.
Step 4: Document what you did
Write a one-page methods note that lists the construct definition, the items, the population, the pilot sample size, the reliability statistics, the items dropped, and the items revised. ReliCheck generates most of this automatically as the methodology appendix; the construct definition and the rationale for revisions are the two parts you have to write yourself.
The note pays for itself the first time someone asks "where did this scale come from?", whether at a peer review, an internal stakeholder meeting, or an accreditation visit. Without it, every conversation starts from scratch.
What this gets you
A scale that has been through expert review, cognitive interviews, and a pilot with documented reliability statistics is not "validated" in the formal sense (that takes confirmatory factor analysis with a fresh sample), but it is defensible. Defensible is the bar most working teams need. It carries you through internal use, applied evaluation, and most peer-reviewed publication contexts where the construct is well-established.
For deeper reading on the formal validation literature, the Reliability guide covers each statistic in plain language with the thresholds and pitfalls behind it.