Validity guide · ReliCheck

What this guide covers

This is the validity counterpart to the reliability guide. If reliability tells you whether your scale is a steady ruler, validity tells you whether you brought the right ruler in the first place. The guide walks through factor analysis (the workhorse of construct validity), the choices the ReliCheck Validity dashboard asks you to make, and the conventions for reading the output without overclaiming.

Why validity is a separate question

Validity is whether you brought the right ruler. A steady ruler still has to measure the right thing. A high reliability score means your items move together. It does not mean they move because of the construct you have in mind. A 10-item engagement scale could have a strong reliability number and still be measuring general job satisfaction, optimism, or how much someone likes their commute. Reliability is necessary, but it is not enough. Validity asks the harder question: is the underlying thing behind these items the one you named?

Plain language

Reliability is the ruler's steadiness. Validity is whether the ruler measures length, weight, or temperature. Both matter, and they take different evidence to defend.

What factor analysis does

Exploratory factor analysis looks at how your items correlate with each other and asks: how many underlying dimensions would best explain this pattern of correlations? Each underlying dimension is a factor. Each item gets a loading on each factor, which is roughly the correlation between the item and the factor. Items with high loadings on the same factor are probably measuring the same thing; items with high loadings on different factors are measuring different things.

If you ran factor analysis on a workplace survey and two clean factors fell out, one mostly capturing engagement items and one mostly capturing workload items, you would have evidence that engagement and workload are distinct constructs in your data. If your "engagement" scale split across both factors, you would have evidence that the scale is doing two jobs at once, and you should consider treating it as two subscales instead of one composite.

PCA vs principal axis factoring

The Method dropdown in the ReliCheck Validity dashboard offers two extraction methods, and they answer slightly different questions.

Principal components (PCA)

PCA explains the total variance in your items, including the variance that is unique to each item (measurement error and item-specific quirks). It is fast, deterministic, and a fine first look. Use it when you want a quick sense of how many dimensions are in your data.

Principal axis factoring (PAF)

PAF explains only the shared variance among the items, which is the variance you can actually attribute to a latent construct. It does this by iterating: it estimates how much of each item's variance is shared, decomposes the correlation matrix on that basis, and repeats until the estimates stabilize. Use PAF when you are arguing for construct validity in a manuscript or report. The numbers are slightly more conservative than PCA and more defensible to a reviewer.

For most surveys, the two methods agree on how many factors are present and on which items load where. They tend to disagree on the exact size of the loadings.

How many factors to keep

ReliCheck uses the Kaiser criterion, which retains every factor with an eigenvalue of at least one. Eigenvalues below one are explaining less variance than a single item would on its own, so they are generally not substantive. The scree plot in the Validity dashboard shows every factor's eigenvalue as a bar, green when above the cutoff and coral when below, with a dashed line at eigenvalue equals one for reference.

Kaiser is a useful default, not a law. Two other things to consider:

The elbow. Look for the bend in the scree plot where the bars flatten. If Kaiser retains four factors but the elbow is at two, the third and fourth factors are probably noise.
Interpretability. A factor only counts if you can name it. If the items loading on factor three do not share a coherent meaning, retaining it adds clutter, not insight.

Rotation, in plain terms

The first time you extract factors, the math points them at the directions of maximum variance. Those directions are mathematically convenient but almost never line up with the human concepts you care about. Rotation re-aims the factors so each item loads strongly on one factor and weakly on the others, which makes the result interpretable.

None

Useful for diagnostics. You can see the raw structure, and items that load on more than one factor stand out. Not what you want for a final report.

Varimax

An orthogonal rotation that keeps the factors uncorrelated with each other. Items end up loading clearly on one factor and near zero on the rest. Use Varimax when you have theoretical reasons to expect the constructs to be independent, like engagement and workload in some models, or when you want the cleanest possible interpretation.

Oblimin and Promax (coming soon)

Oblique rotations that allow the factors to correlate. Use these when your constructs are theoretically related, like motivation and engagement, or self-efficacy and confidence. The ReliCheck dashboard lists these in the Rotation dropdown so you know what is on the roadmap; they will ship in a follow-up release.

KMO and Bartlett's test

Two prerequisites tell you whether factor analysis is even worth running on your data.

KMO sampling adequacy

The Kaiser-Meyer-Olkin measure ranges from zero to one and asks whether your correlation matrix has enough shared variance to factor. ReliCheck uses Kaiser's conventional labels: 0.90 and up is marvelous, 0.80 to 0.90 is meritorious, 0.70 to 0.80 is middling, 0.60 to 0.70 is mediocre, below 0.60 is unacceptable. Aim for 0.70 or higher before treating your factors as solid.

Bartlett's test of sphericity

Bartlett's test asks whether the correlation matrix is statistically distinguishable from an identity matrix. A significant result (typically p less than .05) means the items are correlated enough to factor. A non-significant result is a stop sign: the items are not co-varying, and any factors you extract are essentially noise. ReliCheck shows the test result in the Validity dashboard footer.

Reading a loading matrix

The Factor Loading Matrix in the Validity dashboard shows one row per item and one column per retained factor, with cells colored on a red-to-white-to-green gradient (red is a strong negative loading, white is near zero, green is strong positive). Conventional thresholds for interpretation:

|loading| ≥ 0.50: a strong loading. The item is doing real work for that factor.
0.30 ≤ |loading| < 0.50: a moderate loading. Worth keeping if it makes theoretical sense, worth reviewing if it does not.
|loading| < 0.30: a weak loading. The item is barely contributing.

The Status column flags items that have at least one strong loading and a communality (the share of the item's variance explained by the retained factors) of at least 0.40. Items below that bar are tagged Review, not Strong. A "Review" item is not necessarily a bad item; it is one that did not fit cleanly into the current factor solution and deserves a second look.

In the dashboard

The full loading matrix and factor structure are summarized at the top of the Validity tab by the Validity Narrator in one or two plain-language sentences. It names whether the data supports one construct or several and points to the strongest grouping by item content.

Per-factor alpha

Each Factor Structure Summary card in the dashboard shows a Cronbach's alpha for that factor on its own, computed from just the items that load on it. This is the validity-aware reliability number: instead of asking how well your entire item bank hangs together, it asks how well each subscale hangs together. Reverse-loading items are mirrored before the alpha is computed so a negative loading does not artificially deflate the value.

A clean factor with five items and a per-factor alpha of 0.85 is the most defensible kind of result: the items measure one thing, and they measure it consistently.

Confirmatory factor analysis (CFA)

Exploratory factor analysis asks what structure emerges from the data. Confirmatory factor analysis asks whether the structure you proposed actually fits the data. Once you have assigned each item to a construct in the Questions builder, the Validity tab runs a single-factor or multi-factor CFA and reports the four fit indices reviewers expect.

Chi-square divided by degrees of freedom

The chi-square test asks whether the model-implied covariance matrix could plausibly have produced the data you observed. Big sample sizes make almost every model fail this test, so the raw chi-square is rarely useful on its own. The ratio of chi-square to df is the rough rule: a value at or below 3 is solid, between 3 and 5 is workable, above 5 is a sign the model is not capturing what is in the data.

Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI)

Both indices compare your model against an independence baseline (no relationships at all). Values run from roughly 0 to 1; higher is better. The conventional cutoff is CFI at or above 0.95 for good fit and 0.90 for acceptable. TLI penalizes model complexity slightly more than CFI, so the two often disagree by a few points; a strong scale clears both thresholds.

RMSEA and SRMR

Root Mean Square Error of Approximation summarizes how much error per item the model leaves on the table. Conventional cutoffs: at or below 0.06 for good fit, 0.06 to 0.08 for acceptable, above 0.08 for cause for concern. ReliCheck also reports the 90 percent confidence interval; the upper bound should sit under 0.08 for a clean read. SRMR is the average size of the residual correlations (what the model could not explain); 0.08 or below is good. These four indices together come from Hu and Bentler's standard table.

When the indices disagree, the rule of thumb is: trust CFI plus RMSEA for overall model fit, and use SRMR for spotting items that the model handles poorly. The standardized loadings table on the CFA card shows you per-item whether each item loads strongly on its assigned factor.

Measurement invariance

If you compare scores across departments, demographic groups, or experimental arms, you are implicitly assuming the survey measures the same thing the same way for every group. Measurement invariance is the formal test of that assumption. ReliCheck fits three nested models in sequence on the same group axis:

Configural, metric, scalar

Configural invariance holds when the same factor structure shows up in every group; the items "hang together" the same way, but the strength of the relationships can differ. Metric invariance adds a constraint that the factor loadings are equal across groups; the items relate to the construct the same way. Scalar invariance adds equal item intercepts; this is the level you need before it is safe to compare mean scores across groups.

How ReliCheck decides whether invariance holds

Following Cheung and Rensvold (2002) and Chen (2007), the change in fit indices between levels is the signal. Invariance holds at a given level when the absolute change in CFI is at most 0.010, the absolute change in RMSEA is at most 0.015, and the absolute change in SRMR is at most 0.030 for metric or 0.015 for scalar. The card reports each delta in a single diagnostics table and surfaces a verdict pill so you can tell at a glance which level was reached.

What an invariance failure means

Failing metric invariance means the items do not relate to the construct the same way across groups; one group might be reading "I feel like I belong" more literally than another, or a translated version weighted certain items differently. Failing scalar invariance means the item intercepts differ; one group is using the scale at a different baseline than another even when their underlying trait level is the same. Either failure means raw mean comparisons across groups are risky, and you should report the invariance result alongside any group differences in your write-up.

Item response theory (IRT)

Classical test theory talks about scales: how reliable is the whole instrument, how well do the items hang together. Item response theory talks about items: where on the underlying trait does each item separate respondents, and how sharply does it discriminate. The two are complementary; IRT does not replace alpha and CFA, it answers a different question about the same data.

The Graded Response Model

ReliCheck fits Samejima's (1969) Graded Response Model on a 21-point quadrature grid. Every Likert item gets a discrimination parameter (a) and ordered category boundaries (b₁, b₂, ..., b_K-1) with standard errors. Discrimination tells you how sharply the item separates respondents at different trait levels: above 1.5 is strong, 0.8 to 1.5 is workable, below 0.8 is weak. Boundaries tell you the trait level at which the cumulative probability of responding at or above each category crosses 50 percent.

Item information and the test information function

Where does each item carry its weight on the trait scale? The item information function answers that. An item with high information at trait level 0 (the middle of the scale) is most useful for telling apart average-trait respondents; an item with information that peaks at +2 only helps you tell high-trait respondents apart. The test information function adds all the items together. A test information value of 4 corresponds to a marginal reliability of about 0.80, which is the same target as Cronbach's alpha.

Marginal reliability and person ability

Marginal reliability is the IRT counterpart to Cronbach's alpha. Values above 0.85 are strong, 0.75 to 0.85 are workable, below 0.65 is shaky. Each respondent also gets an expected a posteriori (EAP) ability estimate with a posterior standard error, computed over the same trait distribution the model was fit on. EAP scores are what you would use for a paper on subgroup gaps in latent trait, rather than raw sum scores.

Common pitfalls

Too small a sample

Factor analysis is unstable below about 100 respondents, and a common rule of thumb is at least five respondents per item. ReliCheck warns when the sample is small enough to make the factor structure unreliable; treat the result as exploratory in that case.

Confusing PCA with theory

PCA chooses factors that explain variance. The result is a description of your data, not a theory of the construct. Use it to see structure, then defend the structure with PAF, a rotation, and conceptual argument.

Skipping rotation

An unrotated solution is almost never the right thing to report. Items spread across factors and the result looks weaker than it is. Always rotate before interpreting.

Overinterpreting cross-loadings

An item that loads 0.55 on one factor and 0.42 on another is doing two jobs. Drop it, rewrite it, or assign it to the factor where it makes the most conceptual sense; do not let the math alone decide.

Forgetting that EFA is exploratory

Exploratory factor analysis surfaces structure. It does not confirm it. The serious follow-up is a confirmatory factor analysis on a second sample, which ReliCheck runs directly on the Validity tab under the EFA dashboard. You no longer need to leave for Mplus, lavaan, or AMOS to close that loop.

What ReliCheck shows you, and what it does not

The Validity dashboard reports factor count by the Kaiser criterion, per-factor and cumulative variance explained, KMO sampling adequacy, Bartlett's test of sphericity, the full loading matrix with a gradient legend, per-factor Cronbach's alpha, communalities, and an item-by-item Strong or Review status. The dropdowns let you switch between PCA and principal axis factoring, and between no rotation and Varimax, and the result updates in place. AI-suggested factor names propose short construct labels based on the item wording (never on response data), and you can edit any label by clicking it.

Underneath the EFA dashboard sits the Confirmatory Factor Analysis card, the Measurement Invariance card, and (on its own tab) the Item Response Theory card. The full sequence from EFA structure to CFA fit to invariance across groups to IRT item-level discrimination runs inside the same view; the methodology section your reviewers expect is right there.

What ReliCheck does not do (yet): oblique rotation in EFA (coming soon), bifactor or hierarchical CFA, multidimensional IRT, or judge whether the items you wrote belong in the scale in the first place. The dashboard tells you what your data looks like. You still decide what it means.

Validity earlier in the workflow

Factor analysis describes the structure your data already has. The Construct Mapper in the Questions builder proposes the structure before respondents see the survey, so the factor analysis has a hypothesis to test rather than a blind fishing expedition. The Survey Purpose Checker audits the draft against the decision the survey is meant to inform, which is a different kind of validity entirely. See all fifteen tools →

Validity, factor analysis, and what your items are actually measuring

What this guide covers

Why validity is a separate question

What factor analysis does

PCA vs principal axis factoring

Principal components (PCA)

Principal axis factoring (PAF)

How many factors to keep

Rotation, in plain terms

None

Varimax

Oblimin and Promax (coming soon)

KMO and Bartlett's test

KMO sampling adequacy

Bartlett's test of sphericity

Reading a loading matrix

Per-factor alpha

Confirmatory factor analysis (CFA)

Chi-square divided by degrees of freedom

Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI)

RMSEA and SRMR

Measurement invariance

Configural, metric, scalar

How ReliCheck decides whether invariance holds

What an invariance failure means

Item response theory (IRT)

The Graded Response Model

Item information and the test information function

Marginal reliability and person ability

Common pitfalls

Too small a sample

Confusing PCA with theory

Skipping rotation

Overinterpreting cross-loadings

Forgetting that EFA is exploratory

What ReliCheck shows you, and what it does not

Put validity to work