Statistical Modeling Makes Weak Results Look Better

I am revising my Industrial-Organizational (I-O) Psychology textbook. I’ve reviewed abstracts for a few hundred recent papers, looking for new material to include. A couple of intervention studies caught my eye because they are rare in my field, so I added them to the list of possibles. Unfortunately when I looked closer I couldn’t use them. I hoped they would show how interventions would have effects on some important organizational outcome variables like sales or turnover. Instead of comparing the intervention group with the control group, the authors just ran some statistical model tests. I wondered why the authors would not report on group differences and then it hit me. There were no statistically significant effects of the intervention on the intended outcomes but statistical modeling makes weak results look better. The authors had to do that to make their paper publishable.

Experimental Designs Are the Gold Standard

Whether they are laboratory studies, drug trials in medicine, or intervention studies in business, the experimental design is considered the gold standard when it comes to drawing causal conclusions. With an intervention design you randomly assign some employees to experience the intervention (perhaps a training program) and others serve as a control group who do not get trained. You then collect data on the outcomes of that training, some immediately after training and some after allowing time for the intervention to take hold. Kirkpatrick outlines four types of outcomes that are relevant. The first two types are assessed immediately after training. The last two are assessed after some period of time, perhaps a couple of months.

Reactions are whether the person enjoyed the training and found it of value.
Knowledge is the amount learned.
Behavior is whether the person performs the job differently.
Results is the impact of the training on organizational outcomes.

The use of this approach allows for a strong conclusion about the impact of the intervention. It is based on an inductive inference that allows you to conclude that if the intervention worked on the current employees (the difference in means between the two groups was statistically significant), it would be expected to work on others. If you tried the intervention again, you would have some confidence that it would work. Of course, we know that there is a 5% chance that our results were a fluke–the Type 1 error that serves as an alternative explanation. To counter this we have to repeat the study, perhaps by training the control group to be sure the intervention works a second time.

The Logic of Statistical Modeling

Statistical modeling allows you to determine if the pattern of correlations among a group of variables fits what you would expect if a given model is correct. For example, we might reason that good leadership builds trust with direct reports that leads to them to be satisfied with the job that leads to job performance. If that is correct, we would expect a certain pattern of correlations if we collect data on the four variables. The underlying logic is based on an “If A Then B” deductive inference–if my model is correct then I will have a good model fit. The weakness in this logic is that it cannot run the other way. “If B does not imply A”. Just because the model fit the data doesn’t mean the model is correct, only that it is one of many possible models that produced those results. If you break it down, it is a case of trying to draw causal conclusions from correlations.

Statistical Modeling Makes Weak Results Look Better

Suppose you spent a lot of time and effort conducting a training intervention study with a group of sales employees, hoping to find that it improved customer service performance of individuals and sales for the organization. You are meticulous and measure all four Kirkpatrick outcomes. You compare the intervention and control groups and all you find is that knowledge increased. Service performance and sales were unaffected. You know that reporting that an intervention did not have the intended effect would be tough to publish given biases in academic publishing. An alternative is to leverage the possibility that statistical modeling makes weak results look better. Assuming that knowledge, customer service, and sales are correlated, there is a good chance that a model of knowledge leading to better customer service leading to sales will be supported–If A Then B.

But there are other likely alternatives that explain what is going on here. We know that cognitive ability can affect how much people learn in training, and it affects how well people perform on the job. If cognitive ability is the real driver, it could produce a pattern of results where

The training impacts knowledge but not performance because knowing more doesn’t help with sales.
Knowledge and performance are related because they share a common driver of cognitive ability.

So there is reason not to take the model test too seriously. If tells us that a given pattern of relationships supports a model, but it doesn’t shed light on why. It cannot tell us, for example, if cognitive ability or some other factor is the real driver.

The Academic-Practice Divide

Much has been written about the academic-practice divide and why practitioners do not apply the results of academic research. My take is that most of what is in academic journals is not actionable in that it does not provide evidence about what works in organizations. The use of statistical modeling to analyze data from intervention studies is a case in point. The practitioner wants to know if a particular intervention has desired effects on important outcomes. It can be just as important to know what doesn’t work as what does. The academic needs to publish and to do so you have to report results showing that something worked. Showing that it didn’t can be hard to publish. As long as that’s the system, academics will leverage the fact that statistical modeling can make weak results look better, and practitioners won’t find those results useful.

Image generated by DALL-E 4.0. Prompts: “Illustrate complex statistical modeling”, “try the one of the left again but less cluttered”, “Better but make it more colorful”.

SUBSCRIBE TO PAUL’S BLOG: Enter your e-mail and click SUBSCRIBE

Join 1,371 other subscribers