Persuasive design didn't predict efficacy in 92 mental-health-app RCTs, and that should bother us

I was rereading a meta-analysis at my kitchen table on a Saturday morning and the line that kept catching me was this one. Across 92 randomized trials of mental health apps, with 16,728 participants between them, the number of "persuasive design" principles in an app had no significant relationship with either how engaged people were or how much the app helped.

That is the kind of finding I want to take seriously, because if it's right, a lot of design discourse in digital health is doing less work than we think.

The paper is a meta-analysis of persuasive design, engagement, and efficacy in 92 RCTs of mental health apps, published in npj Digital Medicine and indexed in PubMed Central here. I want to walk through what it actually says, because I think the headline gets read one of two wrong ways, and the real point is more useful than either.

The setup, briefly

The team pulled 119 studies and 30,251 participants for the systematic review, and ran the meta-analysis on the 92 studies that were randomized trials and reported usable outcome data. Every app was coded against the Persuasive Systems Design framework, which is Oinas-Kukkonen's 28-principle taxonomy across four buckets, primary task support, dialogue support, system credibility, and social support.

Each app got somewhere between 1 and 12 of those 28 principles. The mode was 5. The five most common were tunnelling (88% of apps), rehearsal (84%), trustworthiness (80%), reminders (55%), and personalisation (50%).

Apps worked. Pooled clinical effect was Hedges' g = -0.43 (95% CI -0.53 to -0.34), which is a small-to-medium effect, on par with face-to-face guided self-help in the same conditions. Heterogeneity was high (I² = 83.4%), which I'll come back to.

The headline finding, the one I keep returning to, is that the count of persuasive principles in an app did not predict either engagement or efficacy. The correlation between engagement and efficacy across studies was r = 0.21, p = 0.43. Not significant. Not even directionally clean.

How this gets misread

Two readings I want to push back on before I get to mine.

The first misreading is "persuasive design doesn't work." That is not what the paper says. The paper says counting principles per app does not predict outcomes across a heterogeneous set of trials. That is a different claim. A specific principle implemented well in a specific app for a specific condition could absolutely be doing the work. A count is just a count.

The second misreading is "engagement doesn't matter." Also not what the paper says. The paper says that across these studies, with the engagement metrics they happen to report, the relationship is statistically invisible. The paper itself flags that 24% of studies didn't report engagement at all, and the 76% that did used 25 different metrics. You cannot pool that. Honestly, the bigger story here is methodological. We do not have a shared way to measure engagement in this field, and that is making the entire literature noisier than it should be.

What I think the paper is actually telling us

Here's my read.

If you took a pile of 92 mental health apps and asked "do they include things from a checklist," the answer was yes, mostly. The average app had 5. The most common principles were the easy ones to ship. Tunnelling is just guided sequential flows. Rehearsal is repeated practice. Reminders are reminders. These are baseline features in any modern app, mental health or not.

So the variance the meta-analysis is trying to explain is not variance in whether apps tried to be persuasive. It's variance between apps that all tried to be persuasive in roughly similar ways. And the count of features they shipped doesn't carry information about whether the implementation was any good.

This matches what I see at work. Two products can both ship "personalized reminders" and one of them sends a generic 9am ping while the other actually models when the user is most likely to act and adapts. Both score 1 in a coding scheme. They are not the same intervention.

The other thing this paper makes obvious is the weakness of theoretical grounding in the field. The authors note that 55% of the studies did not link their intervention to any behavior change theory. More than half. If you don't know which mechanism you're trying to activate, the persuasive features you ship are a bag of stuff, not a coherent intervention.

I think this is where the BCT taxonomy and COM-B earn their keep, even though this paper deliberately doesn't use either. BCTs at least tell you what specific active ingredient you intended. COM-B tells you which capability or motivation or opportunity barrier you're trying to address. PSD tells you what UI patterns you used. These are not interchangeable layers.

What to do if you ship behavior-change software

A few takeaways I think hold up.

Stop counting features and start naming mechanisms. "Our app has 8 evidence-based engagement principles" is the wrong sentence. "Our app uses self-monitoring with feedback to address an awareness gap, plus action planning to address an opportunity gap, both grounded in COM-B" is a better one. The first is countable. The second is testable.

Pick an engagement metric and stick with it for the life of the product. The 25-metric problem in this meta-analysis is also the 25-metric problem inside most companies I've worked with. Different teams measure different things, change them when they look bad, and end up unable to compare across releases. Pick two or three metrics, write down why those, and don't change them.

Build the harder principles, not just the easy ones. The most common PSD principles in this dataset are also the cheapest to implement. Tunnelling and rehearsal are basically table stakes. The harder ones, like genuine personalization, social comparison, simulation, are rare in part because they are hard. If everyone in the dataset shipped the cheap ones and no relationship emerged, that's evidence that the cheap ones may have hit a ceiling and the harder ones are where any remaining lift lives.

The g = -0.43 still matters. The apps did help people. The aggregate effect is real and roughly comparable to other low-intensity interventions. Don't read this paper as "apps don't work." Read it as "apps work, and the question of why is harder to answer than the field has been pretending."

The deeper point

Meta-analyses of digital health interventions are starting to feel like astronomy before the telescope. We can see something is there. We can measure it in the aggregate. We don't really know what we're looking at, and our instruments aren't precise enough to tell us.

I think this paper is a useful instrument check. It tells us that counting persuasive features doesn't work as an explanation, and the field has been overrelying on count-style explanations because they're easy to publish.

The next round of meta-analyses needs better-coded interventions. BCT-coded, COM-B-mapped, with engagement measured the same way across studies. Until we have that, every meta-analysis will end the same way. Big confidence intervals, lots of heterogeneity, and a discussion section that says "more standardization needed."

If you're a product person reading this and thinking it's depressing, I genuinely don't think it is. I think it means the field has a real opportunity to shift from "did we include the feature" to "did the feature do the thing." That is a better question to be asking, and your team is closer to the data than any meta-analysis will ever be.

Read the paper directly if you ship in this space. Pay attention to the supplement, especially the breakdown of which principles appeared in which trials. There's more signal in the specifics than in the headline.

If you've coded a digital health intervention against BCTs or COM-B and want to compare notes on what your team learned, please reach out. I genuinely don't have this figured out and I'd like to.