“Well, I do not believe in Treatment Heterogeneity,” I say to my supervisors.

They respond by laughing. “This one will be included in your speech,” my PI adds to the conversation.

Their reaction is completely natural. In the first year of my PhD, I have been doing research in the field of causal inference, specifically looking into models for treatment effect heterogeneity. Claiming that the phenomenon I would like to capture with my models does not exist is obviously quite odd. It is akin to a colleague working on extreme value modelling walking up to their supervisor and saying that extreme weather events do not exist, or that modelling them is impossible.

My statement is obviously hyperbole, but it is grounded in reality. Although the intuition behind treatment heterogeneity is straightforward—i.e., that some underlying individual characteristics modify the expected treatment effect that an individual is expected to have—finding evidence of treatment effect heterogeneity in real data sets is quite hard. In this blog, I would like to focus on the challenge of finding evidence for treatment effect heterogeneity, not delving into whether it should be investigated at all in policy considerations, which is another important debate to be had.

Signal to Noise in Causal Models

When running a causal model, our principal aim is to find the Average Treatment Effect (ATE), which denotes the effect of treatment on the population on average. We might also want to go beyond this, into the realm of personalized medicine, and investigate the Conditional Average Treatment Effect (CATE). The CATE denotes the expected benefit or harm that an individual might experience from treatment given a set of observed covariates.

More concretely, we can dissect an outcome of interest Y_i into a prognostic part and a treatment part. We can express this formally as:

Y_i = \mu(X_i) + \tau(X_i)Z_i + \varepsilon_i

Here, \mu(X_i) is the prognostic function, describing the relationship between underlying covariates and the outcome for individuals regardless of which arm they are in. The term Z_i is the treatment indicator (where 1 is treated and 0 is control), and \tau(X_i) is the treatment function—the core of our interest.

To make this estimable, we often assume a linear form for the treatment function, such as \tau(X_i) = \alpha + \beta X_i. In this setup, if we center our covariates X_i around zero, \alpha represents the treatment effect for the “average” individual (effectively the ATE), while \beta X_i captures the heterogeneity.

Statistically speaking, and assuming these two parts of the outcome are independent, the treatment function tries to explain whatever variance is left differing between the two arms. If you now think about signal-to-noise ratio, we might expect that in a realistic medical setting the covariates explain around 50% of the variance (the prognostic part), and what remains is residual noise. Of this half of the pie, probably a very small fraction—say, around one percent of the variance of the outcome—is to be expected to be explained by the treatment function.

 

Conceptually, within the treatment function, we can also understand that in most cases the variance is mainly driven by the ATE (\alpha). For example, with a new chemotherapy, we might observe that mortality decreases by 50% for the population on average, and that the difference between men and women in terms of mortality is 5% (e.g., men have a 45% decrease and women 55% in mortality). We see that the average treatment effect is an order of magnitude greater than the component of treatment modification.

The numbers here are exaggerated for the sake of illustration, but in practice, the individual contribution of covariates to the treatment modifier on an odds scale is probably more in the vicinity of 1%, if there is any treatment heterogeneity at all. The main challenge of individualized medicine and causal inference is one of a simple statistical nature: separating noise from signal. When treatment effect modifiers only make out a small part of the explained variance, they either need to be very pronounced, or we need to have lots of samples to learn this very small signal from the data.

Biological vs. Statistical Evidence

With a very pronounced signal, I mean cases like the role of Vitamin C deficiency in the effectiveness of Vitamin C supplements—an example given by Richard Hahn in a recent episode of the Causal Bandits podcast (Kuo, 2024). It is almost nonsensical to investigate this using statistical methods, as we could know this by using our biological knowledge and simple deduction.

On the other hand, we have treatment modifiers that we would really like to find which are probably more discrete and are less pronounced. Indications of this type of heterogeneity were found in Canagliflozin, a treatment against diabetes (Perkovic et al., 2019). The heterogeneity here is somewhat mechanical: the drug works by inhibiting glucose reabsorption in the kidneys, causing sugar to be excreted in urine. Therefore, its efficacy relies on kidney function. Patients with reduced glomerular filtration rates (kidney function) physically cannot excrete as much glucose, leading to a mathematically smaller treatment benefit compared to those with healthy kidneys.

 

Oftentimes, however, these signals are so small that we really need lots of data to find them, and clinical trials and observational studies are rarely powered to find them. In the same podcast episode, Hahn underlines that he believes there should be a middle ground in which causal inference tools can genuinely be useful for personalized medicine (Kuo, 2024).

The Power of Large Data

There are a couple of examples of this middle ground, but they are hard to combine. A recent paper by Welz et al. (2025) showed that there is indeed treatment heterogeneity and that different subpopulations based on gender and sex did react better or worse to a new lung cancer screening method in the Netherlands and Belgium. Crucially, further analysis in the paper showed that these differences are mainly related to underlying histologies: certain groups were more prone to having lung cancer types (like adenocarcinomas) that have a better prognosis and thus respond better to screening. The paper also came with useful policy recommendations on relaxing screening eligibility.

It has to be noted that the data employed in this analysis was large; it consisted of two cohorts with 15,000 and 53,000 individuals respectively. The study was well powered, and different machine learning methods showed aligning results.

 

Conclusion

So, when I tell my supervisors that I don’t believe in treatment heterogeneity, I am not denying the reality of personalized medicine. I am respecting the difficulty of finding it. As Welz et al. (2025) shows, the signal is there—buried deep in the noise—but we shouldn’t expect to find it everywhere. Until we have the massive datasets required to see these subtle variations, skepticism isn’t just natural; it’s statistically necessary.

Geef een reactie

Je e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *