It was the fall of 2023, and I was hosting a birthday party at my shared apartment in Rennes. My exchange semester had just begun, and the party was a perfect opportunity to gather the new friends I’d made. As it happened, a delayed regional train held up the lion’s share of them. This twist of fate gave me the chance to have a proper chat with the one friend who did arrive on time.
As the conversation rolled on, she asked me: “So what is that poster on your door about?” A French poster, made in GIMP, hailed: “La forme functionelle? Non, merci on a la statistique” (The Functional form? No thanks, we have statistics!). The statement, dripping with student-esque arrogance, meant that when it comes to learning things, we need no guidance from structures or teachers; experience and data will suffice. My French, spoken with an accent of unclear origin, spoke for itself.
At the time, I knew the statement on that poster was overtly optimistic, but recent advances in statistics and machine learning lent themselves to this optimism. Just a year prior, ChatGPT swept the world. It did not consist of a web of if-statements checking grammar and rules; instead, gradient descent with a model of billions of parameters, let loose on the content of the internet, taught the model language. It was a problem of sample size, not of structure. This reflects a classical intuition in statistics: with enough data, and under the right assumptions, the estimator converges toward the ‘true value’. We can find this simplified idea in Alex Hormozi’s dictum that ‘volume negates luck,’ where he points out that an individual’s skill in business or marketing increases as they engage in that task.
The example I gave my friend was that in my quest to learn French, I had benefited much more from the small talk with classmates and roommates than from the weekly French classes at the Grande École I was attending. It made sense; compared to the daily exposure I got, the French classes amounted to filling a pool with tea glasses. Here we can see that these ideas from statistics do not only apply to estimating certain quantities but also to more complex “learning problems,” as I would like to call them. After all, in language learning, I was trying to correctly communicate my thoughts, and the loss function determined whether my roommates understood my jokes.
So can we all get up in a crowd and declare that we can surmount every problem just with more data and experience? I, unfortunately, have to say to my younger self: “Not that quickly, young man.” But I also do not want to be an old man screaming at the clouds.
In Conjectures and Refutations, Popper presents a framework for science where hypotheses must be first formulated—i.e., conjured—before they are tested using our observations. Data obviously plays an important role in this process, as in its absence, we cannot test the conjured hypotheses. The hypotheses do not follow from the data, but from logic. In The Beginning of Infinity, Deutsch gives the example that before ever witnessing an atomic bomb, scientists were able to theorize that one would work and would have disastrous effects if unleashed. Similarly, before embarking on a project, a PI or researcher conjures in which direction to head and which hypotheses are realistic before sending out some PhDs and Post-docs. Thus, the role of theory and structure in learning problems and science does not seem so redundant.
This is especially the case in causal inference, where structure is essential for untangling which effects play which role. Temporal structure, for example, can be massively helpful.
Consider an epidemiological study. Epigenetic factors are fixed at birth, while individuals have some choice over their lifestyle. These genetic factors might influence a person’s lifestyle choices. However, it would be nonsensical to assume that lifestyle choices could change those underlying genetic factors. This temporal structure is valuable information we must use if we want to learn how these different factors influence an outcome of interest.
Yes, it is fascinating how we learn from exposure and experience, both in Statistics and in daily life. Children and adolescents pick up gestures and expressions on the fly. Empirical science has given us many advances, but for both of them, logic, theory, and structure provide an unmissable guiding hand.