Don't Do Statistics Before Causality

If you don't put causality into your models, you won't get it out of them

Sep 19, 2021

I’ve identified an area in my analysis that is lacking, and so I have been studying since my last post to erase it. That area is causality, but another phrase for it would be scientific rationality. It was triggered by this video, and catalyzed by this being the second time causality has caught my attention. I now believe that causality should come before any statistics.

Causality v. Statistics

Causality is concerned with what-causes-what. Statistics is concerned with forecasts, correlation, and significance. If you’re a trader, you care about causality. You might see some index or news and put on a trade on because you think the former will cause the latter to move.

Our statistical models do not contain any causal information. They are data zombies that mindlessly take inputs they are given and walk them over to where they are designed to put them. In this sense models are not even wrong, though they can be uselessly applied.

We are all guilty of including variables in a regression or other model without regard for causality, instead focusing on metrics like AIC or mean squared error to tell us when we have the best model. Those metrics would give the geocentric model of the solar system high marks because it made accurate predictions despite being dead wrong.

If you care about understanding what really moves your output (be it prices, or otherwise), you need to start with a causality model. For speculators, we want a model that best maps causes to prices, which may not necessarily provide the most accurate forecast. The best forecast may underweight the greatest cause, and overweight secondary effects that shouldn’t be traded on.

Mapping Data Relationships

A causality model is just a graph. Nodes represent events, arrows between nodes represent the direction of relationship. Producing this model first allows you to correctly include features and also help you interpret the physical meaning of any coefficients related to those features your statistics spits out.

We naturally think this way. We always have a model like this in our head, but we might not explicitly lay it out and think about it.

A casual graph indicating that X causes Y, but also influences Z which in turn influences Y

This map guides how we build our statistical models and how we interpret the model output. If our statistical work following the construction of this causality model fails to find a result: that means our causality model may be wrong. We should find ways to adjust it and try again.

For the above example you’d probably get a more precise forecast of Y by including Z. Most data scientists would throw in Z into their forecasts of Y without a second thought. In my opinion, that’s not what a speculator wants. By adding Z to your model, you’re controlling for Z and thereby diluting the total effect of X.

If you instead model X’s total impact on Y without including Z you’ll know how to act when you see a surprising value released for X. If you included Z in your model, you’d seriously underweight the impact X will have because you dilute it with the partial effect through Z. You need a causality model to make these kinds of decisions.

Summary

Before you start throwing all your variables together and calculating regressions, build yourself a causality model that outlines how you think things interrelate to each other. You can use published papers, Wikipedia, whatever you like to try to piece those relationships together, but you can’t use data. Data will not tell you this information, only science will.

Without a causality model guiding your process, you may construct accurate but misleading models. You might also include variables that actually make your forecasts worse. You won’t really know what your model means, or how to generalize any kind of wisdom out of it that would help you as a speculator.

This has been a criminally brief introduction to causality. I recommend checking out the video I mentioned at the start for a much deeper introduction. I’m digging through the creator’s textbook Statistical Rethinking to get my head screwed on straight with this topic and produce far more rational analysis as a result.

confirm signal? sir, this is a casino

Discussion about this post