The Art of Planning, DeepSeek and Politics

A while back I discussed in a post some of the nuances of data driven decision making when we need to answer a straight question at a point in time e.g. : “shall we do this now?”.

That was a case presented as if our decision “now” would have no relationsip to the future, meaning that it would have no impact on our future decisions on the same problem.

Luckily, often, we do know that our decisions have an impact on the future, but the issue here is slightly different, we are looking not only on impact in the future but an impact on future decisions.

This is the problem of sequential decisioning. Instead of “shall we do this now?” answers the questions:

“Shall we adopt this strategy/policy over time?”

In other words it tries to solve a dynamical problem, so that the sequence of decisions made could not have been different no matter what happened. When you have an optimal solution to this problem, whatever decision is made, at any point in time, is a decision that one would never regret.

I will answer three questions here:

  • What is an example of such a problem?
  • Is there a way to solve such problems?
  • What is the relationship with DeepSeek and politics?

Sequential decisioning problem example – Customers Enagegement Management

A typical example could be that of working on marketing incentives on a customer base. The problem is that of deriving a policy for optimal marketing incentives depending on some measures of engagement in the customer base.

Why is it sequential? Because we have a feedback loop between what we plan to do, the marketing incentive, and the behaviour of our customers that triggers customer incentives: engagement.

Whenever we have such loops we cannot solve the problem at one point in time and regardless of the future.

The solution could look something like: “Invest $X in customer rewards whenever the following engagement KPIs, email open rate, web visits, product interaction etc.etc. drop below certain levels.

It is important to note that we need something to aim for, e.g. maximise return on marketing incentives.

Does a solution to such problems exist?

The good news is that yes, solutions exist, but we have been able to deal only with such cases that would be amenable to some mathematical/statistical modelling.

We can 100% find a solution If we have a good model that answers the following:

  • How does a given marketing incentive influence customers enagement on average?
  • Given a certain level of customer engagement, what is the expected value of the relationship with the customer?
  • How does the engagement of our customer evolves overtime, in absence of any incentive?

So we do need some KPIs that give us a good sense of what level of cusotmer engagement we have, and we need to have a sense of what is the expected value of the relationship with the customer given a particular set of KPIs. To be clear what we need is something true on average, and at least over a certain period of time.

Example, we should be able to say that, on average, a customer using our products daily for the past month will deliver X value over 2 years.

We also need to be able to say that a given marketing incentive, on average, increase customers daily enagement by X% and costs some defined amount of $

We also need to be able to say something like: our customer engagement rate tends to decrease X% every few months, again on average.

The above modelling of engagement and customer value is not something that most businesses would find difficutl today. Engagement rates, or attrition rates overtime, are easy to get, as well as results of marketing campaigns on engagement rates. Harder to get is a reliable estimate of lifetime value given current engagement metrics, but we can solve for a shorter time horizon in such cases.

Richar Bellman is the mathematician credited for having solved such problems in the most general way.

His equation, the Bellman equation, is all you need after you have a mathematical model of your problem.

The equation, in the form I’d like to present it is the one below, where V is the value of the optimal policy pi*:

It says the following:

The optimal strategy, depending on parameters a (e.g. how $ spent on incentives), is that which maximizes the instant rewards now, as well as the expected future rewards (future rewards conditional on the policy), no matter what the variables of the probelm are at that point in time (e.g. customer engagement and marketing budget left).

It is a beautiful and intuitive mathematical statement which is familiar to most humans:

The best course of action is that which balances instant gratification with delayed gratification.

We all work this way, and this is arguably what makes us human.

To be clear this problem can be, and it is solved, in a large variety of sequential decisioning problems.

The Bellman equation is the basic tool for all planning optimizations in supply management, portfolio management, power grid management and more…

Politics?

Well, in some sense politics is the art of collective, at least representative, policy making. And here a problem obviously arises: How to dial well instant versus delayed gratification when the views of the collective might differ? What if the collective doesn’t even agree on the various aspects of the problem at hand? (e.g. the KPIs, the expected rewards etc.etc.).

A key aspect of this should be the following: LEARNING.

The illustration below should be of guidance:

There has to be a feedback loop as the Bellman equation gives conditions for optimality, but often the solution can only be found iteratively. Futhermore, we also need to be open minded on revising the overall model of the problem, if we gather evidence that the conditions for optimality do not seem to come about.

So we have multiple challenges, which are true for politics but also for business management. We need:

  • Consensus on how to frame the problem
  • A shared vision on the balance between instant versus delayed rewards (or my reward versus your reward)
  • A framework for adapting as we proceed

The above is what I would call as being “data driven”- basically “reality” driven, but it is hard to get there. We all obviously use data when we have it, but the real challenge is to operate in such a way that the data will be there when needed.

How about DeepSeek?

If you have followed thus far you might have understood that solving a sequential problem basically involves algorithms to learn the dynamics between a policy maker action (e.g. marketing $ spent) and the long term feedback of the policy. This is the principle behing “reinforcement learning” in the AI world, as one want the learning to be reinforced by incentives for moving in the right direction. This was considered as a very promising framework for AI in the early days, but it is indeed an approach that requires time, and a fair bit of imagination as well as data, and was not pursued in recent years… until DeepSeek stormed the AI world.

A key element of DeepSeek successful R1 generative AI model was that it leveraged reinforecement learning, which forced it to develop advanced reasoning at a lower cost.

This was achieved through very clever redesigning of the transformer architecture, but I won’t go through it (as I have not yet understood it well myself).

As usual let me point you to a great book on the above topics.

My favorite is “Economics Dynamics” by John Stachursky. Great resource on sequential problem solving under uncertainty. Pretty theoretical, but that’s inevitable here.

Basics of Mathematical Modelling – Shapes of Uncertainty

Most of us, especially in business, but also in our private lives, peform some basic mathematical modelling all the time. We can do it wrong or we can do it right, but we do it … constantly.

The most basic form of mathematical/statistical modelling is to give shapes to our uncertainty (ignoring the shape is also part of it).

Let me tell you a story that should clarify what I mean.

John and his team at SalesX, an online marketplace, are trying to outsource their call centre operations and are looking for potential suppliers. They sent out an RFP (request for proposals) and they received a few responses. (note: despite the name SalesX is not owned by Elon Musk).

John calls his team in his office:

John: “Hi, everyone, could you give me a clear high level summary of the various responses we received from the call centers?”

Jane: “Yes, we shortlisted 3 responses for you, although we are quite confident on a winner…”

John: “Ok, what are the average call handling times of the shortlisted suppliers?”

Jane: “They are all about 2 minutes, but one provider is considerably cheaper. It is quite clear we should go for them. SmartCalls is the name of the company and they are heavily relying on automation and AI to support their operations, which keeps their costs very low, a very clever team.”

John: “Ok, sounds pretty good, let’s meet them but still keep another response in the race as we dig deeper beyond performance and price.”

Laura in Jane’s team, is not convinced by such a swift decision, and questions Jane…

Laura: “Jane, 2 minutes is great but what about the % of calls that are not resolved in one session? that one is…”

Jane: ” Yes that number is higher for SmartCalls, but this is hard to compare across providers right? as their clients have different rules for re-routing inquiries in house…”

Laura: “John, Jane, before moving forward with SmartCalls, let me reach out to the shortlisted suppliers with requesting some additional information. It will be very hard to switch to another supplier once we are done with this…”

John: “Ok but I give you 3 days, I hope you are after something substantial… SmartCalls is super cheap. Handling calls is not rocket science…”

Laura goes back to the suppliers and asks them for a full view of the distribution of call handling times and she finds out the following, summarized in below chart, which she promptly shares with John:

John looks at it but he is not the too clear on what that means…

John: “Ok Laura, translate please…”

Laura: “Basically SmartCall business model is that of leveraging AI with an inexperienced and cheap workforce, they can deal quickly with a large number of calls relating to simple issues, but their operators lack the creativity or experience to deal with issues of medium complexity. The operators all work remotely with no chance of sharing information with each other…”

John: “Wait, wait… the chart first, what does that mean?”

Laura: “Oh… isn’t that obvious. SmartCall has a 2 minutes call average, yes, but this is driven by a larger number of very quick calls, when it comes customer satisfaction there’s a good number of calls that go beyond 3-4 minutes.”

John: “Ok I get it, their calls are either quick, or rather long whilst, iHelp for example, is able to be more consistent, with most calls handled in about 2 minutes right?”

Laura: “Yes, they avoid shortcutting the initial problem identification phase and have a longer list of mandatory screening questions, but this pays off. They are able to share the calls with specialized teams and…”

John: “Ok I get it indeed. I also see several of SmartCall going beyond 5 minutes, which is our threshold for bringing back customer calls in house… good work Laura. Jane, good work on hiring Laura”

Jane is a bit deflated, but she ultimately smiles as she is proud of having trained Laura well.

The quick story above is a good example of how we routinely perform unsophisticated statistical modeling, especially when we implicity anchor ourselves to some one dimensional metrics, like averages.

When John heard the average call times he implicitly assumed that those averages were comparable and meaningful in their own right. This means he assumed (modelled) the call times range and frequency to be similar in shape across suppliers, which is a fair bit of statistical modelling in practice (wrong modelling in this case).

After all, what can you really tell from comparing averages unless you make the rather strong assumption that those averages are informative in summarizing the underlying data apprpriately? If you do so, although you might not know it, you are doing statistical modelling.

Laura, instead, decided to look at the actual shape of the data and to avoid any high level, uninformed assumption, yet she is still assuming quite a few things, among which:

  • The data she received is representative (not skewed toward some specific types of calls)
  • The distribution of the data will be consistent over time
  • The distribution of the data basis how call centers are handling other clients calls is relevant to SalesX customer base

That’s basically statistical inference, and whenever you make a decision or judge data, although you might not think about it, you, as well, are doing statistics and data science.

The question is: “Do you know when you are doing that well, or when you are being ineffective?”

Another key aspect of giving shapes to uncertainty is the key question of what metrics to measure, whether average call times or abandon rate, or both, or other metrics. Which ones to chose?

This is somewhat the operational solution of the problem I presented above, when it comes to data that tracks a business activity or process, which are the metrics that truthfully summarize the shape of that data?

I can recommend two good books on this topic, one is a technical text, Optimal Control Theory by Donald E. Kirk, the other is a more accessible read focusing specifically designing meaningful metrics that track Objective Key Results (OKRs), Measure What Matters by John Doerr

Optimal Control Theory goes well beyond the subject of performance measures, but the first 50 pages are a good introduction to the overall framework.

What is the value of Machine Learning? – A stock trading example

Since the days of the big data related buzz words, now over 20 years ago, data driven changes to business models, product values, customer journeys and more have gone through massive enhancements.

Nevertheless in the early days there were several discussions about the supposed “value” of data and such transformations. I know it because I was there, in most cases on the selling side.

It was not uncommon to find middle managers considering data driven transformations as a “nice to have” or simple ways to copy the competition, or something to do from a PR perspective. It was also not unusual for large businesses to look down on new tech startups as if looking at kids playing with unnecessary & fancy toys.

There was a lot of: “we do things this way because it works”, “our customers don’t need this”, “open source is dangerous”, “we do not need fancy machine learning for our simple problems” and so on.

There was also a lot of confidence in existing practices, something to the extent of: “I have been doing this for 20 years, I know better. My expertise beats your data science kid…”

To be clear a lot of the skepticism was healthy and based on more than gut feelings, but overtime the most negative voices disappeared. Nevertheless, a question still remains actual when starting any data driven & AI powered transformation: “What value will it deliver considering the unknowns and costs?”

Is it just headcount reduction? Or reduced wastes? Or better products? How will we measure all of this?

I am not going to answer those questions in general, in my view it would be silly to do so. How technology driven innovations can benefit a given solution to a problem is very contextual, not something we can broadly generalise.

I will though share an example, and a good example I believe. An example where value is upfront and easy to measure: How much value a simple machine learning stock trading algorithm can deliver when put against a simple rule of thumb trading strategy?

Disclaimer: I am not an experienced stock trader, I have only started recently following the sharp devaluation of the Japanese Yen (I am based and work in Tokyo). Do not reach out for any investment advise.

In any case, I thought it would be interesting to quickly put the two trading strategies against one another. Both very simple and devised within few days in the small window of time I have between my kids falling asleep and my inability to keep my eyes open.

The two strategies are as follows (on a daily trading routine) and are based only on S&P 500 Stocks:

  • Rule Of Thumb: Every day check S&P 500 stocks and select top 50 stocks once sorted by strength of linear growth in the past 15 days, and by how much the price has gone down in the past 3 days. Buy the 50 ones that had the strongest linear growth, but lost the most value in the last 3 days. The rationale is that the price should bounce back to regress to the mean of the linear growth
  • Machine Learning: Create a model that basis 15 days of S&P 500 stock price history it gives the probability that the price will grow the next day. Buy stock proportionally to that probability, when the probability is > 50%. The model is trained with 50 random dates data from the first part of 2023.

As you can notice even the Machine Learning approach is really simple, we are feeding the model only 15 days of stock price data and creating only 3 features actually: avg 15 days growth, avg 15 days volatility and last 3 days price change (as a ratio). This is super basic, and indeed only used to compare with the basic Rule of Thumb approach. The idea is that we only want to assess the value of the Machine Learning algorithm (a simple logistic regression) and keep things as similar as possible to the basic Rule of Thumb strategy.

Here are the results! (trading from July 23 to June end 24, price/value indexed at the first day of trading):

Now it does not look like a massive win, but we are looking at a return on the investment of ~50% as opposed to ~15% (S&P index gains) over a year. It is also of note that the ML model was actually doing worse when the S&P500 index overall was going low in the in the second half of 2023. The rule of thumb instead seems to basically track the S&P500 index overall.

To be honest I could have given up on the ML trading algorithm when the S&P500 index was going down actually, but overall it is interesting to see what such a simple ML model could achieve in a year.

When I showed that chart to my wife she said: “You made up this chart with your dreams I think… stick to the ETF, but keep dreaming”. I do not blame her skepticism, above chart looks to good to be true from Jan 24 onward.

In order to get more clarity and “curb my enthusiasm” I have tried the same comparative analysis in a different time when the S&P500 index was indeed quite volatile, from July 22 to June end 23 (ML model trained with 50 dates first half of 2022).

Below are the results:

It looks like in more volatile times the Machine Learning trading algorithm tracks pretty close the S&P500 index, but we also see that the Rule of Thumb approach does not transfer well as it fails miserably.

It is also of note that over 250 trading days, the ML approach beat the S&P index 142 days and the Rule of Thumb approach 245 trading days.

In conclusion no Jim Simons level wizardry (look Jim Simons up if interested in algorithmic trading) but as for our purpose, the example shows pretty clearly what value Machine Learning can add even in a very simple setting and with highly volatile data.

What would then be possible integrating additional data sources like businesses performance, macro, searches of stock terms on the internet, Investor updates data and more?

Forecast: AI assistants will enable even further “scientization” of business practices and decision making, but we will need people able to articulate the solutions to a variety of audiences.

Note: A decent book on financial analytics below, based on R code wise, but easy to transfer to Python if needed:

Statistics, Machine Learning and Deep Learning…

Recently I am sure you must have often heard about AGI (Artificial General Intelligence) and that it has something to do with Deep Learning and all that, but, what is Deep Learning? What are the differences with things you might be more familiar with? Like Stats or Machine Learning?

Let me give you a tour, and in the end you’ll find out that Deep Learning is not has mysterious as it sounds (in fact the backbone is liner algebra plus the calculus you studied at college).

Patterns, Patterns and more Patterns:

Overall we are talking about patterns, and Maths is fundamentally the discipline that “speaks” about patterns. In a sense this is the most fundamental definition of Mathematics (see Hofstadter foreword of ‘Godel’s Proof by Nagel & Newman).

Let me put in front of you 3 apples, 3 pens, 3 books, 3 plates and 3 balls… now forget what ‘3’ means in my sentence (if you can), then answer this question: What is the pattern among the various objects that I have presented you? The answer should be that they all seem to be a group of… three distinct objects of a given category. There you have it, a definition of the most basic pattern, and that’s what we call a number. It might seem I am digressing but I want you to get a bit abstract here so let me indulge.

There are then hundredths of specific disciplines that study patterns and relationships and properties of patterns, calculus for example, is the branch of mathematics that studies patterns between infinitely small quantities and how those patterns generalise, geometry can be looked at the study of patterns in shapes (or distances) etc.etc.

From this perspective Statistics, Machine Learning and Deep learning are of the same family, they sit in a branch of Mathematics that studies how to reconstruct or recognise patterns. For example, I could give you the below series of numbers:

2, 4, 8, 16, 32 etc.etc.

And we could try and understand whether the pattern is: number at position n+1 is twice that at position number n.

How Statistics, Machine Learning and Deep Learning differ is at how they Represent patterns, and what type of questions about patterns they focus answering.

With “representation” I mean something pretty basic, in essence a map that allows you to work with the pattern. For example for the square function you can use several representations:

X^2, or X*X, or you could draw a square with a side of length x, or go more abstract and draw a parabola that meets the origin, or as a derivative, or an integral, or any other way you want really, so long as that representation is useful to you.

That representation will only be a useful map to you so long as it helps you move toward your objective, e.g. are you trying to draw a trajectory? Or prove a geometrical theorem? or solve an equation? etc.etc.etc.

Statistics:

Patterns are represented by parameters of special functions we call Distributions, so statistics focuses on finding parameters and telling us how confident we should be that the pattern we identified can be mapped by a given relationship which is determined by some parameters. We don’t always want to “predict” in Statistics, but more likely want to make sure we are onto something substantial or not (e.g. is the proposed medicine sure to drive benefits? Is the defendant DNA matching the blood sample?).

Statistical Inference, to use an example, would like to answer the question whether the income of a customer has an impact on how much the customer is likely to spend on your products. It will give you a yes or no answer (yes for Jewellery… not so much for bottled water perhaps?).

Statistics is a highly developed discipline that is able to define boundaries and provide indisputable proofs, as it is a branch of Mathematics. For example it can answer the following general question:

Under certain assumptions on my data, what is the “best” test to verify my hypothesis? What is an admissible test? What is the dynamic policy that maximises utility? Best strategy to win a game? etc.etc.

Machine Learning (or Statistical Learning):

With Machine Learning we are interested in teaching computers to identify patterns, for the purpose of making predictions. For example, if the income of my customer is X how much is she likely to spend on our products? Our task will be that of giving as much data as we can, and a learning framework, for the machine to establish that pattern between income and spend (if there is one).

In Machine Learning it is also important how patterns are represented. We can only use representations that can be implemented in computer programs efficiently.

For example decision trees represent patterns as a series of if-else statements… if this -> then this else that etc.etc. So, in the case of decision trees, the pattern needs to be amenable to mapping to simple if-else statements (e.g. if income is between 100 and 200 then spend will be 10, if greater than 200 then 20 etc.etc.), but overall all Machine Learning algorithms have their own specific ways of representing patterns.

To give another example K Nearest Neighbours is an algorithm where patterns are identified “spatially” by mapping the input data to an abstract space where we can define distances. In our example a KNN learning approach would put incomes and spend in a two dimensional space and try to find, when you give it only income, the spend that puts the point in a region of space that “fits” with the geometry of that shape the training data drew.

Unlike Statistics (and Decision Theory as a consequence) Machine Learning does not give close answers to defined questions. There is no theorem that goes something like: given some assumptions on your data this specific learning approach is guaranteed to give you the best prediction.

Machine Learning is therefore closer to engineering than Mathematics, a lot of the effort (and fun) in Machine Learning is that you have to play with the tools to go somewhere and find new opportunities.

Obviously the algorithms are not found by trial and error only, there is a lot of pure statistics behind Machine Learning but it is mixed with research in computer science as well.

Deep Learning (or Deep Neural Networks):

Deep Learning is a specific branch of Machine Learning where patterns are to be represented by Networks of fundamental computational units called “Neurons”.

Such Neurons are not that complicated, as they perform basic algebra + a simple transformation (to be decided upfront), but the overall architecture of the networks plus the size can make them incredibly powerful. In fact Neural Networks have a nice theorem to them, the universal approximation theorem, which states that, under certain assumptions, neural nets can indeed represent any well behaved function, guaranteed! (we do not have such theorems for decision trees or KNN).

Chat GPT, for example, has millions of Neurons and hundredths of billions of parameters that are only devoted to the task of predicting what sentence from a given language best fits the “pattern” of the conversation that a given prompt initiated (in a human like way).

As with Machine Learning, and even more so, Deep Learning is closer to Computer Science than Statistics since we basically need to proceed experimentally within technical constraints, we do not have solid theoretical foundations for any results that are obtained in Deep Learning, we literally do not know what Chat GPT is doing in much details but, through trial and error, a certain Network architecture (called the transformer) has been found to be effective, but this is driven also by what architectures we can implement within the constraints of existing Computers and computer languages, so Deep Learning is also about the hardware.

A lot of current research focuses on architectures and how to set up various layers of the network.

To be clear this is not “trial and error” in the common language sense, but works more like experimental Physics, with a close interplay with theoretical advances and areas of inquiry.

I hope you found the above useful but for anything a bit deeper on Deep Learning, I recommend the two books below:

Both very readable introductions, and can also give a sense of what might be needed to work on AI as opposed to consume AI, or what might be needed to develop AI in house (lots of compute power, computer scientists and mathematicians for sure).

Causal inference: suggested readings and thoughts

I have studied statistics and probability for over 20 years and I have been constantly engaged in all things data, AI and analytics throughout my career. In my experience, and I am sure in the experience of most of you with a similar background, there are core “simple” questions in a business setting that are still very hard to answer in most context where uncertainty plays a big role:

Looking at the data, can you tell me what caused this to happen? (lower sales, higher returns etc.etc.)

Or

If we do X can we expect Y?

…and other variation on similar questions

The two books I have read recently focus on such questions and on the story of how statisticians and computer scientists have led a revolution that, in the past 30-50 years, succeeded in providing us with tools that allow us to answer those questions precisely (I add “precisely” since, obviously, those questions are always answered, but through unnecessary pain, and often with the wrong answers indeed).

Those tools and that story is still largely unknown outside of Academia and the AI world.

Let’s come to the books:

1 – Causal Inference by Paul R.Rosenbaum

2 – The Book of Why by Judea Pearl and Dana Mackenzie

Both books are an incredibly good read.

Causal Inference takes the statistical inference approach and it tells the story of how causes are derived from statistical experiments. Most of you will know the mantra: “Correlation does not imply Causation”, yet this book outlines in fairly simple terms how correlations can be leveraged and are indeed leveraged to infer on causation.

Typical example here is how the “does smoke cause cancer?” question was answered, and it was answered, obviously without randomised trials.

The Book of Why is a harder read and goes deeper into philosophical questions. This is natural given the authors are trying to share with us how the language of causation has been developed mathematically in the last 30 years, and the core objective here is to develop the tools that would allow machines to answer causal queries.

I want to get more into the details of some of the key concepts covered in the books and also to give you a sense of how useful the readings could be.

Starting with Rosenbaum’s I would point out that this books is even overall a great book to get a sense of how statistical inference, and the theory of decisions under uncertainty, is developing.

This book is a gem, no less.

It starts very simple with the true story of Washington and how he died after having been bled by his doctor (common practice back then) and asks: Would have Washington died, or recovered, had he not been bled?

He then moves to explain randomised trials, causal effects, matching, instruments, propensity scores and more.

Key here is that the only tool for statistical inference that was well developed and accepted up to the 70s was the randomised trial, that is, for example in medicine, giving a treatment to a random sample of individuals, a placebo to the others and check the difference in outcomes to make inferences.

This procedure itself is not even causal, in principle it is still flawed with respect to answering a causal query (logically), but it works as follows:

  • I see outcome O associated with treatment T
  • What are the chances that I would see O regardless of treatment T?
  • If chances are low, that is evidence of treatment T causing outcome O (unlikely association after all is interpreted to imply causation)

Rosenbaum goes to explain why above works in the causal inference framework, as it is interpreted as a counterfactual statement supported by a particular type of experiment, but then moves to explain that even observational studies (where there is no placebo, for example) can provide answers that are as robust as randomised trials/experiments.

Other key points are really on the struggle that the statistical community had to go through and goes through today when working through observational studies. It is of note the historical account of the debate when the relationship between smoking and lung cancer was investigated with the unending “What ifs”… what if smokers are less careful? what if smokers are also drinking alcohol and alcohol causes cancer? etc.etc.

An illuminating read, which also sheds light to how the debate on climate changes is addressed by the scientific community.

Moving on to The Book of Why

I love one of the statements that is found earlier in the book:

“You cannot answer a question that you cannot ask, you cannot ask a question you have no words for”

You can tell this book goes deeper into philosophy and artificial intelligence as it really aims to share with us the development of a language that can deal with posing and answering causal queries:

“Is the rooster crowing causing the sun to rise?”.

Roosters always sing before sunrise, so the association is there, but can we express easily in precise terms the obvious concept that roosters are not causing the sun to rise? Can we even express that question in a way a computer could answer?

The authors go into the development of those tools and the story of what hindered this development, which is the “reductionist” school in statistics. Taking quotes from Karl Pearson and his follower Niles:

  • “To contrast causation and correlation is unwarranted as causation is simply perfect correlation”
  • “The ultimate scientific statement of description can always be thrown back upon…a contingency table”

The reductionist school was very attractive as it made the whole endeavour much simpler, and mathematically precise with the tools available at the time. There was a sense of “purity” to that approach, as it was self consistent although limited, but indeed attempted to imply that causation, which was a hard problem, was unnecessary (best way to solve a hard problem right?). Ironically, as the authors also point out, this approach of assuming that data is all there is and that association are enough to draw an effective picture of the world, it is something that novices in machine learning still believe today (will probably talk more about this in a later blogpost).

Pearson himself ended up having to clarify that some correlations are not all there is, and often misleading. He later compiled what his school called “spurious correlations” which are different from correlations that arise “organically” although what that meant (which is causal correlations) was never addressed..

The authors also introduce the ladder of causation, see below:

Which is referenced throughout the book and it is indeed a meaty concept to grasp as one goes through the 400 pages of intellectual mastery.

What Pearl and Mackenzie achieve, that Rosenbaum does not even aim to discuss, is to invite the reader to reflect upon what understanding is, and how fundamental to our intelligence causality is.

They also then share the tools that allow for answering very practical questions, e.g.:

Our sales went up 30% this month, great, how much of it is due to the recent marketing campaign and how much is it driven by other factors?

The data scientists among you know that tools to address that question are out there, namely structural equation modelling and do calculus, but this indeed is closely related to structural causal models that Pearl promotes, and, in ultimate instance, the framework of introducing causal hypothesis is unavoidable.

Conclusions:

I recommend the books to any knowledge seeker, and anyone that is involved decision making (or selling benefits that are hard to measure).

I would start with Rosenbaum’s book as it is less than 200 pages and, if time is scarce I would prioritise reading Pearl and Mackenzie’s book up to chapter 6 first (190 pages)