The Art of Planning, DeepSeek and Politics

A while back I discussed in a post some of the nuances of data driven decision making when we need to answer a straight question at a point in time e.g. : “shall we do this now?”.

That was a case presented as if our decision “now” would have no relationsip to the future, meaning that it would have no impact on our future decisions on the same problem.

Luckily, often, we do know that our decisions have an impact on the future, but the issue here is slightly different, we are looking not only on impact in the future but an impact on future decisions.

This is the problem of sequential decisioning. Instead of “shall we do this now?” answers the questions:

“Shall we adopt this strategy/policy over time?”

In other words it tries to solve a dynamical problem, so that the sequence of decisions made could not have been different no matter what happened. When you have an optimal solution to this problem, whatever decision is made, at any point in time, is a decision that one would never regret.

I will answer three questions here:

  • What is an example of such a problem?
  • Is there a way to solve such problems?
  • What is the relationship with DeepSeek and politics?

Sequential decisioning problem example – Customers Enagegement Management

A typical example could be that of working on marketing incentives on a customer base. The problem is that of deriving a policy for optimal marketing incentives depending on some measures of engagement in the customer base.

Why is it sequential? Because we have a feedback loop between what we plan to do, the marketing incentive, and the behaviour of our customers that triggers customer incentives: engagement.

Whenever we have such loops we cannot solve the problem at one point in time and regardless of the future.

The solution could look something like: “Invest $X in customer rewards whenever the following engagement KPIs, email open rate, web visits, product interaction etc.etc. drop below certain levels.

It is important to note that we need something to aim for, e.g. maximise return on marketing incentives.

Does a solution to such problems exist?

The good news is that yes, solutions exist, but we have been able to deal only with such cases that would be amenable to some mathematical/statistical modelling.

We can 100% find a solution If we have a good model that answers the following:

  • How does a given marketing incentive influence customers enagement on average?
  • Given a certain level of customer engagement, what is the expected value of the relationship with the customer?
  • How does the engagement of our customer evolves overtime, in absence of any incentive?

So we do need some KPIs that give us a good sense of what level of cusotmer engagement we have, and we need to have a sense of what is the expected value of the relationship with the customer given a particular set of KPIs. To be clear what we need is something true on average, and at least over a certain period of time.

Example, we should be able to say that, on average, a customer using our products daily for the past month will deliver X value over 2 years.

We also need to be able to say that a given marketing incentive, on average, increase customers daily enagement by X% and costs some defined amount of $

We also need to be able to say something like: our customer engagement rate tends to decrease X% every few months, again on average.

The above modelling of engagement and customer value is not something that most businesses would find difficutl today. Engagement rates, or attrition rates overtime, are easy to get, as well as results of marketing campaigns on engagement rates. Harder to get is a reliable estimate of lifetime value given current engagement metrics, but we can solve for a shorter time horizon in such cases.

Richar Bellman is the mathematician credited for having solved such problems in the most general way.

His equation, the Bellman equation, is all you need after you have a mathematical model of your problem.

The equation, in the form I’d like to present it is the one below, where V is the value of the optimal policy pi*:

It says the following:

The optimal strategy, depending on parameters a (e.g. how $ spent on incentives), is that which maximizes the instant rewards now, as well as the expected future rewards (future rewards conditional on the policy), no matter what the variables of the probelm are at that point in time (e.g. customer engagement and marketing budget left).

It is a beautiful and intuitive mathematical statement which is familiar to most humans:

The best course of action is that which balances instant gratification with delayed gratification.

We all work this way, and this is arguably what makes us human.

To be clear this problem can be, and it is solved, in a large variety of sequential decisioning problems.

The Bellman equation is the basic tool for all planning optimizations in supply management, portfolio management, power grid management and more…

Politics?

Well, in some sense politics is the art of collective, at least representative, policy making. And here a problem obviously arises: How to dial well instant versus delayed gratification when the views of the collective might differ? What if the collective doesn’t even agree on the various aspects of the problem at hand? (e.g. the KPIs, the expected rewards etc.etc.).

A key aspect of this should be the following: LEARNING.

The illustration below should be of guidance:

There has to be a feedback loop as the Bellman equation gives conditions for optimality, but often the solution can only be found iteratively. Futhermore, we also need to be open minded on revising the overall model of the problem, if we gather evidence that the conditions for optimality do not seem to come about.

So we have multiple challenges, which are true for politics but also for business management. We need:

  • Consensus on how to frame the problem
  • A shared vision on the balance between instant versus delayed rewards (or my reward versus your reward)
  • A framework for adapting as we proceed

The above is what I would call as being “data driven”- basically “reality” driven, but it is hard to get there. We all obviously use data when we have it, but the real challenge is to operate in such a way that the data will be there when needed.

How about DeepSeek?

If you have followed thus far you might have understood that solving a sequential problem basically involves algorithms to learn the dynamics between a policy maker action (e.g. marketing $ spent) and the long term feedback of the policy. This is the principle behing “reinforcement learning” in the AI world, as one want the learning to be reinforced by incentives for moving in the right direction. This was considered as a very promising framework for AI in the early days, but it is indeed an approach that requires time, and a fair bit of imagination as well as data, and was not pursued in recent years… until DeepSeek stormed the AI world.

A key element of DeepSeek successful R1 generative AI model was that it leveraged reinforecement learning, which forced it to develop advanced reasoning at a lower cost.

This was achieved through very clever redesigning of the transformer architecture, but I won’t go through it (as I have not yet understood it well myself).

As usual let me point you to a great book on the above topics.

My favorite is “Economics Dynamics” by John Stachursky. Great resource on sequential problem solving under uncertainty. Pretty theoretical, but that’s inevitable here.

Basics of Mathematical Modelling – Shapes of Uncertainty

Most of us, especially in business, but also in our private lives, peform some basic mathematical modelling all the time. We can do it wrong or we can do it right, but we do it … constantly.

The most basic form of mathematical/statistical modelling is to give shapes to our uncertainty (ignoring the shape is also part of it).

Let me tell you a story that should clarify what I mean.

John and his team at SalesX, an online marketplace, are trying to outsource their call centre operations and are looking for potential suppliers. They sent out an RFP (request for proposals) and they received a few responses. (note: despite the name SalesX is not owned by Elon Musk).

John calls his team in his office:

John: “Hi, everyone, could you give me a clear high level summary of the various responses we received from the call centers?”

Jane: “Yes, we shortlisted 3 responses for you, although we are quite confident on a winner…”

John: “Ok, what are the average call handling times of the shortlisted suppliers?”

Jane: “They are all about 2 minutes, but one provider is considerably cheaper. It is quite clear we should go for them. SmartCalls is the name of the company and they are heavily relying on automation and AI to support their operations, which keeps their costs very low, a very clever team.”

John: “Ok, sounds pretty good, let’s meet them but still keep another response in the race as we dig deeper beyond performance and price.”

Laura in Jane’s team, is not convinced by such a swift decision, and questions Jane…

Laura: “Jane, 2 minutes is great but what about the % of calls that are not resolved in one session? that one is…”

Jane: ” Yes that number is higher for SmartCalls, but this is hard to compare across providers right? as their clients have different rules for re-routing inquiries in house…”

Laura: “John, Jane, before moving forward with SmartCalls, let me reach out to the shortlisted suppliers with requesting some additional information. It will be very hard to switch to another supplier once we are done with this…”

John: “Ok but I give you 3 days, I hope you are after something substantial… SmartCalls is super cheap. Handling calls is not rocket science…”

Laura goes back to the suppliers and asks them for a full view of the distribution of call handling times and she finds out the following, summarized in below chart, which she promptly shares with John:

John looks at it but he is not the too clear on what that means…

John: “Ok Laura, translate please…”

Laura: “Basically SmartCall business model is that of leveraging AI with an inexperienced and cheap workforce, they can deal quickly with a large number of calls relating to simple issues, but their operators lack the creativity or experience to deal with issues of medium complexity. The operators all work remotely with no chance of sharing information with each other…”

John: “Wait, wait… the chart first, what does that mean?”

Laura: “Oh… isn’t that obvious. SmartCall has a 2 minutes call average, yes, but this is driven by a larger number of very quick calls, when it comes customer satisfaction there’s a good number of calls that go beyond 3-4 minutes.”

John: “Ok I get it, their calls are either quick, or rather long whilst, iHelp for example, is able to be more consistent, with most calls handled in about 2 minutes right?”

Laura: “Yes, they avoid shortcutting the initial problem identification phase and have a longer list of mandatory screening questions, but this pays off. They are able to share the calls with specialized teams and…”

John: “Ok I get it indeed. I also see several of SmartCall going beyond 5 minutes, which is our threshold for bringing back customer calls in house… good work Laura. Jane, good work on hiring Laura”

Jane is a bit deflated, but she ultimately smiles as she is proud of having trained Laura well.

The quick story above is a good example of how we routinely perform unsophisticated statistical modeling, especially when we implicity anchor ourselves to some one dimensional metrics, like averages.

When John heard the average call times he implicitly assumed that those averages were comparable and meaningful in their own right. This means he assumed (modelled) the call times range and frequency to be similar in shape across suppliers, which is a fair bit of statistical modelling in practice (wrong modelling in this case).

After all, what can you really tell from comparing averages unless you make the rather strong assumption that those averages are informative in summarizing the underlying data apprpriately? If you do so, although you might not know it, you are doing statistical modelling.

Laura, instead, decided to look at the actual shape of the data and to avoid any high level, uninformed assumption, yet she is still assuming quite a few things, among which:

  • The data she received is representative (not skewed toward some specific types of calls)
  • The distribution of the data will be consistent over time
  • The distribution of the data basis how call centers are handling other clients calls is relevant to SalesX customer base

That’s basically statistical inference, and whenever you make a decision or judge data, although you might not think about it, you, as well, are doing statistics and data science.

The question is: “Do you know when you are doing that well, or when you are being ineffective?”

Another key aspect of giving shapes to uncertainty is the key question of what metrics to measure, whether average call times or abandon rate, or both, or other metrics. Which ones to chose?

This is somewhat the operational solution of the problem I presented above, when it comes to data that tracks a business activity or process, which are the metrics that truthfully summarize the shape of that data?

I can recommend two good books on this topic, one is a technical text, Optimal Control Theory by Donald E. Kirk, the other is a more accessible read focusing specifically designing meaningful metrics that track Objective Key Results (OKRs), Measure What Matters by John Doerr

Optimal Control Theory goes well beyond the subject of performance measures, but the first 50 pages are a good introduction to the overall framework.

Statistics, Machine Learning and Deep Learning…

Recently I am sure you must have often heard about AGI (Artificial General Intelligence) and that it has something to do with Deep Learning and all that, but, what is Deep Learning? What are the differences with things you might be more familiar with? Like Stats or Machine Learning?

Let me give you a tour, and in the end you’ll find out that Deep Learning is not has mysterious as it sounds (in fact the backbone is liner algebra plus the calculus you studied at college).

Patterns, Patterns and more Patterns:

Overall we are talking about patterns, and Maths is fundamentally the discipline that “speaks” about patterns. In a sense this is the most fundamental definition of Mathematics (see Hofstadter foreword of ‘Godel’s Proof by Nagel & Newman).

Let me put in front of you 3 apples, 3 pens, 3 books, 3 plates and 3 balls… now forget what ‘3’ means in my sentence (if you can), then answer this question: What is the pattern among the various objects that I have presented you? The answer should be that they all seem to be a group of… three distinct objects of a given category. There you have it, a definition of the most basic pattern, and that’s what we call a number. It might seem I am digressing but I want you to get a bit abstract here so let me indulge.

There are then hundredths of specific disciplines that study patterns and relationships and properties of patterns, calculus for example, is the branch of mathematics that studies patterns between infinitely small quantities and how those patterns generalise, geometry can be looked at the study of patterns in shapes (or distances) etc.etc.

From this perspective Statistics, Machine Learning and Deep learning are of the same family, they sit in a branch of Mathematics that studies how to reconstruct or recognise patterns. For example, I could give you the below series of numbers:

2, 4, 8, 16, 32 etc.etc.

And we could try and understand whether the pattern is: number at position n+1 is twice that at position number n.

How Statistics, Machine Learning and Deep Learning differ is at how they Represent patterns, and what type of questions about patterns they focus answering.

With “representation” I mean something pretty basic, in essence a map that allows you to work with the pattern. For example for the square function you can use several representations:

X^2, or X*X, or you could draw a square with a side of length x, or go more abstract and draw a parabola that meets the origin, or as a derivative, or an integral, or any other way you want really, so long as that representation is useful to you.

That representation will only be a useful map to you so long as it helps you move toward your objective, e.g. are you trying to draw a trajectory? Or prove a geometrical theorem? or solve an equation? etc.etc.etc.

Statistics:

Patterns are represented by parameters of special functions we call Distributions, so statistics focuses on finding parameters and telling us how confident we should be that the pattern we identified can be mapped by a given relationship which is determined by some parameters. We don’t always want to “predict” in Statistics, but more likely want to make sure we are onto something substantial or not (e.g. is the proposed medicine sure to drive benefits? Is the defendant DNA matching the blood sample?).

Statistical Inference, to use an example, would like to answer the question whether the income of a customer has an impact on how much the customer is likely to spend on your products. It will give you a yes or no answer (yes for Jewellery… not so much for bottled water perhaps?).

Statistics is a highly developed discipline that is able to define boundaries and provide indisputable proofs, as it is a branch of Mathematics. For example it can answer the following general question:

Under certain assumptions on my data, what is the “best” test to verify my hypothesis? What is an admissible test? What is the dynamic policy that maximises utility? Best strategy to win a game? etc.etc.

Machine Learning (or Statistical Learning):

With Machine Learning we are interested in teaching computers to identify patterns, for the purpose of making predictions. For example, if the income of my customer is X how much is she likely to spend on our products? Our task will be that of giving as much data as we can, and a learning framework, for the machine to establish that pattern between income and spend (if there is one).

In Machine Learning it is also important how patterns are represented. We can only use representations that can be implemented in computer programs efficiently.

For example decision trees represent patterns as a series of if-else statements… if this -> then this else that etc.etc. So, in the case of decision trees, the pattern needs to be amenable to mapping to simple if-else statements (e.g. if income is between 100 and 200 then spend will be 10, if greater than 200 then 20 etc.etc.), but overall all Machine Learning algorithms have their own specific ways of representing patterns.

To give another example K Nearest Neighbours is an algorithm where patterns are identified “spatially” by mapping the input data to an abstract space where we can define distances. In our example a KNN learning approach would put incomes and spend in a two dimensional space and try to find, when you give it only income, the spend that puts the point in a region of space that “fits” with the geometry of that shape the training data drew.

Unlike Statistics (and Decision Theory as a consequence) Machine Learning does not give close answers to defined questions. There is no theorem that goes something like: given some assumptions on your data this specific learning approach is guaranteed to give you the best prediction.

Machine Learning is therefore closer to engineering than Mathematics, a lot of the effort (and fun) in Machine Learning is that you have to play with the tools to go somewhere and find new opportunities.

Obviously the algorithms are not found by trial and error only, there is a lot of pure statistics behind Machine Learning but it is mixed with research in computer science as well.

Deep Learning (or Deep Neural Networks):

Deep Learning is a specific branch of Machine Learning where patterns are to be represented by Networks of fundamental computational units called “Neurons”.

Such Neurons are not that complicated, as they perform basic algebra + a simple transformation (to be decided upfront), but the overall architecture of the networks plus the size can make them incredibly powerful. In fact Neural Networks have a nice theorem to them, the universal approximation theorem, which states that, under certain assumptions, neural nets can indeed represent any well behaved function, guaranteed! (we do not have such theorems for decision trees or KNN).

Chat GPT, for example, has millions of Neurons and hundredths of billions of parameters that are only devoted to the task of predicting what sentence from a given language best fits the “pattern” of the conversation that a given prompt initiated (in a human like way).

As with Machine Learning, and even more so, Deep Learning is closer to Computer Science than Statistics since we basically need to proceed experimentally within technical constraints, we do not have solid theoretical foundations for any results that are obtained in Deep Learning, we literally do not know what Chat GPT is doing in much details but, through trial and error, a certain Network architecture (called the transformer) has been found to be effective, but this is driven also by what architectures we can implement within the constraints of existing Computers and computer languages, so Deep Learning is also about the hardware.

A lot of current research focuses on architectures and how to set up various layers of the network.

To be clear this is not “trial and error” in the common language sense, but works more like experimental Physics, with a close interplay with theoretical advances and areas of inquiry.

I hope you found the above useful but for anything a bit deeper on Deep Learning, I recommend the two books below:

Both very readable introductions, and can also give a sense of what might be needed to work on AI as opposed to consume AI, or what might be needed to develop AI in house (lots of compute power, computer scientists and mathematicians for sure).