Basics of Mathematical Modelling – Shapes of Uncertainty

Most of us, especially in business, but also in our private lives, peform some basic mathematical modelling all the time. We can do it wrong or we can do it right, but we do it … constantly.

The most basic form of mathematical/statistical modelling is to give shapes to our uncertainty (ignoring the shape is also part of it).

Let me tell you a story that should clarify what I mean.

John and his team at SalesX, an online marketplace, are trying to outsource their call centre operations and are looking for potential suppliers. They sent out an RFP (request for proposals) and they received a few responses. (note: despite the name SalesX is not owned by Elon Musk).

John calls his team in his office:

John: “Hi, everyone, could you give me a clear high level summary of the various responses we received from the call centers?”

Jane: “Yes, we shortlisted 3 responses for you, although we are quite confident on a winner…”

John: “Ok, what are the average call handling times of the shortlisted suppliers?”

Jane: “They are all about 2 minutes, but one provider is considerably cheaper. It is quite clear we should go for them. SmartCalls is the name of the company and they are heavily relying on automation and AI to support their operations, which keeps their costs very low, a very clever team.”

John: “Ok, sounds pretty good, let’s meet them but still keep another response in the race as we dig deeper beyond performance and price.”

Laura in Jane’s team, is not convinced by such a swift decision, and questions Jane…

Laura: “Jane, 2 minutes is great but what about the % of calls that are not resolved in one session? that one is…”

Jane: ” Yes that number is higher for SmartCalls, but this is hard to compare across providers right? as their clients have different rules for re-routing inquiries in house…”

Laura: “John, Jane, before moving forward with SmartCalls, let me reach out to the shortlisted suppliers with requesting some additional information. It will be very hard to switch to another supplier once we are done with this…”

John: “Ok but I give you 3 days, I hope you are after something substantial… SmartCalls is super cheap. Handling calls is not rocket science…”

Laura goes back to the suppliers and asks them for a full view of the distribution of call handling times and she finds out the following, summarized in below chart, which she promptly shares with John:

John looks at it but he is not the too clear on what that means…

John: “Ok Laura, translate please…”

Laura: “Basically SmartCall business model is that of leveraging AI with an inexperienced and cheap workforce, they can deal quickly with a large number of calls relating to simple issues, but their operators lack the creativity or experience to deal with issues of medium complexity. The operators all work remotely with no chance of sharing information with each other…”

John: “Wait, wait… the chart first, what does that mean?”

Laura: “Oh… isn’t that obvious. SmartCall has a 2 minutes call average, yes, but this is driven by a larger number of very quick calls, when it comes customer satisfaction there’s a good number of calls that go beyond 3-4 minutes.”

John: “Ok I get it, their calls are either quick, or rather long whilst, iHelp for example, is able to be more consistent, with most calls handled in about 2 minutes right?”

Laura: “Yes, they avoid shortcutting the initial problem identification phase and have a longer list of mandatory screening questions, but this pays off. They are able to share the calls with specialized teams and…”

John: “Ok I get it indeed. I also see several of SmartCall going beyond 5 minutes, which is our threshold for bringing back customer calls in house… good work Laura. Jane, good work on hiring Laura”

Jane is a bit deflated, but she ultimately smiles as she is proud of having trained Laura well.

The quick story above is a good example of how we routinely perform unsophisticated statistical modeling, especially when we implicity anchor ourselves to some one dimensional metrics, like averages.

When John heard the average call times he implicitly assumed that those averages were comparable and meaningful in their own right. This means he assumed (modelled) the call times range and frequency to be similar in shape across suppliers, which is a fair bit of statistical modelling in practice (wrong modelling in this case).

After all, what can you really tell from comparing averages unless you make the rather strong assumption that those averages are informative in summarizing the underlying data apprpriately? If you do so, although you might not know it, you are doing statistical modelling.

Laura, instead, decided to look at the actual shape of the data and to avoid any high level, uninformed assumption, yet she is still assuming quite a few things, among which:

  • The data she received is representative (not skewed toward some specific types of calls)
  • The distribution of the data will be consistent over time
  • The distribution of the data basis how call centers are handling other clients calls is relevant to SalesX customer base

That’s basically statistical inference, and whenever you make a decision or judge data, although you might not think about it, you, as well, are doing statistics and data science.

The question is: “Do you know when you are doing that well, or when you are being ineffective?”

Another key aspect of giving shapes to uncertainty is the key question of what metrics to measure, whether average call times or abandon rate, or both, or other metrics. Which ones to chose?

This is somewhat the operational solution of the problem I presented above, when it comes to data that tracks a business activity or process, which are the metrics that truthfully summarize the shape of that data?

I can recommend two good books on this topic, one is a technical text, Optimal Control Theory by Donald E. Kirk, the other is a more accessible read focusing specifically designing meaningful metrics that track Objective Key Results (OKRs), Measure What Matters by John Doerr

Optimal Control Theory goes well beyond the subject of performance measures, but the first 50 pages are a good introduction to the overall framework.

What is the value of Machine Learning? – A stock trading example

Since the days of the big data related buzz words, now over 20 years ago, data driven changes to business models, product values, customer journeys and more have gone through massive enhancements.

Nevertheless in the early days there were several discussions about the supposed “value” of data and such transformations. I know it because I was there, in most cases on the selling side.

It was not uncommon to find middle managers considering data driven transformations as a “nice to have” or simple ways to copy the competition, or something to do from a PR perspective. It was also not unusual for large businesses to look down on new tech startups as if looking at kids playing with unnecessary & fancy toys.

There was a lot of: “we do things this way because it works”, “our customers don’t need this”, “open source is dangerous”, “we do not need fancy machine learning for our simple problems” and so on.

There was also a lot of confidence in existing practices, something to the extent of: “I have been doing this for 20 years, I know better. My expertise beats your data science kid…”

To be clear a lot of the skepticism was healthy and based on more than gut feelings, but overtime the most negative voices disappeared. Nevertheless, a question still remains actual when starting any data driven & AI powered transformation: “What value will it deliver considering the unknowns and costs?”

Is it just headcount reduction? Or reduced wastes? Or better products? How will we measure all of this?

I am not going to answer those questions in general, in my view it would be silly to do so. How technology driven innovations can benefit a given solution to a problem is very contextual, not something we can broadly generalise.

I will though share an example, and a good example I believe. An example where value is upfront and easy to measure: How much value a simple machine learning stock trading algorithm can deliver when put against a simple rule of thumb trading strategy?

Disclaimer: I am not an experienced stock trader, I have only started recently following the sharp devaluation of the Japanese Yen (I am based and work in Tokyo). Do not reach out for any investment advise.

In any case, I thought it would be interesting to quickly put the two trading strategies against one another. Both very simple and devised within few days in the small window of time I have between my kids falling asleep and my inability to keep my eyes open.

The two strategies are as follows (on a daily trading routine) and are based only on S&P 500 Stocks:

  • Rule Of Thumb: Every day check S&P 500 stocks and select top 50 stocks once sorted by strength of linear growth in the past 15 days, and by how much the price has gone down in the past 3 days. Buy the 50 ones that had the strongest linear growth, but lost the most value in the last 3 days. The rationale is that the price should bounce back to regress to the mean of the linear growth
  • Machine Learning: Create a model that basis 15 days of S&P 500 stock price history it gives the probability that the price will grow the next day. Buy stock proportionally to that probability, when the probability is > 50%. The model is trained with 50 random dates data from the first part of 2023.

As you can notice even the Machine Learning approach is really simple, we are feeding the model only 15 days of stock price data and creating only 3 features actually: avg 15 days growth, avg 15 days volatility and last 3 days price change (as a ratio). This is super basic, and indeed only used to compare with the basic Rule of Thumb approach. The idea is that we only want to assess the value of the Machine Learning algorithm (a simple logistic regression) and keep things as similar as possible to the basic Rule of Thumb strategy.

Here are the results! (trading from July 23 to June end 24, price/value indexed at the first day of trading):

Now it does not look like a massive win, but we are looking at a return on the investment of ~50% as opposed to ~15% (S&P index gains) over a year. It is also of note that the ML model was actually doing worse when the S&P500 index overall was going low in the in the second half of 2023. The rule of thumb instead seems to basically track the S&P500 index overall.

To be honest I could have given up on the ML trading algorithm when the S&P500 index was going down actually, but overall it is interesting to see what such a simple ML model could achieve in a year.

When I showed that chart to my wife she said: “You made up this chart with your dreams I think… stick to the ETF, but keep dreaming”. I do not blame her skepticism, above chart looks to good to be true from Jan 24 onward.

In order to get more clarity and “curb my enthusiasm” I have tried the same comparative analysis in a different time when the S&P500 index was indeed quite volatile, from July 22 to June end 23 (ML model trained with 50 dates first half of 2022).

Below are the results:

It looks like in more volatile times the Machine Learning trading algorithm tracks pretty close the S&P500 index, but we also see that the Rule of Thumb approach does not transfer well as it fails miserably.

It is also of note that over 250 trading days, the ML approach beat the S&P index 142 days and the Rule of Thumb approach 245 trading days.

In conclusion no Jim Simons level wizardry (look Jim Simons up if interested in algorithmic trading) but as for our purpose, the example shows pretty clearly what value Machine Learning can add even in a very simple setting and with highly volatile data.

What would then be possible integrating additional data sources like businesses performance, macro, searches of stock terms on the internet, Investor updates data and more?

Forecast: AI assistants will enable even further “scientization” of business practices and decision making, but we will need people able to articulate the solutions to a variety of audiences.

Note: A decent book on financial analytics below, based on R code wise, but easy to transfer to Python if needed:

Statistics, Machine Learning and Deep Learning…

Recently I am sure you must have often heard about AGI (Artificial General Intelligence) and that it has something to do with Deep Learning and all that, but, what is Deep Learning? What are the differences with things you might be more familiar with? Like Stats or Machine Learning?

Let me give you a tour, and in the end you’ll find out that Deep Learning is not has mysterious as it sounds (in fact the backbone is liner algebra plus the calculus you studied at college).

Patterns, Patterns and more Patterns:

Overall we are talking about patterns, and Maths is fundamentally the discipline that “speaks” about patterns. In a sense this is the most fundamental definition of Mathematics (see Hofstadter foreword of ‘Godel’s Proof by Nagel & Newman).

Let me put in front of you 3 apples, 3 pens, 3 books, 3 plates and 3 balls… now forget what ‘3’ means in my sentence (if you can), then answer this question: What is the pattern among the various objects that I have presented you? The answer should be that they all seem to be a group of… three distinct objects of a given category. There you have it, a definition of the most basic pattern, and that’s what we call a number. It might seem I am digressing but I want you to get a bit abstract here so let me indulge.

There are then hundredths of specific disciplines that study patterns and relationships and properties of patterns, calculus for example, is the branch of mathematics that studies patterns between infinitely small quantities and how those patterns generalise, geometry can be looked at the study of patterns in shapes (or distances) etc.etc.

From this perspective Statistics, Machine Learning and Deep learning are of the same family, they sit in a branch of Mathematics that studies how to reconstruct or recognise patterns. For example, I could give you the below series of numbers:

2, 4, 8, 16, 32 etc.etc.

And we could try and understand whether the pattern is: number at position n+1 is twice that at position number n.

How Statistics, Machine Learning and Deep Learning differ is at how they Represent patterns, and what type of questions about patterns they focus answering.

With “representation” I mean something pretty basic, in essence a map that allows you to work with the pattern. For example for the square function you can use several representations:

X^2, or X*X, or you could draw a square with a side of length x, or go more abstract and draw a parabola that meets the origin, or as a derivative, or an integral, or any other way you want really, so long as that representation is useful to you.

That representation will only be a useful map to you so long as it helps you move toward your objective, e.g. are you trying to draw a trajectory? Or prove a geometrical theorem? or solve an equation? etc.etc.etc.

Statistics:

Patterns are represented by parameters of special functions we call Distributions, so statistics focuses on finding parameters and telling us how confident we should be that the pattern we identified can be mapped by a given relationship which is determined by some parameters. We don’t always want to “predict” in Statistics, but more likely want to make sure we are onto something substantial or not (e.g. is the proposed medicine sure to drive benefits? Is the defendant DNA matching the blood sample?).

Statistical Inference, to use an example, would like to answer the question whether the income of a customer has an impact on how much the customer is likely to spend on your products. It will give you a yes or no answer (yes for Jewellery… not so much for bottled water perhaps?).

Statistics is a highly developed discipline that is able to define boundaries and provide indisputable proofs, as it is a branch of Mathematics. For example it can answer the following general question:

Under certain assumptions on my data, what is the “best” test to verify my hypothesis? What is an admissible test? What is the dynamic policy that maximises utility? Best strategy to win a game? etc.etc.

Machine Learning (or Statistical Learning):

With Machine Learning we are interested in teaching computers to identify patterns, for the purpose of making predictions. For example, if the income of my customer is X how much is she likely to spend on our products? Our task will be that of giving as much data as we can, and a learning framework, for the machine to establish that pattern between income and spend (if there is one).

In Machine Learning it is also important how patterns are represented. We can only use representations that can be implemented in computer programs efficiently.

For example decision trees represent patterns as a series of if-else statements… if this -> then this else that etc.etc. So, in the case of decision trees, the pattern needs to be amenable to mapping to simple if-else statements (e.g. if income is between 100 and 200 then spend will be 10, if greater than 200 then 20 etc.etc.), but overall all Machine Learning algorithms have their own specific ways of representing patterns.

To give another example K Nearest Neighbours is an algorithm where patterns are identified “spatially” by mapping the input data to an abstract space where we can define distances. In our example a KNN learning approach would put incomes and spend in a two dimensional space and try to find, when you give it only income, the spend that puts the point in a region of space that “fits” with the geometry of that shape the training data drew.

Unlike Statistics (and Decision Theory as a consequence) Machine Learning does not give close answers to defined questions. There is no theorem that goes something like: given some assumptions on your data this specific learning approach is guaranteed to give you the best prediction.

Machine Learning is therefore closer to engineering than Mathematics, a lot of the effort (and fun) in Machine Learning is that you have to play with the tools to go somewhere and find new opportunities.

Obviously the algorithms are not found by trial and error only, there is a lot of pure statistics behind Machine Learning but it is mixed with research in computer science as well.

Deep Learning (or Deep Neural Networks):

Deep Learning is a specific branch of Machine Learning where patterns are to be represented by Networks of fundamental computational units called “Neurons”.

Such Neurons are not that complicated, as they perform basic algebra + a simple transformation (to be decided upfront), but the overall architecture of the networks plus the size can make them incredibly powerful. In fact Neural Networks have a nice theorem to them, the universal approximation theorem, which states that, under certain assumptions, neural nets can indeed represent any well behaved function, guaranteed! (we do not have such theorems for decision trees or KNN).

Chat GPT, for example, has millions of Neurons and hundredths of billions of parameters that are only devoted to the task of predicting what sentence from a given language best fits the “pattern” of the conversation that a given prompt initiated (in a human like way).

As with Machine Learning, and even more so, Deep Learning is closer to Computer Science than Statistics since we basically need to proceed experimentally within technical constraints, we do not have solid theoretical foundations for any results that are obtained in Deep Learning, we literally do not know what Chat GPT is doing in much details but, through trial and error, a certain Network architecture (called the transformer) has been found to be effective, but this is driven also by what architectures we can implement within the constraints of existing Computers and computer languages, so Deep Learning is also about the hardware.

A lot of current research focuses on architectures and how to set up various layers of the network.

To be clear this is not “trial and error” in the common language sense, but works more like experimental Physics, with a close interplay with theoretical advances and areas of inquiry.

I hope you found the above useful but for anything a bit deeper on Deep Learning, I recommend the two books below:

Both very readable introductions, and can also give a sense of what might be needed to work on AI as opposed to consume AI, or what might be needed to develop AI in house (lots of compute power, computer scientists and mathematicians for sure).

Where is the uncertainty?

When it comes to making decisions we are often faced with the issue of dealing with uncertainties, meaning that we have to make decisions on the basis of the limited information we have, and our best understanding at a point in time.

In this post I want to tell a story that hopes to shine some light on the multiple places where uncertainty might be lurking, and the eventual costs from such uncertainties and how to deal with those costs.

Note: This is a data heavy post, easier to read by anyone with some exposure to statistics, but I tried to make the story as easy to read as possible.

THE GOOD & BAD BEES STORY

In this story we have:

  • Sophie: the owner of business that relies on Bees (for example a honey farm)
  • Tim: Sophie’s husband and COO of the company
  • Reggie: A smart employee

The business is faced with a great threat, Bad Bees have started to appear that damage the business by effectively poisoning Good Bees… what to do?

Sophie: “Tim!, come over? so what is the issue with the Bad Bees?”

Tim: “Well we might have 10%-20% bad bees and they generally spoil the work of 3-5 bees and…

Sophie: “Cut with the numbers tale! What’s the damage?”

Tim: “Well, revenues down at least 30%…possibly 100%”

Sophie: “Wow, ok you got my attention now …”

Reggie: “Well, we have some news from the Govt, we should be able to manage some of the impact… the bad bees are larger and hotter than normal ones, but otherwise totally similar… we can try and identify them and shoot them with lasers!”

Tim: “We have lasers?”

Reggie: “Yes, just in case”

Tim: “Ok, how does that work?”

Reggie: “See below, it’s a chart showing the Bees sizes and body temperatures released by the Govt…”

Tim: “What does that mean?”

Sophie: “Tim, aren’t you supposed to be clever? It means that we can try and classify Bees in Good or Bad Bees using length and temperature and eventually shoot the bad ones with the lasers! Isn’t it obvious?”

Reggie: “That’s right Sophie but… we can’t rely on the temperature unfortunately, current detectors are not able to measure that with any reliability, our cameras can calculate the Bee body length very accurately though and then… “

Sophie: “Lasers! I like the plan, work it out and bring me the numbers ASAP”

Reggie goes back to his team and they start to work through a solution.

We see the first uncertainty coming in, if we could use temperature and body length the issue could be solved, but we will need to rely on limited information. Knowing only the length, leave us with uncertainty (potentially solved with body temperature) on whether a Bee is Good or Bad.

See below:

With a sample of a 100 genetically tested Bad and Good Bees it looks like the Good bees are about 10cm (…yes in this world we have engineered massive bees, not sure why) whilst Bad Bees are often much larger. We now have a second source of uncertainty, not only we can’t rely on temperature data, but we are also having to rely on limited data to assess the difference in lengths between Good and Bad Bees.

Reggie nevertheless goes ahead and fits a machine learning model that tries to predict whether a Bee is Good or Bad basis the body length and he gets the below ROC curve:

ROC stands for “Receiver Operating Characteristic”, but that curve basically tells us how good Reggie has been with his modelling, the higher the AUC (area under the curve) the better.

The AUC tells us what is the probability that I can tell the two Bees apart by leveraging a certain body length threshold (e.g. 10cm).

Overall the model is good, it will be right 80% of the times on average. The Lasers will work well… on average, but if you go back to the previous chart (Bees lengths chart) you can see that, despite the good AUC, we have a large overlap in length distributions among the Bees, in fact about 50% of Bad Bees have body length that is very common for Good Bees too, so always beware of averages. It is relatively easy to be right on average, but the lasers will struggle to be right consistently (will share more about this in a later post).

Well, here we actually have a third source of uncertainty: did Reggie do his modelling as well as possible? Is this predictive model as good as it gets given the previous two uncertainty constraints?

Given I have simulated the data I can tell you he got it almost right, the best possible model would be right 83% of the time instead of the ~80% Reggie achieved, but that 3% can be very costly indeed as it means being wrong once every 5.8 laser beams rather than once every 5 (yes, ROC curves are hard to read and interpret…)

Finally, if you remember we have additional uncertainties, which are of a nature beyond the modelling exercise, those are uncertainties that we have not yet been able to model, that is:

  • Prevalence: Govt says about 10%-20% of Bees will be Bad bees, that’s quite a range
  • Cost of the Bad bees: Govt says that a Bad Bee is likely to spoil the work of 3-5 Good bees, still quite a range…

Reggie goes back to Sophie and Tim.

Reggie: “Ok I have got the plan. We have developed a predictive model…”

Sophie: “With ChatGPT? I know that thing is smarter that Tim…”

Reggie: “No, no we have done it ourselves…”

Sophie: “Oh my good…”

Reggie: “No, don’t be like that, the news are good! but I need your help and opinion on how much Bad Bees we want to be prepared for, and what level of damage…”

Sophie: “Ask ChatGPT Reggie!”

Reggie: “The chat thing doesn’t know!….”

Tim: “Hey, can we move beyond the ChatGPT stuff?”

Sophie: “Reggie, explain…”

Reggie: “Ok I have developed this predictive model, it can be right 80% of the time, impressive I know…”

Sophie: “Is that impressive?, anyway continue…”

Reggie: “Right right, yes the model is good but we need to decide on a threshold of length for the lasers, the model leverages the length of the Bee that we detect from our cameras, and we need to decide when to fire. Take a look at the chart…”

Reggie: “If we believe the Bad Bees will be 20%, and will damage the work of 5 good Bees we need to start firing any Bee longer than 10cm… this will mean killing half of Good Bees, but also 85% of the Bad ones!”

Sophie: “That’s not good enough! You want to kill 50% of our lovely Bees!!”

Tim: “Calm down Sophie, we will ask ChatGPT later, let him finish…”

Reggie: “Right, then if we assume that prevalence will be 10% and damage will be that of 3 Good Bees we can hit at ~13cm, this will kill almost no Good Bees! but it will also only hit about 30% of the Bad Bees…”

Sophie: “Ok, let’s talk money, what are we looking at in terms of impact on revenues? how much revenue we will loose if we hit 50% of our Good Bees but Bad Bees won’t be that many? or that Bad?”

Reggie: “That would be 55% lower revenues Sophie…”

Sophie: “… What about if we are going for the less aggressive scenario, but the Bad Bees are actually going to be super Bad?”

Reggie: ” We would lose 80% in that scenario Sophie”

Tim: “There you have it! the lasers should fire away!…”

Sophie: “Wait, I have just asked the Chat thing, the proposed approach is to start with the super firing strategy, but then reassess every week or so by sending the dead Bees for a genetic assessment to check prevalence”

Tim: “We can also measure the output of the Good Bees work to see what’s the damage from the Bad Bees we won’t hit?”

Sophie: “Finally you say something clever Tim! Ok we will proceed this way… get the lasers and videos ready, as well as the process to work together with Govt on a timely manner, and the predictive model governance to make sure things are smoothly updated”

Reggie, why is uncertainty so costly?”

Reggie: “I don’t know Sophie, it’s a matter of numbers I guess…”

The little story finishes here and I can tell you that Sophie and team will make it, and will eventually also sell their machine learning assets to smaller farms eventually growing their business beyond the small operation it was!

Given I have fully simulated the data it is interesting to note that the various uncertainties carry very different costs.

The sample and model uncertainty, in this case, accounts for about 10% of the costs of the uncertainty, whilst, in various scenarios the uncertainty on prevalence or “Badness” of the Bees is actually more prominent, accounting for the remaining 90% of costs.

So, here Sophie’s monitoring approach will be fundamental, whilst Reggie’s team improvements on the modelling might not change much the outcomes.

This was a very simple example with very well behaved data (I simulated the whole process with Normal distributions for the data scientists out there), but the nature of uncertainties and their costs can be much larger.

Let me leave you with two questions:

How often do you think of where the uncertainties lie? And which are more or less costly or addressable?

How often that thinking is clearly expressed quantitatively?