Galileo and the scalability of progress

As I write this blog I am surrounded by a social/political dialogue that does not seem to proceed, or be guided, by the standards and signposts I was educated to follow to aim for progress. It is all quite confusing and this has brought back memories of a great hero of mine: Galileo Galilei.

The man in the image below:

Galileo lived ~500 years ago and he is largely credited as one of the driving forces of the scientific revolution. That is the revolution that took us from 1-2bps of growth in output and standards of living for hundreds of thousands of years, to the skyrocketing acceleration of innovation, and quality of life, of the last few hundreds years. We all hope this pace will continue for as long as possible.

When I grew up I did get the point that Galileo was a genius, and that he was prosecuted for his heretic ideas about the Earth moving around the Sun, but I did not understand, as much as I do now, how fundamental his life and example had been in driving change and progress in a world that was intellectually paralyzed for millennia.

When Galileo started his scientific career, to be educated meant to be able to understand the body of Aristotle philosophy, the Bible, and to have a good account of Geometry as it was developed by the Ancient Greeks.

Instead of focusing on “How do we know, explain, or understand such and such…” the prevailing framework was to argue on the basis of “by what authority do we claim?”. The latter and prevailing intellectual tradition led to what is called “Scholasticism”. Scholasticism basically meant that everyone’s effort was to come up with some sort of top down explanation about facts that had to be aligned or, even better, justified by some passage of what Aristotle wrote, or what was in the Bible.

Facts and reason did not matter as much, it was all about narratives and how good a story was and how well it aligned with “first principles”, or whatever fashion was prevailing in a given epoch.

For example Aristotle, who was nevertheless a great philosopher, envisioned a world where Earth was at the centre, and somehow at the bottom of the universe, which, in his view, explained why everything fell down. He also devised a world in which different substances behaved differently, but everything liked to move in circles, the argument was that circles are perfect.

Aristotle never tried to really test his theories as we would do today, after Galileo’s lesson. All he did was providing plausible arguments and sensory examples. In this sense he was an empiricist, but not a scientist. I will clarify the distinction later, but here’s an example.

Aristotle observed that large bodies seemed to fall harder to the ground, and that very light and thin ones didn’t. He then provided an explanation that sounds “common sense” in his philosophy. Heavier objects have more “matter” and long to reach Earth, the mother of matter, as quickly as possible, others are happy to take a bit more time as they are also part of a substance that longs for the heavens (like air).

Believe it or not this was then to be the prevailing understanding of how things move for about two thousand year, in Europe at least – crazy how well a story can sell right?

Nevertheless Aristotle was wrong on all accounts (when it comes to Physics).

Here’s how this has to do with Galileo, and science. Galileo was of a different temper since he was young. He thought that all the philosophical speculations were a lot of fun, but that ideas needed to be accountable before reality. Galileo didn’t just believe in “experiments” and gathering data, what we called empiricism, but he developed an approach that was tougher and gave birth to western science in the process. Galileo thought ideas and conjectures needed to be put on “trial”. In italian he called that “cimenti” which, for the fans of Game of Thrones, best translates to “trial by combat”.

So when Galileo decided to investigate the motion of objects, he immediatly sought to test his view that objects moved following specific geometrical and mathematical laws, with the view of Aristotle, and famously dropped objects from the leaning tower of Pisa and behold!:

After two millennia someone actually bothered to find out that, after all, objects of a similar fabric, but of very different weights, fall at the same speed! (his “cimenti” actually were on inclined planes but… the power of a story). And in a fierce trial by combat Aristotle theory lies on the ground, dead as a stone. This is the key difference, Galileo didn’t tell stories about how his view “made sense”, he took is idea and put it out there, fly or get killed.

Why did it take two thousand years? – That is a long story, but we can all testify that the behavior that was followed for millenia is still around us: believe in authority, don’t ask quetions, optics are all that matters, tell a good story, its common sense etc.etc..

Galileo was, as Aristotle, also a great debater and writer and in his famous books written as dialogues he indeed asks the question to Simplicio, he goes:

“But, tell me Simplicio, have you ever made the experiment to see whether in fact a lead ball and a wooden one let fall from the same height reach the ground at the same time?”

This is science: truth is not determined by who said what, common sense, or examples, but by what nature/reality reveals when cornered. We discover the truth, we do not and should not create it!

This spirit is what led to the cures that help us when we are sick, and the tehcnologies that improved our lives beyond the wildest imagination of anyone living in Galileo’s time. This is what made progress scalable: a method to drive progress, that can be distributed everywhere. Everyone can pick things up from where Galileo left them today, and do the experiment and move things forward, there is no authority, no story or view to follow, no boundary beyond honesty.

This matters everywhere. In politics, in science, and indeed in business.

But Galileo was also a revolutionary genius specifically within Physics and I do want to close with what we today call the “principle of relativity”, and how Galileo deeply unlocked the power of imagination in science. This seems a contradiction as you might now think that Galileo was weary of speculations, but the truth is the opposite, once he provided an effective method for progress, he then unleashed imagination in a productive way.

Most pople that did not to go far in enjoying the study of science, and physics, believe that what is hard about science is something like the math, the jargon, or the baggage of things to memorize. In truth the concepts are hard, the formalism is often very easy and mechanical.

Listen to what Galileo said about motion carefully. Galileo observed the following:

“No one can actually tell whether he or she is moving, velocity is not real, there isn’t physical fact about velocity, as velocity is only but relative” (this was an important point for him to support that Earth was moving around the Sun).

He then goes on to the famous example of one sitting below deck in the cabin of a smooth sailing ship, where you won’t be able to tell whether the ship is moving, or not, in any way.

He was right but this concept is tough, how can velocity be an illusion? If the change in velocity (acceleration) is real, and we can tell if a ship is changing its speed, how can velocity not be physically real? How can the change in something “unreal”, like velocity, be real, like acceleration?

I won’t answer those questions now, but I hope the thirst for knowledge will push some of you toward picking up a physics and classical mechanics book to find out more, but my point is the following:

You don’t need maths and all the techincal baggage to understand more about Galileo and his science.

And my broader point is what is most at my heart in writing this blogpost:

You don’t need any title to go out there and defend what has served us well, and lifted us after hundreds of millenia of no progress, which is the methodical pursuit of truth, useful truth, I would say. Facts matter, reality is out there. Opinions, narratives, stories, optics, bad formulas and all that alone do not move us forward, they never did and never will.

We should all embrace the “cimenti” , the trial by combat of our scientific tradition, first and foremost on our own ideas and conjectures and never forget we are aiming for progress not fabrications.

To get a quick view on Galileo’s life and key ideas I recommend the following book:

The Art of Planning, DeepSeek and Politics

A while back I discussed in a post some of the nuances of data driven decision making when we need to answer a straight question at a point in time e.g. : “shall we do this now?”.

That was a case presented as if our decision “now” would have no relationsip to the future, meaning that it would have no impact on our future decisions on the same problem.

Luckily, often, we do know that our decisions have an impact on the future, but the issue here is slightly different, we are looking not only on impact in the future but an impact on future decisions.

This is the problem of sequential decisioning. Instead of “shall we do this now?” answers the questions:

“Shall we adopt this strategy/policy over time?”

In other words it tries to solve a dynamical problem, so that the sequence of decisions made could not have been different no matter what happened. When you have an optimal solution to this problem, whatever decision is made, at any point in time, is a decision that one would never regret.

I will answer three questions here:

  • What is an example of such a problem?
  • Is there a way to solve such problems?
  • What is the relationship with DeepSeek and politics?

Sequential decisioning problem example – Customers Enagegement Management

A typical example could be that of working on marketing incentives on a customer base. The problem is that of deriving a policy for optimal marketing incentives depending on some measures of engagement in the customer base.

Why is it sequential? Because we have a feedback loop between what we plan to do, the marketing incentive, and the behaviour of our customers that triggers customer incentives: engagement.

Whenever we have such loops we cannot solve the problem at one point in time and regardless of the future.

The solution could look something like: “Invest $X in customer rewards whenever the following engagement KPIs, email open rate, web visits, product interaction etc.etc. drop below certain levels.

It is important to note that we need something to aim for, e.g. maximise return on marketing incentives.

Does a solution to such problems exist?

The good news is that yes, solutions exist, but we have been able to deal only with such cases that would be amenable to some mathematical/statistical modelling.

We can 100% find a solution If we have a good model that answers the following:

  • How does a given marketing incentive influence customers enagement on average?
  • Given a certain level of customer engagement, what is the expected value of the relationship with the customer?
  • How does the engagement of our customer evolves overtime, in absence of any incentive?

So we do need some KPIs that give us a good sense of what level of cusotmer engagement we have, and we need to have a sense of what is the expected value of the relationship with the customer given a particular set of KPIs. To be clear what we need is something true on average, and at least over a certain period of time.

Example, we should be able to say that, on average, a customer using our products daily for the past month will deliver X value over 2 years.

We also need to be able to say that a given marketing incentive, on average, increase customers daily enagement by X% and costs some defined amount of $

We also need to be able to say something like: our customer engagement rate tends to decrease X% every few months, again on average.

The above modelling of engagement and customer value is not something that most businesses would find difficutl today. Engagement rates, or attrition rates overtime, are easy to get, as well as results of marketing campaigns on engagement rates. Harder to get is a reliable estimate of lifetime value given current engagement metrics, but we can solve for a shorter time horizon in such cases.

Richar Bellman is the mathematician credited for having solved such problems in the most general way.

His equation, the Bellman equation, is all you need after you have a mathematical model of your problem.

The equation, in the form I’d like to present it is the one below, where V is the value of the optimal policy pi*:

It says the following:

The optimal strategy, depending on parameters a (e.g. how $ spent on incentives), is that which maximizes the instant rewards now, as well as the expected future rewards (future rewards conditional on the policy), no matter what the variables of the probelm are at that point in time (e.g. customer engagement and marketing budget left).

It is a beautiful and intuitive mathematical statement which is familiar to most humans:

The best course of action is that which balances instant gratification with delayed gratification.

We all work this way, and this is arguably what makes us human.

To be clear this problem can be, and it is solved, in a large variety of sequential decisioning problems.

The Bellman equation is the basic tool for all planning optimizations in supply management, portfolio management, power grid management and more…

Politics?

Well, in some sense politics is the art of collective, at least representative, policy making. And here a problem obviously arises: How to dial well instant versus delayed gratification when the views of the collective might differ? What if the collective doesn’t even agree on the various aspects of the problem at hand? (e.g. the KPIs, the expected rewards etc.etc.).

A key aspect of this should be the following: LEARNING.

The illustration below should be of guidance:

There has to be a feedback loop as the Bellman equation gives conditions for optimality, but often the solution can only be found iteratively. Futhermore, we also need to be open minded on revising the overall model of the problem, if we gather evidence that the conditions for optimality do not seem to come about.

So we have multiple challenges, which are true for politics but also for business management. We need:

  • Consensus on how to frame the problem
  • A shared vision on the balance between instant versus delayed rewards (or my reward versus your reward)
  • A framework for adapting as we proceed

The above is what I would call as being “data driven”- basically “reality” driven, but it is hard to get there. We all obviously use data when we have it, but the real challenge is to operate in such a way that the data will be there when needed.

How about DeepSeek?

If you have followed thus far you might have understood that solving a sequential problem basically involves algorithms to learn the dynamics between a policy maker action (e.g. marketing $ spent) and the long term feedback of the policy. This is the principle behing “reinforcement learning” in the AI world, as one want the learning to be reinforced by incentives for moving in the right direction. This was considered as a very promising framework for AI in the early days, but it is indeed an approach that requires time, and a fair bit of imagination as well as data, and was not pursued in recent years… until DeepSeek stormed the AI world.

A key element of DeepSeek successful R1 generative AI model was that it leveraged reinforecement learning, which forced it to develop advanced reasoning at a lower cost.

This was achieved through very clever redesigning of the transformer architecture, but I won’t go through it (as I have not yet understood it well myself).

As usual let me point you to a great book on the above topics.

My favorite is “Economics Dynamics” by John Stachursky. Great resource on sequential problem solving under uncertainty. Pretty theoretical, but that’s inevitable here.

Basics of Mathematical Modelling – Shapes of Uncertainty

Most of us, especially in business, but also in our private lives, peform some basic mathematical modelling all the time. We can do it wrong or we can do it right, but we do it … constantly.

The most basic form of mathematical/statistical modelling is to give shapes to our uncertainty (ignoring the shape is also part of it).

Let me tell you a story that should clarify what I mean.

John and his team at SalesX, an online marketplace, are trying to outsource their call centre operations and are looking for potential suppliers. They sent out an RFP (request for proposals) and they received a few responses. (note: despite the name SalesX is not owned by Elon Musk).

John calls his team in his office:

John: “Hi, everyone, could you give me a clear high level summary of the various responses we received from the call centers?”

Jane: “Yes, we shortlisted 3 responses for you, although we are quite confident on a winner…”

John: “Ok, what are the average call handling times of the shortlisted suppliers?”

Jane: “They are all about 2 minutes, but one provider is considerably cheaper. It is quite clear we should go for them. SmartCalls is the name of the company and they are heavily relying on automation and AI to support their operations, which keeps their costs very low, a very clever team.”

John: “Ok, sounds pretty good, let’s meet them but still keep another response in the race as we dig deeper beyond performance and price.”

Laura in Jane’s team, is not convinced by such a swift decision, and questions Jane…

Laura: “Jane, 2 minutes is great but what about the % of calls that are not resolved in one session? that one is…”

Jane: ” Yes that number is higher for SmartCalls, but this is hard to compare across providers right? as their clients have different rules for re-routing inquiries in house…”

Laura: “John, Jane, before moving forward with SmartCalls, let me reach out to the shortlisted suppliers with requesting some additional information. It will be very hard to switch to another supplier once we are done with this…”

John: “Ok but I give you 3 days, I hope you are after something substantial… SmartCalls is super cheap. Handling calls is not rocket science…”

Laura goes back to the suppliers and asks them for a full view of the distribution of call handling times and she finds out the following, summarized in below chart, which she promptly shares with John:

John looks at it but he is not the too clear on what that means…

John: “Ok Laura, translate please…”

Laura: “Basically SmartCall business model is that of leveraging AI with an inexperienced and cheap workforce, they can deal quickly with a large number of calls relating to simple issues, but their operators lack the creativity or experience to deal with issues of medium complexity. The operators all work remotely with no chance of sharing information with each other…”

John: “Wait, wait… the chart first, what does that mean?”

Laura: “Oh… isn’t that obvious. SmartCall has a 2 minutes call average, yes, but this is driven by a larger number of very quick calls, when it comes customer satisfaction there’s a good number of calls that go beyond 3-4 minutes.”

John: “Ok I get it, their calls are either quick, or rather long whilst, iHelp for example, is able to be more consistent, with most calls handled in about 2 minutes right?”

Laura: “Yes, they avoid shortcutting the initial problem identification phase and have a longer list of mandatory screening questions, but this pays off. They are able to share the calls with specialized teams and…”

John: “Ok I get it indeed. I also see several of SmartCall going beyond 5 minutes, which is our threshold for bringing back customer calls in house… good work Laura. Jane, good work on hiring Laura”

Jane is a bit deflated, but she ultimately smiles as she is proud of having trained Laura well.

The quick story above is a good example of how we routinely perform unsophisticated statistical modeling, especially when we implicity anchor ourselves to some one dimensional metrics, like averages.

When John heard the average call times he implicitly assumed that those averages were comparable and meaningful in their own right. This means he assumed (modelled) the call times range and frequency to be similar in shape across suppliers, which is a fair bit of statistical modelling in practice (wrong modelling in this case).

After all, what can you really tell from comparing averages unless you make the rather strong assumption that those averages are informative in summarizing the underlying data apprpriately? If you do so, although you might not know it, you are doing statistical modelling.

Laura, instead, decided to look at the actual shape of the data and to avoid any high level, uninformed assumption, yet she is still assuming quite a few things, among which:

  • The data she received is representative (not skewed toward some specific types of calls)
  • The distribution of the data will be consistent over time
  • The distribution of the data basis how call centers are handling other clients calls is relevant to SalesX customer base

That’s basically statistical inference, and whenever you make a decision or judge data, although you might not think about it, you, as well, are doing statistics and data science.

The question is: “Do you know when you are doing that well, or when you are being ineffective?”

Another key aspect of giving shapes to uncertainty is the key question of what metrics to measure, whether average call times or abandon rate, or both, or other metrics. Which ones to chose?

This is somewhat the operational solution of the problem I presented above, when it comes to data that tracks a business activity or process, which are the metrics that truthfully summarize the shape of that data?

I can recommend two good books on this topic, one is a technical text, Optimal Control Theory by Donald E. Kirk, the other is a more accessible read focusing specifically designing meaningful metrics that track Objective Key Results (OKRs), Measure What Matters by John Doerr

Optimal Control Theory goes well beyond the subject of performance measures, but the first 50 pages are a good introduction to the overall framework.

Dealing with Uncertainty – Can Maths lead us astray?

Here I want to discuss about how dealing with uncertainty can be rather unintuitive even in a fairly straightforward example.

Suppose that we are facing questions like below:

Sohuld I make a specific investment where the outcome is uncertain to a certain degree x? How much uncertainty should I be comfortable with?

I want to stress the investment here is generic: buying stock, investing in a marketing campaign, hire personnel, give out a credit card etc.etc. But we are talking of a point in time Yes/No decision (I will cover sequential decisions at a later time).

At first glance this is a pretty easy question to answer, suppose that the expected income when the investment pays out is $1 and the loss is $5 when things go wrong, the equation we generally want to maximize is below ( P(x) being the probability of x):

  • E(Income)*P(Pay out)>E(Loss)*P(Loss) which, considering P(Pay out) = 1 – P(Loss), becomes:
  • E(Income)*[1-P(Loss)] > E(Loss)*P(Loss) which with some algebra becomes (including a function inversion):
    • P(Loss) < E(Income)/[E(Income) + E(Loss)]

Which is simply a statement that we should go ahead taking risks until the likelihood of gain is greater than the likelihood of a loss. The notation E(variable) means the expectation of the variable, i.e. the most likely outcome.

If we have a situation where the expected income when things go right is $1 and the loss is $5 when things go wrong, it looks like our best action should be to dare until the P(Loss) = 1/(1+5) = 16.6%.

This means that we demand 5 wins every one loss 83.4%/16.6% at least (which should be intuitively right, as the cost of a loss is 5 times that of a win).

Now, I want to point out that this analysis already relies on quite some assumptions:

  • 1 – That the expected income is independent of the pay out probability. This is a fair assumption but, in general, the relationship plays a role, e.g., you buy a stock option which is priced low, yet the price is cheap since the pay out probability is considered low by the market. In this case pay out probability and expected income (conditional on pay out) are inversely related. But we also have other situations where the opposite is true (like in credit granting).
  • 2 – We have a way to estimate a probability of pay out and loss. This is non trivial, how do we manage that? How would different individuals/models come up with different probabilities?
  • 3 – We can make that investment multple times. It does not make sense to guide yourself with probabilities and maximising expected utility, if all you can do is play once. Although this is still the “rational” choice by the book, the topic is debated (see Theory of Decisions Under Uncertainty by Itzhak Gilboa – beautiful book though highly technical)

Ok, assuming the above let us continue with this, and I will show you a condundrum.

Nowadays our problem seems easy, let us get a machine learning model and set the threshold to 16.6%, so that whenever the model tells me that the P(Loss)> 16.6% I won’t go ahead with the investment (or bet, or prospect acquisition). I am glossing over the model details, but it will be a model trained on the useful data for a particular use case (e.g. probability of conversion for a marketing investment, probability of default when giving out credit etc.etc.).

Great, I have created a simulation and I have a classifier performing as below:

Let me explain the chart briefly:

Sensitivity tells, given a threshold, how good my model is at identifying Bad Biz opportunities, also called the True Positive Rate (where positive means “I am positive this investment is not a good idea”). So, sensitivity should increase as the threshold decreases as we make the model sensitive to fieble signals of Bad Biz. Specificity, on the converse, is how good is the model at being “specific” with its recommendation, meaning how good it is at avoiding collateral damange by giving me, wrongly, a high likelihood of Bad Biz when we are looking at good investments, specificity increases as the threshold increases, as we ask the model to pull the trigger only when the signals are strong.

Overall, in this example, we have a decent model predicting whether the investment I am considering is Bad Biz, i.e. not a good idea, something that will loose money. The Area Under the Curve (AUC) = 0.82 in the chart above (which is called an Receiver Operating Characteristic, or ROC, chart) tells us that the model is 82% likely to correctly rank two investment opportunities in terms of their risk of being a bad business (Bad Biz). This is a global assessment of the model performance across thresholds, and takes into account the sensitivity and specificity we have achieved. The higher the AUC the better the model (on average with respect to the thresholds).

Now, if I ask a common algorithm in R (a statistical programming language) to provide me with the best threshold to guide my strategy, I get below:

> # Best threshold for investment
> coord1 <- min(coords(
+ rocobj,
+ "best",
+ input = "threshold",
+ ret = "threshold",
+ best.weights = c(5, sample_prevalence)
+ ))
> coord1
[1] 0.2466861

So it tells me that I should go as far as 24.6%! That’s pretty different from the “theoretical” threshold we have previously calculated, that one was ~16.6%, what’s going on?

What is going on is that we were not thinking statistically earlier on, but rather probabilistically (mathematically), which is something to consider when it comes to any decision in the real world.

When it comes to actual decisions from models, we need to deal with the model as it is given its actual performance, which is tied to the sample of data we worked with, and how well that data represents “now”. In fact, thinking about it, what happens if your model changes or improves?, or if there are no bad investments out there in a given period? Should we always blindly use the 16.6% probability of Bad Biz? Shouldn’t we react to the actual facts at hand?

The underling issue is that the utility threshold derived theoretically is correct, as mathematics/probability is always correct, nevertheless reality is messy and the score our model gives us is not a probability, but rather a statistical estimate of a probability. In fact the model gives us estimated odds, and odds also need to know the underlying baseline (this is key for the initiated, classifiers scores are not probabilities, but rather they are random variables themselves!).

The clearest way to articulate the actual, empirical outcome, is to use the concept of a policy Net Benefit, which simply measures what value I can expect out of my policy in terms of benefits (from good decisions) and costs (when we get it wrong).

What is the key Net Benefit equation? For a classifier driven policy, depending on threshold x, it is the following:

Net Benefit(x) = TPR(x)*Prevalence*LossPrevented – FPR(x)*(1-Prevalence)*IncomeSuppressed

TPR stands for True Positive Rate, which is simply the number of hits on Bad Biz beyond model threshold x and FPR is, conversely the False Positive Rate, which means that those are investments that are good, but we let go at the threshold x. Prevalence is the general ratio of Bad to Good Biz, regardless of my policy, i.e. if I had no policy that’s what I would get in terms of average bad to good decisions.

So, what will the Net Benefit of my policy be whether we use the expected utility threshold, or the value that the statistical software (R) algorithm gave us? See below:

It turns out that R is actually maximizing the Net Benefit equation, which is great, as it is aiming at maximing our expected profits “all” things considered, including the model we are using, and its performance (the optimal threshold can also be derived analytically with some calculus, but won’t go there this time).

The difference in NetBenefit between the two thresholds is high +9% for the “empirical” threshold, this can be a lot of money in a scaled up context.

To recap the overall story is a bit bendy: we have maths, which is correct, and yet, the best action is rather a different one, which is derived statistically. What connects the two thresholds, Maths and Stats derived?

The connection is the classifier itself and “where” the classifier errors are along the utility derived threshold.

On average the classifier scores are perfect, the % of Bad Biz is 20% and the average classifier score for Bad Biz is exacly 20% but… see below:

Our “Theoretical” utility threshold happens to be around deciles 6 and 7, where the score (grey bar) overpredicts materally, and this pushes the threshold up. For the threshold determination it does not matter the “global” performance of the model, all that matters is how the model performs around the utility threshold. Nevertheless the issue I have presented above is real, even a well calibrated model will always present distortions, unless it is perfect, and the distortions will be larger when the “Bad” event is rare, therefore we need to deal with it “in practice”.

One thing that I can tell is that if your classifier improves, you can worry less about “Theory vs Reality”, see below how things improve if the classifier achieves 0.87 AUC (from 0.82):

The thresholds get much closer together, and the overall Net Benefit of the policy increases + 16%! (Yes investing in Decision Science is probably worth it).

It is of note that the policy value decreases much steeply to the left (when we are conservative) then to the right (when we take more risk) so -> take more risks when the Bad Biz outcome baseline is low.

It is also worth noticing that, in reality, we do not know that an average loss is going to be $5 and a win $1, nor how noisy things would be around those averages, so, things get much more interesting, and yet the relationship between Maths and Stats needs even more expertise to be dealt with.

Next time, I will try to share more about “sequential” decisions:

What if your choice “now” impacts your future ones? e.g. what if refuting a Biz opportunity from someone will preclude us to do some business with them in the future? Now you might need to rethink your thresholds, as turning down a certain Bad Biz, means turning down several future opportunities.

A key tool to deal with that question is Dynamic Programming (recently rebranded as Reinforcement Learning) a wonderful and powerful idea from the 50s.

What is the value of Machine Learning? – A stock trading example

Since the days of the big data related buzz words, now over 20 years ago, data driven changes to business models, product values, customer journeys and more have gone through massive enhancements.

Nevertheless in the early days there were several discussions about the supposed “value” of data and such transformations. I know it because I was there, in most cases on the selling side.

It was not uncommon to find middle managers considering data driven transformations as a “nice to have” or simple ways to copy the competition, or something to do from a PR perspective. It was also not unusual for large businesses to look down on new tech startups as if looking at kids playing with unnecessary & fancy toys.

There was a lot of: “we do things this way because it works”, “our customers don’t need this”, “open source is dangerous”, “we do not need fancy machine learning for our simple problems” and so on.

There was also a lot of confidence in existing practices, something to the extent of: “I have been doing this for 20 years, I know better. My expertise beats your data science kid…”

To be clear a lot of the skepticism was healthy and based on more than gut feelings, but overtime the most negative voices disappeared. Nevertheless, a question still remains actual when starting any data driven & AI powered transformation: “What value will it deliver considering the unknowns and costs?”

Is it just headcount reduction? Or reduced wastes? Or better products? How will we measure all of this?

I am not going to answer those questions in general, in my view it would be silly to do so. How technology driven innovations can benefit a given solution to a problem is very contextual, not something we can broadly generalise.

I will though share an example, and a good example I believe. An example where value is upfront and easy to measure: How much value a simple machine learning stock trading algorithm can deliver when put against a simple rule of thumb trading strategy?

Disclaimer: I am not an experienced stock trader, I have only started recently following the sharp devaluation of the Japanese Yen (I am based and work in Tokyo). Do not reach out for any investment advise.

In any case, I thought it would be interesting to quickly put the two trading strategies against one another. Both very simple and devised within few days in the small window of time I have between my kids falling asleep and my inability to keep my eyes open.

The two strategies are as follows (on a daily trading routine) and are based only on S&P 500 Stocks:

  • Rule Of Thumb: Every day check S&P 500 stocks and select top 50 stocks once sorted by strength of linear growth in the past 15 days, and by how much the price has gone down in the past 3 days. Buy the 50 ones that had the strongest linear growth, but lost the most value in the last 3 days. The rationale is that the price should bounce back to regress to the mean of the linear growth
  • Machine Learning: Create a model that basis 15 days of S&P 500 stock price history it gives the probability that the price will grow the next day. Buy stock proportionally to that probability, when the probability is > 50%. The model is trained with 50 random dates data from the first part of 2023.

As you can notice even the Machine Learning approach is really simple, we are feeding the model only 15 days of stock price data and creating only 3 features actually: avg 15 days growth, avg 15 days volatility and last 3 days price change (as a ratio). This is super basic, and indeed only used to compare with the basic Rule of Thumb approach. The idea is that we only want to assess the value of the Machine Learning algorithm (a simple logistic regression) and keep things as similar as possible to the basic Rule of Thumb strategy.

Here are the results! (trading from July 23 to June end 24, price/value indexed at the first day of trading):

Now it does not look like a massive win, but we are looking at a return on the investment of ~50% as opposed to ~15% (S&P index gains) over a year. It is also of note that the ML model was actually doing worse when the S&P500 index overall was going low in the in the second half of 2023. The rule of thumb instead seems to basically track the S&P500 index overall.

To be honest I could have given up on the ML trading algorithm when the S&P500 index was going down actually, but overall it is interesting to see what such a simple ML model could achieve in a year.

When I showed that chart to my wife she said: “You made up this chart with your dreams I think… stick to the ETF, but keep dreaming”. I do not blame her skepticism, above chart looks to good to be true from Jan 24 onward.

In order to get more clarity and “curb my enthusiasm” I have tried the same comparative analysis in a different time when the S&P500 index was indeed quite volatile, from July 22 to June end 23 (ML model trained with 50 dates first half of 2022).

Below are the results:

It looks like in more volatile times the Machine Learning trading algorithm tracks pretty close the S&P500 index, but we also see that the Rule of Thumb approach does not transfer well as it fails miserably.

It is also of note that over 250 trading days, the ML approach beat the S&P index 142 days and the Rule of Thumb approach 245 trading days.

In conclusion no Jim Simons level wizardry (look Jim Simons up if interested in algorithmic trading) but as for our purpose, the example shows pretty clearly what value Machine Learning can add even in a very simple setting and with highly volatile data.

What would then be possible integrating additional data sources like businesses performance, macro, searches of stock terms on the internet, Investor updates data and more?

Forecast: AI assistants will enable even further “scientization” of business practices and decision making, but we will need people able to articulate the solutions to a variety of audiences.

Note: A decent book on financial analytics below, based on R code wise, but easy to transfer to Python if needed:

Statistics, Machine Learning and Deep Learning…

Recently I am sure you must have often heard about AGI (Artificial General Intelligence) and that it has something to do with Deep Learning and all that, but, what is Deep Learning? What are the differences with things you might be more familiar with? Like Stats or Machine Learning?

Let me give you a tour, and in the end you’ll find out that Deep Learning is not has mysterious as it sounds (in fact the backbone is liner algebra plus the calculus you studied at college).

Patterns, Patterns and more Patterns:

Overall we are talking about patterns, and Maths is fundamentally the discipline that “speaks” about patterns. In a sense this is the most fundamental definition of Mathematics (see Hofstadter foreword of ‘Godel’s Proof by Nagel & Newman).

Let me put in front of you 3 apples, 3 pens, 3 books, 3 plates and 3 balls… now forget what ‘3’ means in my sentence (if you can), then answer this question: What is the pattern among the various objects that I have presented you? The answer should be that they all seem to be a group of… three distinct objects of a given category. There you have it, a definition of the most basic pattern, and that’s what we call a number. It might seem I am digressing but I want you to get a bit abstract here so let me indulge.

There are then hundredths of specific disciplines that study patterns and relationships and properties of patterns, calculus for example, is the branch of mathematics that studies patterns between infinitely small quantities and how those patterns generalise, geometry can be looked at the study of patterns in shapes (or distances) etc.etc.

From this perspective Statistics, Machine Learning and Deep learning are of the same family, they sit in a branch of Mathematics that studies how to reconstruct or recognise patterns. For example, I could give you the below series of numbers:

2, 4, 8, 16, 32 etc.etc.

And we could try and understand whether the pattern is: number at position n+1 is twice that at position number n.

How Statistics, Machine Learning and Deep Learning differ is at how they Represent patterns, and what type of questions about patterns they focus answering.

With “representation” I mean something pretty basic, in essence a map that allows you to work with the pattern. For example for the square function you can use several representations:

X^2, or X*X, or you could draw a square with a side of length x, or go more abstract and draw a parabola that meets the origin, or as a derivative, or an integral, or any other way you want really, so long as that representation is useful to you.

That representation will only be a useful map to you so long as it helps you move toward your objective, e.g. are you trying to draw a trajectory? Or prove a geometrical theorem? or solve an equation? etc.etc.etc.

Statistics:

Patterns are represented by parameters of special functions we call Distributions, so statistics focuses on finding parameters and telling us how confident we should be that the pattern we identified can be mapped by a given relationship which is determined by some parameters. We don’t always want to “predict” in Statistics, but more likely want to make sure we are onto something substantial or not (e.g. is the proposed medicine sure to drive benefits? Is the defendant DNA matching the blood sample?).

Statistical Inference, to use an example, would like to answer the question whether the income of a customer has an impact on how much the customer is likely to spend on your products. It will give you a yes or no answer (yes for Jewellery… not so much for bottled water perhaps?).

Statistics is a highly developed discipline that is able to define boundaries and provide indisputable proofs, as it is a branch of Mathematics. For example it can answer the following general question:

Under certain assumptions on my data, what is the “best” test to verify my hypothesis? What is an admissible test? What is the dynamic policy that maximises utility? Best strategy to win a game? etc.etc.

Machine Learning (or Statistical Learning):

With Machine Learning we are interested in teaching computers to identify patterns, for the purpose of making predictions. For example, if the income of my customer is X how much is she likely to spend on our products? Our task will be that of giving as much data as we can, and a learning framework, for the machine to establish that pattern between income and spend (if there is one).

In Machine Learning it is also important how patterns are represented. We can only use representations that can be implemented in computer programs efficiently.

For example decision trees represent patterns as a series of if-else statements… if this -> then this else that etc.etc. So, in the case of decision trees, the pattern needs to be amenable to mapping to simple if-else statements (e.g. if income is between 100 and 200 then spend will be 10, if greater than 200 then 20 etc.etc.), but overall all Machine Learning algorithms have their own specific ways of representing patterns.

To give another example K Nearest Neighbours is an algorithm where patterns are identified “spatially” by mapping the input data to an abstract space where we can define distances. In our example a KNN learning approach would put incomes and spend in a two dimensional space and try to find, when you give it only income, the spend that puts the point in a region of space that “fits” with the geometry of that shape the training data drew.

Unlike Statistics (and Decision Theory as a consequence) Machine Learning does not give close answers to defined questions. There is no theorem that goes something like: given some assumptions on your data this specific learning approach is guaranteed to give you the best prediction.

Machine Learning is therefore closer to engineering than Mathematics, a lot of the effort (and fun) in Machine Learning is that you have to play with the tools to go somewhere and find new opportunities.

Obviously the algorithms are not found by trial and error only, there is a lot of pure statistics behind Machine Learning but it is mixed with research in computer science as well.

Deep Learning (or Deep Neural Networks):

Deep Learning is a specific branch of Machine Learning where patterns are to be represented by Networks of fundamental computational units called “Neurons”.

Such Neurons are not that complicated, as they perform basic algebra + a simple transformation (to be decided upfront), but the overall architecture of the networks plus the size can make them incredibly powerful. In fact Neural Networks have a nice theorem to them, the universal approximation theorem, which states that, under certain assumptions, neural nets can indeed represent any well behaved function, guaranteed! (we do not have such theorems for decision trees or KNN).

Chat GPT, for example, has millions of Neurons and hundredths of billions of parameters that are only devoted to the task of predicting what sentence from a given language best fits the “pattern” of the conversation that a given prompt initiated (in a human like way).

As with Machine Learning, and even more so, Deep Learning is closer to Computer Science than Statistics since we basically need to proceed experimentally within technical constraints, we do not have solid theoretical foundations for any results that are obtained in Deep Learning, we literally do not know what Chat GPT is doing in much details but, through trial and error, a certain Network architecture (called the transformer) has been found to be effective, but this is driven also by what architectures we can implement within the constraints of existing Computers and computer languages, so Deep Learning is also about the hardware.

A lot of current research focuses on architectures and how to set up various layers of the network.

To be clear this is not “trial and error” in the common language sense, but works more like experimental Physics, with a close interplay with theoretical advances and areas of inquiry.

I hope you found the above useful but for anything a bit deeper on Deep Learning, I recommend the two books below:

Both very readable introductions, and can also give a sense of what might be needed to work on AI as opposed to consume AI, or what might be needed to develop AI in house (lots of compute power, computer scientists and mathematicians for sure).

Where is the uncertainty?

When it comes to making decisions we are often faced with the issue of dealing with uncertainties, meaning that we have to make decisions on the basis of the limited information we have, and our best understanding at a point in time.

In this post I want to tell a story that hopes to shine some light on the multiple places where uncertainty might be lurking, and the eventual costs from such uncertainties and how to deal with those costs.

Note: This is a data heavy post, easier to read by anyone with some exposure to statistics, but I tried to make the story as easy to read as possible.

THE GOOD & BAD BEES STORY

In this story we have:

  • Sophie: the owner of business that relies on Bees (for example a honey farm)
  • Tim: Sophie’s husband and COO of the company
  • Reggie: A smart employee

The business is faced with a great threat, Bad Bees have started to appear that damage the business by effectively poisoning Good Bees… what to do?

Sophie: “Tim!, come over? so what is the issue with the Bad Bees?”

Tim: “Well we might have 10%-20% bad bees and they generally spoil the work of 3-5 bees and…

Sophie: “Cut with the numbers tale! What’s the damage?”

Tim: “Well, revenues down at least 30%…possibly 100%”

Sophie: “Wow, ok you got my attention now …”

Reggie: “Well, we have some news from the Govt, we should be able to manage some of the impact… the bad bees are larger and hotter than normal ones, but otherwise totally similar… we can try and identify them and shoot them with lasers!”

Tim: “We have lasers?”

Reggie: “Yes, just in case”

Tim: “Ok, how does that work?”

Reggie: “See below, it’s a chart showing the Bees sizes and body temperatures released by the Govt…”

Tim: “What does that mean?”

Sophie: “Tim, aren’t you supposed to be clever? It means that we can try and classify Bees in Good or Bad Bees using length and temperature and eventually shoot the bad ones with the lasers! Isn’t it obvious?”

Reggie: “That’s right Sophie but… we can’t rely on the temperature unfortunately, current detectors are not able to measure that with any reliability, our cameras can calculate the Bee body length very accurately though and then… “

Sophie: “Lasers! I like the plan, work it out and bring me the numbers ASAP”

Reggie goes back to his team and they start to work through a solution.

We see the first uncertainty coming in, if we could use temperature and body length the issue could be solved, but we will need to rely on limited information. Knowing only the length, leave us with uncertainty (potentially solved with body temperature) on whether a Bee is Good or Bad.

See below:

With a sample of a 100 genetically tested Bad and Good Bees it looks like the Good bees are about 10cm (…yes in this world we have engineered massive bees, not sure why) whilst Bad Bees are often much larger. We now have a second source of uncertainty, not only we can’t rely on temperature data, but we are also having to rely on limited data to assess the difference in lengths between Good and Bad Bees.

Reggie nevertheless goes ahead and fits a machine learning model that tries to predict whether a Bee is Good or Bad basis the body length and he gets the below ROC curve:

ROC stands for “Receiver Operating Characteristic”, but that curve basically tells us how good Reggie has been with his modelling, the higher the AUC (area under the curve) the better.

The AUC tells us what is the probability that I can tell the two Bees apart by leveraging a certain body length threshold (e.g. 10cm).

Overall the model is good, it will be right 80% of the times on average. The Lasers will work well… on average, but if you go back to the previous chart (Bees lengths chart) you can see that, despite the good AUC, we have a large overlap in length distributions among the Bees, in fact about 50% of Bad Bees have body length that is very common for Good Bees too, so always beware of averages. It is relatively easy to be right on average, but the lasers will struggle to be right consistently (will share more about this in a later post).

Well, here we actually have a third source of uncertainty: did Reggie do his modelling as well as possible? Is this predictive model as good as it gets given the previous two uncertainty constraints?

Given I have simulated the data I can tell you he got it almost right, the best possible model would be right 83% of the time instead of the ~80% Reggie achieved, but that 3% can be very costly indeed as it means being wrong once every 5.8 laser beams rather than once every 5 (yes, ROC curves are hard to read and interpret…)

Finally, if you remember we have additional uncertainties, which are of a nature beyond the modelling exercise, those are uncertainties that we have not yet been able to model, that is:

  • Prevalence: Govt says about 10%-20% of Bees will be Bad bees, that’s quite a range
  • Cost of the Bad bees: Govt says that a Bad Bee is likely to spoil the work of 3-5 Good bees, still quite a range…

Reggie goes back to Sophie and Tim.

Reggie: “Ok I have got the plan. We have developed a predictive model…”

Sophie: “With ChatGPT? I know that thing is smarter that Tim…”

Reggie: “No, no we have done it ourselves…”

Sophie: “Oh my good…”

Reggie: “No, don’t be like that, the news are good! but I need your help and opinion on how much Bad Bees we want to be prepared for, and what level of damage…”

Sophie: “Ask ChatGPT Reggie!”

Reggie: “The chat thing doesn’t know!….”

Tim: “Hey, can we move beyond the ChatGPT stuff?”

Sophie: “Reggie, explain…”

Reggie: “Ok I have developed this predictive model, it can be right 80% of the time, impressive I know…”

Sophie: “Is that impressive?, anyway continue…”

Reggie: “Right right, yes the model is good but we need to decide on a threshold of length for the lasers, the model leverages the length of the Bee that we detect from our cameras, and we need to decide when to fire. Take a look at the chart…”

Reggie: “If we believe the Bad Bees will be 20%, and will damage the work of 5 good Bees we need to start firing any Bee longer than 10cm… this will mean killing half of Good Bees, but also 85% of the Bad ones!”

Sophie: “That’s not good enough! You want to kill 50% of our lovely Bees!!”

Tim: “Calm down Sophie, we will ask ChatGPT later, let him finish…”

Reggie: “Right, then if we assume that prevalence will be 10% and damage will be that of 3 Good Bees we can hit at ~13cm, this will kill almost no Good Bees! but it will also only hit about 30% of the Bad Bees…”

Sophie: “Ok, let’s talk money, what are we looking at in terms of impact on revenues? how much revenue we will loose if we hit 50% of our Good Bees but Bad Bees won’t be that many? or that Bad?”

Reggie: “That would be 55% lower revenues Sophie…”

Sophie: “… What about if we are going for the less aggressive scenario, but the Bad Bees are actually going to be super Bad?”

Reggie: ” We would lose 80% in that scenario Sophie”

Tim: “There you have it! the lasers should fire away!…”

Sophie: “Wait, I have just asked the Chat thing, the proposed approach is to start with the super firing strategy, but then reassess every week or so by sending the dead Bees for a genetic assessment to check prevalence”

Tim: “We can also measure the output of the Good Bees work to see what’s the damage from the Bad Bees we won’t hit?”

Sophie: “Finally you say something clever Tim! Ok we will proceed this way… get the lasers and videos ready, as well as the process to work together with Govt on a timely manner, and the predictive model governance to make sure things are smoothly updated”

Reggie, why is uncertainty so costly?”

Reggie: “I don’t know Sophie, it’s a matter of numbers I guess…”

The little story finishes here and I can tell you that Sophie and team will make it, and will eventually also sell their machine learning assets to smaller farms eventually growing their business beyond the small operation it was!

Given I have fully simulated the data it is interesting to note that the various uncertainties carry very different costs.

The sample and model uncertainty, in this case, accounts for about 10% of the costs of the uncertainty, whilst, in various scenarios the uncertainty on prevalence or “Badness” of the Bees is actually more prominent, accounting for the remaining 90% of costs.

So, here Sophie’s monitoring approach will be fundamental, whilst Reggie’s team improvements on the modelling might not change much the outcomes.

This was a very simple example with very well behaved data (I simulated the whole process with Normal distributions for the data scientists out there), but the nature of uncertainties and their costs can be much larger.

Let me leave you with two questions:

How often do you think of where the uncertainties lie? And which are more or less costly or addressable?

How often that thinking is clearly expressed quantitatively?

Causal inference: suggested readings and thoughts

I have studied statistics and probability for over 20 years and I have been constantly engaged in all things data, AI and analytics throughout my career. In my experience, and I am sure in the experience of most of you with a similar background, there are core “simple” questions in a business setting that are still very hard to answer in most context where uncertainty plays a big role:

Looking at the data, can you tell me what caused this to happen? (lower sales, higher returns etc.etc.)

Or

If we do X can we expect Y?

…and other variation on similar questions

The two books I have read recently focus on such questions and on the story of how statisticians and computer scientists have led a revolution that, in the past 30-50 years, succeeded in providing us with tools that allow us to answer those questions precisely (I add “precisely” since, obviously, those questions are always answered, but through unnecessary pain, and often with the wrong answers indeed).

Those tools and that story is still largely unknown outside of Academia and the AI world.

Let’s come to the books:

1 – Causal Inference by Paul R.Rosenbaum

2 – The Book of Why by Judea Pearl and Dana Mackenzie

Both books are an incredibly good read.

Causal Inference takes the statistical inference approach and it tells the story of how causes are derived from statistical experiments. Most of you will know the mantra: “Correlation does not imply Causation”, yet this book outlines in fairly simple terms how correlations can be leveraged and are indeed leveraged to infer on causation.

Typical example here is how the “does smoke cause cancer?” question was answered, and it was answered, obviously without randomised trials.

The Book of Why is a harder read and goes deeper into philosophical questions. This is natural given the authors are trying to share with us how the language of causation has been developed mathematically in the last 30 years, and the core objective here is to develop the tools that would allow machines to answer causal queries.

I want to get more into the details of some of the key concepts covered in the books and also to give you a sense of how useful the readings could be.

Starting with Rosenbaum’s I would point out that this books is even overall a great book to get a sense of how statistical inference, and the theory of decisions under uncertainty, is developing.

This book is a gem, no less.

It starts very simple with the true story of Washington and how he died after having been bled by his doctor (common practice back then) and asks: Would have Washington died, or recovered, had he not been bled?

He then moves to explain randomised trials, causal effects, matching, instruments, propensity scores and more.

Key here is that the only tool for statistical inference that was well developed and accepted up to the 70s was the randomised trial, that is, for example in medicine, giving a treatment to a random sample of individuals, a placebo to the others and check the difference in outcomes to make inferences.

This procedure itself is not even causal, in principle it is still flawed with respect to answering a causal query (logically), but it works as follows:

  • I see outcome O associated with treatment T
  • What are the chances that I would see O regardless of treatment T?
  • If chances are low, that is evidence of treatment T causing outcome O (unlikely association after all is interpreted to imply causation)

Rosenbaum goes to explain why above works in the causal inference framework, as it is interpreted as a counterfactual statement supported by a particular type of experiment, but then moves to explain that even observational studies (where there is no placebo, for example) can provide answers that are as robust as randomised trials/experiments.

Other key points are really on the struggle that the statistical community had to go through and goes through today when working through observational studies. It is of note the historical account of the debate when the relationship between smoking and lung cancer was investigated with the unending “What ifs”… what if smokers are less careful? what if smokers are also drinking alcohol and alcohol causes cancer? etc.etc.

An illuminating read, which also sheds light to how the debate on climate changes is addressed by the scientific community.

Moving on to The Book of Why

I love one of the statements that is found earlier in the book:

“You cannot answer a question that you cannot ask, you cannot ask a question you have no words for”

You can tell this book goes deeper into philosophy and artificial intelligence as it really aims to share with us the development of a language that can deal with posing and answering causal queries:

“Is the rooster crowing causing the sun to rise?”.

Roosters always sing before sunrise, so the association is there, but can we express easily in precise terms the obvious concept that roosters are not causing the sun to rise? Can we even express that question in a way a computer could answer?

The authors go into the development of those tools and the story of what hindered this development, which is the “reductionist” school in statistics. Taking quotes from Karl Pearson and his follower Niles:

  • “To contrast causation and correlation is unwarranted as causation is simply perfect correlation”
  • “The ultimate scientific statement of description can always be thrown back upon…a contingency table”

The reductionist school was very attractive as it made the whole endeavour much simpler, and mathematically precise with the tools available at the time. There was a sense of “purity” to that approach, as it was self consistent although limited, but indeed attempted to imply that causation, which was a hard problem, was unnecessary (best way to solve a hard problem right?). Ironically, as the authors also point out, this approach of assuming that data is all there is and that association are enough to draw an effective picture of the world, it is something that novices in machine learning still believe today (will probably talk more about this in a later blogpost).

Pearson himself ended up having to clarify that some correlations are not all there is, and often misleading. He later compiled what his school called “spurious correlations” which are different from correlations that arise “organically” although what that meant (which is causal correlations) was never addressed..

The authors also introduce the ladder of causation, see below:

Which is referenced throughout the book and it is indeed a meaty concept to grasp as one goes through the 400 pages of intellectual mastery.

What Pearl and Mackenzie achieve, that Rosenbaum does not even aim to discuss, is to invite the reader to reflect upon what understanding is, and how fundamental to our intelligence causality is.

They also then share the tools that allow for answering very practical questions, e.g.:

Our sales went up 30% this month, great, how much of it is due to the recent marketing campaign and how much is it driven by other factors?

The data scientists among you know that tools to address that question are out there, namely structural equation modelling and do calculus, but this indeed is closely related to structural causal models that Pearl promotes, and, in ultimate instance, the framework of introducing causal hypothesis is unavoidable.

Conclusions:

I recommend the books to any knowledge seeker, and anyone that is involved decision making (or selling benefits that are hard to measure).

I would start with Rosenbaum’s book as it is less than 200 pages and, if time is scarce I would prioritise reading Pearl and Mackenzie’s book up to chapter 6 first (190 pages)

Bias & Variance: data science or more?

Due to the nature of my profession I am often engaged in conversations that revolve around the definition, the role and the potential of data science and artificial intelligence. Often the conversation start out of the desire to understand how to use this new bag of tools called “data science”… typical case of a solution trying to find a problem. I am not criticising this approach per se, it’s only natural that a relatively new discipline will go through this exploratory phase, but I am observing that the conversation often starts away from data scientists and data professionals in general.

At the core of those conversations is the desire to outline a data science strategy in a business context.

The issue is that the knowledge necessary to have a fruitful discussion is, sometimes, technical in nature and, therefore, I will now share an important technical aspect of doing data science that I believe even scales up to be a guiding principle for more general business strategies: the bias-variance trade off.

It is a fact of statistical learning that the error of predicting models is the sum of an error due to bias, and an error due to variance (note: variance of the model not the data).

In other words our predictive models will either be too stable and be wrong because of strong assumptions (BIAS) or too unstable due to sensitivity to data points (VARIANCE).

You could also use the terms under-fitting and over-fitting, but fitting sounds technical as well.

The drawing below shows where some popular data science techniques stand in this two dimensional scale (bias & variance):

BIASVARIANCECHART

Old traditional methods often have strong assumptions (like simple linear models), yet provide decent and, most importantly, coherent results when the data is updated. Whilst some of the most modern methods try to dispense as much as possible of assumptions and rely completely on data, yet can produce nonsensical results if the data varies too much, or if the data does correctly represent nonsensical behaviour.

It is also then consequential that low-bias methods need much more data (unless the data is highly structured), and can’t be used lightly in contexts where it is the objective to model general behaviours (like forecasting).

Historically we see then that businesses have relied on the two types of approaches being performed by different units. Subject matter experts providing forecasts (for example) and insight analysts providing complex numerical tables and charts. The first category being (generally) formed by biased experts (e.g. the product will sell 10% more if priced down by a fifth) and the second by research types looking impartially at the data (the chart says that your customers are already unhappy with the price yet buying your product). Inevitably, in several occasions, it turns out that the SMEs are right if the price is changed exactly by a fifth but terribly wrong otherwise, and the researchers have drawn conclusions from a small sample of customers with unclear methodological issues.

The drawing below sort of summarises the above paragraph:

DATASCENaESTR

Now, when I think “what can data science do for this organisation?”, my technical brain somewhat answers: it can help the organisation strike a better bias-variance trade off! and this actually means business transformation. Data science can bring together the insight and product teams, data science can also bring management close to customers and partners, and all this by simply striking a better balance between a business instinct toward common practices/assumptions (bias) and a business need to react quickly to new information (variance).

We can easily see that the two dimensions are not bad at modelling decision making in several other aspects of life, e.g. will I vote the party I always voted for (bias/assumptions) or stick strictly to the manifesto commitments (variance/data).

It is also a good framework for prioritising items in a data science transformation (net of business value), e.g. if you don’t have customer data, start collecting it and keep relying on smart and experienced marketers, whilst if your products data is in a good place start developing advanced models on cost effective production and quality control.

Data science is pretty much a discipline born out of the growth of available data and, therefore the possibility to move away from bias. Ignoring this fact when designing a data science strategy is akin to ignoring what a long catapult could do in the middle ages.

You can lose the war for it.

 

 

 

 

 

 

Entropy and fraud analytics

Snip20160807_1

When I was studying Physics at University, I deeply fell in love with one “quantity” or “concept”: that of the ENTROPY of a system.

Perhaps you have not heard of it, but it is so important that in some ways it relates to:

  1. Time itself
  2. Disorder
  3. Information
  4. Energy
  5. Optimisation
  6. Economics and much more…

I find it incredibly interesting that the time evolution of the universe in a direction (universe from the Latin means one direction…please do not think of the boy band!) rather than another, is still a fundamentally unknown property of reality. In fact it appears that systems processing information (like us), wouldn’t perceive time if it weren’t for a particular tendency of Entropy to increase (also, we would be immortal too!).

Closing the Physics parenthesis, in the context of this blog post, I can tell you that I will see Entropy as a measure of “disorder”, and also as a measure of “information” contained in “something”.

To be more explicit, I see something possessing a high level of disorder if, by changing it slightly, it doesn’t change the way I see it. For example, if your desk is a mess (high disorder), I can move some paper from the left to the right, and I still basically see a messy desk. If your desk is tidy (low disorder) and I put a t-shirt on it… I easily perceive that something isn’t as it should.

Also, when I refer to the information possessed by something I mean pretty much the number of “meaningful” answers to straight yes/no questions that something can give me. Here “meaningful” is not defined but I hope you can follow.

What does this have to do with fraud analytics?

Let me put it this way asking you a couple of questions:

  1. Do you think that when a certain system (you could think of a payment processor infrastructure, or a trading desk) is being cheated, will the system be more or less tidy?
  2. Do you think that the information of a system where fraud happens is larger or smaller than a normal system?

When fraudster do act, they do not act randomly, they act on purpose and follow patterns, therefore your system could show signs of being “more tidy” than usual. For example, processing several hundreds payments of the same amount, or seeing traders all following a particular investment strategy (perhaps suggesting insider trading?) might not be a natural state of affairs.

When you see something following a pattern, you instinctively think that there is SOMEONE behind it… in other words it cannot be random. This is equivalent to say that Entropy tend to be high, unless we work to make it low (e.g. sorting out your desk).

This is when Entropy can come in and help fraud analysts monitor the system and see if too many patterns are emerging.

It can get even more interesting since we can also calculate the Entropy of X given Y. Therefore we can also analyze relationships with all the weaponry that statisticians can use to establish relationships.

Let’s look at some numbers.

Here’s a vector of random numbers: 52  31  22  52 100  46  11  24  77  21

Here’s a vector of not so random numbers: 10  10  10  20  20  20  30  30  30 100 

Using R (the statistical computing language), for example, we can calculate the Entropy of the random collection of numbers. We get 2.16 (don’t care about the units for now).

If we calculate the Entropy of the not so random vector we get 1.3.

But let’s now add some additional information and context. Let’s consider the vector of not so random numbers as the amount that was withdrawn on certain days. Let’s also look at the days: Mon, Tue, Wed, Tue, Wed, Thu, Thu, Fri, Sat, Sun.

Now what’s the Entropy of those cash withdrawals given the days they were made?

o,4

That tells us that this is a pattern, not necessarily a fraudulent one, but if we have a hypothesis over the average Entropy of cash withdrawals we could somewhat understand if a Bot/Fraudster has brought order where there shouldn’t be!

Overall, thinking of utilising Entropy in any analytics related matter is a fascinating example of the variety of tools, and sources of inspiration, that can help data professionals (and data driven organisations) in achieving their objectives.