Call (+1) 650-212-1212


What to do when your model results are ridiculous?

Max Henrion 12 Nov 2018 Analytics and OR, Case studies and applications, Energy and environment, Modeling methods, News

Experienced analysts and modelers know that “Oh shit!” moment all too well. You inspect the first results from your new model and they are obviously wrong. They are an order of magnitude off from what you expected. Or you change a key assumption, but it has no effect on the result. Or it has a big effect -- in the direction opposite to what you expected. Or it recommends as the best option what is clearly a bad decision. So you dig through the guts of the model to find out why. Often the problem turns out to be a simple mistake – you mistyped a formula, or you divided by a conversion constant when you should have multiplied, or the data has a bad format.  But, sometimes, after exhaustive scrutiny of the model code and the data, and rerunning the sensitivity analyses, you just can’t see anything wrong – at least not in the model. Eventually, it starts to dawn on you. It's a lot more interesting than a mere modeling mistake. It's the original intuition that formed the basis of your expectations that is wrong.  Finally, that “Oh shit” feeling turns into a valuable new insight and a reason to celebrate!

The Apple Tree Bake-Off

Here’s a case from early in my career as a decision analyst. This case also shows why it helps to have a normative framework like decision analysis – instead of treating human expertise and intuition as the “gold standard.”  

At the second Conference on Uncertainty in Artificial Intelligence (UAI) back in 1986, I found myself in a heated debate with another young professor. I was trying to convince him that the best way to treat uncertainty in an expert system or decision support system is to use probability distributions structured in Bayesian belief networks and influence diagrams. (Influence diagrams extend belief networks by adding decisions and objectives to the chance variables.) He replied that natural human reasoning just doesn’t use numerical probabilities or Bayesian inference. It’s more natural to represent human expertise as a set of if-then rules with heuristic measures of the “strength of evidence”. Such heuristic rule-based expert systems were the basis of the first boom in artificial intelligence in the early 1980s. (The boom ended with a bust, the “AI Winter”, that descended at the end of that decade, as people realized that rule-based systems were not sufficiently scalable.) 

The early UAI conferences spawned many such debates. The Society for UAI was founded by researchers (including me) from a range of then-disparate fields, including AI, logic, statistics, decision theory, psychology, and philosophy. We wanted to clarify and, we hoped, resolve the differences among the many competing schemes for reasoning under uncertainty, including rule-based expert systems, Bayesian belief and causal networks, influence diagrams, fuzzy set theory, nonmonotonic logic, Dempster-Shafer theory, and more. Some of us wanted to find or create an approach that was grounded in a compelling theory of rationality but also tractable in terms of the human effort to build systems and the computational effort to run them. 

To his great credit, my interlocutor was not content to leave our debate a matter of words, or even theorems − unlike other participants. He proposed a bake-off: Each of us would build an expert system for the same problem using our preferred methods. We could then compare our approaches and their performance in a real example. He invited me to spend a week with him that summer at his university. He identified a task, the diagnosis and treatment of root disorders in apple trees, and a willing expert, Dr. Daniel R. Cooley, a plant pathologist with ten years’ experience in the domain. US orchardists sold over a billion dollars’ worth of apples per year in the 1980s − and twice that today. Untreated root disorders can ruin an apple orchard − and its owner.

Treating ailing apple trees

We took turns to interview Dr Cooley to encode his domain knowledge. He explained three key afflictions of apple trees − extreme cold during the winter, waterlogged soil, and phytophthora, a fungus-like pathogen. (A related species of phytophthora causes Sudden Oak Death, which plagues oak trees in several parts of the US including in my literal backyard in the Santa Cruz Mountains.)  We each interviewed the expert in our own style of “knowledge engineering”. My colleague encoded the expert’s knowledge as a set of rules. I encoded his knowledge as an influence diagram. Both our models represented the relationships among the causes, symptoms, tests, and treatments for these disorders.  This diagram shows some key elements of our two models:

Compare rule-based expert system and influence diagram for apple tree treatment

Most of the nodes are common to both systems.  The gray dashed arrows depict the rules that my colleague created using the then-popular KEE software from Intellicorp – for example from the observation Phytophthera test positive to the disorder Phytophthera infection:

If {Phytophthera test} is {positive} then {Phytophthera infection} is strongly supported.

The solid black arrows show causal relationships encoded as conditional probability distributions – for example, the probability of Phytophthera test positive conditional on the disorder Phytophthera infection, and conditional on the disorder being absent, i.e. the true and false positive rates:

Pr[Phytophthera test positive| Phytophthera infection] = 90%

Pr[Phytophthera test positive| No Phytophthera infection] = 5%

You can see that the dashed gray arrow depicting the rule goes in the opposite direction to the solid black arrow depicting the conditional probabilities.  The influence diagram system applies Bayesian inference to “reverse the arrows” and estimate the probability of Phytophthera infection given the observed test result, and all the other observed variables that relate to it.

Dr. Cooley told us that the standard treatment for trees infected by phytophthera is to apply a fungicide, which my colleague encoded as this rule:

If {Phytophthera infection} is {positive} then {Treat with fungicide} is strongly supported.

I built the model as an influence diagram model in Demos (the predecessor of Analytica). The influence diagram has more nodes than the rule-based system. It includes the five nodes on the bottom right of the figure above that had no counterpart in the rule-based system. They calculate the objective to minimize Total cost based on the Level of tree damage (based on the two disorders), and the Fungicide treatment cost or Replacement cost, according to the decision to treat or replace the tree.  The Total cost also includes the effect of permanent root damage on reducing the Income from a healthy tree.  This decision analysis approach recommends Treat with fungicide if that reduces expected Total cost relative to replacing the tree.

Testing and refining the model

The next phase was to test and debug the two systems. This provided an interesting contrast.  We applied each system to a set of example cases, each with a set of observations and corresponding recommendation. When the rule-based system suggested a treatment different from the expert’s recommendation, my colleague modified or added rules until it did agree with the expert.  For the rule-based system, the expert’s judgment was the gold standard.

Similarly, when the influence diagram model disagreed with the expert ­--  for example, in many cases it recommended tree replacement even when phytophthera was strongly suspected --  I carefully checked my model with him to make sure that the influences, probabilities, and costs accurately reflected his considered knowledge and judgment. He confirmed that they did. However, as a decision analyst familiar with the extensive psychological research on the fallibility of human judgment under uncertainty, I did not assume that his inferences and recommendations were necessarily valid.  So I then explained to him why the model made recommended against the fungicide: It turns out that its effectiveness is limited because, based on what he had told me, there is a low probability that the tree is both curable – i.e. actually has an infection -- and not already so damaged that it will never provide an economic yield of apples. So, it is more cost-effective to just replace the tree with a new seedling. After careful reflection he found this explanation convincing; he added that it should be of great practical value to orchardists who should consider revising their standard protocol for treatment.

The value of ridiculous results

When the results of a model appear ridiculous – in conflict with our commonsense intuitions – there are two possible reasons: Most often there’s a bug in the model. It doesn't properly reflect the available data, the experts’ knowledge, or the modelers’ intent.  In that case, we aim to diagnose the bug and fix it.  

But sometimes, as in this case of ailing apple trees, the problem is more interesting. As we scrutinize the model to see how it generates this unexpected result, we come to realize that the model is actually correct. The bug was in our original intuition. We can now revise and improve our intuition to reflect this deeper understanding. It’s time to stop trying to fix sick apple trees with fungicide.  

And then it’s time to celebrate!  We learned something new and interesting from the modeling process. Indeed, improving our intuition and getting new insights is often the main reason to build a model in the first place. If a model always behaved as we expectated, what would be the point of building one?


The Association for Uncertainty in Artificial Intelligence continues to flourish and had its 34th Conference in Monterey, California in August 2018. It has largely transcended its origins in the “uncertainty wars” among advocates of alternative representations. Most contributors now take for granted the use of probabilistic Bayesian networks, as do modern standard AI textbooks, such as Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig. With the remarkable success of deep-learning in the last decade, a key focus is on automated learning of probabilistic models from massive quantities of data, rather than hand-crafted knowledge engineering of heuristic rules from domain experts. However, we decision analysts who try to help people make complex decisions with long-term implications cannot rely just on historical data. We must still rely in part on expert judgment. We continue to work with human experts and decision makers using influence diagrams and probabilities to represent their knowledge, decision options and objectives, often in combination with data-derived models.  The synthesis of powerful analytics using statistical and machine learning methods on large amounts of data from the past with decision analysis methods to model client objectives and expert judgments about future uncertainties provides practical value far beyond what is possible from each method alone.

» Back

Max Henrion

Max Henrion has 25 years of experience as a researcher, educator, software designer, consultant, and entrepreneur, specializing in the design and effective use of decision technologies. He is the Founder and CEO of Lumina Decision Systems. He has led teams to create decision-support tools in a wide variety of applications, including energy, environment, R&D management, healthcare, telecommunications, aerospace, security, and consumer choice. He is the lead designer of Lumina’s flagship product line, Analytica -— the software about which PC Week said “Everything that’s wrong with the common spreadsheet is fixed in Analytica”.

Leave a Comment