Validation and verification techniques — do-a-thon



What are your indicators if your model has produced “correct” results? What do you look at first? I suggest to gather experience and create a list or even code of easy to apply validation and verification techniques for energy system models.


Hi Frauke. A useful discussion to start.

I take it the notion of validation is ensuring that a particular numerical model — comprising code and data — represents the system under consideration sufficiently well to allow the research or policy questions being traversed to be legitimately addressed (the notion of verification seems synonymous to me, perhaps someone can clarify the distinction).

In the energy system domain, numerical models are normally expressed as scenarios and evaluated by difference against a selected baseline case — often but not always some estimate of business as usual. This approach is used because it is generally not possible to foretell with confidence how necessarily model exogenous factors might develop over time — for instance, technology development and cost reduction or future natural resource stocks and prices. Technically, a scenario can be treated as an instance of a model framework, with most projects (with the exception of spreadsheet models) having adopted this architecture. Each individual scenario is normally accompanied by a supporting narrative or storyline. Despite claims to the contrary, scenarios are not value-free and in no sense can ever be considered correct.

Let’s start by suggesting what validation is not. It is not ensuring that your model reproduces historical data and dynamics with sufficient accuracy. Given that most energy system models are very thin caricatures of reality, seeking to match past system states and the current trajectory can easily stray into what statisticians call overfitting. For which Christian and Griffiths (2017:ch 7) provide a good review (and for agent-based modelers, the rest of the book is well worth reading).

I make this comment because a recent energy model review process (and I am avoiding names) asked if one’s model had been confirmed against historical data. In my view, the degree of tuning implied by this question can be problematic. That is not to say that comparisons against historical data should not be undertaken, indeed they may prove quite revealing. It is that calibrating a highly simplified model against historical data is unlikely to improve model fidelity — indeed, the opposite may well occur.

Instead, it is doubtless better if the model possesses what might be termed intrinsic legitimacy (for want of a better phrase), meaning both the system elements and the architecture of the model are, in their own rights, representative and defensible approximations of reality — even if the model does not reproduce current events particularly well. Intrinsic legitimacy thus requires good specifications and an informed debate about how the various model elements — as proxies for actual components — might best be represented numerically, at what level of detail and resolution, and how so parameterized to create scenarios. Model elements can extend to decision-takers under a hybrid agent-based modeling (ABM) paradigm.

There are a good number of system metrics that can be calculated from both the input and output data to check for model integrity. These metrics are often intensities and can usually be assessed instinctively for plausibility — examples include aggregate conversion efficiencies and weighted-average production costs.

I am especially interested in the role of network effects (discussed by Outhred and Kaye 1996 and also described here) in energy systems, their significance, and how best to capture these in models. I have long believed that an analog of Nyquist–Shannon sampling (which sets a lower limit on the frequency of discrete sampling for a predefined fidelity) applies such that energy system models should adequately reproduce the influence of abrupt constraints in the system — with one example being the upper bound capacities of engineering assets. One way of exploring the nature of network effects would be to rerun a model with some appropriate level of noise in the input data — while noting that this exercise is distinct and different from the usual motivation for performing sensitivity analysis, more often applied to smooth rather than lumpy systems (these are descriptive rather than technical terms) in order to investigate characterization uncertainties.

In the background is also the idea that the model code was implemented as intended — in other words, free from bugs. Known as code correctness, this requirement stresses the need for software engineering practices: a stated formal model, simplified test cases, automated unit testing (for object-oriented languages), evaluation of output metrics, runtime integrity tests and assertions, fuzz testing, execution logging, commit reviews, code reviews more generally, and the use of coding conventions (Pfenninger et al 2017:§3.2 covers some of these measures in detail). (The wikipedia page also indicates that software engineering is not without critics, although creativity and software engineering should nonetheless be able to coexist.)

In a similar vein, data accuracy is material. The same information is normally used to both view the system and to create a set of scenarios. In which case, the provenance of the information is crucial, as is an open debate about its semantics and accuracy. Ideally, an explicit domain ontology should also be available and agreed.

There is no substitute for having model frameworks and scenario sets continually used, assessed, fixed, and extended by a wide variety of developers, users, and analysts. One reason why the open source development paradigm is so important for energy policy models is that open development can excel here in ways that closed development cannot remotely match.

Finally, a caution about the over-interpretation of model results (and perhaps it is findings that are the target of verification?). Although Bruckner (2016) deals with integrated assessment models, his comments about the limits to evaluation are equally valid for energy system models. And because both types of model are often predicated on rational choice and least cost in some restricted sense, the resulting system trajectories offer neither predictions nor requirements.

Some thoughts! With best wishes, Robbie.


Bruckner, Thomas (January 2016). “Decarbonizing the global energy system: an updated summary of the IPCC report on mitigating climate change”. Energy Technology. 4 (1): 19–30. ISSN 2194-4296. doi:10.1002/ente.201500387. Paywalled (email me).

Christian, Brian and Tom Griffiths (6 April 2017). “Chapter 7: Overfitting”. In Algorithms to live by: the computer science of human decisions. London, United Kingdom: William Collins. pp 149–168. ISBN 978-0-00-754799-9.

Outhred, Hugh R and R John Kaye (1996). Electricity transmission pricing and technology. In Michael A Einhorn and Riaz Siddiqi. Electricity transmission pricing and technology. Boston, Massachusetts, USA: Kluwer. ISBN 978-94-010-7304-2. doi:10.1007/978-94-009-1804-7.

Pfenninger, Stefan, Lion Hirth, Ingmar Schlecht, Eva Schmid, Frauke Wiese, Tom Brown, Chris Davis, Matthew Gidden, Heidi Heinrichs, Clara Heuberger, Simon Hilpert, Uwe Krien, Carsten Matke, Arjuna Nebel, Robbie Morrison, Berit Müller, Guido Pleßmann, Matthias Reeg, Jörn C Richstein, Abhishek Shivakumar, Iain Staffell, Tim Tröndle, and Clemens Wingenbach (2017). “Opening the black box of energy modelling: strategies and lessons learned”. Energy Strategy Reviews. 19: 63–71. ISSN 2211-467X. doi:10.1016/j.esr.2017.12.002.


In our experience, key metrics to evaluate performance of dispatch and price forecast models are:

Prices: in [€/MWh]

  • difference between annual average historical vs. modelled price
  • standard deviation of the hourly difference between historical vs. modelled price
  • maximum (or other quantiles) of the hourly difference between average historical vs. modelled price

Price spreads: in [€/MWh]

  • difference between annual average historical vs. modelled price spread (for each pair of countries)
  • standard deviation of the hourly difference between historical vs. modelled price
  • maximum (or other quantiles) of the hourly difference between average historical vs. modelled price

Production: (for dispatchable plants) in [MWh]

  • difference between the historical vs. the modelled annual generation output (for dispatchable plants)
  • standard deviation of the hourly difference between historical vs. modelled generation output
  • maximum (or other quantiles) of the hourly difference between average historical vs. modelled generation output

Cross-zonal flows: in [MWh]

  • difference between the historical vs. the modelled annual import (or export) for each border
  • standard deviation of the hourly difference between historical vs. modelled import (or export) for each border
  • maximum (or other quantiles) of the hourly difference between import (or export) for each border


Outcomes from the group using more detailed technology models e.g. for municipal district heating modelling:

  • Plausibility checks
    • Duration curves depict operational frequency, (part-load) behaviour, ranking among technologies, …
    • Heat and power production compared to overall demand captures temporal dimension
  • Validation
    • Checking plant behaviour in linearized models against non-linear modelling or real results e.g. in PQ-diagrams indicates model quality
    • Validation against real data is hard and questionable due to random factors and the human dimension on the demand and production site e.g. through groups involved on the plant operator site

General leading question towards model comparison and “general” validation:

On which levels can be compared among different model types? E.g. technologies/regions can always be compared…


Long-term investment and operation energy system models.
First ideas for a good-practice-list what to consider for validation:

  • plan time for it following the model runs
  • check model results with results from others (publications)
  • sensitivity analysis on the most influential input parameters
  • sensitivity analysis on the model formulation by comparing it to other model results
  • Show the results to others earlier than you think you should…

Input data and assumptions:

  • get feedback for scenarios from local experts
  • decisive parameters are (list to be continued):
  • Full load hours of wind and solar
    * LCOE / CAPEX and OPEX of all technologies

Output Data (Decisive output parameters…list to be continued):

  • Dual Variables of investments should be similar to the assumed investment costs

Special question: Does the model have to fit completely for historic years?

  • validation with historic results / statistical data yes, but not convincing for 2050
  • calibration does not make completely sense, but for the start year using statistical data
  • simplified energy flow diagram should be similar
  • stick to the fact: We are not predicting the future - it is more a “if that - then that” analysis
  • Grid model: Here validation with historical data does make sense


Some ideas for validation of LCA or sustainability databases:

  • Open street map (e.g. open grid map).
  • Enepedia (trying to move beyond energy into industry)
  • Remote sensing (e.g. aggregate land use and land use change)
  • Remote sensing (atmospheric emissions, e.g. methane, GHG)
  • Comparison with top-down (IO) and bottom-up (LCA)
  • Passive crowd source data (e.g. sensebox, bus sensors)
  • Prioritized crowd source data (check up individually on particular hot spots)
  • National GHG inventories
  • Add up total health burdens or other total indicators (e.g. global species extinction rate), compare with total predicted values from health meta-studies. Some unpublished work on this from Bo Weidema.

Data quality indicators:

  • Mass balance (substance-specific) & energetic balances (or at least nothing too impossible)
  • Water, land, monetary, etc. balances
  • Could measures from robust statistics help in finding outliers?
  • Pedigree matrix as a measure of data quality or uncertainty from use of proxy datasets
  • People processing trade data (e.g. for EEMRIO) do a lot of manual checking - possibility of working together to have common algorithms or expectations?
  • Try to build on the techniques used in hunting for the “rogue” CFC manufacturer
  • End- or intermediate-level models can check assumptions before starting to work with data, e.g. the ocelot project defines models as a list of functions, and half of these functions are checking assumptions in data, as opposed to actually doing things.


Validation of operational model:

  • QQ plot (quantile-quantile plots) to spot outliers
  • duration curve: order time-series by value to check peak values
  • performances with simple model, then improve it and compare it with old version
  • compare performances of own model with other similar models: Pearson correlation coefficient, RMSD, mean absolute error, etc.
  • annual average values, i.e. coefficient of performance, etc.
  • cross-validation of any curve


An interesting article applying ex post modeling about cost optimization as a realistic approximation of the evolution of energy systems:


Hello @ksyranid Just to repeat your reference (Trutnevyte 2016) in full. And to mention work that @ClaraHeuberger presented at the openmod Munich workshop (12–13 October 2017) and which is now published (Heuberger et al 2018). This work uses rolling horizon optimization to capture and study myopic planning and limited foresight. The paper itself talks about “unicorn technologies”. Both articles sound a warning about the practice of tuning policy models too closely to the recent past.


Heuberger, Clara F, Iain Staffell, Nilay Shah, and Niall Mac Dowell (21 May 2018). “Impact of myopic decision-making and disruptive events in power systems planning”. Nature Energy. ISSN 2058-7546. doi:10.1038/s41560-018-0159-3. Paywalled.

Trutnevyte, Evelina (1 July 2016). “Does cost optimization approximate the real-world energy transition?”. Energy. 106: 182–193. ISSN 0360-5442. doi:10.1016/ Paywalled.