Do-a-thon: Towards a common data standard for integrated assessment and energy systems modelling

Proposal for: Do-a-thon
by Daniel Huppmann (IIASA) and Stefan Pfenninger (ETHZ)

Session Title
Towards a common data standard for integrated assessment and energy systems modelling

Session Description

It is clear from previous discussions that not only a common standard and potentially conversion tools between data formats would make sense, but also, multiple efforts are now starting up or underway to develop such standards and tools.

Jointly hosted by the Horizon 2020 projects SENTINEL and openENTRANCE, the aim of this session is to coordinate ongoing efforts on common (or at least inter-operable) data exchange formats for energy system and integrated assessment (i.e., human-earth-climate systems) models/frameworks.

Background : There are multiple ongoing projects in Europe aiming to develop the technical infrastructure (e.g., online databases) and required data standards (i.e., templates and formats) to facilitate integration and model linkage across different frameworks and tools. Each of these projects includes several (up to a dozen) research teams across Europe, working with different methodologies, focusing on different sectors, and modelling varying spatial and temporal scales. Within each project, the infrastructure and formats should enable efficient collaboration and data exchange while supporting the FAIR principles and open, collaborative science.

Aim : Compare currently used implementations of data exchange formats and determine the scope for harmonization and/or development of conversion tool across these projects.

Definition/scope of “data exchange format” : The discussion should encompass both the technical specifications and the application/implementation aspects, i.e.:

  • Which file type is used?
  • What is the schema structure?
    Example: in a tabular format, what are required/optional columns?
  • What is the required scope?
    For example, are aggregates required to be included in the dataset, or is there an expectation that a user computes aggregates herself?
  • What are the naming conventions (ontology) to describe the data?
    Example: for the spatial dimension, which region identifiers are used?
  • Which metadata fields/tags are mandatory/optional?

The intended outcomes are:

  • A brief session summary which can inform further work within the projects working on this topic, including the Horizon 2020 projects openENTRANCE, SENTINEL, and Spine, as well as the OpenEnergyDatabase and Open Power System Data.
  • Establishment of an ongoing discussion forum for further exchange between interested parties working on related projects, similar to the Scientific Working Group on Data Protocols and Management of the IAMC.

Would you like to be responsible for this Session?
Yes

Do you need any special infrastructure for this Session?
A projector and if possible, sufficient space for up to 30 participants to split into smaller groups of 4-5 people.

Do you have any recommendations who could be part of this Session?
As many representatives as possible from other ongoing or planned projects involved in developing or operating tools and platforms for sharing data and code related to energy modelling.

3 Likes

@ludwig.huelk would be great if you were to get involved here too!

Thanks for the follow up on the working group at the EMP-E. Of course I’m in.
Maybe it would make sense to take out some of the suggested topics to be more focused.
It is already planned to have a separate session on the ontology.

This do‑a‑thon looks entirely sensible and the Berlin workshop is certainly a good opportunity to build upon previous work by this community on data and metadata standards (much of it pushed along by the OpenEnergy Platform and OPSD projects mentioned in the original post).

This do‑a‑thon is also timely given some significant EU‑funded modeling infrastructure projects are either just kicking off (SENTINEL and openENTRANCE) or part underway (Spine). The openENTRANCE project, as I understand it, is as much about the social dimension of cooperation as it is about the technical dimensions of interoperability. So I think the social context is at least obliquely significant to this discussion — for starters, just to consider which stakeholders should be involved.

The title mentions both “integrated assessment modeling” and “energy system modeling” and those two worlds have been disjoint for way too long. Although I suspect providing a common language (or semantics or ontology) that works across both domains is going to be rather challenging.


Some issues, minor and otherwise, that might also be traversed at the do‑a‑thon:

Data model sophistication: The data model underpinning this exercise can range from pedestrian data structures to high‑level abstractions with supporting semantics. Abstractions require implementation, so that choice is essentially progressive. Therefore a first question might be the level of sophistication appropriate to this initiative at this juncture?

The Spine project offers a central generic abstract data model, lacking semantics, with bi‑directional translators to service both data sources and energy model instances. Another early question would then be how much of the Spine approach can and should be used here and whether it could even be a core component? Spine is based on an EAV/CR or entity-attribute-value/class-relation approach with the actual semantics left to each team to define. So Spine, if you like, has no explicit social dimension. And although Spine does not currently support object composition (to aid compilations of datasets as discussed shortly), it could be thus extended. Adoption of Spine is simply a question on my part and not an implicit suggestion.

Standardized derived metrics: Described as computed aggregates above. Yes please!

Data license tracking: If each dataset (or more specifically, each legally separate work) was accompanied by standardized legal metadata, then users could filter on license compatibility and perhaps also machine‑generate license notices, with listings of contributors, when merging datasets (resulting in a new single work rather than a compilation of several existing works).

UML class diagrams: Also worth thinking about is the adoption of UML class diagrams for depicting the underlying data models and similar concepts. (I am a big fan of UML diagrams.)

Technical matters: Preferred practice for CSV data. Preferred character encoding (clearly either ASCII or UTF‑8). Preferred license for metadata (CC0‑1.0). Support for object composition.

Graph objects: Analysts normally think of scalars, timeseries, tabular data, and key‑value pairs as the primitives. But graph objects should perhaps also be supported as primitives — while noting these objects usually manifest as lists of various kinds.

Workflow: I guess the organizers will not wish to stray into the area of scripted workflow, but that question is probably also obliquely relevant.

Process: Some thought about how best to converge to agreement would be useful.