WiNDC - The current WiNDC data processing pipeline

The current WiNDC data processing pipeline

Mitch Phillipson July 11, 2025

blog api data

This week I would like to discuss how WiNDC takes our raw data and transforms it into the calibrated datasets used by our models. This is a multi-step process where the data passes through both Python and GAMS.

Python Data Processing

The purpose of the Python methods is to tidy the data and export a format readable by GAMS. I’m going to get fairly specific, here is a link to the windc_datastream if you’d like more reference. This is not the cleanest package, I’ve debated doing a pass and refactoring it, but I intended to replace it Julia code.

The parent class for all datasets is Parser located in windc_data/dataparser.py. This class has all the methods for reading the data, cleaning it, and exporting it to a GDX file. However, since every dataset is different, some methods are required to be overridden in the child classes. The methods to be overwritten are clearly documented in the class docstring.

The data goes through a series of transformations, which are documented in the class docstrings. The main steps are:

Reading the Data: Load the data into a dictionary self.data with keys for each variable. Eventually these will be merged into a single dataframe.
Cleaning the Data: The most important step of the process, this makes the data follow a predictable format, best described as “tidy”. For example, in bea_use.py, the file responsible for loading the BEA Use table, the data is transformed from a wide format to a long format and then columns are given explicit types.
Automated Cleaning: The next four steps are the same for every dataset, so they are implemented in the parent class: strip whitespace, concatenate the dictionary data, remove any zeros, and regularize label names. The last step will ensure, for example, that “OHIO” becomes “Ohio”, which will get replaced by “oh” later in the process.
Build Notations: Every dataset stores the state names differently. This step builds a dictionary to link the state names to their GAMS notation. For example, “Ohio” becomes “oh”, “Texas” becomes “tx”, etc. Similar for units, years, and sector identifiers.
Automated Replacements: The next three steps apply the regularizations we just built to the data.
Final Transformations: The last step is a catch-all for any final transformations that need to be applied to the data. For example, in bea_use.py, we rename the columns to match the GAMS notation.

Each dataset additionally has a _build_gdx_dict method that returns a dictionary of dictionaries that contain information on the data to be exported to GDX. This is both sets and parameters. For example, bea_use.py creates two sets, one for the sectors and one for the commodities, and one parameter for the data itself.

Finally, we export the data to a GDX using the gamsapi[transfer] package, here is a link to the package on PyPi. This package is a wrapper around the GAMS API that allows us to export data to GDX files. The Parser class has a method to_gdx that takes the GDX file name as an argument and exports the data to the file.

In this process the actual data is hardly modified, some data gets converted from millions to billions. However, for an individual dataset it can be hard to track exactly what has changed or follow the process.

GAMS Data Processing

This is where things get messier. The GAMS code is responsible for taking the data from the GDX files and transforming it into the calibrated datasets used by the models. To make things simpler, I am going to briefly describe how the national level data is processed, I will be focusing on the core directory in the windc_build repository.

Each module in windc_build has a build.gms file that acts as the main file for that module. This file calls each of the other files in the directory with the appropriate arguments. It also ensures the directory structure is in place for the output files.

The first file we call is partition_bea.gms, which takes the BEA tables and partitions them into the appropriate subtables. This file is responsible for creating almost all the parameters necessary for the national model. Let’s look at id0, or intermediate demand. This gets extracted from the full table using using the sectors and commodities defined by the data parser. It is possible that id0 contains negative values, which isn’t ideal. Any negative flows get moved to supply as positives. However, a similar process for supply isn’t carried out until calibrate.gms. I have debated changing this behavior, but I wanted to ensure data consistency and there are only a few negative flows in the data.

The final step in partition_bea.gms is to export the data to a GDX file. This is the standard process for every file. Load data from a GDX, process it, and export it to a GDX file. This is a GAMS limitation, there are no subroutines so you must save the data to use it later. Most of our data files small and GAMS is very fast, so this isn’t a big issue. But it still inefficient and adds complexity to the code.

Next up is calibrate.gms. We require our data to satisfy zero profit (column sums of Use and Supply must be equal), market clearance (row sums of Use and Supply must be equal), and the margins must be balanced. This seems like a fairly straightforward process, but it gets complex because we check for negative values before calibrating. Checking is fine, it’s good to have error checking, however it makes the script much more complex. This is, again, a GAMS limitation. This should be a function that we can call and use in other places in the code base. In particular, va_0 and m_0 get adjusted quite a bit, they get replaced with an adjusted share. There is one more variable created in calibrate.gms, bopdef. This variable is used to define the balance of payments constraints in the model.

Conclusion

The point of this post is to demonstrate the complexity of the data processing in WiNDC. Having two different languages and two different processes makes it hard to track the data. This is an example of the two language problem. Julia is a great language to solve this problem.

As I move towards an API solution to data delivery it will be essential that I can easily deploy the data to what ever structure will be holding it. Having an easy to maintain code base is going to be essential. It will also be easier to create automated deployments and tests with a single language.

I have made progress in this conversion and will have updates in the future.