The current WiNDC data processing pipeline
July 11, 2025
This week I would like to discuss how WiNDC takes our raw data and transforms it into the calibrated datasets used by our models. This is a multi-step process where the data passes through both Python and GAMS.
Python Data Processing
The purpose of the Python methods is to tidy the data and export a format readable by GAMS. I’m going to get fairly specific, here is a link to the windc_datastream if you’d like more reference. This is not the cleanest package, I’ve debated doing a pass and refactoring it, but I intended to replace it Julia code.
The parent class for all datasets is Parser located in
windc_data/dataparser.py. This class has all the methods
for reading the data, cleaning it, and exporting it to a GDX file.
However, since every dataset is different, some methods are required to
be overridden in the child classes. The methods to be overwritten are
clearly documented in the class docstring.
The data goes through a series of transformations, which are documented in the class docstrings. The main steps are:
- Reading the Data: Load the data into a dictionary
self.datawith keys for each variable. Eventually these will be merged into a single dataframe. - Cleaning the Data: The most important step of the
process, this makes the data follow a predictable format, best described
as “tidy”. For example, in
bea_use.py, the file responsible for loading the BEA Use table, the data is transformed from a wide format to a long format and then columns are given explicit types. - Automated Cleaning: The next four steps are the same for every dataset, so they are implemented in the parent class: strip whitespace, concatenate the dictionary data, remove any zeros, and regularize label names. The last step will ensure, for example, that “OHIO” becomes “Ohio”, which will get replaced by “oh” later in the process.
- Build Notations: Every dataset stores the state names differently. This step builds a dictionary to link the state names to their GAMS notation. For example, “Ohio” becomes “oh”, “Texas” becomes “tx”, etc. Similar for units, years, and sector identifiers.
- Automated Replacements: The next three steps apply the regularizations we just built to the data.
- Final Transformations: The last step is a catch-all
for any final transformations that need to be applied to the data. For
example, in
bea_use.py, we rename the columns to match the GAMS notation.
Each dataset additionally has a _build_gdx_dict method
that returns a dictionary of dictionaries that contain information on
the data to be exported to GDX. This is both sets and parameters. For
example, bea_use.py creates two sets, one for the sectors
and one for the commodities, and one parameter for the data itself.
Finally, we export the data to a GDX using the
gamsapi[transfer] package, here is a link to the package on
PyPi. This package is a wrapper around the GAMS API that allows us
to export data to GDX files. The Parser class has a method
to_gdx that takes the GDX file name as an argument and
exports the data to the file.
In this process the actual data is hardly modified, some data gets converted from millions to billions. However, for an individual dataset it can be hard to track exactly what has changed or follow the process.
GAMS Data Processing
This is where things get messier. The GAMS code is responsible for
taking the data from the GDX files and transforming it into the
calibrated datasets used by the models. To make things simpler, I am
going to briefly describe how the national level data is processed, I
will be focusing on the core directory in the windc_build
repository.
Each module in windc_build has a build.gms
file that acts as the main file for that module. This file calls each of
the other files in the directory with the appropriate arguments. It also
ensures the directory structure is in place for the output files.
The first file we call is partition_bea.gms, which takes
the BEA tables and partitions them into the appropriate subtables. This
file is responsible for creating almost all the parameters
necessary for the national model. Let’s look at id0, or
intermediate demand. This gets extracted from the full table using using
the sectors and commodities defined by the data parser. It is possible
that id0 contains negative values, which isn’t ideal. Any
negative flows get moved to supply as positives. However, a similar
process for supply isn’t carried out until calibrate.gms. I
have debated changing this behavior, but I wanted to ensure data
consistency and there are only a few negative flows in the data.
The final step in partition_bea.gms is to export the
data to a GDX file. This is the standard process for every file. Load
data from a GDX, process it, and export it to a GDX file. This is a GAMS
limitation, there are no subroutines so you must save the data to use it
later. Most of our data files small and GAMS is very fast, so this isn’t
a big issue. But it still inefficient and adds complexity to the
code.
Next up is calibrate.gms. We require our data to satisfy
zero profit (column sums of Use and Supply must be equal), market
clearance (row sums of Use and Supply must be equal), and the margins
must be balanced. This seems like a fairly straightforward process, but
it gets complex because we check for negative values before calibrating.
Checking is fine, it’s good to have error checking, however it makes the
script much more complex. This is, again, a GAMS limitation. This should
be a function that we can call and use in other places in the code base.
In particular, va_0 and m_0 get adjusted quite
a bit, they get replaced with an adjusted share. There is one more
variable created in calibrate.gms, bopdef.
This variable is used to define the balance of payments constraints in
the model.
Conclusion
The point of this post is to demonstrate the complexity of the data processing in WiNDC. Having two different languages and two different processes makes it hard to track the data. This is an example of the two language problem. Julia is a great language to solve this problem.
As I move towards an API solution to data delivery it will be essential that I can easily deploy the data to what ever structure will be holding it. Having an easy to maintain code base is going to be essential. It will also be easier to create automated deployments and tests with a single language.
I have made progress in this conversion and will have updates in the future.