WiNDC - Structuring data for API access and WiNDC.jl

Structuring data for API access and WiNDC.jl

Mitch Phillipson August 01, 2025

data blog

How should we be storing and using WiNDC and economic data? We currently use sparse arrays to store sets and parameters. However, this approach means every model is tightly coupled to the specific application, for example the National module has fewer parameters than the Household module. This makes it difficult write generic code that can be used across modules. In this post I discuss a new data model that is more flexible and can be used across all modules.

The information in this post is the basis for WiNDC.jl and the WiNDC API. I make explicit reference to column names and parameters in the WiNDC National module which are subject to change. The concepts in this post will remain the same. This structure is available on GitHub, this is the best place to find the most up-to-date information about the data structure.

Understanding the WiNDC National Data

I am going to use the WiNDC National module as an example to understand our new data model. But first we should understand where the data comes from, the assumptions we make, and how we store the data in GAMS.

All the data for the National module comes from the Supply-Use tables (SUT) provided by the Bureau of Economic Analysis (BEA). The Supply table shows the total supply of goods and services in the economy, while the Use table shows how those goods and services are used in production and consumption.

Here is a figure that shows the structure of the SUT as provided by the BEA:

The SUT is a comprehensive representation of the economy, capturing all transactions between sectors. Because these tables represent both sides of the economy, we get three natural assumptions about the data:

Zero Profit Condition: The total value of inputs must equal the total value of outputs for each sector. This means that the sum of all the rows in the Use table must equal the sum of all the columns in the Supply table for each sector. As an equation, this can be written as: $\sum_{c} ID_{c,s} + VA_s + OT_s = \sum_{c} IS_{c,s}$ where $s$ is a fixed sector and the sums are over all commodities $c$ .

Note that this condition does not say that individual firms cannot make profits, surplus is contained in value added. Rather, it says that over the entire economy all payments are accounted for. Or, the economy itself can’t make a profit.
Market Clearance: The total supply of each commodity must equal the total use of that commodity. This means the sum across the columns of the Supply table must equal the sum across the rows of the Use table for each commodity. In other words, all the goods produced must be consumed.
Margin Balance: This is less obvious, but the two columns representing margins, Transport and Trade, must each sum to zero. In WiNDC we decompose these columns (by sign) into two parameters, margin_supply and margin_demand.

From these tables we create sets and parameters. The sets represent the labels on the rows and columns of the SUT (sectors, commodities, value added, etc.) and the parameters represent the values in the SUT, indexed by the sets. For example, intermediate_demand is a parameter indexed by commodities and sectors, and exports is indexed only by commodities.

Above, I wrote the equation to compute the zero profit condition but I did not write the market clearance condition. That’s because the market clearance condition is a long equation, it uses every parameter except value added and other tax. If only there was an easy to way to group by the commodities and sum the values.

Going Long rather than Wide

The standard trick is to make the data long rather than wide. This means we will make a single table with four columns: row, column, parameter and value.

The column row contains the row labels (commodities, value added and other tax),
The column column contains the column labels (sectors, final demand, etc.),
The parameter column contains the parameter names (intermediate_demand, exports, etc.)
The value column contains the values from the SUT. By convention, we make values from use positive and values from supply negative.

We refer to the row and column columns as the domain of the data. In general, our table will have $N+2$ columns, where $N$ is the dimension of the data. If our data included multiple years, then we would have a year column in the domain and our table would have 5 columns.

The parameter column is necessary because intermediate demand and intermediate supply are both indexed by commodities and sectors, but they are different parameters.

It might appear that the parameter column duplicates information. For instance, the exports parameter is only indexed by commodities, meaning the labels in column will always agree with the labels in parameter. In the summary-level tables the column label for exports is F040 and in the detailed tables it is F04000. The parameter column will always be exports, this separates the data from the labels. Meaning I can filter my large table to view exports without having to worry about which aggregation level I am using.

Let’s call this long table the data table as it store all of our data.

Accounting for Sets

The data table does not contain any information about the sets. To account for this, we create a sets table that contains the following columns:

name: The name of the set, preferring the singular version of the name (sector, commodity, etc.)
description: A description of the set, which can be used to provide more context about the set.
domain: Which column of the data table the set is used in (row, column).

Notice the sets table does not contain the elements of the sets. If it did then we would be duplicating the description and domain information for each element. Instead, we will create a separate elements table that contains the actual elements of the sets. The columns of the elements table will be:

name: The name of the element, which is the same as the label in the row and column columns of the data table.
description: A description of the element, which can be used to provide more context about the element. For example, F040 means export.
set: The name of the set that the element belongs to, which is used to link the element to the set in the sets table.

Computing Market Clearance

Using these three tables, let’s explore the logic of computing the market clearance condition. The general idea is to filter the row column of data to be only commodities then group by row and sum the values. Let’s break down we do this with our tables:

Use the sets table to find which column of data contains each commodity, this gives us row.
Filter the elements to find each commodity.
Subset data on row to only include the commodities.
Group data by row and sum the values.

The other balance conditions can be computed in similar ways. For zero profit, you want the sector rather than commodity. For margin balance, we create a set that contains the two marginal codes for Transport and Trade.

The parameter column is not used to compute any of the balance conditions, although you could explicitly extract the parameters needed to compute the balance.

Understanding the `parameter` Column

The parameter column contain the smallest parameter of interest that the data belongs to. Let’s take the three blocks other final demand, personal consumption and exports. Together, these three blocks make up final demand. If we wanted to extract final demand we can filter the parameter column to only the three blocks. However, this requires us to know the names of the blocks and is inflexible if the blocks change. For example, other final demand is not a single column it includes categories like government spending, investment, and inventory changes. If we decide to break other final demand into three blocks, then we would have to update our code to include the new blocks.

Instead, we create a fourth table called parameters that contains the following columns:

name: The name of the parameter, this is where we would have final_demand
subtable: References to the parameter column in the data table, this is where we would have other_final_demand, personal_consumption, and exports.

This table may see modification moving forward, for example we don’t have descriptions for parameters. For example, this may become part of the sets and elements table in the future or two similar tables. However, for now it is sufficient to understand the data model.

To extract final demand we filter the name column of the parameters table. This will return three rows with other_final_demand, personal_consumption, and exports in the subtable column which we can use to filter the data table. If we end up disaggregating other_final_demand into three separate blocks, then we can simply update the parameters table to include the new blocks without changing any of our code.

Future Work

Currently, I have implemented the data model described above and provided a generic calibration routine. There is a lot of work to be done:

Determine optimal data structure for the parameters table.
Add domain checking to ensure, for example, that intermediate demand is only indexed by commodities and sectors.
Add methods to aggregate and disaggregate the data.
Profile the code base and optimize the performance of the data model.

I’ll be working on this project in August, expect future posts as I make progress on these tasks.