An Overview of WiNDC Data
January 08, 2026
This week I released early versions of the three main WiNDC data packages in Julia:
These repositories all have documentation on how to get started with a simple example for loading and working with the data. In this post I will provide a brief overview of each data package and its contents.
WiNDC Container
WiNDCContainer.jl is the engine for each of the WiNDC data packages. At it’s core, WiNDCContainer links three DataFrames, one for the data values, one for the sets, and one for the elements of the sets. This allows for easy manipulation of the data using standard Julia tools. Each of the WiNDC data packages builds on this framework to provide specific data for different levels of aggregation.
Traditional modelers may wonder why we didn’t use a sparse array and that is because Julia lacks a canonical multi-dimensional sparse array. While there are several sparse array Julia packages, none of them provide the functionality we desire for working with WiNDC data. Here is a short list of sparse array packages in Julia:
- SparseArrays.jl - Built-in Julia package for sparse matrices, but is limited to 2 dimensions.
- SparseArrayKit.jl - Focused on high dimensional linear algebra for quantum operations. Lacks documentation and the keys are required to be integers.
- Sparse Axis Array - This is a JuMP-specific sparse array implementation that supports named axes. However, it does not perform domain checking and does not default to zero for missing values.
One option we considered early on was creating a new sparse array specifically for WiNDC data. We prototyped this approach, but ultimately decided against it for several reasons. First, as our data grew in size and complexity, our code became slow and difficult to maintain. Each new feature required significant changes to the sparse array implementation, leading to a lot of technical debt. Second, Julia doesn’t perform well when there are a large number of objects in the global space, which is often the case when working with WiNDC data. This led to more performance issues that were difficult to resolve. Finally, and this is also true for GAMS, having a large number of parameters makes it difficult to get an overview of the data. For example, we have strict requirements for the signs on our data. If we store our data across 20 different sparse arrays, it becomes difficult to ensure that all of the data is correctly signed.
This brings us to our DataFrame-based approach. We found that most of
the operations we wanted to perform on the data were more naturally
expressed in terms of DataFrame operations. For example, the raw data
gets loaded into a DataFrame and then cleaned. This data can then easily
be joined with our base data to perform operations. DataFrames also have
the advantage of being largely language agnostic, meaning that users
familiar with R, Python, or SQL can easily understand how to work with
the data in Julia. DataFrames are also naturally sparse, meaning that
missing values are simply not stored. If we join two DataFrames
together, any missing values will automatically be filled with
missing, which can then be handled appropriately, such as
by replacing with zeros or filtering out.
A DataFrame has a lot of advantages over arrays. It is very easy to manipulate and filter data using DataFrame operations. Visualizing data with Plots.jl, Plotly.jl, or Makie.jl is straightforward. The main disadvantage comes when inserting data into a model. There are two general strategies for this:
- Construct a DataFrame where each row contains the data for a single constraint.
- Convert a subset of your DataFrame into a dictionary, a simple sparse array, and use that to populate your model.
I have been using the second approach in my MPSGE models and it has worked well, you can see an example of this in the Household Model Documentation. The data that goes into a single constraint is typically a small subset of the overall data, so converting it into a dictionary or sparse array is efficient and straightforward. The first method can work very well, but it struggles when there are a large number of columns, for example if you need to index over each commodity.
I plan to write additional documentation on best practices for working with DataFrames in Julia. I may also offer a course or workshop on this topic in the future, depending on interest. If you are interested, please reach out to me.
WiNDC National
The WiNDC National package contains a the framework and a simple model for working with national level datasets. We provide methods to load data for both the US and Australia. If you are reading this, hoping for an in depth analysis of how we go from the raw data to the final WiNDC National data, then you should read the documentation. My goal, for each of our data packages, is to provide a clear and transparent path from the raw data to the final data used in models. This will always be located in the package documentation so we can be confident it is up-to-date.
The WiNDC National data is a great place to get started with WiNDC data in Julia. It is the simplest of the three data packages and has the most documentation. If you are new to WiNDC or Julia, I recommend starting here. I’ve written a simple example that walks through loading and aggregating the WiNDC National data so that the MPSGE model can be solved on a trial PATH license.
The model we use is fully documented in the WiNDC National documentation. It is written in MPSGE. If you are unfamiliar with MPSGE, think of it as a domain-specific language for writing CGE models. All you need to do is specify the inputs and outputs for each sector, and MPSGE automatically generates the necessary CES equations. This allows you to focus on the economic structure of your model, rather than the mathematical details. We are planning to add more features, documentation, and functionality to MPSGE in the coming months. Join our Mailing List to stay up-to-date on the latest developments.
When we create our data, we store only necessary data. This is in contrast to GAMS, where a large number of parameters are composites (or combinations of other parameters) which need to be updated whenever the underlying data changes. In Julia, we can compute these composite parameters in functions using DataFrame operations, for example armington supply. This makes it easier to maintain and update the data, as we only need to change the underlying data and not all of the composite parameters.
This package embodies the core principles of WiNDC: transparency, reproducibility, and ease of use. We want to make it as easy as possible for researchers to access and work with WiNDC data in Julia. Our goal is to make all of our data offerings be to this standard. We plan to constantly improve our documentation efforts to ensure that users can easily understand how the data is built and how to work with it.
WiNDC Regional
The WiNDC Regional package disaggregates the United States national data to the 51 US states (including DC). The data is created for each year of available national data, 1997 to 2024.
The documentation detailing how the data is build is mostly complete, but there are still some sections that need to be filled in. We will be working to complete this documentation soon. We are currently focusing our attention on the WiNDC Household package.
WiNDC Household
Our most recent package, the WiNDC Household package adds 5 household income groups to the WiNDC Regional data. This allows for analysis of income distributional effects of policy changes.
This package is available as an early alpha release. The documentation will get you started with loading the data and running the canonical model. However, it is missing a comprehensive data build section. Further, this only builds data for 2024. We will be working on adding additional years and adding the household build options from GAMS.
I encourage users to try out this package and provide feedback. You can raise issues on the GitHub repository or reach out to me directly. Your feedback will help us improve the package and ensure it meets the needs of the community.
Planned Future Data Offerings
We have a lot of plans for the future. These projects don’t have a timeline yet, but are in the planning stages. If you are interested in supporting any of these projects, please reach out to me directly.
Bilateral Trade Flows Between US States
I wrote a blog post about this topic in October, read that post here. But, essentially, the idea is to disaggregate the WiNDC Regional data to include bilateral trade flows between US states using a CGE gravity model with iceberg trade. Note that neither the CFS nor FAF datasets provide true bilateral trade flows between states. The CFS data double counts shipments as they move between states. The FAF data is better, but is highly aggregated.
Recreation of GTAP WiNDC
GTAP is an incredibly important dataset for global economic analysis. We are working to load GTAP data natively into Julia. Once this is complete, we will recreate the GTAP WiNDC dataset in Julia. This will allow for seamless integration between WiNDC data and GTAP data. I also plan to provide tools built into other WiNDC package to disaggregate to GTAP sectors.
Energy Data
WiNDC used to have an energy data package called Bluenote. This package has been deprecated for some time. I would like to create a successor to this package in Julia. This package would provide energy data for the WiNDC models, allowing for more detailed analysis of energy policies.