WiNDC - Loading GTAP in Julia

Loading GTAP in Julia

Mitch Phillipson October 10, 2025

julia gtap

I’ve been spending the past week formalizing the process of loading GTAP data into Julia. The first step was to create a Julia package to read GEMPACK HAR files. I’ve been working with Maros Ivanic, who has a package for reading these files in R, and we’ve been collaborating to ensure that the Julia package is efficient and user-friendly.

Our package is called HeaderArrayFile.jl, and I have a pull request open to update to a new, easier to maintain version of the package. To follow along with the code in this post, you’ll want to install the package from my fork:

using Pkg
Pkg.activate(".")

Pkg.add(url="https://github.com/mitchphillipson/HeaderArrayFile.jl")

You will also want to install both DataFrames.jl and NamedArrays.jl:

Pkg.add("DataFrames")
Pkg.add("NamedArrays")

This code is being tested on GTAP9 data, obtain the data here. I am going to assume you have downloaded and unzipped the GTAP9 data into a folder on your computer. You will need to change the file paths in the code below to point to where you have stored the GTAP9 data. I will have mine in a data directory in my working directory.

using HeaderArrayFile
const HAR = HeaderArrayFile

using DataFrames
using NamedArrays

data_dir = "data" # Change this to point to where you have stored the GTAP9 data

The const HAR = HeaderArrayFile line creates an alias for the package name. The HeaderArrayFile package doesn’t export any functions, so we need to use the package name to access the functions, for example HeaderArrayFile.File(...). By creating the alias HAR, we can shorten this to HAR.File(...).

Sets

The first file we will explore is gsdsets.har, which contains the sets used in the GTAP model. We can read this file using the read_har function from the HeaderArrayFile.jl package:

gsdsets = HAR.File(joinpath(data_dir, "gsdset.har"))

In the REPL you’ll see the output:

HarFile with 21 records:
  "xxcd" => HarRecord("Creation Date and Time", HeaderArrayFile.HarSet)
  "xxcr" => HarRecord("Creating Program", HeaderArrayFile.HarSet)
  "xxcp" => HarRecord("compiler used to make EXE which created this file", HeaderArrayFile.HarSet)
  "xxhs" => HarRecord("File History", HeaderArrayFile.HarSet)
  "drel" => HarRecord("GTAP data release identifier", HeaderArrayFile.HarSet)
  "dver" => HarRecord("Format of GTAP Data", HeaderArrayFile.HarParameter)
  "h1" => HarRecord("Set REG  regions", HeaderArrayFile.HarSet)
  "h2" => HarRecord("Set TRAD_COMM  traded commodities", HeaderArrayFile.HarSet)
  "h3" => HarRecord("Set NSAV_COMM  non-savings commodities", HeaderArrayFile.HarSet)
  "h4" => HarRecord("Set DEMD_COMM  demanded commodities", HeaderArrayFile.HarSet)
  "h5" => HarRecord("Set PROD_COMM  produced commodities", HeaderArrayFile.HarSet)
  "h6" => HarRecord("Set ENDW_COMM  endowment commodities", HeaderArrayFile.HarSet)
  "h7" => HarRecord("Set ENDWS_COMM  sluggish endowment commodities", HeaderArrayFile.HarSet)
  "h8" => HarRecord("Set ENDWM_COMM  mobile endowment commodities", HeaderArrayFile.HarSet)
  "h9" => HarRecord("Set CGDS_COMM  capital goods commodities", HeaderArrayFile.HarSet)
  "marg" => HarRecord("Set MARG_COMM  margin commodities", HeaderArrayFile.HarSet)
  "tars" => HarRecord("Set TARTYPE", HeaderArrayFile.HarSet)
  "tarl" => HarRecord("Long Names of Types of Tariffs shown in the set TARTYPE (TAR)", HeaderArrayFile.HarSet)
  "h1l" => HarRecord("Region Long Names", HeaderArrayFile.HarSet)
  "h2l" => HarRecord("Descriptions of set TRAD_COMM", HeaderArrayFile.HarSet)
  "h6l" => HarRecord("Long Names for endowment commodities", HeaderArrayFile.HarSet)

It looks like h1 is the set of regions, so let’s extract that record and convert it to a DataFrame:

REG = DataFrame(gsdsets["h1"])

We can also return this as a NamedArray:

NamedArray(gsdsets["h1"])

Parameters

Let’s look at some actual data.

gsddat = HAR.File(joinpath(data_dir, "gsddat.har"))

The parameters vdfa and vdfm are the total value of domestic purchases at agent and market prices, respectively. Let’s extract these parameters and convert them to DataFrames:

vdfa = DataFrame(gsddat["vdfa"])
vdfm = DataFrame(gsddat["vdfm"])

Using these we can calculate the tax value by subtracting the market price from the agent price. We can do this using a join and a transformation:

tax_value = outerjoin(
    vdfa,
    vdfm,
    on = [:trad_comm, :prod_comm, :reg],
    renamecols = "_agent" => "_market"
) |>
x -> transform!(x,
    [:value_agent, :value_market] => ((a, m) -> a - m) => :value
) |>
x -> select!(x, Not([:value_agent, :value_market])) |>
x -> subset!(x, :value => ByRow(!=(0)))

If you are unfamiliar with DataFrame operations, I’ll break this down step-by-step:

We use outerjoin to join the two DataFrames vdfa and vdfm on the columns :trad_comm, :prod_comm, and :reg. We also rename the :value columns to :value_agent and :value_market to avoid name clashes. An outerjoin sticks the two DataFrames together based on the matching values in the specified columns.
We then use transform! to create a new column :value that contains the difference between :value_agent and :value_market. This gives us the tax value for each record. We use the ! version of transform to modify the DataFrame in place, this is more memory efficient.
Next, we use select! to remove the now-unnecessary :value_agent and :value_market columns from the DataFrame.
Finally, we use subset! to keep only the records where the :value column is not equal to zero.

Let’s compare this process to using NamedArrays.

vdfa_na = NamedArray(gsddat["vdfa"])
vdfm_na = NamedArray(gsddat["vdfm"])

tax_value_na = vdfa_na - vdfm_na

This is much simpler, we can just subtract the two NamedArrays directly. However, we lose some of the flexibility of working with DataFrames. For example, if we cannot filter out the zero values, Julia lacks a good multi-dimensional sparse array implementation.

However, DataFrames have other advantages. For example, what indices have negative tax values?

tax_value |>
   x -> subset(x, :value => ByRow(<(0)))

For a NamedArray, you can extract the negative values, but can’t easily get the indices:

tax_value_na[tax_value_na .< 0]

I believe to do this you need to loop over the indices, storing the indices where the value is negative. This is a fairly slow operation compared to the DataFrame approach.

Additional Functions

If you want to view only the parameters in the gsddat file, you can use the parameters function:

HAR.parameters(gsddat)

HAR.sets(gsddat)

Conclusion

I am currently working on loading the GTAP data in the WiNDCContainer style, or as a DataFrame. Loading the data as a DataFrame is more time consuming, however I find it to be much easier to debug and work with the data. I will post more about this as I make progress.

If you use this to load the arrays and set up the GTAP model, please let me know! I would love to hear about how you are using it.