Loading GTAP in Julia
October 10, 2025
I’ve been spending the past week formalizing the process of loading GTAP data into Julia. The first step was to create a Julia package to read GEMPACK HAR files. I’ve been working with Maros Ivanic, who has a package for reading these files in R, and we’ve been collaborating to ensure that the Julia package is efficient and user-friendly.
Our package is called HeaderArrayFile.jl, and I have a pull request open to update to a new, easier to maintain version of the package. To follow along with the code in this post, you’ll want to install the package from my fork:
using Pkg
Pkg.activate(".")
Pkg.add(url="https://github.com/mitchphillipson/HeaderArrayFile.jl")You will also want to install both DataFrames.jl and
NamedArrays.jl:
Pkg.add("DataFrames")
Pkg.add("NamedArrays")This code is being tested on GTAP9 data, obtain the data here.
I am going to assume you have downloaded and unzipped the GTAP9 data
into a folder on your computer. You will need to change the file paths
in the code below to point to where you have stored the GTAP9 data. I
will have mine in a data directory in my working
directory.
using HeaderArrayFile
const HAR = HeaderArrayFile
using DataFrames
using NamedArrays
data_dir = "data" # Change this to point to where you have stored the GTAP9 dataThe const HAR = HeaderArrayFile line creates an alias
for the package name. The HeaderArrayFile package doesn’t
export any functions, so we need to use the package name to access the
functions, for example HeaderArrayFile.File(...). By
creating the alias HAR, we can shorten this to
HAR.File(...).
Sets
The first file we will explore is gsdsets.har, which
contains the sets used in the GTAP model. We can read this file using
the read_har function from the
HeaderArrayFile.jl package:
gsdsets = HAR.File(joinpath(data_dir, "gsdset.har"))In the REPL you’ll see the output:
HarFile with 21 records:
"xxcd" => HarRecord("Creation Date and Time", HeaderArrayFile.HarSet)
"xxcr" => HarRecord("Creating Program", HeaderArrayFile.HarSet)
"xxcp" => HarRecord("compiler used to make EXE which created this file", HeaderArrayFile.HarSet)
"xxhs" => HarRecord("File History", HeaderArrayFile.HarSet)
"drel" => HarRecord("GTAP data release identifier", HeaderArrayFile.HarSet)
"dver" => HarRecord("Format of GTAP Data", HeaderArrayFile.HarParameter)
"h1" => HarRecord("Set REG regions", HeaderArrayFile.HarSet)
"h2" => HarRecord("Set TRAD_COMM traded commodities", HeaderArrayFile.HarSet)
"h3" => HarRecord("Set NSAV_COMM non-savings commodities", HeaderArrayFile.HarSet)
"h4" => HarRecord("Set DEMD_COMM demanded commodities", HeaderArrayFile.HarSet)
"h5" => HarRecord("Set PROD_COMM produced commodities", HeaderArrayFile.HarSet)
"h6" => HarRecord("Set ENDW_COMM endowment commodities", HeaderArrayFile.HarSet)
"h7" => HarRecord("Set ENDWS_COMM sluggish endowment commodities", HeaderArrayFile.HarSet)
"h8" => HarRecord("Set ENDWM_COMM mobile endowment commodities", HeaderArrayFile.HarSet)
"h9" => HarRecord("Set CGDS_COMM capital goods commodities", HeaderArrayFile.HarSet)
"marg" => HarRecord("Set MARG_COMM margin commodities", HeaderArrayFile.HarSet)
"tars" => HarRecord("Set TARTYPE", HeaderArrayFile.HarSet)
"tarl" => HarRecord("Long Names of Types of Tariffs shown in the set TARTYPE (TAR)", HeaderArrayFile.HarSet)
"h1l" => HarRecord("Region Long Names", HeaderArrayFile.HarSet)
"h2l" => HarRecord("Descriptions of set TRAD_COMM", HeaderArrayFile.HarSet)
"h6l" => HarRecord("Long Names for endowment commodities", HeaderArrayFile.HarSet)
It looks like h1 is the set of regions, so let’s extract
that record and convert it to a DataFrame:
REG = DataFrame(gsdsets["h1"])We can also return this as a NamedArray:
NamedArray(gsdsets["h1"])Parameters
Let’s look at some actual data.
gsddat = HAR.File(joinpath(data_dir, "gsddat.har"))The parameters vdfa and vdfm are the total
value of domestic purchases at agent and market prices, respectively.
Let’s extract these parameters and convert them to
DataFrames:
vdfa = DataFrame(gsddat["vdfa"])
vdfm = DataFrame(gsddat["vdfm"])Using these we can calculate the tax value by subtracting the market price from the agent price. We can do this using a join and a transformation:
tax_value = outerjoin(
vdfa,
vdfm,
on = [:trad_comm, :prod_comm, :reg],
renamecols = "_agent" => "_market"
) |>
x -> transform!(x,
[:value_agent, :value_market] => ((a, m) -> a - m) => :value
) |>
x -> select!(x, Not([:value_agent, :value_market])) |>
x -> subset!(x, :value => ByRow(!=(0)))If you are unfamiliar with DataFrame operations, I’ll break this down step-by-step:
- We use
outerjointo join the two DataFramesvdfaandvdfmon the columns:trad_comm,:prod_comm, and:reg. We also rename the:valuecolumns to:value_agentand:value_marketto avoid name clashes. An outerjoin sticks the two DataFrames together based on the matching values in the specified columns. - We then use
transform!to create a new column:valuethat contains the difference between:value_agentand:value_market. This gives us the tax value for each record. We use the!version oftransformto modify the DataFrame in place, this is more memory efficient. - Next, we use
select!to remove the now-unnecessary:value_agentand:value_marketcolumns from the DataFrame. - Finally, we use
subset!to keep only the records where the:valuecolumn is not equal to zero.
Let’s compare this process to using NamedArrays.
vdfa_na = NamedArray(gsddat["vdfa"])
vdfm_na = NamedArray(gsddat["vdfm"])
tax_value_na = vdfa_na - vdfm_naThis is much simpler, we can just subtract the two
NamedArrays directly. However, we lose some of the
flexibility of working with DataFrames. For example, if we
cannot filter out the zero values, Julia lacks a good multi-dimensional
sparse array implementation.
However, DataFrames have other advantages. For example, what indices have negative tax values?
tax_value |>
x -> subset(x, :value => ByRow(<(0)))For a NamedArray, you can extract the negative values, but can’t easily get the indices:
tax_value_na[tax_value_na .< 0]I believe to do this you need to loop over the indices, storing the indices where the value is negative. This is a fairly slow operation compared to the DataFrame approach.
Additional Functions
If you want to view only the parameters in the gsddat
file, you can use the parameters function:
HAR.parameters(gsddat)HAR.sets(gsddat)Conclusion
I am currently working on loading the GTAP data in the WiNDCContainer style, or as a DataFrame. Loading the data as a DataFrame is more time consuming, however I find it to be much easier to debug and work with the data. I will post more about this as I make progress.
If you use this to load the arrays and set up the GTAP model, please let me know! I would love to hear about how you are using it.