Comparing Annual GDP by State data to Summary data
September 05, 2025
Every year the BEA releases the SAGDP dataset which
provides state-level data on a range of categories. We use five of these
tables in our state-level disaggregation process. Today I am going to
compare one of these tables, Compensation of Employees, to the national
level summary data.
If you’d like to follow along with a new Julia environment, you can
download the SAGDP
from here. You also need two files
Alternatively, you can also download the code from my GitHub
repo., all you need to do is execute the code in
main.jl.
I’ll be using Julia code in the post. If you follow along and any code breaks, please let me know!
Setup
If you want to follow along without downloading my code, you can install the required packages with the following code:
using Pkg
Pkg.activate(".")
Pkg.add("WiNDCNational")
Pkg.add("DataFrames")
Pkg.add("CSV")This code will create a new environment in the current directory and
install the required packages. You can also do this from the Julia
package manager by typing ] in the Julia REPL.
The code that I provided instead uses Pkg.instantiate()
to install the packages from the Project.toml and
Manifest.toml files. This is why it’s nice to use
environments, as it ensures that you have the exact same package
versions as I do.
Now we need to load the packages we just installed and load some data.
using WiNDCNational
using DataFrames
using CSV
summary_raw = build_us_table(:summary)
summary, _ = calibrate(summary_raw)
data_dir = "data" # Change this to your data directory, if needed
state_fips = CSV.read(
joinpath(data_dir,"state_fips.csv"),
DataFrame;
types = Dict(
:fips => String,
:state => String,
),
select = [:fips, :state]
)
industry_codes = CSV.read(
joinpath(data_dir,"industry_codes.csv"),
DataFrame;
types = Dict(:naics => Symbol),
drop = [:Description]
) |>
dropmissingIf you are new to Julia and data, this might need some explanation.
Let’s discuss industry_codes.csv. We load this with the
CSV.read function. It takes a file path and
sink, in this case a DataFrame. After the
; is all optional keywords. I want the naics
column to be a Symbol and I don’t want the description.
Finally, after the CSV.read function, I use the pipe
operator |> to pass the result to the
dropmissing function. This removes any rows with missing
values.
We are going to see the |> operator a lot today and
it’s useful to see a small example. Let’s say we have
f(x) = x^2, we can call this function like
f(3). We can also use the pipe operator to pass
3 to f like this: 3 |> f. The
right hand side is a function that takes one argument, the left hand
side. You can chain multiple functions together like this:
3 |> f |> sqrt. This is equivalent to
sqrt(f(3)). If your function has multiple inputs, you can
use an anonymous function. For example, if we have
g(x,y) = x + y, we can call it like g(3,4). To
use the pipe operator, we can do
3 |> x -> g(x,4).
Function for Loading the Data
All the SAGDP tables have the same format, so we can
write a function to load them:
function load_sagdp(
path::String;
data_dir = data_dir,
industry_codes::DataFrame = industry_codes,
state_fips::DataFrame = state_fips,
)
df = CSV.read(
joinpath(data_dir,path),
DataFrame,
footerskip = 4,
missingstring = ["(NA)", "(D)", "(NM)","(L)", "(T)"],
drop = [:GeoName, :Region, :TableName, :IndustryClassification],
types = Dict(:GeoFIPS => String)
) |>
x -> stack(x, Not(:GeoFIPS, :LineCode, :Unit, :Description), variable_name=:year, value_name=:value) |>
dropmissing |>
x -> transform(x,
:year => ByRow(y -> parse(Int, y)) => :year,
:value => ByRow(y -> y/1_000_000) => :value, #Convert to billions
) |>
x -> innerjoin(x, state_fips, on = :GeoFIPS => :fips) |>
x -> innerjoin(x, industry_codes, on = :LineCode) |>
x -> select(x, Not(:GeoFIPS, :LineCode, :Unit, :Description)) |>
x -> subset(x, :value => ByRow(!iszero))
return df
endThis function has one required argument, path, which is
the file name of the table to load. The other arguments are optional and
have default values, you shouldn’t need to change them. You get an error
if you run this function without these default values defined. This is
fine for this small example but could cause issues in a package.
Let’s break this function down as it’s only really loading then cleaning the data. First step is read the CSV:
CSV.read(
joinpath(data_dir,path),
DataFrame,
footerskip = 4,
missingstring = ["(NA)", "(D)", "(NM)","(L)", "(T)"],
drop = [:GeoName, :Region, :TableName, :IndustryClassification],
types = Dict(:GeoFIPS => String)
)You should be able to understand most of this, especially if you open
the CSV. You’ll see some metadata in the last four rows, so we skip them
with footerskip = 4. The missingstring keyword
tells the function to treat these strings as missing values, each of
those represents a different type of missing data. We also drop some
columns we don’t need and make sure that GeoFIPS is a
string.
Next we stack the data and drop missing values:
x -> stack(x, Not(:GeoFIPS, :LineCode, :Unit, :Description), variable_name=:year, value_name=:value) |>
dropmissing The data starts in a wide format with all the years in different
columns. This is not ideal for analysis, so we use the
stack function to convert it to a long format. We keep
GeoFIPS, LineCode, Unit, and
Description as they are and stack the rest of the columns.
The new column names are year and value. After
stacking, we drop any rows with missing values.
The two columns we created aren’t quite right, so we transform them:
x -> transform(x,
:year => ByRow(y -> parse(Int, y)) => :year,
:value => (y -> y./1_000_000) => :value, #Convert to billions
) The format is
[input columns] => function => [output columns]. Here
we are using a single input column :year, applying the
parse function to each row and storing the result back in
:year. The ByRow function implicitly applies
our function to each row. This isn’t strictly necessary, look at
:value. Here we apply the function to the entire column at
once. I usually default to ByRow as it’s easier to
conceptualize.
Penultimately, we translate the FIPS codes and LineCodes codes to state names and industry names:
x -> innerjoin(x, state_fips, on = :GeoFIPS => :fips) |>
x -> innerjoin(x, industry_codes, on = :LineCode)If you understand joins, this should be straightforward. The only
“special” thing is the on keyword. This tells the function
which columns to join on. The syntax is
left table column => right table column, or just
column if the names are the same. This can also be a vector
if you’re joining on multiple columns.
Finally, we drop some columns we don’t need and remove any rows with a value of zero:
x -> select(x, Not(:GeoFIPS, :LineCode, :Unit, :Description)) |>
x -> subset(x, :value => ByRow(!iszero))That was a lot. Let’s do something fun with the data!
Compensation of Employees or Labor Demand
The first table we will look at is SAGDP4, which
contains data on the compensation of employees across different states
and industries. You can read the xml file for a full
description.
Let’s use our function to load the data:
labor_path = "SAGDP4__ALL_AREAS_1997_2023.csv"
labor = load_sagdp(labor_path)We can compare this data to the national accounts. First, aggregate the labor data to the national level:
aggregated_labor = labor |>
x -> groupby(x, [:year, :naics]) |>
x -> combine(x, :value => sum => :labor)Then join it to the national accounts:
df = outerjoin(
table(summary, :Labor_Demand),
aggregated_labor,
on = [:year, :col => :naics],
) |>
x -> transform(x, [:value,:labor] => ByRow((+)) => :diff)I did an outerjoin which will produce a lot of missing
values, that’s fine we’ll use them to determine which industries are
missing.
One would expect these values to be exactly equal, but they are not. Let’s first find the industries where the difference is small, say less than $1 billion:
df |>
x -> dropmissing(x, :diff) |>
x -> subset(x, :diff => ByRow(y-> abs(y) < 1))This returns all but 304 rows, so most industries with data are fairly close to expected values. These are the good industries. If you just want to see the 60 industries codes, you can do:
df |>
x -> dropmissing(x, :diff) |>
x -> subset(x, :diff => ByRow(y-> abs(y) < 1)) |>
x -> unique(x, :col) # change :col to [:col, :year] to get unique pairsNow let’s look at the opposite direction, industries where the value is different and sort the differences:
df |>
x -> dropmissing(x, :diff) |>
x -> subset(x, :diff => ByRow(y-> abs(y) >= 1)) |>
x -> sort(x, :diff)These values are quite different, which is interesting. This is
probably due to the imperfect matching between
LineCodes and NAICS codes. You’d think the BEA
would use the same labelling on both datasets, but they don’t.
Finally, let’s take a look at the missing values. These are industries that appear in the national accounts but not in the state-level data.
df |>
x -> subset(x, :diff => ByRow(ismissing)) |>
x -> unique(x, :col)These are not a surprise. We can see they do not appear in
industry_codes.csv which is why they are missing.
Disaggregation
So how do we use this data to disaggregate the national accounts? Here is my suggested process: 1. Use the actual values for any industry where the difference is small (e.g.xa0less than $1 billion) 2. For industries where the difference is large, use value shares of the state-level data to disaggregate the national accounts. 3. For industries that are missing, we will assume equal shares in all states. This is not ideal and may change to population shares in the future.
First up, the small differences. For this I will find the small
industries and years, use that to filter the labor data,
then add/rename columns to match the expected format, and make the
values negative as they are inputs:
step_1 = df |>
x -> dropmissing(x, :diff) |>
x -> subset(x, :diff => ByRow(y-> abs(y) < 1)) |>
x -> select(x, :col, :year) |>
x -> innerjoin(
labor,
x,
on = [:naics => :col, :year]
) |>
x -> transform(x,
:year => ByRow(y -> [:labor, :labor_demand]) => [:row, :parameter],
:value => ByRow((-)) => :value,
) |>
x -> select(x, :row, :naics => :col, :state, :year, :parameter, :value)Steps 2 and 3 are actually the same process, disaggregating by shares. The difference being that in step 2 we have actual values, while in step 3 we are assuming equal shares. I’ll make two sub-dataframes for these two cases, then combine them.
missing_categories = df |>
x -> subset(x, :diff => ByRow(ismissing)) |>
x -> unique(x, [:col, :year]) |>
x -> select(x, :col, :year) |>
x -> crossjoin(x, state_fips[!,[:state]]) |> # Add in the states
x -> transform(x, :col => ByRow(y -> 1) => :labor)
big_differences = df |>
x -> dropmissing(x,:diff) |>
x -> subset(x, :diff => ByRow(y -> abs(y) >=1)) |>
x -> select(x, :col, :year) |>
x -> innerjoin(
x,
labor,
on = [:col => :naics, :year]
) |>
x -> rename(x, :value => :labor)
step_2_3 = vcat(missing_categories, big_differences) |>
x -> innerjoin(
table(summary, :Labor_Demand),
x,
on = [:col, :year]
) |>
x -> groupby(x, [:col, :year]) |>
x -> combine(x,
:state => identity => :state,
[:value, :labor] => ((v,l) -> v .* (l ./ sum(l))) => :value,
) |>
x -> transform(x, :year => ByRow(y -> [:labor, :labor_demand]) => [:row, :parameter])The combine in step_2_3 is performing the disaggregation
by shares. You should be able to see the shares computation. This is one
case where you can not use the ByRow function as we need
the entire column to compute the shares. Also, we preserve the
state column by using identity.
Finally, we can combine step_1 and step_2_3
to get our final disaggregated labor demand data:
X = vcat(step_1, step_2_3)Checking Our Work
The last thing to do is aggregate X back to the national
level and compare it to the summary data. We should get exactly the same
values as in the summary data, or close to them.
X |>
x -> groupby(x, [:year, :col]) |>
x -> combine(x, :value => sum => :labor) |>
x -> outerjoin(
table(summary, :Labor_Demand),
x,
on = [:year, :col]
) |>
x -> transform(x, [:value, :labor] => ByRow((-)) => :diff) |>
x -> sort(x, :diff)All the differences are within 1, the number we used to
define small differences. That choice is arbitrary, try changing it to
0.5 or 2 and see how it affects the
results.
Next Steps
This process can be repeated for the other four SAGDP
tables we use in our disaggregation process. The code is very similar
and is a good exercise for the reader.