WiNDC - Comparing Annual GDP by State data to Summary data

Comparing Annual GDP by State data to Summary data

Mitch Phillipson September 05, 2025

blog data

Every year the BEA releases the SAGDP dataset which provides state-level data on a range of categories. We use five of these tables in our state-level disaggregation process. Today I am going to compare one of these tables, Compensation of Employees, to the national level summary data.

If you’d like to follow along with a new Julia environment, you can download the SAGDP from here. You also need two files

state_fips: here
industry_codes: here.

Alternatively, you can also download the code from my GitHub repo., all you need to do is execute the code in main.jl.

I’ll be using Julia code in the post. If you follow along and any code breaks, please let me know!

Setup

If you want to follow along without downloading my code, you can install the required packages with the following code:

using Pkg
Pkg.activate(".")
Pkg.add("WiNDCNational")
Pkg.add("DataFrames")
Pkg.add("CSV")

This code will create a new environment in the current directory and install the required packages. You can also do this from the Julia package manager by typing ] in the Julia REPL.

The code that I provided instead uses Pkg.instantiate() to install the packages from the Project.toml and Manifest.toml files. This is why it’s nice to use environments, as it ensures that you have the exact same package versions as I do.

Now we need to load the packages we just installed and load some data.

using WiNDCNational
using DataFrames
using CSV

summary_raw = build_us_table(:summary)
summary, _ = calibrate(summary_raw)

data_dir = "data" # Change this to your data directory, if needed

state_fips = CSV.read(
    joinpath(data_dir,"state_fips.csv"), 
    DataFrame;
    types = Dict(
        :fips => String,
        :state => String,
        ),
    select = [:fips, :state]
    )

industry_codes = CSV.read(
    joinpath(data_dir,"industry_codes.csv"), 
    DataFrame;
    types = Dict(:naics => Symbol),
    drop = [:Description]
    ) |>
    dropmissing

If you are new to Julia and data, this might need some explanation. Let’s discuss industry_codes.csv. We load this with the CSV.read function. It takes a file path and sink, in this case a DataFrame. After the ; is all optional keywords. I want the naics column to be a Symbol and I don’t want the description. Finally, after the CSV.read function, I use the pipe operator |> to pass the result to the dropmissing function. This removes any rows with missing values.

We are going to see the |> operator a lot today and it’s useful to see a small example. Let’s say we have f(x) = x^2, we can call this function like f(3). We can also use the pipe operator to pass 3 to f like this: 3 |> f. The right hand side is a function that takes one argument, the left hand side. You can chain multiple functions together like this: 3 |> f |> sqrt. This is equivalent to sqrt(f(3)). If your function has multiple inputs, you can use an anonymous function. For example, if we have g(x,y) = x + y, we can call it like g(3,4). To use the pipe operator, we can do 3 |> x -> g(x,4).

Function for Loading the Data

All the SAGDP tables have the same format, so we can write a function to load them:

function load_sagdp(
    path::String; 
    data_dir = data_dir,
    industry_codes::DataFrame = industry_codes,
    state_fips::DataFrame = state_fips,
    )

    df = CSV.read(
        joinpath(data_dir,path), 
        DataFrame,
        footerskip = 4,
        missingstring = ["(NA)", "(D)", "(NM)","(L)", "(T)"],
        drop = [:GeoName, :Region, :TableName, :IndustryClassification],
        types = Dict(:GeoFIPS => String)
        ) |>
        x -> stack(x, Not(:GeoFIPS, :LineCode, :Unit, :Description), variable_name=:year, value_name=:value) |>
        dropmissing |>
        x -> transform(x, 
            :year => ByRow(y -> parse(Int, y)) => :year,
            :value => ByRow(y -> y/1_000_000) => :value, #Convert to billions
        ) |>
        x -> innerjoin(x, state_fips, on = :GeoFIPS => :fips) |>
        x -> innerjoin(x, industry_codes, on = :LineCode) |>
        x -> select(x, Not(:GeoFIPS, :LineCode, :Unit, :Description)) |>
        x -> subset(x, :value => ByRow(!iszero))

    return df
end

This function has one required argument, path, which is the file name of the table to load. The other arguments are optional and have default values, you shouldn’t need to change them. You get an error if you run this function without these default values defined. This is fine for this small example but could cause issues in a package.

Let’s break this function down as it’s only really loading then cleaning the data. First step is read the CSV:

CSV.read(
  joinpath(data_dir,path), 
  DataFrame,
  footerskip = 4,
  missingstring = ["(NA)", "(D)", "(NM)","(L)", "(T)"],
  drop = [:GeoName, :Region, :TableName, :IndustryClassification],
  types = Dict(:GeoFIPS => String)
)

You should be able to understand most of this, especially if you open the CSV. You’ll see some metadata in the last four rows, so we skip them with footerskip = 4. The missingstring keyword tells the function to treat these strings as missing values, each of those represents a different type of missing data. We also drop some columns we don’t need and make sure that GeoFIPS is a string.

Next we stack the data and drop missing values:

x -> stack(x, Not(:GeoFIPS, :LineCode, :Unit, :Description), variable_name=:year, value_name=:value) |>
dropmissing

The data starts in a wide format with all the years in different columns. This is not ideal for analysis, so we use the stack function to convert it to a long format. We keep GeoFIPS, LineCode, Unit, and Description as they are and stack the rest of the columns. The new column names are year and value. After stacking, we drop any rows with missing values.

The two columns we created aren’t quite right, so we transform them:

x -> transform(x, 
    :year => ByRow(y -> parse(Int, y)) => :year,
    :value => (y -> y./1_000_000) => :value, #Convert to billions
)

The format is [input columns] => function => [output columns]. Here we are using a single input column :year, applying the parse function to each row and storing the result back in :year. The ByRow function implicitly applies our function to each row. This isn’t strictly necessary, look at :value. Here we apply the function to the entire column at once. I usually default to ByRow as it’s easier to conceptualize.

Penultimately, we translate the FIPS codes and LineCodes codes to state names and industry names:

x -> innerjoin(x, state_fips, on = :GeoFIPS => :fips) |>
x -> innerjoin(x, industry_codes, on = :LineCode)

If you understand joins, this should be straightforward. The only “special” thing is the on keyword. This tells the function which columns to join on. The syntax is left table column => right table column, or just column if the names are the same. This can also be a vector if you’re joining on multiple columns.

Finally, we drop some columns we don’t need and remove any rows with a value of zero:

x -> select(x, Not(:GeoFIPS, :LineCode, :Unit, :Description)) |>
x -> subset(x, :value => ByRow(!iszero))

That was a lot. Let’s do something fun with the data!

Compensation of Employees or Labor Demand

The first table we will look at is SAGDP4, which contains data on the compensation of employees across different states and industries. You can read the xml file for a full description.

Let’s use our function to load the data:

labor_path = "SAGDP4__ALL_AREAS_1997_2023.csv"
labor = load_sagdp(labor_path)

We can compare this data to the national accounts. First, aggregate the labor data to the national level:

aggregated_labor = labor |>
    x -> groupby(x, [:year, :naics]) |>
    x -> combine(x, :value => sum => :labor)

Then join it to the national accounts:

df = outerjoin(
    table(summary, :Labor_Demand),
    aggregated_labor,
    on = [:year, :col => :naics],
) |>
x -> transform(x, [:value,:labor] => ByRow((+)) => :diff)

I did an outerjoin which will produce a lot of missing values, that’s fine we’ll use them to determine which industries are missing.

One would expect these values to be exactly equal, but they are not. Let’s first find the industries where the difference is small, say less than $1 billion:

df |>
    x -> dropmissing(x, :diff) |>
    x -> subset(x, :diff => ByRow(y-> abs(y) < 1))

This returns all but 304 rows, so most industries with data are fairly close to expected values. These are the good industries. If you just want to see the 60 industries codes, you can do:

df |>
    x -> dropmissing(x, :diff) |>
    x -> subset(x, :diff => ByRow(y-> abs(y) < 1)) |>
    x -> unique(x, :col) # change :col to [:col, :year] to get unique pairs

Now let’s look at the opposite direction, industries where the value is different and sort the differences:

df |>
    x -> dropmissing(x, :diff) |>
    x -> subset(x, :diff => ByRow(y-> abs(y) >= 1))  |>
    x -> sort(x, :diff)

These values are quite different, which is interesting. This is probably due to the imperfect matching between LineCodes and NAICS codes. You’d think the BEA would use the same labelling on both datasets, but they don’t.

Finally, let’s take a look at the missing values. These are industries that appear in the national accounts but not in the state-level data.

df |>
  x -> subset(x, :diff => ByRow(ismissing)) |>
  x -> unique(x, :col)

These are not a surprise. We can see they do not appear in industry_codes.csv which is why they are missing.

Disaggregation

So how do we use this data to disaggregate the national accounts? Here is my suggested process: 1. Use the actual values for any industry where the difference is small (e.g.xa0less than $1 billion) 2. For industries where the difference is large, use value shares of the state-level data to disaggregate the national accounts. 3. For industries that are missing, we will assume equal shares in all states. This is not ideal and may change to population shares in the future.

First up, the small differences. For this I will find the small industries and years, use that to filter the labor data, then add/rename columns to match the expected format, and make the values negative as they are inputs:

step_1 = df |>
    x -> dropmissing(x, :diff) |>
    x -> subset(x, :diff => ByRow(y-> abs(y) < 1)) |>
    x -> select(x, :col, :year) |>
    x -> innerjoin(
      labor,
      x,
      on = [:naics => :col, :year]
   ) |>
   x -> transform(x, 
    :year => ByRow(y -> [:labor, :labor_demand]) => [:row, :parameter],
    :value => ByRow((-)) => :value,
  ) |>
   x -> select(x, :row, :naics => :col, :state, :year, :parameter, :value)

Steps 2 and 3 are actually the same process, disaggregating by shares. The difference being that in step 2 we have actual values, while in step 3 we are assuming equal shares. I’ll make two sub-dataframes for these two cases, then combine them.

missing_categories = df |>
  x -> subset(x, :diff => ByRow(ismissing)) |>
  x -> unique(x, [:col, :year]) |>
  x -> select(x, :col, :year) |>
  x -> crossjoin(x, state_fips[!,[:state]]) |> # Add in the states
  x -> transform(x, :col => ByRow(y -> 1) => :labor)

big_differences = df |>
  x -> dropmissing(x,:diff) |>
  x -> subset(x, :diff => ByRow(y -> abs(y) >=1)) |>
  x -> select(x, :col, :year) |>
  x -> innerjoin(
    x,
    labor,
    on = [:col => :naics, :year]
  ) |>
  x -> rename(x, :value => :labor)


step_2_3 = vcat(missing_categories, big_differences) |>
  x -> innerjoin(
    table(summary, :Labor_Demand),
    x,
    on = [:col, :year]
  ) |>
  x -> groupby(x, [:col, :year]) |>
  x -> combine(x,
    :state => identity => :state,
    [:value, :labor] => ((v,l) -> v .* (l ./ sum(l))) => :value,
  ) |>
  x -> transform(x, :year => ByRow(y -> [:labor, :labor_demand]) => [:row, :parameter])

The combine in step_2_3 is performing the disaggregation by shares. You should be able to see the shares computation. This is one case where you can not use the ByRow function as we need the entire column to compute the shares. Also, we preserve the state column by using identity.

Finally, we can combine step_1 and step_2_3 to get our final disaggregated labor demand data:

X = vcat(step_1, step_2_3)

Checking Our Work

The last thing to do is aggregate X back to the national level and compare it to the summary data. We should get exactly the same values as in the summary data, or close to them.

X |>
  x -> groupby(x, [:year, :col]) |>
  x -> combine(x, :value => sum => :labor) |>
  x -> outerjoin(
    table(summary, :Labor_Demand),
    x,
    on = [:year, :col]
  ) |>
  x -> transform(x, [:value, :labor] => ByRow((-)) => :diff) |>
  x -> sort(x, :diff)

All the differences are within 1, the number we used to define small differences. That choice is arbitrary, try changing it to 0.5 or 2 and see how it affects the results.

Next Steps

This process can be repeated for the other four SAGDP tables we use in our disaggregation process. The code is very similar and is a good exercise for the reader.