Getting the data

On March 1, 2022, Tidy Tuesday presented the alternative fuel stations database as the weekly challenge. The alternative fuel data center at the U.S. Department of Energy has great tools to visualize the availability, spread, and evolution of alternative fuel stations. Instead of focusing on the stations themselves, I thought this would be a good opportunity to combine this nicely kept database with U.S. Census data. I was curious about the relationship between the zip code median income and the availability of electric fueling stations.

First, I got a count per zip code of electric fuel, available, and public stations.

# STATIONS per zip code -----------------------------------------------
stations <- readRDS("data/stations.rds")
stations <- stations %>% 
  filter(STATUS_CODE == "E",            #Available
         FUEL_TYPE_CODE == "ELEC",      #Electric
         ACCESS_CODE == "public") %>%   #Public
  group_by(ZIP) %>% 
  summarize(e_stations = n())

Then I downloaded the 2020 estimates from the American Community Survey for individual earnings (S2001) and family income (S1903) for the last 12 months, adjusted for inflation. I retrieved only the median earnings of individuals 16 years and older and the median income of families per zip code.

# INCOME per zip code --------------------------------------------------

# S2001 EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)
# Estimate!!Total!!Population 16 years and over with earnings!!Median earnings (dollars)

S2001 <- read_csv("data/S2001/S2001.csv", 
                  col_select = c(NAME,S2001_C01_002E),
                  col_types = list(S2001_C01_002E = col_double())) %>%
  mutate(NAME = substr(NAME,7,11)) %>% 
  rename(ZIP = NAME, ind_med_earnings = S2001_C01_002E)
attr(S2001,'spec') <- NULL
S2001 <- S2001[-1,]
   

# S1903 MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)
# Estimate!!Median income (dollars)!!FAMILIES!!Families

S1903 <- read_csv("data/S1903/S1903.csv",
                  col_select = c(NAME, S1903_C03_015E),
                  col_types = list(S1903_C03_015E = col_double())) %>% 
  mutate(NAME = substr(NAME,7,11)) %>% 
  rename(ZIP = NAME, fam_med_income = S1903_C03_015E)
attr(S1903,'spec') <- NULL
S1903 <- S1903[-1,]

iee_data <- left_join(S2001,S1903)

I joined the electric station data to the income data. To keep track of the lack of existence of electric fuel stations I did not want to rely on counting the NAs, so I created a “has_e” column. I also added a quartile factor for family median income and individual median earnings.

qs <- quantile(iee_data$fam_med_income, na.rm = T)
qs_lbls <- sapply(1:4,function(i) paste0("Q",i," ",scales::dollar(qs[i])," to ",scales::dollar(qs[i+1])))
iee_data$fam_quant <- cut(iee_data$fam_med_income, qs, include.lowest = T, labels = qs_lbls)

qs <- quantile(iee_data$ind_med_earnings, na.rm = T)
qs_lbls <- sapply(1:4,function(i) paste0("Q",i," ",scales::dollar(qs[i])," to ",scales::dollar(qs[i+1])))
iee_data$ind_quant <- cut(iee_data$ind_med_earnings, qs, include.lowest = T, labels = qs_lbls)

rm(qs,qs_lbls)

iee_data <- iee_data %>% 
  left_join(stations, by = c("ZIP")) %>% 
  mutate(has_e = ifelse(is.na(e_stations),"no","yes"))

Visualizing and interpreting the data

Zip code incomes and electric fuel stations

These density charts confirm my hunch. The distribution of zip codes with electric stations shifts to the right compared to those without. In other words, the median incomes/earnings of zip codes with at least one electric station is higher than those without.

iee_medians <- iee_data %>% 
  group_by(has_e) %>% 
  summarise(fam_m = median(fam_med_income, na.rm = T),
            ind_m = median(ind_med_earnings, na.rm = T))

iee_data %>% 
  filter(!is.na(fam_med_income)) %>% 
  ggplot(aes(fam_med_income)) + 
  geom_density(aes(fill = has_e, color = has_e), alpha = 1/3) +
  geom_vline(data = iee_medians, aes(xintercept = fam_m, color = has_e), size = 0.5) +
  geom_text(data = iee_medians, 
            aes(label = scales::dollar(fam_m), x = fam_m, y = 4e-5), 
            angle = 90, size = 3) +
  scale_y_continuous(limits = c(0,6e-5), name = "density") +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
  scale_color_discrete(name = LBL_ES) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

iee_data %>% 
  filter(!is.na(ind_med_earnings)) %>% 
  ggplot(aes(ind_med_earnings)) + 
  geom_density(aes(fill = has_e, color = has_e), alpha = 1/3) +
  geom_vline(data = iee_medians, aes(xintercept = ind_m, color = has_e), size = 0.5) +
  geom_text(data = iee_medians, 
            aes(label = scales::dollar(ind_m), x = ind_m, y = 1e-5), 
            angle = 90, size = 3) +
  scale_y_continuous(limits = c(0,6e-5), name = "density") +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
  scale_color_discrete(name = LBL_ES) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

I plotted the histograms just to see how the quantity of zip codes with and without stations compared, since this is lost in the density charts. As expected, there are more zip codes without stations.

iee_data %>% 
  filter(!is.na(fam_med_income)) %>% 
  ggplot(aes(fam_med_income, fill = has_e, color = has_e)) + 
  geom_histogram(alpha = .2, binwidth = 2000) +
  scale_y_continuous(limits = c(0,3000), name = LBL_ZCC) +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
  scale_color_discrete(name = LBL_ES) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

iee_data %>% 
  filter(!is.na(ind_med_earnings)) %>% 
  ggplot(aes(ind_med_earnings, fill = has_e, color = has_e)) + 
  geom_histogram(alpha = .2, binwidth = 2000) +
  scale_y_continuous(limits = c(0,3000), name = LBL_ZCC) +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
  scale_color_discrete(name = LBL_ES) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

E station proportions according to income quartiles

To dig deeper into how lower income zip codes compare to higher income ones, I calculated the proportion of zip codes with e stations in each income quartile. The difference here is more telling. The proportion of zip codes with e stations is 17% in the bottom family median income quartile and 47% in the highest, which means zip codes with family median incomes in the top 25% are 2.76 times more likely to have an e station than those in the bottom 25%. In other words, for every low income zip code, there are 2.76 high income ones that have at least one e station.

fam_prop <- iee_data %>% 
  filter(!is.na(fam_med_income)) %>% 
  group_by(has_e,fam_quant) %>%  
  summarise(freq = n()) %>% 
  group_by(fam_quant) %>% 
  mutate(prop = round(freq/sum(freq),2)) %>%
  ungroup()

ind_prop <- iee_data %>% 
  filter(!is.na(ind_med_earnings)) %>% 
  group_by(has_e,ind_quant) %>%  
  summarise(freq = n()) %>% 
  group_by(ind_quant) %>% 
  mutate(prop = round(freq/sum(freq),2)) %>%
  ungroup()

iee_data %>% 
  filter(!is.na(fam_med_income)) %>% 
  group_by(has_e) %>% 
  ggplot(aes(fam_quant, fill = has_e)) +
  geom_bar(alpha = 3/4) +
  geom_text(data = fam_prop, aes(label = scales::percent(prop), x = fam_quant, y = c(rep(5000,4),rep(1000,4)))) +
  scale_x_discrete(labels = c("Q1","Q2","Q3","Q4"), name = paste(LBL_FAM,"quartiles")) +
  scale_y_continuous(name = LBL_ZCC) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

iee_data %>% 
  filter(!is.na(ind_med_earnings)) %>% 
  group_by(has_e) %>% 
  ggplot(aes(ind_quant, fill = has_e)) +
  geom_bar(alpha = 3/4) +
  geom_text(data = ind_prop, aes(label = scales::percent(prop), x = ind_quant, y = c(rep(5000,4),rep(1000,4)))) +
  scale_x_discrete(labels = c("Q1","Q2","Q3","Q4"), name = paste(LBL_IND,"quartiles")) +
  scale_y_continuous(name = LBL_ZCC) +
  scale_fill_discrete(name = LBL_ES) +
  theme_minimal()

Each quartile is made up of a quarter of all zip code incomes, organized from lowest to largest such that Q1 contains the bottom 25% and Q4 contains the top 25%.

e family median income quartiles n %
no Q1 $5,313 to $57,045 6229 0.83
yes Q1 $5,313 to $57,045 1288 0.17
no Q2 $57,045 to $70,682 5843 0.78
yes Q2 $57,045 to $70,682 1673 0.22
no Q3 $70,682 to $88,548.50 5456 0.73
yes Q3 $70,682 to $88,548.50 2058 0.27
no Q4 $88,548.50 to $249,489 3972 0.53
yes Q4 $88,548.50 to $249,489 3544 0.47
e individual median earnings quartiles n %
no Q1 $2,588 to $28,750 6145 0.79
yes Q1 $2,588 to $28,750 1619 0.21
no Q2 $28,750 to $33,913 5740 0.75
yes Q2 $28,750 to $33,913 1901 0.25
no Q3 $33,913 to $41,017.25 5734 0.74
yes Q3 $33,913 to $41,017.25 1967 0.26
no Q4 $41,017.25 to $246,151 4490 0.58
yes Q4 $41,017.25 to $246,151 3212 0.42

Quantity of stations per zip code and income/earnings

I learned these heatmaps are not good visuals to interpret the relationship between stations and zip code income. They tell more a story of the distribution of income than of the relationship between station/income. They are good to visualize the quantity of stations per zip code and see the ourliers. It’s worth noting that zip code 94025, Menlo Park, in California’s Silicon Valley has a whopping 359 electric fuel stations and a median family income of $204,557!!

iee_data %>% 
  filter(!is.na(fam_med_income),!is.na(e_stations)) %>% 
  ggplot(aes(fam_med_income, e_stations)) +
  geom_bin_2d(bins = 100) +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
  scale_y_continuous(name = "e station count") +
  scale_fill_continuous(type = "viridis", name = LBL_ZCC)  +
  theme_minimal()

iee_data %>% 
  filter(!is.na(ind_med_earnings),!is.na(e_stations)) %>% 
  ggplot(aes(ind_med_earnings, e_stations)) +
  geom_bin_2d(bins = 100) +
  scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
  scale_y_continuous(name = "e station count") +
  scale_fill_continuous(type = "viridis", name = LBL_ZCC)  +
  theme_minimal()

Individual earnings and family income

In the charts above, I was surprised by how different the relationship between income and stations looks depending on whether you use family median income or individual median earnings. To see why, I plotted the density chart comparing these two. As expected, the median income of families, is higher and more spread out than those of individual earnings. The U.S. census explains the difference between income and earnings.. I’m shocked that the family median income is over $70,000 (earnings/income for the past 12 months in 2020 inflation-adjusted dollars).

iee_medians <- tibble(type = c("ind_med_earnings","fam_med_income"),
                      amount = c(median(iee_data$ind_med_earnings, na.rm = T), 
                                 median(iee_data$fam_med_income, na.rm = T)))

iee_data %>% 
  select(ZIP,ind_med_earnings, fam_med_income) %>% 
  pivot_longer(cols = 2:3, names_to = "type", values_to = "amount") %>% 
  ggplot(aes(amount)) +
  geom_density(aes(fill = type, color = type), alpha = 1/3) +
  geom_vline(data = iee_medians, aes(xintercept = amount, color = type), size = 0.5) +
  geom_text(data = iee_medians, 
            aes(label = scales::dollar(amount), x = amount, y = 1e-5), 
            angle = 90, size = 3) +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(name = "density") +
  theme_minimal()

Conclusion

Which income indicator you use matters. There is indeed a relationship between zip code income/earnings and electric fuel stations. The lesson, however, is more about which indicator to use.

This project was done as an exercise to learn R, Rmarkdown, and the basics of joining data from different sources. In that regard, the biggest challenge was finding and cleaning U.S. Census data.


Get data and code at github.com/guasi/e_stations_rmd