On March 1, 2022, Tidy Tuesday presented the alternative fuel stations database as the weekly challenge. The alternative fuel data center at the U.S. Department of Energy has great tools to visualize the availability, spread, and evolution of alternative fuel stations. Instead of focusing on the stations themselves, I thought this would be a good opportunity to combine this nicely kept database with U.S. Census data. I was curious about the relationship between the zip code median income and the availability of electric fueling stations.
First, I got a count per zip code of electric fuel, available, and public stations.
# STATIONS per zip code -----------------------------------------------
<- readRDS("data/stations.rds")
stations <- stations %>%
stations filter(STATUS_CODE == "E", #Available
== "ELEC", #Electric
FUEL_TYPE_CODE == "public") %>% #Public
ACCESS_CODE group_by(ZIP) %>%
summarize(e_stations = n())
Then I downloaded the 2020 estimates from the American Community Survey for individual earnings (S2001) and family income (S1903) for the last 12 months, adjusted for inflation. I retrieved only the median earnings of individuals 16 years and older and the median income of families per zip code.
# INCOME per zip code --------------------------------------------------
# S2001 EARNINGS IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)
# Estimate!!Total!!Population 16 years and over with earnings!!Median earnings (dollars)
<- read_csv("data/S2001/S2001.csv",
S2001 col_select = c(NAME,S2001_C01_002E),
col_types = list(S2001_C01_002E = col_double())) %>%
mutate(NAME = substr(NAME,7,11)) %>%
rename(ZIP = NAME, ind_med_earnings = S2001_C01_002E)
attr(S2001,'spec') <- NULL
<- S2001[-1,]
S2001
# S1903 MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2020 INFLATION-ADJUSTED DOLLARS)
# Estimate!!Median income (dollars)!!FAMILIES!!Families
<- read_csv("data/S1903/S1903.csv",
S1903 col_select = c(NAME, S1903_C03_015E),
col_types = list(S1903_C03_015E = col_double())) %>%
mutate(NAME = substr(NAME,7,11)) %>%
rename(ZIP = NAME, fam_med_income = S1903_C03_015E)
attr(S1903,'spec') <- NULL
<- S1903[-1,]
S1903
<- left_join(S2001,S1903) iee_data
I joined the electric station data to the income data. To keep track of the lack of existence of electric fuel stations I did not want to rely on counting the NAs, so I created a “has_e” column. I also added a quartile factor for family median income and individual median earnings.
<- quantile(iee_data$fam_med_income, na.rm = T)
qs <- sapply(1:4,function(i) paste0("Q",i," ",scales::dollar(qs[i])," to ",scales::dollar(qs[i+1])))
qs_lbls $fam_quant <- cut(iee_data$fam_med_income, qs, include.lowest = T, labels = qs_lbls)
iee_data
<- quantile(iee_data$ind_med_earnings, na.rm = T)
qs <- sapply(1:4,function(i) paste0("Q",i," ",scales::dollar(qs[i])," to ",scales::dollar(qs[i+1])))
qs_lbls $ind_quant <- cut(iee_data$ind_med_earnings, qs, include.lowest = T, labels = qs_lbls)
iee_data
rm(qs,qs_lbls)
<- iee_data %>%
iee_data left_join(stations, by = c("ZIP")) %>%
mutate(has_e = ifelse(is.na(e_stations),"no","yes"))
These density charts confirm my hunch. The distribution of zip codes with electric stations shifts to the right compared to those without. In other words, the median incomes/earnings of zip codes with at least one electric station is higher than those without.
<- iee_data %>%
iee_medians group_by(has_e) %>%
summarise(fam_m = median(fam_med_income, na.rm = T),
ind_m = median(ind_med_earnings, na.rm = T))
%>%
iee_data filter(!is.na(fam_med_income)) %>%
ggplot(aes(fam_med_income)) +
geom_density(aes(fill = has_e, color = has_e), alpha = 1/3) +
geom_vline(data = iee_medians, aes(xintercept = fam_m, color = has_e), size = 0.5) +
geom_text(data = iee_medians,
aes(label = scales::dollar(fam_m), x = fam_m, y = 4e-5),
angle = 90, size = 3) +
scale_y_continuous(limits = c(0,6e-5), name = "density") +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
scale_color_discrete(name = LBL_ES) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
%>%
iee_data filter(!is.na(ind_med_earnings)) %>%
ggplot(aes(ind_med_earnings)) +
geom_density(aes(fill = has_e, color = has_e), alpha = 1/3) +
geom_vline(data = iee_medians, aes(xintercept = ind_m, color = has_e), size = 0.5) +
geom_text(data = iee_medians,
aes(label = scales::dollar(ind_m), x = ind_m, y = 1e-5),
angle = 90, size = 3) +
scale_y_continuous(limits = c(0,6e-5), name = "density") +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
scale_color_discrete(name = LBL_ES) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
I plotted the histograms just to see how the quantity of zip codes with and without stations compared, since this is lost in the density charts. As expected, there are more zip codes without stations.
%>%
iee_data filter(!is.na(fam_med_income)) %>%
ggplot(aes(fam_med_income, fill = has_e, color = has_e)) +
geom_histogram(alpha = .2, binwidth = 2000) +
scale_y_continuous(limits = c(0,3000), name = LBL_ZCC) +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
scale_color_discrete(name = LBL_ES) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
%>%
iee_data filter(!is.na(ind_med_earnings)) %>%
ggplot(aes(ind_med_earnings, fill = has_e, color = has_e)) +
geom_histogram(alpha = .2, binwidth = 2000) +
scale_y_continuous(limits = c(0,3000), name = LBL_ZCC) +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
scale_color_discrete(name = LBL_ES) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
To dig deeper into how lower income zip codes compare to higher income ones, I calculated the proportion of zip codes with e stations in each income quartile. The difference here is more telling. The proportion of zip codes with e stations is 17% in the bottom family median income quartile and 47% in the highest, which means zip codes with family median incomes in the top 25% are 2.76 times more likely to have an e station than those in the bottom 25%. In other words, for every low income zip code, there are 2.76 high income ones that have at least one e station.
<- iee_data %>%
fam_prop filter(!is.na(fam_med_income)) %>%
group_by(has_e,fam_quant) %>%
summarise(freq = n()) %>%
group_by(fam_quant) %>%
mutate(prop = round(freq/sum(freq),2)) %>%
ungroup()
<- iee_data %>%
ind_prop filter(!is.na(ind_med_earnings)) %>%
group_by(has_e,ind_quant) %>%
summarise(freq = n()) %>%
group_by(ind_quant) %>%
mutate(prop = round(freq/sum(freq),2)) %>%
ungroup()
%>%
iee_data filter(!is.na(fam_med_income)) %>%
group_by(has_e) %>%
ggplot(aes(fam_quant, fill = has_e)) +
geom_bar(alpha = 3/4) +
geom_text(data = fam_prop, aes(label = scales::percent(prop), x = fam_quant, y = c(rep(5000,4),rep(1000,4)))) +
scale_x_discrete(labels = c("Q1","Q2","Q3","Q4"), name = paste(LBL_FAM,"quartiles")) +
scale_y_continuous(name = LBL_ZCC) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
%>%
iee_data filter(!is.na(ind_med_earnings)) %>%
group_by(has_e) %>%
ggplot(aes(ind_quant, fill = has_e)) +
geom_bar(alpha = 3/4) +
geom_text(data = ind_prop, aes(label = scales::percent(prop), x = ind_quant, y = c(rep(5000,4),rep(1000,4)))) +
scale_x_discrete(labels = c("Q1","Q2","Q3","Q4"), name = paste(LBL_IND,"quartiles")) +
scale_y_continuous(name = LBL_ZCC) +
scale_fill_discrete(name = LBL_ES) +
theme_minimal()
Each quartile is made up of a quarter of all zip code incomes, organized from lowest to largest such that Q1 contains the bottom 25% and Q4 contains the top 25%.
e | family median income quartiles | n | % |
---|---|---|---|
no | Q1 $5,313 to $57,045 | 6229 | 0.83 |
yes | Q1 $5,313 to $57,045 | 1288 | 0.17 |
no | Q2 $57,045 to $70,682 | 5843 | 0.78 |
yes | Q2 $57,045 to $70,682 | 1673 | 0.22 |
no | Q3 $70,682 to $88,548.50 | 5456 | 0.73 |
yes | Q3 $70,682 to $88,548.50 | 2058 | 0.27 |
no | Q4 $88,548.50 to $249,489 | 3972 | 0.53 |
yes | Q4 $88,548.50 to $249,489 | 3544 | 0.47 |
e | individual median earnings quartiles | n | % |
---|---|---|---|
no | Q1 $2,588 to $28,750 | 6145 | 0.79 |
yes | Q1 $2,588 to $28,750 | 1619 | 0.21 |
no | Q2 $28,750 to $33,913 | 5740 | 0.75 |
yes | Q2 $28,750 to $33,913 | 1901 | 0.25 |
no | Q3 $33,913 to $41,017.25 | 5734 | 0.74 |
yes | Q3 $33,913 to $41,017.25 | 1967 | 0.26 |
no | Q4 $41,017.25 to $246,151 | 4490 | 0.58 |
yes | Q4 $41,017.25 to $246,151 | 3212 | 0.42 |
I learned these heatmaps are not good visuals to interpret the relationship between stations and zip code income. They tell more a story of the distribution of income than of the relationship between station/income. They are good to visualize the quantity of stations per zip code and see the ourliers. It’s worth noting that zip code 94025, Menlo Park, in California’s Silicon Valley has a whopping 359 electric fuel stations and a median family income of $204,557!!
%>%
iee_data filter(!is.na(fam_med_income),!is.na(e_stations)) %>%
ggplot(aes(fam_med_income, e_stations)) +
geom_bin_2d(bins = 100) +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_FAM) +
scale_y_continuous(name = "e station count") +
scale_fill_continuous(type = "viridis", name = LBL_ZCC) +
theme_minimal()
%>%
iee_data filter(!is.na(ind_med_earnings),!is.na(e_stations)) %>%
ggplot(aes(ind_med_earnings, e_stations)) +
geom_bin_2d(bins = 100) +
scale_x_continuous(labels = scales::dollar_format(), name = LBL_IND) +
scale_y_continuous(name = "e station count") +
scale_fill_continuous(type = "viridis", name = LBL_ZCC) +
theme_minimal()
In the charts above, I was surprised by how different the relationship between income and stations looks depending on whether you use family median income or individual median earnings. To see why, I plotted the density chart comparing these two. As expected, the median income of families, is higher and more spread out than those of individual earnings. The U.S. census explains the difference between income and earnings.. I’m shocked that the family median income is over $70,000 (earnings/income for the past 12 months in 2020 inflation-adjusted dollars).
<- tibble(type = c("ind_med_earnings","fam_med_income"),
iee_medians amount = c(median(iee_data$ind_med_earnings, na.rm = T),
median(iee_data$fam_med_income, na.rm = T)))
%>%
iee_data select(ZIP,ind_med_earnings, fam_med_income) %>%
pivot_longer(cols = 2:3, names_to = "type", values_to = "amount") %>%
ggplot(aes(amount)) +
geom_density(aes(fill = type, color = type), alpha = 1/3) +
geom_vline(data = iee_medians, aes(xintercept = amount, color = type), size = 0.5) +
geom_text(data = iee_medians,
aes(label = scales::dollar(amount), x = amount, y = 1e-5),
angle = 90, size = 3) +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(name = "density") +
theme_minimal()
Which income indicator you use matters. There is indeed a relationship between zip code income/earnings and electric fuel stations. The lesson, however, is more about which indicator to use.
This project was done as an exercise to learn R, Rmarkdown, and the basics of joining data from different sources. In that regard, the biggest challenge was finding and cleaning U.S. Census data.
Get data and code at github.com/guasi/e_stations_rmd