In this tutorial, we will first introduce the United States Census data and give an overview of how to access and use Census data. The later part covers how to access and extract Census data in R, using package tidycensus. We will also briefly covers downloading shapefiles using R package tigris.
Part of this section was derived from the online book Analyzing US Census Data: Methods, Maps, and Models in R section 1 and 2.
The decennial US Census is intended to be a complete enumeration of the US population to assist with apportionment, which refers to the balanced arrangement of Congressional districts to ensure appropriate representation in the United States House of Representatives. It asks a limited set of questions on race, ethnicity, age, sex, and housing tenure.
American Community Survey (ACS) is now the premier source of detailed demographic information about the US population. The ACS is mailed to approximately 3.5 million households per year (representing around 3 percent of the US population), allowing for annual data updates. The Census Bureau releases two ACS datasets to the public: the 1-year ACS, which covers areas of population 65,000 and greater, and the 5-year ACS, which is a moving average of data over a 5-year period that covers geographies down to the Census block group. ACS data are distinct from decennial Census data in that data represent estimates rather than precise counts, and in turn are characterized by margins of error around those estimates.
Aggregate data from the decennial US Census, American Community Survey, and other Census surveys are made available to the public at different enumeration units. Enumeration units are geographies at which Census data are tabulated. They include both legal entities such as states and counties, and statistical entities that are not official jurisdictions but used to standardize data tabulation. The smallest unit at which data are made available from the decennial US Census is the block, and the smallest unit available in the ACS is the block group, which represents a collection of blocks. Other surveys are generally available at higher levels of aggregation.
Enumeration units represent different levels of the Census hierarchy. This hierarchy is summarized in the graphic below (from https://www.census.gov/programs-surveys/geography/guidance/hierarchy.html).
The central axis of the diagram represents the central Census hierarchy of enumeration units, as each geography from Census blocks all the way up to the nation nests within its parent unit. This means that block groups are fully composed of Census blocks, Census tracts are fully composed of block groups, and so forth. An example of nesting is shown in the graphic below for 2020 Census tracts in Harris County, Texas.
The plot illustrates how Census tracts in Harris County neatly nest within its parent geography, the county. This means that the sum of Census data for Census tracts in Harris County will also equal Harris County’s published total at the county level.
Reviewing the diagram shows that some Census geographies, like congressional districts, only nest within states, and that some other geographies do not nest within any parent geography at all. A good example of this is the Zip Code Tabulation Area (ZCTA), which is used by the Census Bureau to represent postal code geographies in the US. A brief illustration is represented by the graphic below.
There is a data download interface available, where users can interactively search several Census Bureau datasets (the decennial Census & ACS, along with the Economic Census, Population Estimates Program, and others), generate custom queries by geography, and download data extracts.
Census APIs are characterized by an API endpoint, which is a base web address for a given Census dataset, and a query, which customizes the data returned from the API. In most cases, users will interact with the Census API through software libraries that offer simplified programmatic access to the API’s data resources. The example covered in this tutorial is the R package tidycensus (K. Walker and Herman 2021). Users of the Census API through these software libraries will require a Census API key, which is free and fast to acquire.
The tidycensus package (K. Walker and Herman 2021), first released in 2017, is an R package designed to facilitate the process of acquiring and working with US Census Bureau population data in the R environment. The package has two distinct goals. First, tidycensus aims to make Census data available to R users in a tidyverse-friendly format, helping kick-start the process of generating insights from US Census data. Second, the package is designed to streamline the data wrangling process for spatial Census data analysts. With tidycensus, R users can request geometry along with attributes for their Census data, helping facilitate mapping and spatial analysis.
T US Census Bureau makes a wide range of datasets available to the user community through their APIs and other data download resources. tidycensus is not a comprehensive portal to these data resources; instead, it focuses on a select number of datasets implemented in a series of core functions. These core functions in tidycensus include:
get_decennial()
,
which requests data from the US Decennial Census APIs for 2000, 2010,
and 2020.
get_acs()
,
which requests data from the 1-year and 5-year American Community Survey
samples. Data are available from the 1-year ACS back to 2005 and the
5-year ACS back to 2005-2009.
get_estimates()
,
an interface to the Population Estimates APIs. These datasets include
yearly estimates of population characteristics by state, county, and
metropolitan area, along with components of change demographic estimates
like births, deaths, and migration rates.
get_pums()
,
which accesses data from the ACS Public Use Microdata Sample APIs. These
samples include anonymized individual-level records from the ACS
organized by household and are highly useful for many different social
science analyses. get_pums()
is covered in more depth in Chapters 9
and 10.
get_flows()
,
an interface to the ACS Migration Flows APIs. Includes information on
in- and out-flows from various geographies for the 5-year ACS samples,
enabling origin-destination analyses.
To get started with tidycensus, users should install
the package with install.packages("tidycensus")
if not yet
installed; load the package with library("tidycensus")
;
and set their Census API key with the census_api_key()
function. API keys can be obtained at https://api.census.gov/data/key_signup.html. After
you’ve signed up for an API key, be sure to activate the key from the
email you receive from the Census Bureau so it works correctly.
Declaring install = TRUE
when calling census_api_key()
will install the key for use in future R sessions, which may be
convenient for many users.
#remotes::install_github("walkerke/tidycensus", force=T)
library(tidycensus)
# census_api_key("YOUR KEY GOES HERE", install = TRUE)
Let’s start with get_decennial()
,
which is used to access decennial Census data from the 2000, 2010, and
2020 decennial US Censuses. To get data from the decennial US Census,
users must specify a string representing the requested
geography
; a vector of Census variable IDs, represented by
variable
; or optionally a Census table ID, passed to
table
. The code below gets data on total population by
state from the 2010 decennial Census.
total_population_10 <- get_decennial(
geography = "state",
variables = "P001001",
year = 2010
)
Getting data from the 2010 decennial Census
Using Census Summary File 1
total_population_10
The function returns a tibble of data from the 2010 US Census (the
function default year) with information on total population by state,
and assigns it to the object total_population_10
. Data for
2000 or 2020 can also be obtained by supplying the appropriate year to
the year
parameter.
By default, get_decennial()
uses the argument sumfile = "sf1"
, which fetches data from
the decennial Census Summary File 1. This summary file exists for the
2000 and 2010 decennial US Censuses, and includes core demographic
characteristics for Census geographies. The 2000 and 2010 decennial
Census data also include Summary File 2, which contains information on a
range of population and housing unit characteristics and is specified as
"sf2"
. Detailed demographic information in the 2000
decennial Census such as income and occupation can be found in Summary
Files 3 ("sf3"
) and 4 ("sf4"
). Data from the
2000 and 2010 Decennial Censuses for island territories other than
Puerto Rico must be accessed at their corresponding summary files:
"as"
for American Samoa, "mp"
for the Northern
Mariana Islands, "gu"
for Guam, and "vi"
for
the US Virgin Islands.
2020 Decennial Census data are available from the PL 94-171
Redistricting summary file, which is specified with
sumfile = "pl"
and is also available for 2010. The
Redistricting summary files include a limited subset of variables from
the decennial US Census to be used for legislative redistricting. These
variables include total population and housing units; race and
ethnicity; voting-age population; and group quarters population. For
example, the code below retrieves information on the American Indian
& Alaska Native population by state from the 2020 decennial
Census.
aian_2020 <- get_decennial(
geography = "state",
variables = "P1_005N",
year = 2020,
sumfile = "pl"
)
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data summary file
aian_2020
The argument sumfile = "pl"
is assumed (and in turn not
required) when users request data for 2020 and will remain so until the
main Demographic and Housing Characteristics File is released in
mid-to-late 2022.
For many geographies, tidycensus supports more
granular requests that are subsetted by state or even by county, if
supported by the API. For example, we can use the state
parameter to subset county level population for Texas:
total_population_10_TX <- get_decennial(
geography = "county",
variables = "P001001",
state="Texas",
year = 2010
)
Getting data from the 2010 decennial Census
Using FIPS code '48' for state 'Texas'
Using Census Summary File 1
total_population_10_TX
Similarly, get_acs()
retrieves data from the American Community Survey. As discussed in the
previous section, the ACS includes a wide variety of variables detailing
characteristics of the US population not found in the decennial Census.
The example below fetches data on the number of residents born in Mexico
by state.
born_in_mexico <- get_acs(
geography = "state",
variables = "B05006_150",
year = 2020
)
Getting data from the 2016-2020 5-year ACS
born_in_mexico
If the year is not specified, get_acs()
defaults to the most recent five-year ACS sample, which at the time of
this writing is 2016-2020. The data returned is similar in structure to
that returned by get_decennial()
,
but includes an estimate
column (for the ACS estimate) and
moe
column (for the margin of error around that estimate)
instead of a value
column. Different years and different
surveys are available by adjusting the year
and
survey
parameters. survey
defaults to the
5-year ACS; however this can be changed to the 1-year ACS by using the
argument survey = "acs1"
. For example, the following code
will fetch data from the 1-year ACS for 2019:
born_in_mexico_1yr <- get_acs(
geography = "state",
variables = "B05006_150",
survey = "acs1",
year = 2019
)
Getting data from the 2019 1-year ACS
The 1-year ACS provides data for geographies with populations of 65,000 and greater.
Note the differences between the 5-year ACS estimates and the 1-year
ACS estimates shown. For states with larger Mexican-born populations
like Arizona, California, and Colorado, the 1-year ACS data will
represent the most up-to-date estimates, albeit characterized by larger
margins of error relative to their estimates. For states with smaller
Mexican-born populations like Alabama, Alaska, and Arkansas, however,
the estimate returns NA
, R’s notation representing missing
data. If you encounter this in your data’s estimate
column,
it will generally mean that the estimate is too small for a given
geography to be deemed reliable by the Census Bureau. In this case, only
the states with the largest Mexican-born populations have data available
for that variable in the 1-year ACS, meaning that the 5-year ACS should
be used to make full state-wise comparisons if desired.
Variables from the ACS detailed tables, data profiles, summary
tables, comparison profile, and supplemental estimates are available
through tidycensus’s get_acs()
function; the function will auto-detect from which dataset to look for
variables based on their names. Alternatively, users can supply a table
name to the table parameter in get_acs(); this will return data for
every variable in that table. For example, to get all variables
associated with table B01001, which covers sex broken down by age, from
the 2016-2020 5-year ACS:
age_table <- get_acs(
geography = "state",
table = "B01001",
year = 2020
)
Getting data from the 2016-2020 5-year ACS
age_table
To find all of the variables associated with a given ACS table,
tidycensus downloads a dataset of variables from the
Census Bureau website and looks up the variable codes for download. If
the cache_table
parameter is set to TRUE
, the
function instructs tidycensus to cache this dataset on
the user’s computer for faster future access. This only needs to be done
once per ACS or Census dataset if the user would like to specify this
option.
The geography
parameter in get_acs()
and get_decennial()
allows users to request data aggregated to common Census enumeration
units. At the time of this writing, tidycensus accepts
enumeration units nested within states and/or counties, when applicable.
Census blocks are available in get_decennial()
but not in get_acs()
as block-level data are not available from the American Community
Survey. To request data within states and/or counties, state and county
names can be supplied to the state
and county
parameters, respectively. Arguments should be formatted in the way that
they are accepted by the US Census Bureau API, specified in the table
below. If an “Available by” geography is in bold, that argument is
required for that geography.
The only geographies available in 2000 are "state"
,
"county"
, "county subdivision"
,
"tract"
, "block group"
, and
"place"
. Some geographies available from the Census API are
not available in tidycensus at the moment as they require more complex
hierarchy specification than the package supports, and not all variables
are available at every geography.
Geography | Definition | Available by | Available in |
---|---|---|---|
"us" |
United States | get_acs() ,
get_decennial() ,
get_estimates() |
|
"region" |
Census region | get_acs() ,
get_decennial() ,
get_estimates() |
|
"division" |
Census division | get_acs() ,
get_decennial() ,
get_estimates() |
|
"state" |
State or equivalent | state | get_acs() ,
get_decennial() ,
get_estimates() ,
get_flows() |
"county" |
County or equivalent | state, county | get_acs() ,
get_decennial() ,
get_estimates() ,
get_flows() |
"county subdivision" |
County subdivision | state, county | get_acs() ,
get_decennial() ,
get_estimates() ,
get_flows() |
"tract" |
Census tract | state, county | get_acs() ,
get_decennial() |
"block group" |
Census block group | state, county | get_acs()
(2013-), get_decennial() |
"block" |
Census block | state, county | get_decennial() |
"place" |
Census-designated place | state | get_acs() ,
get_decennial() ,
get_estimates() |
"alaska native regional corporation" |
Alaska native regional corporation | state | get_acs() ,
get_decennial() |
"american indian area/alaska native area/hawaiian home land" |
Federal and state-recognized American Indian reservations and Hawaiian home lands | state | get_acs() ,
get_decennial() |
"american indian area/alaska native area (reservation or statistical entity only)" |
Only reservations and statistical entities | state | get_acs() ,
get_decennial() |
"american indian area (off-reservation trust land only)/hawaiian home land" |
Only off-reservation trust lands and Hawaiian home lands | state | get_acs() , |
"metropolitan statistical area/micropolitan statistical area"
OR "cbsa" |
Core-based statistical area | state | get_acs() ,
get_decennial() ,
get_estimates() ,
get_flows() |
"combined statistical area" |
Combined statistical area | state | get_acs() ,
get_decennial() ,
get_estimates() |
"new england city and town area" |
New England city/town area | state | get_acs() ,
get_decennial() |
"combined new england city and town area" |
Combined New England area | state | get_acs() ,
get_decennial() |
"urban area" |
Census-defined urbanized areas | get_acs() ,
get_decennial() |
|
"congressional district" |
Congressional district for the year-appropriate Congress | state | get_acs() ,
get_decennial() |
"school district (elementary)" |
Elementary school district | state | get_acs() ,
get_decennial() |
"school district (secondary)" |
Secondary school district | state | get_acs() ,
get_decennial() |
"school district (unified)" |
Unified school district | state | get_acs() ,
get_decennial() |
"public use microdata area" |
PUMA (geography associated with Census microdata samples) | state | get_acs() |
"zip code tabulation area" OR "zcta" |
Zip code tabulation area | state | get_acs() ,
get_decennial() |
"state legislative district (upper chamber)" |
State senate districts | state | get_acs() ,
get_decennial() |
"state legislative district (lower chamber)" |
State house districts | state | get_acs() ,
get_decennial() |
"voting district" |
Voting districts (2020 only) | state | get_decennial() |
The geography parameter must be typed exactly as specified in the
table above to request data correctly from the Census API; use the guide
above as a reference and copy-paste for longer strings. For core-based
statistical areas and zip code tabulation areas, two heavily-requested
geographies, the aliases "cbsa"
and "zcta"
can
be used, respectively, to fetch data for those geographies.
cbsa_population <- get_acs(
geography = "cbsa",
variables = "B01003_001",
year = 2020
)
Getting data from the 2016-2020 5-year ACS
One additional challenge when searching for Census variables is
understanding variable IDs, which are required to fetch data from the
Census and ACS APIs. There are thousands of variables available across
the different datasets and summary files. To make searching easier for R
users, tidycensus offers the load_variables()
function. This function obtains a dataset of variables from the Census
Bureau website and formats it for fast searching, ideally in
RStudio.
The function takes two required arguments: year
, which
takes the year or endyear of the Census dataset or ACS sample, and
dataset
, which references the dataset name. For the 2000 or
2010 Decennial Census, use "sf1"
or "sf2"
as
the dataset name to access variables from Summary Files 1 and 2,
respectively. The 2000 Decennial Census also accepts "sf3"
and "sf4"
for Summary Files 3 and 4. For 2020, the only
dataset supported at the time of this writing is "pl"
for
the PL-94171 Redistricting dataset; more datasets will be supported as
the 2020 Census data are released. An example request would look like
load_variables(year = 2020, dataset = "pl")
for variables
from the 2020 Decennial Census Redistricting data.
For variables from the American Community Survey, users should
specify the dataset as "acs1"
for the 1-year ACS or
"acs5"
for the 5-year ACS. If no suffix to these dataset
names is specified, users will retrieve data from the ACS Detailed
Tables. Variables from the ACS Data Profile, Summary Tables, and
Comparison Profile are also available by appending the suffixes
/profile
, /summary
, or /cprofile
,
respectively. For example, a user requesting variables from the 2020
5-year ACS Detailed Tables would specify
load_variables(year = 2020, dataset = "acs5")
; a request
for variables from the Data Profile then would be
load_variables(year = 2020, dataset = "acs5/profile")
. In
addition to these datasets, the ACS Supplemental Estimates variables can
be accessed with the dataset name "acsse"
.
As this function requires processing thousands of variables from the
Census Bureau which may take a few moments depending on the user’s
internet connection, the user can specify cache = TRUE
in
the function call to store the data in the user’s cache directory for
future access. On subsequent calls of the load_variables()
function, cache = TRUE
will direct the function to look in
the cache directory for the variables rather than the Census
website.
An example of how load_variables()
works is as follows:
v16 <- load_variables(2016, "acs5", cache = TRUE)
The returned data frame always has three columns: name
,
which refers to the Census variable ID; label
, which is a
descriptive data label for the variable; and concept
, which
refers to the topic of the data and often corresponds to a table of
Census data. For the 5-year ACS detailed tables, the returned data frame
also includes a fourth column, geography
, which specifies
the smallest geography at which a given variable is available from the
Census API. As illustrated above, the data frame can be filtered using
tidyverse tools for variable exploration. However, the RStudio
integrated development environment includes an interactive data viewer
which is ideal for browsing this dataset, and allows for interactive
sorting and filtering. The data viewer can be accessed with the View()
function:
View(v16)
By browsing the table in this way, users can identify the appropriate
variable IDs (found in the name
column) that can be passed
to the variables
parameter in get_acs()
or get_decennial()
.
Users may note that the raw variable IDs in the ACS, as consumed by the
API, require a suffix of E
or M
.
tidycensus does not require this suffix, as it will
automatically return both the estimate and margin of error for a given
requested variable. Additionally, if users desire an entire table of
related variables from the ACS, the user should supply the characters
prior to the underscore from a variable ID to the table
parameter.
In the next section, we will demonstrate how to access and clean Census Data using county level sociodemographic data from Texas as an example.
Part of this section was derived from the online book Analyzing US Census Data: Methods, Maps, and Models in R section 3 and 4.
This chapter covers wrangling Census data with R package tidyverse and exploring Census data with visualization. We will demonstrate the process using county level sociodemographic data from Harris County as an example.
To extract sociodemofraphic data of interest from 2016-2020 American Community Survey 5-year estimates, we first load the codebook which allows us to search and locate the variables of interest:
#Census API token: replace "Your KEY" with your requested API token here: https://api.census.gov/data/key_signup.html
#census_api_key("Your KEY", install="TRUE", overwrite = TRUE)
#View all the possible variables we can use in the dataset
acs20 <- load_variables(year= 2020, dataset="acs5/profile", cache=TRUE)
# Search in the codebook page
#View(acs20)
If we are interested in county level sociodemographic variables (% unemployed, % living under poverty, Per capita income and etc.), we need first list the variables of interest by searching the keywords in the codebook page viewed above:
#View tha loaded variables and filter to choose the variables of interest at county level
varlist <- c("DP05_0001", "DP03_0088", "DP03_0005P", "DP03_0128P",
"DP03_0099P","DP04_0077P", "DP02_0060P", "DP02_0061P",
"DP05_0077P", "DP04_0058P","DP02_0072P", "DP02_0114P",
"DP04_0047P")
subset(acs20, acs20$name %in% varlist)
# Get 2020 ACS data for all Texas counties
TX_county <- get_acs(geography = "county", state="TX",
variables=varlist,
year=2020) ##geometry=T will also give you the shapefiles
Getting data from the 2016-2020 5-year ACS
Using the ACS Data Profile
Using FIPS code '48' for state 'TX'
TX_county
If you are interested in getting county level data for all US states,
you can set the state
parameter to NULL
.
With the data retrieved using tidycensus, we need process and manipulate it to the data format wanted for further analysis and visualizations.
library(tidyverse)
#Clean the datsaet
TX_county_dt <- TX_county[,c(1,3,4)] %>%
spread(key="variable", value="estimate") %>% # from long to wide format
mutate(Crowded_housing= 100- DP04_0077P, #Define variable crowded housing
No_high_school= DP02_0060P+DP02_0061P,#Define variable No high school diploma
Racial_minority = 100 - DP05_0077P, #Define variable Racial minority
) %>%
#mutate_each(funs(replace(., .<0, NA))) %>% #Replace missing values with NA
rename(No_vehicle=DP04_0058P, Unemployed=DP03_0005P, Renters=DP04_0047P,
Per_capita_income=DP03_0088, Living_poverty=DP03_0128P,
Uninsured=DP03_0099P,Population=DP05_0001, Disability=DP02_0072P,
Limited_English=DP02_0114P) #Rename the variables
# Retain the variables of interest
TX_county_dt <- dplyr::select(TX_county_dt, GEOID, Population, Crowded_housing,
No_high_school, Renters, Racial_minority, No_vehicle,
Unemployed,Per_capita_income, Living_poverty,Uninsured,
Disability, Limited_English)
#options(digits=3)
TX_county_dt
The data contains 254 observations and 13 columns, with each raw containing data for one county in Texas and each column representing a variable. GEOID is the county FIPS. Each measure is in percentage, except per capita income in US dollars.
table1
package is useful to create summary statistics
table: https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html
library(table1)
table1(~ Population+ Per_capita_income + Unemployed+ Crowded_housing+No_high_school+
Renters+Racial_minority+No_vehicle+Living_poverty+Uninsured+Disability+
Limited_English, data=TX_county_dt)
Overall (N=254) |
|
---|---|
Population | |
Mean (SD) | 113000 (405000) |
Median [Min, Max] | 18500 [117, 4680000] |
Per_capita_income | |
Mean (SD) | 27100 (5850) |
Median [Min, Max] | 26700 [13300, 47700] |
Unemployed | |
Mean (SD) | 2.92 (1.39) |
Median [Min, Max] | 2.80 [0, 9.20] |
Crowded_housing | |
Mean (SD) | 4.02 (2.40) |
Median [Min, Max] | 3.60 [0, 12.8] |
No_high_school | |
Mean (SD) | 18.2 (8.51) |
Median [Min, Max] | 16.9 [3.40, 78.1] |
Renters | |
Mean (SD) | 28.0 (9.46) |
Median [Min, Max] | 26.8 [3.60, 82.3] |
Racial_minority | |
Mean (SD) | 44.5 (21.4) |
Median [Min, Max] | 41.8 [0, 99.0] |
No_vehicle | |
Mean (SD) | 5.06 (2.42) |
Median [Min, Max] | 5.00 [0, 12.4] |
Living_poverty | |
Mean (SD) | 15.3 (6.39) |
Median [Min, Max] | 14.5 [0, 42.6] |
Uninsured | |
Mean (SD) | 17.4 (5.11) |
Median [Min, Max] | 17.1 [4.00, 35.2] |
Disability | |
Mean (SD) | 15.9 (4.75) |
Median [Min, Max] | 16.0 [4.60, 47.9] |
Limited_English | |
Mean (SD) | 26.6 (18.9) |
Median [Min, Max] | 20.2 [2.60, 94.1] |
cor <- cor(TX_county_dt[,2:13], method = "pearson", use = "complete.obs")
cor <- as.data.frame(round(cor,2))
# rename variables
names <- c("Population", "Crowded housing","No high school diploma", "Renters",
"Racial minority","No vehicle","Unemployment","Per capita income",
"Living in poverty", "Uninsured", "Disability", "Limited English")
names(cor) <- names
row.names(cor) <- names
#add correlation coefficients & reorder matrix using hierarchical clustering
library(ggcorrplot)
ggcorrplot(cor, type = "lower",lab = TRUE, lab_size = 1.5)
Alternatively, we can request data in wide format using tidycensus, which will spread the estimate (variable name ending in “E”) and margin of error (variable name ending in “M”) information across the columns.
tx_wide <- get_acs(
geography = "county",
state = "Texas",
variables = c(living_poverty = "DP03_0128P", # rename the variables
unemployed = "DP03_0005P",
Per_capita_income="DP03_0088"),
output = "wide", # wide format
year = 2020
)
Getting data from the 2016-2020 5-year ACS
Using the ACS Data Profile
Using FIPS code '48' for state 'Texas'
tx_wide
We can create some visualization using ggplot2 package, such as histogram of % living under poverty at county level in Texas.
ggplot(tx_wide, aes(x = living_povertyE)) +
geom_histogram()+
labs(x="% people living in poverty", y="count")
Explore the relationship between % poverty and % unemployed using scatter plot:
ggplot(tx_wide, aes(x = living_povertyE, y = unemployedE)) +
geom_point()+
geom_smooth(method="lm")+
labs(x="% people living in poverty", y="% people unemployed")
The geom_smooth()
function draws a fitted line representing the relationship between the
two columns on the plot. The argument method = "lm"
draws a
straight line based on a linear model fit; smoothed relationships can be
visualized as well with method = "loess"
.
We can also create a bar chart identifying the top 20 counties in Texas with the highest per capita income.
tx_wide %>%
mutate(County = str_split_fixed(NAME, ",", 2)[,1]) %>% # retain only county name
arrange(desc(Per_capita_incomeE)) %>% # order by variable
slice(1:20)%>% # top 20
ggplot(aes(y = reorder(County, Per_capita_incomeE), x = Per_capita_incomeE)) +
#order from highest to lowest
geom_col()+
labs(x="Per capita income, $", y="")+
ggtitle("The top 20 counties in Texas with the highest per capita income")
While tidycensus has tools available for working with margins of error in a data wrangling workflow, it is also often useful to visualize those margins of error to illustrate the degree of uncertainty around estimates, especially when making comparisons between those estimates. Below is an example presenting uncertainty around each estimates using error bars:
tx_wide %>%
mutate(County = str_split_fixed(NAME, ",", 2)[,1]) %>% # retain only county name
arrange(desc(Per_capita_incomeE)) %>% # order by variable
slice(1:20)%>% # top 20
ggplot(aes(x = Per_capita_incomeE, y = reorder(County, Per_capita_incomeE))) +
geom_errorbarh(aes(xmin = Per_capita_incomeE - Per_capita_incomeM,
xmax = Per_capita_incomeE + Per_capita_incomeM)) +
geom_point(size = 3, color = "darkgreen") +
theme_minimal(base_size = 12.5) +
labs(title = "Per capita income",
subtitle = "Top 20 counties in Texas",
x = "2016-2020 ACS estimate",
y = "")
Sometimes we want to obtain a time series of ACS estimates to explore
temporal demographic shifts. For an illustrative example, we’ll obtain
5-year ACS data from 2010 through 2019 on median house value for Harris
County and Travis County, Texas. map_dfr()
is used to iterate over a named vector of years, creating a time-series
dataset of median home value in Harris and Travis County since 2010, and
we use the formula specification for anonymous functions so that
~ .x
translates to function(x) x
.
my_vars <- c(
"B25077_001"
) # list the vairables of interest (median house value)
years <- 2010:2019 # year
names(years) <- years
county_mhv_tx <- map_dfr(
years,
~ get_acs(
geography = "county",
state="Texas",
variables = my_vars,
year = .x,
survey = "acs5",
geometry = FALSE
),
.id = "year" # when combining results, add id var (name of list item)
) %>%
arrange(variable, NAME) %>%
rename(value=estimate) %>%
select(-variable) %>%
mutate(year=as.numeric(year))
county_mhv_tx
Arguably the most common chart type chosen for time-series
visualization is the line chart, which ggplot2 handles
capably with the geom_line()
function. Here we compare the temporal trend of median house value in
Harris and Travis County over time (2010-2019):
compare <- subset(county_mhv_tx, county_mhv_tx$NAME %in%
c("Harris County, Texas", "Travis County, Texas"))
ggplot(compare, aes(x = year, y = value, color=NAME)) +
geom_line() +
geom_point()+
labs(title = "Median home value in Harris and Travis County, TX",
x = "Year",
y = "ACS estimate",
caption = "2016-2020 ACS 5-year estimate",
color="County")+
theme_minimal(base_size = 12) +
scale_x_continuous(breaks = c(2010:2019))+
scale_y_continuous(labels = scales::label_dollar(scale = .001, suffix = "k"))
NA
NA
If you prefer to download the shapefiles using the web
interface, you can navigate and download shapefiles by specifying
the year and geographic level on the website. By saving the shapefile in
your local folder under the working directory, package
rgdal
or sf
can be used used to read and open
the shapefiles in R. In this example, the 2020 Texas county level
shapefile was downloaded:
# using rgdal
library(rgdal)
# indicate dsn (data source, folder name) and layer name
TX_County <- readOGR(dsn = "TX_county",layer = "TX_county", verbose=F)
Warning: OGR support is provided by the sf and terra packages among othersWarning: OGR support is provided by the sf and terra packages among othersWarning: OGR support is provided by the sf and terra packages among othersWarning: OGR support is provided by the sf and terra packages among othersWarning: OGR support is provided by the sf and terra packages among othersWarning: OGR support is provided by the sf and terra packages among others
class(TX_County)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
Note that the returned object is a “SpatialPolygonsDataFrame”, which consists of data columns on characteristics of the counties, and spatial polygons, which contains a set of spatially explicit shapes/polygons that represent the geographic locations of the counties.
head(TX_County@data,10)
Or you can use read_sf
from the sf
package,
which will return a sf object.
library(sf)
TX_County1 <- read_sf('TX_county/TX_county.shp')
class(TX_County1)
[1] "sf" "tbl_df" "tbl" "data.frame"
head(TX_County1)
Simple feature collection with 6 features and 17 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -102.0904 ymin: 28.02283 xmax: -93.6882 ymax: 35.18324
Geodetic CRS: GRS 1980(IUGG, 1980)
The sf
object includes a data frame with a series of
columns representing characteristics of those states, like a name,
postal code, and Census ID (the GEOID
column). It also
contains a special list-column, geometry
, which is made up
of a sequence of coordinate of longitude/latitude coordinate pairs that
collectively represent the boundary of each state.
You can easily convert sf
object to
SpatialPolygonDataFrame
as following:
# TX_County <- as(TX_County1, "Spatial")
The tigris R package simplifies the process for R users of obtaining and using Census geographic datasets. Functions in tigris download a requested Census geographic dataset from the US Census Bureau website, then load the dataset into R as a spatial object.
You can download shapefiles at different geographical levels. For example, use states() function to retrieve state-level shapefile, counties() for county-level, and tracts() for census-tract level.
When you call a tigris function, it does the following:
Downloads your data from the US Census Bureau website
Stores your data in a user cache directory or in a temporary directory
Loads your data into R session with rgdal::readOGR() or sf::st_read()
see reference here.
‘state’ argument can be a two-digit FIPS code of a specific state or state name abbreviation, for example, 48 stands for Texas.
library(tigris)
options(tigris_use_cache=TRUE) #cache Census shapefile downloads
txcounty <- counties(state=48, cb=TRUE, year=2020)
Notice the argument cb=TRUE. It helps to return Cartographic Boundary Files, which has less details than TIGER/Line Shapefiles. (Detail descriptions can be found here.)
Let’s see a comparison for Texas, where Black lines are TIGER/Line file and Red lines are Cartographic Boundary.
With tracts() function, you can retrieve census tract shapefile for a specific county, here’s an example of Harris county in Texas, (Note: you may actually want to use FIPS code instead of county name to accurately find the data you need. For instance, “Harris” will match both “Harris County” and “Harrison County”, which causes confusion that end up returning empty data.)
harristracts <- tracts(state=48, county = "Harris County", cb=T, year=2020)
Using FIPS code '201' for 'Harris County'
# View the polygons using package tmap
tm_shape(harristracts) + tm_polygons()
To retrieve county-level shapefiles, you can specify a vector of FIPS code in state argument. For example, 17=Illinois, 26=Michigan, 55=Wisconsin.
multi_st_county <- counties(state=c(17,26,55), cb=TRUE, year=2020)
tm_shape(multi_st_county) + tm_polygons()
However, since census tracts shapefiles are only available at
state-level, it is not applicable with a vector of FIPS code directly
put into the function. Instead, you can apply rind_tigris()
function to combine multiple tracts files.
sts <- c(17,26,55)
combined <- rbind_tigris(
lapply(sts, function(x){
tracts(x, cb=TRUE)
})
)
Retrieving data for the year 2021
Retrieving data for the year 2021
Retrieving data for the year 2021
tm_shape(combined) + tm_polygons()
It is straightforward to get tracts file for the entire US, you can simply specify NULL in state and county arguments,
ustract <- tracts(state = NULL, county = NULL, cb = TRUE, year = 2019)
Retrieving Census tracts for the entire United States
tm_shape(ustract) + tm_polygons()
A common problem for national display of the United States is the fragmented nature of US states and territories geographically. The continental United States can be displayed on a map in a relatively straightforward way, and there are a number of projected coordinate reference systems designed for correct display of the continental US. Often, analysts and cartographers then have to make decisions about how to handle Alaska, Hawaii, and Puerto Rico, which cannot be reasonably plotted using default US projections.
tigris offers a solution to this problem with the shift_geometry()
function. shift_geometry()
takes an opinionated approach to the shifting and rescaling of Alaska,
Hawaii, and Puerto Rico geometries to offer four options for an
alternative view of the US. The function works by projecting geometries
in Alaska, Hawaii, and Puerto Rico to appropriate coordinate reference
systems for those areas, then re-sizing the geometries (if requested)
and moving them to an alternative layout in relationship to the rest of
the US using the Albers Equal Area CRS.
us_states <- states(cb = TRUE, resolution = "20m")
Retrieving data for the year 2021
us_states_shifted <- shift_geometry(us_states)
tm_shape(us_states_shifted)+tm_polygons()
This view uses two default arguments:
preserve_area = FALSE
, which shrinks Alaska and inflates
Hawaii and Puerto Rico, and position = "below"
, which
places these areas below the continental United States. Alternatively,
we can set preserve_area = TRUE
and
position = "outside"
(used together below, but they can be
mixed and matched) for a different view:
us_states_outside <- shift_geometry(us_states,
preserve_area = TRUE,
position = "outside")
tm_shape(us_states_outside)+tm_polygons()
Before creating maps in R, we need to merge the data of interest (in this case, not census data) with shapefiles. The following example used the Texas county level COVID-19 cases data from the Department of Health and Human Services website.
# read in data
library(readxl)
cases <- read_excel("Texas COVID-19 Cumulative Confirmed Cases by County.xlsx",
sheet="Cases by County 2023",
skip=2)[1:254,1:2] # cumulative cases by 1/1/2023
colnames(cases) <- c("County", "Freq")
head(cases)
NA
library(dplyr)
# Merge with SpatialPolygonDataFrame
#merge by county name
TX_County@data <- left_join(TX_County@data, cases, by=c("NAME"="County"))
# Merge with sf object
TX_County1 <- left_join(TX_County1, cases,by=c("NAME"="County"))
If you have individual level address data, and would like to obtain census information from address, you need to first geocode the address: https://geocoding.geo.census.gov/geocoder/
# Map the number of cases at county level
library(RColorBrewer)
library(classInt)
library(maptools)
### select variable to plot
plotvar <- TX_County@data$Freq ### variable to plot
# specify breaks
nclr <- 5
brks <- round(quantile(plotvar, probs=seq(0,1,1/(nclr))), digits=1)
colornum <- findInterval(plotvar,brks,all.inside=T)
plotclr <- brewer.pal(nclr,"BuPu") # you can choose different colors schemes
colcode <- plotclr[colornum]
### make the map
plot(TX_County, col=colcode, axe=T)
legend("bottomleft",legend=leglabs(round(brks,digits=1)),fill=plotclr,cex=0.8,bty="n")
title("Number of cumulative COVID-19 cases at county level \nin Texas, as of Jan 1, 2023")
As covered in the previous section, Census geographies are available
from the tigris R package as simple features objects,
using the data model from the sf R package. tidycensus
wraps several common geographic data functions in the
tigris package to allow R users to return simple
feature geometry pre-linked to downloaded demographic data with a single
function call. The key argument to accomplish this is
geometry = TRUE
, which is available in the core data
download functions in tidycensus, get_acs()
,
get_decennial()
,
and get_estimates()
.
Traditionally, getting “spatial” Census data requires a tedious multi-step process that can involve several software platforms. These steps include:
Fetching shapefiles from the Census website;
Downloading a CSV of data, then cleaning and formatting it;
Loading geometries and data into your desktop GIS of choice;
Aligning key fields in your desktop GIS and joining your data.
geometry = TRUE
combines the automated data download
functionality of tidycensus and tigris to allow R users to bypass this
process entirely. The following example illustrates the use of the
geometry = TRUE
argument, fetching information on
population, per capita income and unemployment rate for all Texas
counties. We will use this data for mapping in the next few
sections.
library(tidycensus)
# 2019 ACS data for all Texas counties - total population,
# Per capita income and % unemployed
TX_county_dt <- get_acs(geography = "county", state="TX",
variables=c("B01001_001", "B19301_001", "DP03_0005P"),
year=2019, geometry=T)
Getting data from the 2015-2019 5-year ACS
Fetching data by table type ("B/C", "S", "DP") and combining the result.
# format the data
TX_county_dt <- TX_county_dt[,-5] %>%
tidyr::spread(key="variable", value="estimate") %>%
rename(Population=B01001_001, Per_capita_income=B19301_001,
Unemployment=DP03_0005P)
TX_county_dt
Simple feature collection with 254 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -106.6456 ymin: 25.83738 xmax: -93.50829 ymax: 36.5007
Geodetic CRS: NAD83
One of the most common ways to visualize statistical information on a map is with choropleth mapping. Choropleth maps use shading to represent how underlying data values vary by feature in a spatial dataset. The COVID-19 cases plot at Texas county level earlier in this chapter is an example of a choropleth map.
We can use ggplot2 to create a Choropleth map:
############################
### Using ggplot2 Method 1
############################
library(ggplot2)
# more about mapping in ggplot: http://mazamascience.com/WorkingWithData/?p=1494
# map
library(scales)
map1 <- ggplot(TX_county_dt)+
geom_sf(aes(fill=Population), color="grey60", alpha=0.8)+ # specify variable to map
scale_fill_distiller(palette = "YlOrRd",na.value = "grey50",
breaks = pretty_breaks(n = 6),
label=comma,values = c(1,0),
guide=guide_colorbar(reverse = TRUE))+
#geom_point(data=text, lat=lat, long=long)+ #can add point data here!
labs(fill="")+xlab("Longitude")+ylab("Latitude")+
ggtitle("Map of Texas County level Population")
map1 #display
The county polygons can be styled using ggplot2
conventions and the geom_sf()
function. The geom_sf()
function in the above example interprets the geometry of the sf object
(in this case, polygon) and visualizes the result as a filled choropleth
map. In this case, the ACS estimate of total population is mapped to the
yellow-orange-red color ramp in ggplot2, highlighting
the most populated counties (such as Harris County) with darker
reds.
If you would like to add the Google map as background, there is an
alternative way for mapping using ggplot2
, with package
ggmap
. You will need to request for your own Google API key
(instructions can be found here).
More details about visualization with ggmap can be found here.
############################
### Using ggplot2 Method 2
############################
#install.packages('rgeos', type='source')
library(gpclib)
# first convert the SpatialPolygonDataFrame to ggplot object
ggSHP <- fortify(TX_County, region="GEOID")
# Merge data with shapefiles
ggSHP2 <- left_join(ggSHP, TX_county_dt, by=c("id"="GEOID"))
# Optional: if you would like to add Google map as the background
#library(ggmap)
#register_google(key="key"YOUR GOOGLE API KEY, write=TRUE)
# you will need to insert your own Google API key
geocode("Texas") #get center coordinates of Texas
# Get Google map - road map
GoogleMap <- ggmap(get_googlemap(center=c(-99.9, 32.0), scale=2 ,zoom=6,
maptype = "roadmap"), extent="panel")
# mapping county level population
map2 <- GoogleMap + #optional
geom_polygon(data=ggSHP2, aes(x=long, y=lat, group=group, fill=Unemployment),
color="grey60", alpha=0.8)+ # specify variable to map
scale_fill_distiller(palette = "YlOrRd",na.value = "grey50",
breaks = pretty_breaks(n = 3), label=comma,
values = c(1,0), guide=guide_colorbar(reverse = TRUE))+
labs(fill="% Unemployed")+xlab("Longitude")+ylab("Latitude")+
ggtitle("Map of Texas County level Unemployment Rate")
map2 #check
# Put several maps together
library(ggpubr)
figure1 <- ggarrange(map1, map2, nrow = 1)
figure1
#save the maps
#ggsave("county_tmap_ggplot.png", figure1)
For ggplot2 users, geom_sf()
offers a familiar interface for mapping data obtained from the US Census
Bureau. However, ggplot2 is far from the only option
for cartographic visualization in R. The tmap package
(Tennekes
2018) is an excellent alternative for mapping in R that includes a
wide range of functionality for custom cartography. The section that
follows is an overview of several cartographic techniques implemented
with tmap for visualizing US Census data. A full
treatment of best practices in cartographic design is beyond the scope
of this section; recommended resources for learning more include
Peterson (2020)
and Brewer (2016).
########################
### Using tmap
#######################
library(tmap)
# more tmap styles info here: https://geocompr.github.io/post/2019/tmap-color-scales/
library(ggmap)
#register_google(key="YOUR GOOGLE API KEY", write=TRUE)
# geocode to get some city
coordinates <- geocode(c("Houston, TX", "Austin, TX", "Dallas, TX")) %>%
mutate(city=c("Houston", "Austin", "Dallas"))
ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfshttps://maps.googleapis.com/maps/api/geocode/json?address=Houston,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfs]8;;>
ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=Austin,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfshttps://maps.googleapis.com/maps/api/geocode/json?address=Austin,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfs]8;;>
ℹ <]8;;https://maps.googleapis.com/maps/api/geocode/json?address=Dallas,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfshttps://maps.googleapis.com/maps/api/geocode/json?address=Dallas,+TX&key=xxx-W7Ie8An3EnYx3a_LadnKrVUUKLDfs]8;;>
coor_sf = st_as_sf(coordinates, coords = c('lon', 'lat'), crs = st_crs(TX_county_dt)$proj4string)
coordinates
# map
map1 <- tm_shape(TX_county_dt)+
tm_polygons(col=c("Unemployment","Per_capita_income"), style="quantile",n=4,
# specify variables and styles of mapping
title=c("Percentage","Dollars"), palette="seq", lwd=0.5,
border.alpha = 0.8, midpoint=NA)+
tm_facets(ncol=2)+ # number of columns
tm_layout(aes.palette = list(seq = "YlOrRd"))+ # color scheme
tm_layout(inner.margin=c(0,0.01,0,0), # layout of the map
legend.text.size = 1,
legend.title.size = 1,
legend.position=c("left", "bottom"),
panel.labels=c("Unemployment","Per capita income"),
panel.label.bg.color = "White")+
tm_shape(coor_sf)+ # you can also add point data - mark locations of the big cities
tm_symbols(size=0.5, col="black", border.col="black")+tm_text("city", size=1.2, just="left", ymod=0.6)
map1
# save the map
#tmap_save(map3, filename="county_map_tmap", dpi=500)
An alternative commonly used to visualize count data is the
graduated symbol map. Graduated symbol maps use shapes
referenced to geographic units that are sized relative to a data
attribute. The example below uses tmap’s tm_bubbles()
function to create a graduated symbol map of the total population in
Texas counties:
tm_shape(TX_county_dt) +
tm_polygons() +
tm_bubbles(size = "Population",
alpha = 0.5,
col = "navy",
title.size = "County population size in Texas, 2020")
There are other functions in tmap that can be used to customize your
maps. tm_scale_bar()
adds a scale bar; tm_compass()
adds a north arrow; and tm_credits()
helps cartographers give credit for the basemap, which is required when
using Mapbox and OpenStreetMap tiles. The legend.hist
argument in tm_polygons()
can be set to TRUE, adding a histogram to the map with bars colored by
the values used on the map.
tm_shape(TX_county_dt)+
tm_polygons(col=c("Unemployment"), style="quantile",n=4,
title=c("Percentage"), palette="seq", lwd=0.5,
border.alpha = 0.3, midpoint=NA,
legend.hist=T,legend.hist.z=4)+
tm_layout(aes.palette = list(seq = "YlOrRd"))+ # color scheme
tm_layout(inner.margin=c(0,0.2,0,0), # layout of the map
legend.text.size = 1,
legend.title.size = 1,
legend.position=c("left", "bottom"),
panel.labels=c("Unemployment"),
panel.label.bg.color = "White")+
tm_compass(position = c("left", "top"))+
tm_scale_bar(position = c("right", "top"))
NA
NA
We can quickly visualize geographic data obtained with tigris on an
interactive map by using the mapview()
function in the mapview package. The mapview()
function also includes a parameter zcol
that takes a column
in the dataset as an argument, and visualizes that column with an
interactive choropleth map.
library(mapview)
mapview(TX_county_dt, zcol = "Per_capita_income")
With over 31,000 GitHub stars as of July 2021, the Leaflet JavaScript library (Agafonkin 2020) is one of the most popular frameworks worldwide for interactive mapping. The RStudio team brought the Leaflet to R with the leaflet R package (Cheng, Karambelkar, and Xie 2021), which now powers several approaches to interactive mapping in R. The following examples cover how to visualize Census data on an interactive Leaflet map using approaches from mapview, tmap, and the core leaflet package. Details on leaflet mapping in R can be found here.
Below is an example of interactive map created using leaflet - visualization county level population data in Texas:
######################
## Interactive maps
####################
library(leaflet)
TX_county_dt2 <- as(TX_county_dt, "Spatial")
l <- leaflet(TX_county_dt2) %>% addTiles()
# color scheme
pal <- colorNumeric(palette = "YlOrRd", domain=TX_county_dt2$Population)
# add popup labels
labels <- sprintf("<strong> County: %s </strong> <br/>
Population: %s <br/> Unemployment rate: %s <br/>
Per capita income, $: %s",
TX_county_dt2$NAME, TX_county_dt2$Population,
TX_county_dt2$Unemployment, TX_county_dt2$Per_capita_income) %>%
lapply(htmltools::HTML)
# add to the map
interactive_map <- l %>%
addPolygons(
color = "grey", weight = 1,
fillColor = ~ pal(TX_county_dt2$Population), fillOpacity = 0.5,
highlightOptions = highlightOptions(weight = 4),
label = labels,
labelOptions = labelOptions(
style = list(
"font-weight" = "normal",
padding = "3px 8px"
),
textsize = "15px", direction = "auto"
)
) %>%
addLegend(
pal = pal, values = ~TX_county_dt2$Population, opacity = 0.5,
title = "Population", position = "bottomright"
)
interactive_map
There are many other ways for interactive mapping in R:
R package ggiraph
: linked
maps and charts
R package gganimate
: create
and save maps in gif or mp4
R package plotly:
interactive mapping
R shiny: reactive mapping