Overview
When using this package to identify clusters, given a set of
date-location-count data, one of the key elements that must be provided
is the geographical information about the units of clustering.
Specifically, we need to provide to the find_clusters()
function an object that contains information about pairwise distances
(See vignette("basic_demo") for overview information on
finding clusters).
Here we describe how the package provides utilities to generate these objects.
Types of Distance Objects:
There are two types of distance objects that can be passed to the
find_clusters() function. These include square pairwise
distance matrices and a constrained version that contains only a limited
subset of the locations.
Square pairwise distance matrices
If we have a set of locations among which we would like to identify clusters (say all the counties in Minnesota, or all the zip codes in Maryland), we can create a matrix containing all the pairwise distances between the centroids of these locations. The package provides functions to do this for census tracts, zipcodes, counties, and states.
Each of these functions must be called with a state level parameter
(st), to constrain to a particular state (To combine
states, see the section on “Custom Distance Objects”). There is also a
unit parameter, which takes a string: "miles"
(default), "kilometers", or “meters" to
obtains the distance estimates in alternative units.
All of these functions return a two-element named list containing:
-
loc_vec: a vector containing the names of theNlocations in the geographic target (e.g. “Minnesota”) -
distance_matrix: anN x Nmatrix containing the pairwise distances between all locations
For all of these functions, locations in the rows and columns of the
distance_matrix will have the same ordering as in the
loc_vec vector.
Example: county distance matrix:
minnesota_counties <- county_distance_matrix(st = "MN")
# show the length and first 10 values of the `loc_vec`
length(minnesota_counties$loc_vec)
#> [1] 87
minnesota_counties$loc_vec |> head(10)
#> [1] "27095" "27045" "27073" "27085" "27153" "27105" "27001" "27057" "27063"
#> [10] "27039"
# show the dimension and upper left section of the `distance_matrix`
dim(minnesota_counties$distance_matrix)
#> [1] 87 87
minnesota_counties$distance_matrix[1:5, 1:5]
#> 27095 27045 27073 27085 27153
#> 27095 0.00000 172.7927 138.99740 82.58382 61.58768
#> 27045 172.79267 0.0000 221.40230 133.61217 214.66847
#> 27073 138.99740 221.4023 0.00000 94.00058 96.16435
#> 27085 82.58382 133.6122 94.00058 0.00000 91.24692
#> 27153 61.58768 214.6685 96.16435 91.24692 0.00000Example: zip code distance matrix:
maryland_zips <- zip_distance_matrix(st = "MD")
# show the length and first 10 values of the `loc_vec`
length(maryland_zips$loc_vec)
#> [1] 621
maryland_zips$loc_vec |> head(10)
#> [1] "00152" "20058" "20207" "20233" "20389" "20395" "20409" "20599" "20601"
#> [10] "20602"
# show the dimension and upper left section of the `distance_matrix`
dim(maryland_zips$distance_matrix)
#> [1] 621 621
maryland_zips$distance_matrix[1:5, 1:5]
#> 00152 20058 20207 20233 20389
#> 00152 0.0000 121.718302 117.987122 105.9661465 106.0287922
#> 20058 121.7183 0.000000 3.896055 16.8089272 16.7956074
#> 20207 117.9871 3.896055 0.000000 12.9178321 12.9064482
#> 20233 105.9661 16.808927 12.917832 0.0000000 0.1337462
#> 20389 106.0288 16.795607 12.906448 0.1337462 0.0000000Example: tract distance matrix:
The package also provides a function to generate distance matrices
between census tracts. However, unlike at the zip code and the county
level, there is no built-in data set that contains the centroids of
these tracts across the United States. Therefore, to use the
tract_distance_matrix() function, the user will need to
have the package tigris installed. Also, in addition to the
st and unit parameters, the user can also
restrict the tracts within a state by providing a vector of 3-character
fips codes in the county parameter. Furthermore, to prevent
the tigris package from using a local cache (using the
cache does have the benefit of speed for repeated calls), the user can
turn this cache off using use_cache = FALSE
cook_county_tracts <- tract_distance_matrix(
st = "IL",
county = "031" # the full 5-digit code for Cook County, IL is 17031
)
#> | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 12% | |========== | 14% | |=========== | 16% | |============ | 17% | |============= | 18% | |=============== | 22% | |================ | 23% | |================= | 24% | |================== | 25% | |=================== | 27% | |=================== | 28% | |====================== | 31% | |======================= | 32% | |======================== | 34% | |========================= | 36% | |========================== | 37% | |=========================== | 38% | |=========================== | 39% | |================================ | 45% | |================================ | 46% | |=================================== | 50% | |==================================== | 52% | |====================================== | 54% | |======================================== | 58% | |========================================== | 60% | |=========================================== | 61% | |=============================================== | 67% | |================================================ | 68% | |================================================= | 69% | |================================================= | 71% | |================================================== | 72% | |===================================================== | 76% | |====================================================== | 77% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 80% | |========================================================== | 83% | |=========================================================== | 84% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================== | 88% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================= | 92% | |=================================================================== | 96% | |======================================================================| 100%
# show the length and first 10 values of the `loc_vec`
length(cook_county_tracts$loc_vec)
#> [1] 1332
cook_county_tracts$loc_vec |> head(10)
#> [1] "17031030101" "17031030701" "17031070103" "17031807100" "17031807200"
#> [6] "17031807300" "17031807400" "17031807500" "17031807700" "17031807900"
# show the dimension and upper left section of the `distance_matrix`
dim(cook_county_tracts$distance_matrix)
#> [1] 1332 1332
cook_county_tracts$distance_matrix[1:5, 1:5]
#> 17031030101 17031030701 17031070103 17031807100 17031807200
#> 17031030101 0.0000000 0.6128501 4.496586 5.0124266 4.9426298
#> 17031030701 0.6128501 0.0000000 3.944467 5.3549625 5.2210581
#> 17031070103 4.4965858 3.9444667 0.000000 9.0726214 8.8099097
#> 17031807100 5.0124266 5.3549625 9.072621 0.0000000 0.6311266
#> 17031807200 4.9426298 5.2210581 8.809910 0.6311266 0.0000000As described above, the default unit for these objects is “miles”.
However, for all of these functions, one can also obtain distances in
other units by passing “kilometers” or “meters” to the unit
parameter.
maryland_zip <- zip_distance_matrix(st = "MD", unit = "kilometers")Custom Distance Matrices
Notice that all of the above functions require the user to indicate a
state parameter, st, thus restricting the output to a
single state. We also provide a function
custom_distance_matrix() to allow users to provide any type
of geo-spatial unit. The function has a different set of parameters than
the other functions described above:
-
df: This is adata.framewith one row per unit, a unique label, and columns containing the latitude and longitude of the centroid of each unit -
unit: as before, this allows the user to obtain distances in miles, kilometers, or meters -
label_var: this is the string name of the column containing the name/label of the unit -
lat_var: the user must provide the name of the column containing the latitude of the centroid -
long_var: the user must provide the name of the column containing the longitude of the centroid
Below, we demonstrate how this function could be used to get a pairwise distance matrix for a collection of contiguous states, in this case Maryland, Delaware, and Virginia
# Use the built-in-counties dataset to get a dataframe of
# counties in the states of interests
states <- c("Delaware", "Maryland", "Virginia")
delmarva_counties <- counties[state_name %in% states]
head(delmarva_counties, 3)
#> state_name state state_fips fips longitude latitude
#> <char> <char> <char> <char> <num> <num>
#> 1: Delaware DE 10 10003 -75.64413 39.57592
#> 2: Delaware DE 10 10005 -75.33702 38.67323
#> 3: Delaware DE 10 10001 -75.50298 39.09709
# Use the custom function to get the distance matrix
delmarva_dm <- custom_distance_matrix(
df = delmarva_counties,
label_var = "fips",
lat_var = "latitude",
long_var = "longitude"
)The output structure of the custom_distance_matrix()
function is the same as the other distance matrix functions.
# show the length and first 10 values of the `loc_vec`
length(delmarva_dm$loc_vec)
#> [1] 160
delmarva_dm$loc_vec |> head(10)
#> [1] "10003" "10005" "10001" "24047" "24001" "24510" "24015" "24005" "24013"
#> [10] "24009"
# show the dimension and upper left section of the `distance_matrix`
dim(delmarva_dm$distance_matrix)
#> [1] 160 160
delmarva_dm$distance_matrix[1:5, 1:5]
#> 10003 10005 10001 24047 24001
#> 10003 0.00000 64.50542 33.93277 95.24787 162.8771
#> 10005 64.50542 0.00000 30.61596 31.20211 191.6783
#> 10001 33.93277 30.61596 0.00000 61.34431 174.6244
#> 24047 95.24787 31.20211 61.34431 0.00000 206.1344
#> 24001 162.87708 191.67834 174.62439 206.13442 0.0000Distance Lists
Notice that the square matrices in all the above examples are of size
N x N where N is the number of unique
locations in your target geography. These matrices can become very
large, and as they become large, they also become slower to calculate.
Furthermore, many of the pairs are never really used in subsequent
analyses, because the distance between them is larger than any typical
radius constraint that a user might want to place on their
cluster-finding technique (Again, see
vignette("basic_demo") for more information on using the
find_clusters() function and setting the maximum distance
to consider when constructing clusters).
For example, we typically recommend starting at 50 miles as the
radius for county-level cluster finding, 15 miles for zip-level cluster
finding, and 3 miles for tract-level cluster finding. Whatever the
radius used, say r, in the square distance matrices above,
none of the pairs where the distance between centroids exceeds
r are needed for the calculation.
Therefore, we also a provide a way to construct distance objects for
find_clusters() that only includes pairs where the
centroid-to-centroid distance is within some threshold. These objects
can be returned using a function create_distance_list()
which has the following parameters:
-
level: this is the geographic level, one of “tract”, “zip”, “county”, or “state” -
threshold: this is the threshold distance, expressed in units given by theunitargument, to constrain paired locations; for each locationx, return a vector with only those locationsy1,y2,y3,...where distance betweenx_iandy_i<threshold -
st: this is a two character state abbreviation, which is required forlevel="tract", but is otherwise optional. In the latter case (i.e. whenstis not specified), the function will return distances of pairs withinthresholdfor all zips or counties in the entire United States -
county: like thetract_distance_matrix()function, the user can restrict the estimation of tract-to-tract distances to a vector of 3-character fips codes -
unit: like other functions discussed above, the distance unit can be returned inmiles(default), kilometers, or meters. The function assumes thatthresholdis always given in terms ofunit(i.e. ifunitis set to “kilometers”, for example, be sure to expressthresholdin kilometers).
The output structure of this function is a list of vectors. The list
is of length N where N is the number of unique
locations in the target geography. Each element of the list is a named
vector of distances to the location(s) (i.e. the y_i) that
are within threshold units of the current location. An
example will help illustrate:
maryland_zip_list <- create_dist_list(
level = "zip",
# using a small distance for zip code clustering for demo purposes
threshold = 7,
st = "MD"
)
# this returns a list
class(maryland_zip_list)
#> [1] "list"
# the list is of length equal to the number of unique zip codes.
# Recall from above that we produced a 621 x 621 matrix; in this
# case, our list is of length 621
length(maryland_zip_list)
#> [1] 621
# the names of the list are the locations
names(maryland_zip_list) |> head(10)
#> [1] "00152" "20058" "20207" "20233" "20389" "20395" "20409" "20599" "20601"
#> [10] "20602"
# each element of the list is a named vector with distances to
# those locations within threshold units
maryland_zip_list |> tail(3)
#> $`21921`
#> 21921 21916 21922 21920 21901
#> 0.000000 1.415658 2.044843 2.762370 6.917458
#>
#> $`21922`
#> 21922 21921 21916 21920
#> 0.000000 2.044843 3.443627 4.097526
#>
#> $`21930`
#> 21930 21913 21635 21919 21912 21645
#> 0.000000 2.691130 3.465579 4.221354 5.411272 5.925401This approach is MUCH faster than the
<geolevel>_distance_matrix() functions described
above, when the number of locations is large, but only a few are used. A
good example of this is obtaining the zip code level distance
information for Texas.
Custom Distance Lists
As above, we provide a way to generate distance lists for custom
location units. In this case, we use the function
create_custom_distance_list(). Again, we can use the
example of creating a distance list for the joint Maryland, Delaware,
and Virginia region, but any user-defined data set can be used for this.
The requirements are basically the same as the
custom_distance_matrix() function, except that we
additionally require a threshold.
Another advantage of custom lists requiring thresholds is that if a user specifies non-contiguous states covering possibly large distances, distance lists limited by a reasonable threshold will avoid calculation of many long distances between faraway locations that would be omitted from clusters by the maximum cluster radius anyway.
# As before, we use the built-in-counties dataset to get a dataframe of counties
states <- c("Delaware", "Maryland", "Virginia")
delmarva_counties <- counties[state_name %in% states]
head(delmarva_counties, 3)
#> state_name state state_fips fips longitude latitude
#> <char> <char> <char> <char> <num> <num>
#> 1: Delaware DE 10 10003 -75.64413 39.57592
#> 2: Delaware DE 10 10005 -75.33702 38.67323
#> 3: Delaware DE 10 10001 -75.50298 39.09709
# Use the custom function to get the distance list
delmarva_dl <- create_custom_dist_list(
df = delmarva_counties,
label_var = "fips",
lat_var = "latitude",
long_var = "longitude",
threshold = 50
)
# this is a list
class(delmarva_dl)
#> [1] "list"
# with length equal to all the counties in the Delmarva region
length(delmarva_dl)
#> [1] 160
# first three elements (i.e. locations) in this list
delmarva_dl[1:3]
#> $`10003`
#> 10003 24015 10001 24029 24025 24035 24011
#> 0.00000 15.87035 33.93277 34.58720 35.02830 43.77808 49.69264
#>
#> $`10005`
#> 10005 24045 24011 10001 24047 24019 24041 24035
#> 0.00000 26.47662 29.96149 30.61596 31.20211 41.92635 45.66223 47.46551
#> 24039
#> 49.93719
#>
#> $`10001`
#> 10001 24011 10005 24035 10003 24029 24015 24041
#> 0.00000 23.54765 30.61596 31.32545 33.93277 34.82778 39.78547 43.57806Other Functions
We also provide
us_distance_matrix()which is a similar to the the other<resolution>_distance_matrix()functions except that it only takes theunitparameter. It returns a county level distance matrix (and vector) for all the counties in the US. In general, unless all pairwise distances are truly desired, we strongly recommend that users desiring national-level distance information pre-determine a threshold and use thecreate_dist_list()function leaving thestparameter asNULL, as that approach will be substantially faster.We provide a function
state_distance_matrix()which acts identically to the other<location>_distance_matrix()functions, except that it doesn’t takestparameter. This returns a distance matrix for all states in the US, and is included for completeness and convenience, even if its practical usage is limited.