Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple batches via Census API #192

Open
Chris-Larkin opened this issue May 29, 2023 · 7 comments
Open

Multiple batches via Census API #192

Chris-Larkin opened this issue May 29, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@Chris-Larkin
Copy link

Chris-Larkin commented May 29, 2023

Hi! Just stumbled on your package and love it! Really really cool stuff. I've got a few thoughts on possible improvements (see below), and apologies in advance for not being able to write PRs for these myself. I'm afraid my R dev skills are nowhere near the required level...

Breaking up large address lists into smaller batches

The Census API currently has a limit of 10k per batch. For large address lists, this means the user has to split the data up and feed each segment of addresses with n < 10k into geocoder in some kind of loop or iterative function.

It would be great if tidygeocoder handled this on the fly. censusxy has a fix for this, see lines 177-182 of this script for their solution in a parallelised implementation, and line 201 for a non-parallelised implementation, which should be portable to tidygeocoder.

Also, my understanding from reading censusxy's documentation is that while the Census batch limit is 10k, running smaller batches is actually quicker. So even if a user passes through 10k addresses, it would still be optimal to split it into ~10 batches of ~1k or fewer.

Progress bar for multi-batch implementation

I see you have a progress bar for single-address encoding. Actually, a progress bar is most useful when the geocoding is going to take many hours, i.e. in a multi-batch implementation (as described above). I'm currently using censusxy, which sadly does not have a progress bar of any kind; but if you were to implement multi-batch processing, an indicator of how many batches have been made, and how many have been geocoded would be incredibly helpful for the user. At the moment, i've been running my programme for ~18 hours and have no idea if that's 10% done or 99% done.

Parallelized implementation

Consider letting users implement parallelized geocoding.

Cacheing

I agree with other issues around cacheing. I would have this as an argument, e.g. cache = FALSE, and the output is a local .csv file of geocoded addresses. With large batches (or a large number of batches if you decide to implement multi-batch processing), if the user loses internet connection after X hours of geocoding, they would currently have to start from scratch.

Let me know if you like any of these ideas or you want more info on use cases etc. Thanks so much for writing, developing, and maintaining tidygeocoder! It fills a big gap

@Chris-Larkin Chris-Larkin added the enhancement New feature or request label May 29, 2023
@jessecambon
Copy link
Owner

Thanks, I agree it could make sense to handle this under-the-hood or through some sort of convenience function. I can take a look at it when I have some bandwidth to work on the next release.

In the mean time, you could use some code like below to accomplish this:

library(tidyverse)
library(tidygeocoder)

batch_size <- 20 # how many addresses we want per batch

# vector of 50 addresses
address_list <- louisville |>
  mutate(combi_address = str_c(str_to_title(street), ' ', city, ', ', state)) |>
  pull(combi_address)

# batch the addresses (code from censuxy: https://github.com/chris-prener/censusxy/blob/main/R/batch.R#L201)
batches <- split(address_list, (seq(length(address_list))-1) %/% batch_size )

geocoded <- batches |>
  map_dfr( \(address) geo(address, method = 'census'))
#> Passing 20 addresses to the US Census batch geocoder
#> Query completed in: 0.5 seconds
#> Passing 20 addresses to the US Census batch geocoder
#> Query completed in: 0.3 seconds
#> Passing 10 addresses to the US Census batch geocoder
#> Query completed in: 0.2 seconds

geocoded
#> # A tibble: 50 × 3
#>    address                                       lat  long
#>    <chr>                                       <dbl> <dbl>
#>  1 2722 Elliott Ave Louisville, Kentucky        38.3 -85.8
#>  2 850 Washburn Ave Louisville, Kentucky        38.3 -85.6
#>  3 1449 St James Ct Louisville, Kentucky        38.2 -85.8
#>  4 9007 Sagebrush Ct Louisville, Kentucky       38.1 -85.6
#>  5 376 Flirtation Walk Louisville, Kentucky     38.2 -85.7
#>  6 3429 Cathe Dykstra Way Louisville, Kentucky  NA    NA  
#>  7 1406 Mill Race Rd Louisville, Kentucky       38.3 -85.6
#>  8 7511 Cane Run Rd Louisville, Kentucky        38.1 -85.9
#>  9 9605 W Manslick Rd Louisville, Kentucky      38.1 -85.8
#> 10 2310 Crittenden Dr Louisville, Kentucky      38.2 -85.8
#> # … with 40 more rows

Created on 2023-05-31 with reprex v2.0.2

@Chris-Larkin
Copy link
Author

Wow, thanks Jesse!

@Chris-Larkin
Copy link
Author

I'm actually struggling to make this work with my own data...

Say I have this data:

df <- tibble::tribble(
  ~num_street,           ~city, ~sate, ~zip_code,
  "976 FAIRVIEW DR",   "SPRINGFIELD",  "OR",    97477L,
  "19843 HWY 213",   "OREGON CITY",  "OR",    97045L,
  "402 CARL ST",         "DRAIN",  "OR",    97435L,
  "304 WATER ST",        "WESTON",  "OR",    97886L,
  "5054 TECHNOLOGY LOOP",     "CORVALLIS",  "OR",    97333L,
  "3401 YACHT AVE",  "LINCOLN CITY",  "OR",    97367L,
  "135 ROOSEVELT AVE",          "BEND",  "OR",    97702L,
  "3631 FENWAY ST",  "FOREST GROVE",  "OR",    97116L,
  "92250 HILLTOP LN",      "COQUILLE",  "OR",    97423L,
  "6920 92ND AVE",        "TIGARD",  "OR",    97223L,
  "591 LAUREL ST", "JUNCTION CITY",  "OR",    97448L,
  "32035 LYNX HOLLOW RD",      "CRESWELL",  "OR",    97426L,
  "6280 ASTER ST",   "SPRINGFIELD",  "OR",    97478L,
  "17533 VANGUARD LN",     "BEAVERTON",  "OR",    97007L,
  "59937 CHEYENNE RD",          "BEND",  "OR",    97702L,
  "2232 42ND AVE",         "SALEM",  "OR",    97317L,
  "3100 TURNER RD",         "SALEM",  "OR",    97302L,
  "3495 CHAMBERS ST",        "EUGENE",  "OR",    97405L,
  "585 WINTER ST",         "SALEM",  "OR",    97301L,
  "23985 VAUGHN RD",        "VENETA",  "OR",    97487L
)

I'm then running this code (adapted from above):


batch_size <- 10 # how many addresses we want per batch

batches <- split(dropme_df, (seq(length(dropme_df))-1) %/% batch_size )

batches %>%
  map_dfr( \(address) geocode(street = num_street, city = city, state = state, postalcode = zip_code, 
          method = "census", full_results = TRUE, api_options = list(census_return_type = 'geographies')))

But i'm getting this error:

Error in is.data.frame(.tbl) :
argument ".tbl" is missing, with no default

I'm not entirely sure where \(address) comes from but i'm keeping it in

@jessecambon
Copy link
Owner

\(x) is a shorthand way to define a function where the input argument is x (scroll down to the section on anonymous functions here: https://www.jumpingrivers.com/blog/new-features-r410-pipe-anonymous-functions/).

So in your example the input argument to the function is a dataframe (since you are using the geocode function).

Try something like this:

batches |>
map_dfr( \(df) geocode(df, street = num_street, city = city, state = state, postalcode = zip_code)

@Chris-Larkin
Copy link
Author

One more thing I just thought about. censusxy has a feature to use historical 'vintages' when calling the census API. This is quite helpful indeed to increase matching on historical data (e.g. if i have address data from 20 years ago, some of the addresses may no longer exist, or zip codes have merged, etc.). Using the latest vintage which I believe tidygeocoder does means that these addresses will likely not get a match. Just food for thought if easy to integrate for your next release! Thank you again for all your help

@jessecambon
Copy link
Owner

jessecambon commented Jun 1, 2023

@Chris-Larkin vintage is a parameter in the Census geocoder API. You can pass your own vintage using the custom_query parameter. Also setting verbose = TRUE will show you which parameters are being passed.

library(tidygeocoder)

geo('10 Wall St New York, NY', method = 'census', 
    custom_query = list(vintage = 'Public_AR_Census2020'),
    full_results = TRUE, verbose = TRUE)
#> 
#> Number of Unique Addresses: 1
#> Passing 1 address to the US Census single address geocoder
#> 
#> Number of Unique Addresses: 1
#> Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/onelineaddress
#> Passing the following parameters to the API:
#> address : "10 Wall St New York, NY"
#> vintage : "Public_AR_Census2020"
#> format : "json"
#> benchmark : "Public_AR_Current"
#> HTTP Status Code: 200
#> Query completed in: 0.4 seconds
#> 
#> Query completed in: 0.6 seconds
#> # A tibble: 1 × 18
#>   address      lat  long match…¹ tiger…² tiger…³ addre…⁴ addre…⁵ addre…⁶ addre…⁷
#>   <chr>      <dbl> <dbl> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 10 Wall S…  40.7 -74.0 10 WAL… L       596596… 10005   WALL    ""      NEW YO…
#> # … with 8 more variables: addressComponents.preDirection <chr>,
#> #   addressComponents.suffixDirection <chr>,
#> #   addressComponents.fromAddress <chr>, addressComponents.state <chr>,
#> #   addressComponents.suffixType <chr>, addressComponents.toAddress <chr>,
#> #   addressComponents.suffixQualifier <chr>,
#> #   addressComponents.preQualifier <chr>, and abbreviated variable names
#> #   ¹​matchedAddress, ²​tigerLine.side, ³​tigerLine.tigerLineId, …

Created on 2023-06-01 with reprex v2.0.2

@Chris-Larkin
Copy link
Author

Chris-Larkin commented Jun 1, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants