-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple batches via Census API #192
Comments
Thanks, I agree it could make sense to handle this under-the-hood or through some sort of convenience function. I can take a look at it when I have some bandwidth to work on the next release. In the mean time, you could use some code like below to accomplish this: library(tidyverse)
library(tidygeocoder)
batch_size <- 20 # how many addresses we want per batch
# vector of 50 addresses
address_list <- louisville |>
mutate(combi_address = str_c(str_to_title(street), ' ', city, ', ', state)) |>
pull(combi_address)
# batch the addresses (code from censuxy: https://github.com/chris-prener/censusxy/blob/main/R/batch.R#L201)
batches <- split(address_list, (seq(length(address_list))-1) %/% batch_size )
geocoded <- batches |>
map_dfr( \(address) geo(address, method = 'census'))
#> Passing 20 addresses to the US Census batch geocoder
#> Query completed in: 0.5 seconds
#> Passing 20 addresses to the US Census batch geocoder
#> Query completed in: 0.3 seconds
#> Passing 10 addresses to the US Census batch geocoder
#> Query completed in: 0.2 seconds
geocoded
#> # A tibble: 50 × 3
#> address lat long
#> <chr> <dbl> <dbl>
#> 1 2722 Elliott Ave Louisville, Kentucky 38.3 -85.8
#> 2 850 Washburn Ave Louisville, Kentucky 38.3 -85.6
#> 3 1449 St James Ct Louisville, Kentucky 38.2 -85.8
#> 4 9007 Sagebrush Ct Louisville, Kentucky 38.1 -85.6
#> 5 376 Flirtation Walk Louisville, Kentucky 38.2 -85.7
#> 6 3429 Cathe Dykstra Way Louisville, Kentucky NA NA
#> 7 1406 Mill Race Rd Louisville, Kentucky 38.3 -85.6
#> 8 7511 Cane Run Rd Louisville, Kentucky 38.1 -85.9
#> 9 9605 W Manslick Rd Louisville, Kentucky 38.1 -85.8
#> 10 2310 Crittenden Dr Louisville, Kentucky 38.2 -85.8
#> # … with 40 more rows Created on 2023-05-31 with reprex v2.0.2 |
Wow, thanks Jesse! |
I'm actually struggling to make this work with my own data... Say I have this data:
I'm then running this code (adapted from above):
But i'm getting this error:
I'm not entirely sure where |
So in your example the input argument to the function is a dataframe (since you are using the geocode function). Try something like this:
|
One more thing I just thought about. censusxy has a feature to use historical 'vintages' when calling the census API. This is quite helpful indeed to increase matching on historical data (e.g. if i have address data from 20 years ago, some of the addresses may no longer exist, or zip codes have merged, etc.). Using the latest vintage which I believe tidygeocoder does means that these addresses will likely not get a match. Just food for thought if easy to integrate for your next release! Thank you again for all your help |
@Chris-Larkin vintage is a parameter in the Census geocoder API. You can pass your own vintage using the library(tidygeocoder)
geo('10 Wall St New York, NY', method = 'census',
custom_query = list(vintage = 'Public_AR_Census2020'),
full_results = TRUE, verbose = TRUE)
#>
#> Number of Unique Addresses: 1
#> Passing 1 address to the US Census single address geocoder
#>
#> Number of Unique Addresses: 1
#> Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/onelineaddress
#> Passing the following parameters to the API:
#> address : "10 Wall St New York, NY"
#> vintage : "Public_AR_Census2020"
#> format : "json"
#> benchmark : "Public_AR_Current"
#> HTTP Status Code: 200
#> Query completed in: 0.4 seconds
#>
#> Query completed in: 0.6 seconds
#> # A tibble: 1 × 18
#> address lat long match…¹ tiger…² tiger…³ addre…⁴ addre…⁵ addre…⁶ addre…⁷
#> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 10 Wall S… 40.7 -74.0 10 WAL… L 596596… 10005 WALL "" NEW YO…
#> # … with 8 more variables: addressComponents.preDirection <chr>,
#> # addressComponents.suffixDirection <chr>,
#> # addressComponents.fromAddress <chr>, addressComponents.state <chr>,
#> # addressComponents.suffixType <chr>, addressComponents.toAddress <chr>,
#> # addressComponents.suffixQualifier <chr>,
#> # addressComponents.preQualifier <chr>, and abbreviated variable names
#> # ¹matchedAddress, ²tigerLine.side, ³tigerLine.tigerLineId, … Created on 2023-06-01 with reprex v2.0.2 |
Oh fantastic. Thank you!
…On Thu, 1 Jun 2023 at 15:49, Jesse Cambon ***@***.***> wrote:
@Chris-Larkin <https://github.com/Chris-Larkin> vintage is a parameter in
the Census geocoder API. You can pass your own vintage using the
custom_query parameter. Also setting verbose = TRUE will show you which
parameters are being passed.
library(tidygeocoder)
geo('10 Wall St New York, NY <https://www.google.com/maps/search/10+Wall+St+New+York,+NY?entry=gmail&source=g>', method = 'census',
custom_query = list(vintage = 'Public_AR_Census2020'),
full_results = TRUE, verbose = TRUE)#> #> Number of Unique Addresses: 1#> Passing 1 address to the US Census single address geocoder#> #> Number of Unique Addresses: 1#> Querying API URL: https://geocoding.geo.census.gov/geocoder/locations/onelineaddress#> Passing the following parameters to the API:#> address : "10 Wall St New York, NY <https://www.google.com/maps/search/10+Wall+St+New+York,+NY?entry=gmail&source=g>"#> vintage : "Public_AR_Census2020"#> format : "json"#> benchmark : "Public_AR_Current"#> HTTP Status Code: 200#> Query completed in: 0.4 seconds#> #> Query completed in: 0.6 seconds#> # A tibble: 1 × 18#> address lat long match…¹ tiger…² tiger…³ addre…⁴ addre…⁵ addre…⁶ addre…⁷#> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 10 Wall S… 40.7 -74.0 10 WAL… L 596596… 10005 WALL "" NEW YO…#> # … with 8 more variables: addressComponents.preDirection <chr>,#> # addressComponents.suffixDirection <chr>,#> # addressComponents.fromAddress <chr>, addressComponents.state <chr>,#> # addressComponents.suffixType <chr>, addressComponents.toAddress <chr>,#> # addressComponents.suffixQualifier <chr>,#> # addressComponents.preQualifier <chr>, and abbreviated variable names#> # ¹matchedAddress, ²tigerLine.side, ³tigerLine.tigerLineId, …
Created on 2023-06-01 with reprex v2.0.2 <https://reprex.tidyverse.org>
—
Reply to this email directly, view it on GitHub
<#192 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADIJBKHJQCS4EAHFIB6JURTXJCTZLANCNFSM6AAAAAAYST6OLU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi! Just stumbled on your package and love it! Really really cool stuff. I've got a few thoughts on possible improvements (see below), and apologies in advance for not being able to write PRs for these myself. I'm afraid my R dev skills are nowhere near the required level...
Breaking up large address lists into smaller batches
The Census API currently has a limit of 10k per batch. For large address lists, this means the user has to split the data up and feed each segment of addresses with n < 10k into geocoder in some kind of loop or iterative function.
It would be great if tidygeocoder handled this on the fly. censusxy has a fix for this, see lines 177-182 of this script for their solution in a parallelised implementation, and line 201 for a non-parallelised implementation, which should be portable to tidygeocoder.
Also, my understanding from reading censusxy's documentation is that while the Census batch limit is 10k, running smaller batches is actually quicker. So even if a user passes through 10k addresses, it would still be optimal to split it into ~10 batches of ~1k or fewer.
Progress bar for multi-batch implementation
I see you have a progress bar for single-address encoding. Actually, a progress bar is most useful when the geocoding is going to take many hours, i.e. in a multi-batch implementation (as described above). I'm currently using censusxy, which sadly does not have a progress bar of any kind; but if you were to implement multi-batch processing, an indicator of how many batches have been made, and how many have been geocoded would be incredibly helpful for the user. At the moment, i've been running my programme for ~18 hours and have no idea if that's 10% done or 99% done.
Parallelized implementation
Consider letting users implement parallelized geocoding.
Cacheing
I agree with other issues around cacheing. I would have this as an argument, e.g. cache = FALSE, and the output is a local .csv file of geocoded addresses. With large batches (or a large number of batches if you decide to implement multi-batch processing), if the user loses internet connection after X hours of geocoding, they would currently have to start from scratch.
Let me know if you like any of these ideas or you want more info on use cases etc. Thanks so much for writing, developing, and maintaining tidygeocoder! It fills a big gap
The text was updated successfully, but these errors were encountered: