Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

h-vetinari · 2020-04-30T12:27:03Z

I'm new to trio, but it seems to me to be the cleanest approach to async programming in python. :)
So when I had a little task of grabbing a bunch of things from the web I automatically thought I'd try it, but ran into problems straight away. Even if the solution to my problem ends up being trivial, I'm maybe a good example of someone looking at the tutorial and trying to build their first toy example (the issue title can be adapted accordingly).

Let's say I have my function:

async def get_shiny_thing(url)
    some_shiny_thing = await asks.get(url)
    # some processing
    return some_shiny_thing

All I really want to do is (knowing that the order is indeterminate):

with trio.open_nursery() as nursery:
    my_treasure = [nursery.start_soon(get_shiny_thing, url) for url in list_of_urls]

This fails with RuntimeError: use 'async with open_nursery(...)', not 'with open_nursery(...)'

Next step: a wrapper function:

async def get_treasure(generic_list):
    async with trio.open_nursery() as nursery:
        generic_treasure = [nursery.start_soon(get_shiny_thing, url)
                            for url in generic_list]
    return generic_treasure

But then - gasp! - my_treasure is empty:

>>> my_treasure = get_treasure(list_of_urls)
>>> my_treasure 
[None, None, None, None, None]

I tested that get_shiny_thing actually does what it should. Next, I then found this SO answer from 2018 by @njsmith, how what I want to do is not really possible (yet?). But the workaround of creating separate functions that update each url (in a dict?) separately and then get passed to start_soon seems cumbersome, even if I built a "function factory".

In short: one of the most generic & popular async examples (a little web scraping) should IMO be one of the things in a tutorial. The tutorial even notes this absence:

(Probably a more relevant example these days would be an application that does lots of concurrent HTTP requests, but for that you need an HTTP library such as asks, so we’ll stick with the echo server tradition.)

If hip isn't ready yet, then it's maybe worth considering just doing that example with asks.

The text was updated successfully, but these errors were encountered:

pquentin · 2020-04-30T12:49:46Z

Hi! Glad you're enjoying Trio.

Can you try https://github.com/python-trio/trimeter and tell us if that helps? Also, the recommended HTTP client right now is https://www.python-httpx.org/

alexchamberlain · 2020-04-30T12:50:23Z

Give Synchronizing and communicating between tasks a read; I'd use a channel to send back the results from the other tasks.

h-vetinari · 2020-04-30T13:13:45Z

Thanks for the quick responses!

@pquentin: Can you try https://github.com/python-trio/trimeter and tell us if that helps?

I don't have time to install it from source right now, but this seems like an excellent solution in principle (only that the last commit was Feb '19)?

@pquentin: Also, the recommended HTTP client right now is https://www.python-httpx.org/

Thanks for the tip! So hip is dead?

@alexchamberlain: Give Synchronizing and communicating between tasks a read; I'd use a channel to send back the results from the other tasks.

I'm sure the task can be implemented. But that seems (at first glance) like an unreasonably high amount of complexity/effort just to process some requests.

smurfix · 2020-04-30T15:56:10Z

the recommended HTTP client right now is https://www.python-httpx.org/

… except when you want to use a websocket …

h-vetinari · 2020-04-30T23:45:57Z

I managed to solve it by writing to a sort of global dict (which is not a pattern I like), but at least it works:

result = {}
async def get_shiny_thing(key, url, session):
    # abort if we've done the lookup already
    if key in result:
        return

    r = await session.get(url)
    # whatever
    result[key] = some_shiny_thing

async def get_treasure(urls, max_concurrent=10):
    session = asks.Session(connections=max_concurrent)
    async with trio.open_nursery() as nursery:
        for key, url in enumerate(urls):
            nursery.start_soon(add_pr_title, key, url, session)

trio.run(get_treasure, list_of_urls)

# ... continue processing `result`

smurfix · 2020-05-01T09:00:47Z

You don't need a global dict. Just create it in get_treasure, pass it as an additional argument to get_shiny_thing, and return it at the end.

oremanj · 2020-05-12T09:31:03Z

I think the remaining action item here is a duplicate of #421.

pquentin · 2020-05-14T09:11:46Z

I don't have time to install [trimeter] from source right now, but this seems like an excellent solution in principle (only that the last commit was Feb '19)?

It does need some love (and packaging!), but it's quite small and I think it still works with latest Trio.

So hip is dead?

@h-vetinari It's not! We sometimes have periods without activity, and sometimes I work on urllib3 before merging the work in hip. We still believe the idea of hip is sound.

oremanj added the docs label May 12, 2020

oremanj closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

h-vetinari commented Apr 30, 2020 •

edited

Loading

pquentin commented Apr 30, 2020

alexchamberlain commented Apr 30, 2020

h-vetinari commented Apr 30, 2020

smurfix commented Apr 30, 2020 •

edited

Loading

h-vetinari commented Apr 30, 2020

smurfix commented May 1, 2020

oremanj commented May 12, 2020

pquentin commented May 14, 2020

Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

Comments

h-vetinari commented Apr 30, 2020 • edited Loading

pquentin commented Apr 30, 2020

alexchamberlain commented Apr 30, 2020

h-vetinari commented Apr 30, 2020

smurfix commented Apr 30, 2020 • edited Loading

h-vetinari commented Apr 30, 2020

smurfix commented May 1, 2020

oremanj commented May 12, 2020

pquentin commented May 14, 2020

h-vetinari commented Apr 30, 2020 •

edited

Loading

smurfix commented Apr 30, 2020 •

edited

Loading