Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add webscraping tutorial (OR: arbitrary number of returns from nursery) #1483

Closed
h-vetinari opened this issue Apr 30, 2020 · 8 comments
Closed
Labels

Comments

@h-vetinari
Copy link

h-vetinari commented Apr 30, 2020

I'm new to trio, but it seems to me to be the cleanest approach to async programming in python. :)
So when I had a little task of grabbing a bunch of things from the web I automatically thought I'd try it, but ran into problems straight away. Even if the solution to my problem ends up being trivial, I'm maybe a good example of someone looking at the tutorial and trying to build their first toy example (the issue title can be adapted accordingly).

Let's say I have my function:

async def get_shiny_thing(url)
    some_shiny_thing = await asks.get(url)
    # some processing
    return some_shiny_thing

All I really want to do is (knowing that the order is indeterminate):

with trio.open_nursery() as nursery:
    my_treasure = [nursery.start_soon(get_shiny_thing, url) for url in list_of_urls]

This fails with RuntimeError: use 'async with open_nursery(...)', not 'with open_nursery(...)'

Next step: a wrapper function:

async def get_treasure(generic_list):
    async with trio.open_nursery() as nursery:
        generic_treasure = [nursery.start_soon(get_shiny_thing, url)
                            for url in generic_list]
    return generic_treasure

But then - gasp! - my_treasure is empty:

>>> my_treasure = get_treasure(list_of_urls)
>>> my_treasure 
[None, None, None, None, None]

I tested that get_shiny_thing actually does what it should. Next, I then found this SO answer from 2018 by @njsmith, how what I want to do is not really possible (yet?). But the workaround of creating separate functions that update each url (in a dict?) separately and then get passed to start_soon seems cumbersome, even if I built a "function factory".

In short: one of the most generic & popular async examples (a little web scraping) should IMO be one of the things in a tutorial. The tutorial even notes this absence:

(Probably a more relevant example these days would be an application that does lots of concurrent HTTP requests, but for that you need an HTTP library such as asks, so we’ll stick with the echo server tradition.)

If hip isn't ready yet, then it's maybe worth considering just doing that example with asks.

@pquentin
Copy link
Member

Hi! Glad you're enjoying Trio.

Can you try https://github.com/python-trio/trimeter and tell us if that helps? Also, the recommended HTTP client right now is https://www.python-httpx.org/

@alexchamberlain
Copy link
Contributor

Give Synchronizing and communicating between tasks a read; I'd use a channel to send back the results from the other tasks.

@h-vetinari
Copy link
Author

Thanks for the quick responses!

@pquentin: Can you try https://github.com/python-trio/trimeter and tell us if that helps?

I don't have time to install it from source right now, but this seems like an excellent solution in principle (only that the last commit was Feb '19)?

@pquentin: Also, the recommended HTTP client right now is https://www.python-httpx.org/

Thanks for the tip! So hip is dead?

@alexchamberlain: Give Synchronizing and communicating between tasks a read; I'd use a channel to send back the results from the other tasks.

I'm sure the task can be implemented. But that seems (at first glance) like an unreasonably high amount of complexity/effort just to process some requests.

@smurfix
Copy link
Contributor

smurfix commented Apr 30, 2020

the recommended HTTP client right now is https://www.python-httpx.org/

… except when you want to use a websocket …

@h-vetinari
Copy link
Author

I managed to solve it by writing to a sort of global dict (which is not a pattern I like), but at least it works:

result = {}
async def get_shiny_thing(key, url, session):
    # abort if we've done the lookup already
    if key in result:
        return

    r = await session.get(url)
    # whatever
    result[key] = some_shiny_thing

async def get_treasure(urls, max_concurrent=10):
    session = asks.Session(connections=max_concurrent)
    async with trio.open_nursery() as nursery:
        for key, url in enumerate(urls):
            nursery.start_soon(add_pr_title, key, url, session)

trio.run(get_treasure, list_of_urls)

# ... continue processing `result`

@smurfix
Copy link
Contributor

smurfix commented May 1, 2020

You don't need a global dict. Just create it in get_treasure, pass it as an additional argument to get_shiny_thing, and return it at the end.

@oremanj oremanj added the docs label May 12, 2020
@oremanj
Copy link
Member

oremanj commented May 12, 2020

I think the remaining action item here is a duplicate of #421.

@oremanj oremanj closed this as completed May 12, 2020
@pquentin
Copy link
Member

I don't have time to install [trimeter] from source right now, but this seems like an excellent solution in principle (only that the last commit was Feb '19)?

It does need some love (and packaging!), but it's quite small and I think it still works with latest Trio.

So hip is dead?

@h-vetinari It's not! We sometimes have periods without activity, and sometimes I work on urllib3 before merging the work in hip. We still believe the idea of hip is sound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants