-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss use cases about skipping cells during execution #429
Comments
We are developing a machine learning platform @ valohai.com One of our features is a Jupyter notebook plug-in, which will allow our customers to run the notebook from start to finish on a cloud instance asynchronously. We use Papermill for the server-side execution. One of the feature requests from customers is an ability to tag certain cells as "local only", which means they will not be executed on the server. Here is a direct quote from our customer:
We could programmatically remove the tagged cells before we send the notebook to the server, BUT this would ruin the version control side of things. When our customers come back 6 months later and download the resulting notebook, they still want the disabled cells to be there, even if they were not executed. |
Thanks @JuhaKiili. A few thoughts on notebook functionality that appears to address your use case today: If you look in the keyboard shortcuts for the notebook, you can disable a cell from running by changing it to a raw cell ( Changing a cell to a raw cell provides the same functionality as skipping a cell during execution. cc/ @MSeal |
Didn't know about raw cells. Thanks for the info! |
The raw cells would work. Though I can see that being a bit annoying to manage. I also noticed the cell that you likely want skipped in that gif is a series of magics which would make conditional execution of the logic within the cell irritating to manage by the user. While we're exploring options. A way to manage how a notebook executes more precisely can be to use a registered engine: https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-a-new-engine. |
We discussed the topic again today in our weekly meetup for nteract and are still of the opinion that it doesn't fit the project to support as a built-in mechanism, especially with some known work-around patterns and the ability to extend papermill for more custom execution patterns. Adding this feature would increase the complexity of describing how papermill operates and why the outcome ran the way it did. It would also break the construct we rely on heavily that rerunning a papermill output in any notebook interface will produce the same result as when it ran in papermill. Of the solutions available, converting cells to raw that you don't want to have executed if probably the best option. Though if people wish to go the engines route we can help improve the engines abstraction where there are gaps to using the plugin system. e.g. @akx If we need to make a way for plugins to be able to accept plugin kwargs from CLI calls we could open a separate issue on that topic. I think that'd benefit customization in general. Other maintainers please feel free to add comments or suggestions to this thread as it goes on, since I'm echoing a wider conversation that we've had over the past few months. |
could you specify an explicit word/character limit for describing new features? We can then see if it's possible to concisely describe.
#403 fixes this by removing cells altogether hence producing notebooks that are runnable in any interface with the same results
not sure
sounds complicated but still welcome |
Sorry for not being highly responsive on this thread, it's been a busy couple weeks. I'm going to try to post the nteract groups response here, though it's flavored a bit in my fashion of wording. I asked the group to also comment on this thread so it's not just myself speaking for the group as a whole.
The complexity of describing is not in the number of words, it's in the fact that we no longer have a linear execution model. There's a lot of users of papermill that this would confuse. Many users it wouldn't confuse at all but in the past when we've had to describe non-linear behavior in other jupyter tooling it put a burden on the maintainers and community to constantly explain why execution happened the way it did. We'd like to not have base papermill have the same explanation and support burden.
That's fair, though then we have to explain to users that they need to flip cell types when they want to turn them on / off. The current UIs don't lend themselves well to this effort, as it won't tell you how to execute a raw cell, it'll just do nothing when you execute it. For a specialized usecase this is probably fine, but there's many users that don't understand the cell type concept so the change has questionable value to them. If there were a more first-class "disabled" concept that guided the users in the interface I think my objection here would be mute for a general purpose tool.
Agreed, just trying to make extensibility for other solutions to problems such as these not be explicitly blocked by the tooling in place. |
Adding an additional point that I find important for the linear execution model: reproducibility of research notebooks. A non-linear execution model with more variability makes it more difficult to account for which cells were run or should be run to reproduce. I agree @MSeal that a future jupyter/nbformat spec would be a better place for a first-class "disabled" concept. |
The issue here is that it makes the job of maintainability really difficult for us in terms of support costs. Not the literal code, but the expectation that we have to respond to issues related to this not just in papermill but also in the classic notebook UI, jupyterlab, etc. This creates notebooks that run differently in different systems, which goes against our core tenant of reproducibility. |
Today, papermill adds one cell and then runs a notebook. It's simple to explain to people. From the perspective of papermill, skipping cells while leaving them in the notebook is definitely a violation of the "core tenant of reproducibility". While #403 does not do that (by removing the cells altogether), it still introduce an additional level of complexity to the project. As the person who wrote the tag exclude rules in nbconvert, I can speak to how much complexity they can introduce to how you think about the flow of a notebook (especially when executing it). It's a lot of complexity. The complexity shows up when conveying how to use these features to users. Getting folks to understand these kinds of features & how to use them has been really challenging. All that said, it should be possible to use So if you wanted to remove cells with nbconvert with the tag "remove_me" and then pipe it into papermill as follows:
|
There are use cases involving scripted / programmatic usage and ad-hoc command line usage. A compact, intuitive CLI can be helped by some conventions, like the one discussed - with cell tagging. Think about C preprocessor, with some predefined symbols. I was thinking about suggesting something like:
Of course, you could always have a parameter for that:
And run your batch processing with The UX criteria for such a feature would be:
|
as was requested - here are my use cases: Use case 2: Problems that I encounter now, with copies that I create is when I start changing smth in these "extracts" from the original notebook - I then sometimes forget to propagate this changes back to the full notebook. |
Another use case request: I need to gracefully exit a papermill run of a notebook, error free, when, for example, the datasource being imported is invalid or empty. But the immediate Python options like |
@ctivanovich You can |
In our case, we are trying to hang the execution of the flow waiting for an event. This means we need to sav execution state, terminate the execution and after some time start where we finished. For now, we didn't find any way to accomplish it easily, but I assume skipping cells would help. |
Could you explain more what this means? And why you think skipping cells would help? |
Sure, we are experimenting with using jupyter notebooks for business process automation. In such an environment sooner or later you have to wait for some kind of event to happen. This could be either a timer event (e.g. wait 5 days) or user input (send a form to the user and wait for events) or something else. Anyway, this means the process (which in this experiment is expressed as a notebook) can take weeks or months to execute and we cannot rely on an os process to survive. And so we need a way to suspend a notebook and then resume it once the event arrives. This brings two major problems: persisting notebook state and starting process in a given place. I believe skipping cells could be used as a solution to the second one, we would just skip all cells up to the point where we finished previous execution. If you think using notebooks for business process automation is crazy then you're probably right. But as I said, its an experiment and we have quite good reasons to hope it will succeed. |
@Krever crazy idea indeed. I think you should write your own |
@Krever I hear you on taking crazy ideas and proving them out. We did something similarly crazy at the time with https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6. In our case we used an external scheduler to manage timer and event triggers to release a notebook. We also tended to split notebooks up to single execution units that can be run end-to-end because it's difficult to rehydrate the full execution state part-way through and error prone to not do so at arbitrary cells. If you are headed in that direction and have a plan for how you want to handle execution patterns within a single notebook, I think you're best bet is going to be to use engine extensions to better manage when you execute what. You can even manipulate when the kernel is up or if you need to run it remotely. This has been successfully applied for domain specific executions where you want a lot more control over execution patterns. Skipping cells would be a way to achieve part of what you need, but if you're investing in complex patterns I think you'd want more control over other aspects as well. |
Guys, I am really surprised that somebody needs to explain why skiping cells is useful! When you do your research you have a lot of cells where you render part of the data for explaratory analysis (i.e. dataframe.head(30)) Such type of cells are totally useless for any machine-learning workflows but are very helpful for everybody who uses the code |
What you're not seeing in this thread @antonkulaga is the several thousand users we interface with in a daily basis across our respective companies that use this tool. Making it easier to conditional execute cells does helps some users do more complex tasks. It also make novice developers and non-professional programmers more likely to set themselves up for failure. Papermill is an opinionated tool. One of those opinions is that notebooks when run in a non-iterative environment should be run from beginning to end. If we add deviations from this it complicates support patterns and leads to needing to encode what cells should and shouldn't be used in offline systems, or when passing to another person / toolchain those tools now have to be aware of more configuration and complexity. That being said, we're not saying we won't do it in papermill directly. We've been delaying making a reversal decision here for two reasons: A) the maintainers of papermill have a bunch of these projects we work on in our spare time so we've been busy and B) we wanted more time to collect usecases and impact it might have if we added the option. You can also add this with your own extension if you really want it today while we sit on the topic. Plus we presented a few alternatives that are available to combine tooling or patterns to achieve skipped cells. Saying "I am really surprised that somebody needs to explain why skiping cells is useful!" is a little inflammatory when we're keeping things open to discuss the ideas and tradeoffs here despite initially saying we didn't want the feature to be added in the first place. After nbconvert 6.0 and the new nbclient are in a solid place I'll personally have more time to think through this issue given the context in the thread and in our interactions with the other maintainer's development ecosystems. |
sigh.
This statement should have been made years ago and featured prominently in the docs. It is perfectly reasonable to be opinionated (e.g. https://github.com/psf/black) provided you say so clearly and in advance. This issue could have been closed a long time ago, or at least a lot of debate avoided.
You imply helpful contributors are being deliberately inflammatory. This is a worrying allegation you seem to resort to every time someone expresses passion equal to your own. It's not in any way helpful, especially since it's abundantly clear that said contributors are genuinely looking to solve a problem, not create one. Furthermore, let us assume some contributors are deliberately being inflammatory. Someone who claims to manage several thousand users daily should be able to accept harsh criticism. If you genuinely believe any conduct to be inappropriate it should be reported. |
Do we need to sigh?
Fair that it's not documented. I've said this in almost all of the papermill talks I have done, and other like Kyle, M, and Carol have said it different settings but it's not mentioned in the docs.
I never said it was deliberate. Sometimes it's a language barrier or just not carefully worded. We get a lot of negativity thrown at maintainers on some threads across the projects. I try to mention when it's headed in a bad direction and not jump to reporting people for miscommunication alone.
I'm fine with passion. But implying we don't understand and that it's so simply anyone whose reasonable would see, after many responses to the topic here IS frustrating and doesn't help motivate us to meet in the middle. As to my response, your comments in #425. the one place where I feel we pushed back strongly on an idea were called out for you violating our conduct (this for was for the same feature being discussed here). Another primary maintainer responded in that thread and I had the remaining nteract maintainers review my response before I made it. I didn't report it, as I wanted to make clear the issue we were seeing and ask for a correction in behavior because we value contributors and I don't want a hostile project space. This here is nowhere near that level of issue, so I'd ask to not compare the two. I do not wish to continue this line of discussion in this thread, and would prefer we talk about the subject. On that I will make a concerted effort to work towards a better solution in this tool or in combination with other tooling to enable to intended usecase, but I don't currently have bandwidth to engage on the topic and I worry about adding cell skipping needs clear direction and tooling decisions or can lead to the concerns mentioned above removing value from the tool's simplicity. If another nteract maintainer has more bandwidth to tackle the topic I'll help review and promote changes that are agreed upon. |
If folks have proposals to help mitigate the risks of adding tooling complexity or making the outputs clear about what executed in the different interfaces for viewing notebooks that is likely the next step to helping get support for the capability more natively available. |
Thank you, @MSeal. I want to remind everyone of the Code of Conduct. By participating in the issue discussion here and in any other nteract project, you agree to the guidelines outlined in this Code of Conduct. Please acknowledge the combined hundreds of hours of time that maintainers invest in this project, despite diminutive comments, and lend everyone kindness in the conversation. |
As a maintainer of black and papermill, I will go ahead and add that papermill is also opinionated. As for the comments, I'm disappointed that folks post disrespectful, arrogant, and entitled statements. I'm sincerely sorry that you disagree with our decisions and approach. You are welcome to fork the project, but bullying the maintainers is rarely a road to success in open source. |
In discussing the topic in today's weekly nteract meeting, I proposed that we could mark cells to be commented instead of strictly skipped by tag. This would preserve the execution end-to-end without downstream systems needing to understand, leave the code which was skipped in the document, and would mimic how a user would temporarily skip some extra code in a notebook before hitting Run All. The other option would be to remove the cells being marked, though some of the earlier comments indicated that this would be non-ideal for their use-case. Thoughts? |
Why not to have both options simultaneously: with one tag cell code will be commented out, with another - it will be removed? |
Noted. Probably possible to implement both if it's done. |
@MSeal In my use cases commenting out and removal will both be fine, so I don't have a preference. I would like to see this implemented though as I think it provides good value, especially in data science. |
Ok we're going to start with |
Fantastic project guys, I tried to write a simple script and hit all sorts of problems with running code as part of tests but papermill made it very easy. We're using this in fastaudio at the moment to verify the tutorial notebooks aren't broken. The problem is, these notebooks have long running executing code in them that we don't want running in the CI. For example we train models and download large datasets and we don't want that happening in the CI. What we do need to happen, is a quick check of syntax and that the code runs. We've created a method for this: I feel that this feature belongs as a part of papermill. A way of skipping cells based on a state. |
+1 |
Any update on this? I see a message from June 2020 about implementing |
@ianhellstrom @mogwai @knkumar @amardeep you could try the implementation from #403
which will give you:
|
Yes, I just had a child and got really distracted on Open Source issues. If someone wants to add a PR with the described approach I'd approve and merge. Otherwise I will put this back on my radar to tackle. |
@MSeal Does |
No I have not implemented it. Covid + first child killed a lot of my available time. Happy to help merge a PR if someone opens it up though |
my team works on a similar project, and we implemented the pip install ploomber-engine
ploomber-engine in.ipynb out.ipynb --remove-tagged-cells remove or from ploomber_engine import execute_notebook
execute_notebook("in.ipynb", "out.ipynb", remove_tagged_cells="remove") |
Thanks for the quick implementation Eduardo. Here is the relevant issue if others would like to chime in with their use cases: |
From #425 by @akx:
I'm opening this issue to gather feedback on real world use cases which may benefit from the ability to skip cells during execution. Let's focus on documenting the use cases before moving toward a technical implementation.
The text was updated successfully, but these errors were encountered: