-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data/config path entry_points with minimal examples #209
Conversation
I see this as a good alternative to using data_files without overhauling the config system. I am a bit worried that it's hard to debug when things go wrong (if 15 directories will be scanned). Could we maybe provide a richer debug facility to see a particular config key, and how each directory is changing it. Grepping in 15 directories will not be fun. Or do I see a problem that does not exist, and are the debug options sufficient? |
Yep, there will be a lot of directories beyond the Big Four. No doubt some combination of A $> jupyter foo --show-config
environment variables:
- JUPYTER_PREFER_ENV_PATH: not set
- ...
paths:
- /etc/jupyter/jupyter_config.json: not found
...
- ~/my-project/src/my_project/etc/jupyter_foo_config.d/my-project.json:
+ SomeHasTraits:
+ foo: bar
...
- ~/my-project/src/my_project/.venv/etc/jupyter_config.d/someone-elses-project.json:
SomeHasTraits:
- foo: bar
+ foo: baz
...
- ./jupyter_foo_config.json: not found
final:
SomeHasTraits:
foo: baz sprinkle in some pygments (if available) and it would be pretty usable. |
Indeed, exactly what I had in mind, that would help a lot |
Gah, looking at it: a lot of the complexity is duplicated between Perhaps the better short-term approach would be to invert it, with a separate package/command, e.g. offered Because of that complexity, this could probably not land here, unless the ConfigManager pattern was brought upstream, which sounds hard to coordinate. |
I have an unshaeably bad version of this, but it kinda works with
getting jupyter_server_config from /etc/jupyter
got {}
getting jupyter_server_config from /usr/local/etc/jupyter
got {}
getting jupyter_server_config from /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter
Reading file /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/jupyter_server_config.d/jupyterlab.json
Reading file /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/jupyter_server_config.d/nbclassic.json
Reading file /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/jupyter_server_config.d/voila.json
got {'ServerApp': {'jpserver_extensions': {'jupyterlab': True, 'nbclassic': True, 'voila.server_extension': True}}}
getting jupyter_server_config from /home/weg/.jupyter
got {}
getting page_config from /etc/jupyter/labconfig
got {}
getting page_config from /usr/local/etc/jupyter/labconfig
got {}
getting page_config from /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/labconfig
got {}
getting page_config from /home/weg/.jupyter/labconfig
got {}
[I 2020-11-22 17:50:37.177 ServerApp] jupyterlab | extension was successfully linked.
getting jupyter_notebook_config from /home/weg/.jupyter
got {}
getting jupyter_notebook_config from /etc/jupyter
got {}
getting jupyter_notebook_config from /usr/local/etc/jupyter
got {}
getting jupyter_notebook_config from /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter
Reading file /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/jupyter_notebook_config.d/jupyterlab.json
Reading file /home/weg/projects/jupyter_showconfig_/envs/default/etc/jupyter/jupyter_notebook_config.d/voila.json
got {'NotebookApp': {'nbserver_extensions': {'jupyterlab': True, 'voila.server_extension': True}}}
getting jupyter_notebook_config from /home/weg/.jupyter
got {}
[I 2020-11-22 17:50:37.322 ServerApp] nbclassic | extension was successfully linked.
[I 2020-11-22 17:50:37.322 ServerApp] voila.server_extension | extension was successfully linked.
[I 2020-11-22 17:50:37.339 LabApp] JupyterLab extension loaded from /home/weg/projects/jupyter_showconfig_/envs/default/lib/python3.7/site-packages/jupyterlab
[I 2020-11-22 17:50:37.339 LabApp] JupyterLab application directory is /home/weg/projects/jupyter_showconfig_/envs/default/share/jupyter/lab
[I 2020-11-22 17:50:37.342 ServerApp] jupyterlab | extension was successfully loaded.
[I 2020-11-22 17:50:37.345 ServerApp] nbclassic | extension was successfully loaded.
[I 2020-11-22 17:50:37.347 ServerApp] voila.server_extension | extension was successfully loaded.
Update: here's some better stuff, generated with
|
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: |
Since entry points come from packages installed in the environment, I think it makes sense that they are treated like the environment paths
@bollwyvl - I made a PR to your PR with a few changes I thought would be good: bollwyvl#1. What do you think? |
Entry point paths treated like environment paths
I'll see if i can get that together. It doesn't add any non-stdlib dependencies, and it gives us some wiggle room for the future. One issue with having dotted notation to the left of the
So I don't know yet how we might avoid the import behavior... i suppose tossing a
That'll be more fun 😝 |
With 1000 packages (so 2000 entry_points):
|
Starting lab:
a minute to first pixels isn't too pretty 😢 |
throwing in a little bit of cache helps immeasurably... well, measurably... but i haven't measured it. def _entry_point_paths(ep_group):
return _cached_entry_point_paths(ep_group, math.floor(time.time() / 100))
@functools.lru_cache(maxsize=10)
def _cached_entry_point_paths(ep_group, epoch):
... |
Ouch. I suppose it does have to open lots of files, which is going to be an even bigger pain on NFS and slower filesystems. |
For completeness in documenting discussions in Jupyter around entry points, see also jupyter/notebook#2894. |
Unfortunately, as far as I can tell, conda does not support general entry points, just |
Put back your pitchforks, No worries here! A number of ecosystems (like pytest) would fall entirely apart. The reason |
Also: tried the PEP 420 namespace package thing... might be a non-starter as totally unsurprisingly the files wouldn't be in place with a |
I was just testing things to see if what conda recipes call "entry points" are in reality just "console_script entry points", and if it just left all other entry points alone that were already in the dist_info directories. ...and yes, installing a conda environment with Pitchfork being sheathed :). |
FYI @bollwyvl, it looks like to me that if you have many entry points with the same name, the Edit: oh, never mind, you just have to use the |
Here are my timings for Using the entrypoints package:
Using
Also, it seems that JupyterLab is slowed down by about a second if the entrypoint paths are cached:
|
I should perhaps clarify this in the importlib.resources docs. The access to resources on the file system is meant to be for the duration of the context manager and that any expectation of use outside of that should be implemented downstream. In other words, if having a copy after the interpreter exits is a goal, I'd recommend to build a routine that manages that lifecycle and copies the content to the more permanent location. The Python import system has little control over the state of the system between interpreter runs (including pip uninstalls) and there's no proposed spec that I'm aware that would enable management of resources across runs.
It does seem like
There is a definition for entry points and that definition does state that the value should be an importable module and optional name inside that module :/. It does feel like mild abuse to violate this stated intention. If there were a clear and obvious way for a package to expose another form of arbitrary metadata, that would be my recommendation, but I'm not sure if such an approach is readily feasible in the current metadata design, as I've not seen it before. But I just tested it, and I think this could work. Instead of using Then, in the
You'd still need a way to solicit the exact hooks for each project. I'd recommend soliciting the hooks from a pyproject.toml, something like
In this way, you're following the same principles as setuptools uses to solicit and expose entry points, but you're defining a custom format for a distinct purpose. You would have to design and implement the syntax for the file and parse it yourselves, but you probably want that anyway. The advantage is you have imminent control over the syntax and experience and you're still using the same metadata mechanism as entry points and other packaging patterns. I'd be willing to help guide this implementation if it sounds attractive.
Yes, and an intended one with importlib_metadata 3.5. Essentially, in order to deduplicate distributions correctly, the metadata for each distribution needs to be loaded. There are plans in python/importlib_metadata#283 to improve performance in light of that concern. |
Thanks for weighing in on this. Interestingly, one of the primary reasons for us to move to entry points over using data_files is that Python will manage the lifecycle of these files. Perhaps we're chasing a pipe dream if we need to build something generic enough to support any way a python module might be loaded, but also need the resources to be available outside of Python.
Nice, thanks! This looks like the approach I was attempting in the "Alternative Solutions" section in the issue description, in commit jasongrout@66351b0 (however, I was really fumbling to get the metadata out of the distributions, and I'm sure I made some inaccurate assumptions involving top_level.txt, for example). We decided to abandon this approach in favor of entry_points since various packagers like poetry, flit, etc., don't seem to support arbitrary metadata files, and having broad packager support was one of our design goals. By the way, I've been thinking over the past few days about how to make finding a specific group of entry points potentially faster (I haven't benchmarked any experiments, so of course this should be treated with appropriate skepticism). It seems that getting a specific group of entry points requires reading in and parsing all entry point metadata files in the entire python installation, then filtering for the group I want. My hypothesis is that checking if a file exists is much faster than opening and parsing a file. If each group of entry points was stored in a separate file inside the dist_info/egg directory (for example, as files named by the group in a new |
Worth a try, but my guess is that the largest proportion of the slowness comes from the disk list operation 🤔 but would be great to see some benchmark numbers on discovery vs parsing overhead. |
Had some other thoughts about our scale issue. And, for reference, a quick look revealed that we are talking about a rough venn diagram of:
so these scale concerns are not entirely academic bikeshedding. Regarding benchmarking: yeah, the above were all with |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/how-could-data-files-be-improved/8972/2 |
""" | ||
spec = importlib.util.find_spec(ep.module_name) | ||
module = importlib.util.module_from_spec(spec) | ||
origin = pathlib.Path(module.__file__).parent.resolve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module.__file__
is None
if there's no top-level __init__.py
file in the module
I brought this up in the jlab dev call today: https://hackmd.io/Y7fBMQPSQ1C08SDGI-fwtg?both#5-May-2021-Weekly-Meeting @bollwyvl @jasongrout What's the status of the work on |
This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/package-managers-extension-paths/11723/2 |
Closing in favor of using |
Background
Jupyter relies on a hierarchy of directories (user-level, environment-level, system-level, etc.) to store configuration and data. These directories are used by a number of Jupyter programs, for example:
Problem
Currently the environment level of this directory hierarchy is a fixed location based on
sys.prefix
. This means that packages need to copy their files into this directory at install time, which has several issues:data_files
feature of Python packages, which is deprecated in setuptools and is not supported in non-setuptools-based packagers likeflit
,poetry
(see here), etc.site-packages
). For some extensions, this a huge (like megabytes or tens of megabytes).pip -e
) do not update data files when the source files change, so when developing a package, if something changes to the data files, you either have to copy them over again, or you have to run a command to make the appropriate data directory a symbolic link (not available on some platforms) to the source files.(Also, it seems that sometimes these data file directories are not deleted. For example, in JupyterLab we actually create files at runtime in the data directory, and I think they don't get deleted when JupyterLab is uninstalled)
Proposed solution
Python has another mechanism that is explicitly designed for plugin systems called entry points. An entry point is a piece of metadata in a package that points to an arbitrary import from the package. This PR changes
jupyter_core
to look for two specific entry points in any installed package, each pointing to a list of paths, to augment the environment-level Jupyter config directories (thejupyter_config_paths
entry point) and data directories (thejupyter_data_paths
entry point). The result is:site-packages
directory, and can use the entry point to point Jupyter to that internal directory. Since this directory is internal to the package:jupyter --paths --json
Problems with the proposed solution
attr
handler for setup.cfg values).entry_point
group is cachedpip install
orconda install
would be able to update the search path, provided the application isn't doing its own caching...data_files
python_packages
entry for these static assets, to avoid bringing in otherwise-unused runtime dependencies, e.g.pandas
log=None
argument to the various callsJUPYTER_CORE_LOGLEVEL
entry_point
is added or (its target is changed) in a package with an editable install, it must be reinstalledentry_point
is changed, no re-install is requiredjupyter_*paths()
jupyter_core
itself: if one of the example packages is installed, the tests breakJUPYTER_PREFER_ENV
is setAlternative solutions
setuptools also provides a way for a package to have custom metadata files in the egg or dist_info directories. This avoids the problems of importing or parsing an arbitrary python file to get the few strings that we need. However, it appears that this arbitrary metadata is not well supported outside of setuptools. See below for some experiments around this approach.
Example
See the setuptools example, specifically
jupyter_core/examples/jupyter_path_entrypoint_setuptools/setup.cfg
Lines 35 to 39 in 38e3acd
MANIFEST.in
and asetup.py
in order to be installed from sourceand the flit example, specifically
jupyter_core/examples/jupyter_path_entrypoint_flit/pyproject.toml
Lines 11 to 15 in 38e3acd
for examples of how to use these entry points.
pyproject.toml
is the only boilerplate file needed, and generates asetup.py
flit
can also generate binary reproduciblewhl
files (for python >=3.7) given the same version offlit_core
Original issue description
Hey folks! Thanks for keeping this foundational technology working.
data_files
are making me sad enough that I'm willing to bring this up again.This is a low-downstream-impact way we could allow python packages to not require the ill-supported
data_files
technique.To test:
I don't know if it really works yet, down the the n-th downstream, but seems it should if they are relying on
jupyter_*_dir
, and handling multiple paths already.