Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: native support for universal_pathlib (upath) IO #60618

Open
1 of 3 tasks
zkurtz opened this issue Dec 29, 2024 · 4 comments
Open
1 of 3 tasks

ENH: native support for universal_pathlib (upath) IO #60618

zkurtz opened this issue Dec 29, 2024 · 4 comments
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action

Comments

@zkurtz
Copy link

zkurtz commented Dec 29, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

universal_pathlib makes it quite a lot easier to read and write data frames directly against cloud paths like, say, "s3://test_bucket/example.txt" by absorbing authentication concerns and cloud-specific-implementation issues into to the construction of the path itself. This then allows IO methods to work as close to normally as possible without regard for the nature of the path being used (local vs GCS vs S3 etc.).

So, ideally, this would just work:

import pandas as pd
from upath import UPath

path = UPath("s3://test_bucket/example.txt")
[my data frame].to_parquet(path)

But it does not quite work. However, this thin wrapper does seems to work, simply by detecting whether the input path is a UPath, and (if so) passing along the storage options into the pandas IO calls.

Proposal: Extend the allowable types of paths in pandas dataframe IO methods to include UPath, and automatically detect storage options in that case.

Feature Description

Nothing to add ...

Alternative Solutions

Nothing to add ...

Additional Context

No response

@zkurtz zkurtz added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2024
@WillAyd
Copy link
Member

WillAyd commented Jan 2, 2025

pandas already supports fsspec - what does upath offer that isn't covered by that?

@zkurtz
Copy link
Author

zkurtz commented Jan 2, 2025

Possibly only what upath offers intrinsically, which is the ability to work with cloud paths the same as pathlib.Path paths. So it would be a convenience, allowing users to work with fsspec file systems without worrying about fsspec syntax. Indeed the first line of the upath readme says that it "extends the pathlib.Path API to support a variety of backend filesystems via filesystem_spec." I could be missing something though, so I'll invite a couple of upath contributors to chime in here as well.

@ap--
Copy link

ap-- commented Jan 3, 2025

Thanks @zkurtz for notifying me in the repository. And hello @WillAyd, I'm the current universal-pathlib maintainer.

what does upath offer that isn't covered by that

There are two reasons for using universal-pathlib instead of fsspec directly, and both are basically convenience features.

  1. A UPath instance combines the path string (fsspec uri) and the storage_options mapping in one container.
  2. universal-pathlib users usually prefer the pathlib interface to construct paths (uris)

The current feature request basically asks for simplifying the following pattern:

from upath import UPath
import pandas as pd

pth = UPath("s3://bucket/file.csv", some_option=True, other_option=123)

pd.DataFrame({"A": [1, 2, 3]}).to_csv(pth, storage_options=pth.storage_options)
pd.read_csv(pth, storage_options=pth.storage_options)

Currently, I would recommend against directly depending on universal-pathlib and against adding direct support for UPath instances via checking for the UPath interface.

For historic reasons, all UPath subclasses incorrectly pretend to be a local Path (they implement __fspath__, which makes them os.PathLike). In a future Python version a pathlib.PathBase class for concrete (virtual) Paths should become available and will be the correct subclass to test for in pandas.io.common._get_filepath_or_buffer. Once UPath will inherit from pathlib.PathBase (or a backport of PathBase) the logic in _get_filepath_or_buffer can be tweaked so that when filepath_or_buffer is detected as a pathlib.PathBase subclass, a buffer instance can be created by calling filepath_or_buffer.open(mode=mode, encoding=encoding).

This refactor in universal-pathlib is tracked here fsspec/universal_pathlib#193 and work on pathlib.PathBase is ongoing in cpython.

As soon as the refactor in universal-pathlib is completed, I would be happy to contribute a PR to have a concrete example for discussion.

Cheers,
Andreas

@WillAyd WillAyd added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 3, 2025
@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Jan 3, 2025
@zkurtz
Copy link
Author

zkurtz commented Jan 10, 2025

Contrary to my original post, I'm observing empirically that the naive approach

import pandas as pd
from upath import UPath

path = UPath("s3://test_bucket/example.txt")
[my data frame].to_parquet(path)

actually does work both for s3 and gcs. I encountered an issue only with Azure. Maybe the issue is more about the specific type of authentication being used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants