Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-10: Add pyarrow as a required dependency #52711

Merged
merged 40 commits into from
Jul 30, 2023
Merged
Changes from 5 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
89a3a3b
Start pdep 10
mroeschke Apr 14, 2023
cf88b43
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 17, 2023
dafa709
finish drawbacks, fix other sections
mroeschke Apr 17, 2023
5e1fbd1
Add number
mroeschke Apr 17, 2023
44a3321
our current version is 7 not 6
mroeschke Apr 17, 2023
ea9f5e3
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 18, 2023
fbd1aa0
Clarify and fix typo
mroeschke Apr 18, 2023
6d667b4
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
bed5f0b
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
12622bb
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Apr 21, 2023
864b8d1
Add string as a preferential pyarrow type
mroeschke Apr 21, 2023
2d4f4fd
Add metric about number of pyarrow import checks
mroeschke Apr 21, 2023
bb332ca
Clarify with actual call
mroeschke Apr 21, 2023
a8275fa
Clarify with actual call
mroeschke Apr 21, 2023
1148007
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke Apr 28, 2023
b406dc1
Address some comments
mroeschke Apr 28, 2023
ecc4d5b
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
ec1c0e3
Update 0010-required-pyarrow-dependency.md
phofl Apr 28, 2023
23eb251
add Patrick as an author, remove constraint on only bumping during ma…
mroeschke Apr 28, 2023
dd7c62a
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
2ddd82a
Change required proposal for 3.0 to be version requiring pyarrow & st…
mroeschke May 9, 2023
3c54d22
Merge remote-tracking branch 'upstream/main' into pdep/pyarrow
mroeschke May 9, 2023
1b60fbb
Address typos
mroeschke May 9, 2023
70cdf74
Merge branch 'main' into pdep/pyarrow
mroeschke May 24, 2023
14602a6
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 1, 2023
2cfb92f
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 9, 2023
e0e406c
Merge branch 'main' into pdep/pyarrow
mroeschke Jun 20, 2023
f047032
Update 0010-required-pyarrow-dependency.md
phofl Jul 2, 2023
ed28c04
Update web/pandas/pdeps/0010-required-pyarrow-dependency.md
phofl Jul 3, 2023
99de932
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
99fd739
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
9384bc7
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
c3beeb3
Update 0010-required-pyarrow-dependency.md
phofl Jul 4, 2023
8347e83
improve structure, list user benefits more clearly, add faq
MarcoGorelli Jul 5, 2023
d740403
restore little demo
MarcoGorelli Jul 5, 2023
959873e
remove masked part, note that pyarrow dtyeps will likely be ready by 3
MarcoGorelli Jul 5, 2023
f936280
Merge pull request #26 from MarcoGorelli/pdep10-amendments
mroeschke Jul 6, 2023
2db0037
Update 0010-required-pyarrow-dependency.md
phofl Jul 13, 2023
c2b8cfe
Merge branch 'main' into pdep/pyarrow
mroeschke Jul 25, 2023
4e05151
Update 0010-required-pyarrow-dependency.md
phofl Jul 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions web/pandas/pdeps/0010-required-pyarrow-dependency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# PDEP-10: PyArrow as a required dependency

- Created: 17 April 2023
- Status: Under discussion
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711)
[#52509](https://github.com/pandas-dev/pandas/issues/52509)
- Author: [Matthew Roeschke](https://github.com/mroeschke)
- Revision: 1

## Abstract

This PDEP proposes that:

- PyArrow becomes a runtime dependency starting pandas 2.1
phofl marked this conversation as resolved.
Show resolved Hide resolved
- The minimum version of PyArrow supported starting pandas 2.1 is version 7.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this version be consistent across the entire pandas API?

e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)

phofl marked this conversation as resolved.
Show resolved Hide resolved
- The minimum version of PyArrow will be bumped every major pandas release to the highest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. This might be too aggressive and might also make it hard to predict what the minimum version will be.

I'd recommend following what we do for numpy, which is according to NEP 29, support
"all minor versions of NumPy released in the prior 24 months from the anticipated release date with a minimum of 3 minor versions of NumPy", for arrow as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the challenge with offering a similar support window for the two libraries is that NumPy has a very stable ABI whereas PyArrow does not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm missing something, but sounds like what @lithomas1 is proposing is pretty much the same as what's written in the proposal but phrased in a different way. Has the proposal been updated, or am I misunderstanding that supporting the releases of the last 24 months, and supporting the highest/oldest version two years old?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as every major release we will bump the min required version of pyarrow to the latest version, but might be misreading here.

Note that my proposed change would be different in that we would drop Arrow versions in both major/minor versions (as opposed to every major version), just like we do with numpy (once we reach the end of the NEP support window).

I think the challenge with offering a similar support window for the two libraries is that NumPy has a very stable ABI whereas PyArrow does not

I might have missed some more discussion on this, but I thought we were going to restrict current usage of pyarrow to just what's exposed through Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as every major release we will bump the min required version of pyarrow to the latest version, but might be misreading here.

To the latest version that has been released for at least 2 years. So, the minimum PyArrow version we support will be around 24 months old, and we should be supporting all the versions since that one, so more or less the same policy as NumPy. @mroeschke not sure if it's easy to rephrase in a way that it's more obvious what's the policy.

About bumping in major or minor releases, I don't have a preference, either is fine for me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rephrase this to make it more clear but @datapythonista has it correct. The only distinction here, compared to what we do with numpy today, is that pyarrow would be bumped only during a pandas major release.

I think the challenge with offering a similar support window for the two libraries is that NumPy has a very stable ABI whereas PyArrow does not

Under this proposal, PyArrow will only be used as a runtime dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider this to upgrading pyarrow in both major and minor versions to be consistent with numpy?

I ask because it is probably tricky for downstream to predict the length of our major release cycle (for 2.0 I think we delayed it twice. IIRC 1.4 was supposed to be 2.0).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider this to upgrading pyarrow in both major and minor versions to be consistent with numpy?

Sure that would be okay with me too

PyArrow version that has been released for at least 2 years.
Copy link
Member

@simonjayhawkins simonjayhawkins Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear here, is the minimum supported version kept for all minor releases in this proposal?

Correct


## Background

PyArrow is an optional dependency of pandas that provides a wide range of supplimental feature to pandas:
attack68 marked this conversation as resolved.
Show resolved Hide resolved

- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet
- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an optional string data type backed by PyArrow
- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV
- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow data types within the `ExtensionArray` interface
- Since pandas 2.0.0, All I/O readers have the option to return PyArrow-backed data types, and many methods now utilize PyArrow compute functions to
phofl marked this conversation as resolved.
Show resolved Hide resolved
accelerate PyArrow-backed data in pandas, notibly string and datetime types.

As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:

1. Consistent ``NA`` support for all data types
2. Broader support of data types such as ``decimal``, ``date`` and nested types
Dr-Irv marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

While all the functionality described in the previous paragraph is currently optional, PyArrow has significant integration into many areas
of pandas. With our roadmap noting that pandas strives for better Apache Arrow interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
phofl marked this conversation as resolved.
Show resolved Hide resolved
ecosystem to pandas users.

Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited
by PyArrow including:

- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

Copy link
Member

@MarcoGorelli MarcoGorelli Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this comment by Will has been addressed (unless I missed it?)

to make it easier to find: the link is here, and says:

Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?

- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future

phofl marked this conversation as resolved.
Show resolved Hide resolved

## Drawbacks

Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow using pip from wheels, numpy and pandas
are about `70MB`, and PyArrow is around `120MB`. An increase of installation size would have negative impliciation using pandas in space-constrained development
or deployment environments such as AWS Lambda.

Additionally, if a user is installing pandas in an environment where wheels are not available and needs to build from source, the user will need to build Arrow C++ and related dependencies. These environments include
Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to more explicitly say that you need to do this (installing Arrow C++) manually and this is not possible through pip install pyarrow
(there are other python packages that also have C/C++ code but that do that build automatically (if you have the dependencies such as a compiler) when installing from source)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for providing details here - is this (pip install pyarrow) much of a hurdle for these cases where pandas wheels aren't available?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, sorry there was a "not" missing in "not possible through pip install pyarrow. Corrected now.

But on the actual question how much of a hurdle this is: I would say, try it out yourself :) That's the best way to get an idea of how difficult it is, otherwise you can only take (or not) my words in saying that: yes, this is a huge hurdle. Installing pyarrow and Arrow C++ from source is far from trivial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - from the installation instructions, it certainly seems tricky https://arrow.apache.org/docs/developers/cpp/building.html#


- Alpine linux (commonly used as a base for Docker containers)
- WASM (pyodide and pyscript)
- Python development versions

Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version before releasing a new pandas version.

### PDEP-1 History
phofl marked this conversation as resolved.
Show resolved Hide resolved

- 17 April 2023: Initial version
phofl marked this conversation as resolved.
Show resolved Hide resolved

[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability>
[^2] <https://arrow.apache.org/powered_by/>
attack68 marked this conversation as resolved.
Show resolved Hide resolved