Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-10: Add pyarrow as a required dependency #52711
PDEP-10: Add pyarrow as a required dependency #52711
Changes from 5 commits
89a3a3b
cf88b43
dafa709
5e1fbd1
44a3321
ea9f5e3
fbd1aa0
6d667b4
bed5f0b
12622bb
864b8d1
2d4f4fd
bb332ca
a8275fa
1148007
b406dc1
ecc4d5b
ec1c0e3
23eb251
dd7c62a
2ddd82a
3c54d22
1b60fbb
70cdf74
14602a6
2cfb92f
e0e406c
f047032
ed28c04
99de932
99fd739
9384bc7
c3beeb3
8347e83
d740403
959873e
f936280
2db0037
c2b8cfe
4e05151
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this version be consistent across the entire pandas API?
e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. This might be too aggressive and might also make it hard to predict what the minimum version will be.
I'd recommend following what we do for numpy, which is according to NEP 29, support
"all minor versions of NumPy released in the prior 24 months from the anticipated release date with a minimum of 3 minor versions of NumPy", for arrow as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the challenge with offering a similar support window for the two libraries is that NumPy has a very stable ABI whereas PyArrow does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm missing something, but sounds like what @lithomas1 is proposing is pretty much the same as what's written in the proposal but phrased in a different way. Has the proposal been updated, or am I misunderstanding that supporting the releases of the last 24 months, and supporting the highest/oldest version two years old?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as every major release we will bump the min required version of pyarrow to the latest version, but might be misreading here.
Note that my proposed change would be different in that we would drop Arrow versions in both major/minor versions (as opposed to every major version), just like we do with numpy (once we reach the end of the NEP support window).
I might have missed some more discussion on this, but I thought we were going to restrict current usage of pyarrow to just what's exposed through Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the latest version
that has been released for at least 2 years
. So, the minimum PyArrow version we support will be around 24 months old, and we should be supporting all the versions since that one, so more or less the same policy as NumPy. @mroeschke not sure if it's easy to rephrase in a way that it's more obvious what's the policy.About bumping in major or minor releases, I don't have a preference, either is fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rephrase this to make it more clear but @datapythonista has it correct. The only distinction here, compared to what we do with numpy today, is that pyarrow would be bumped only during a pandas major release.
Under this proposal, PyArrow will only be used as a runtime dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you consider this to upgrading pyarrow in both major and minor versions to be consistent with numpy?
I ask because it is probably tricky for downstream to predict the length of our major release cycle (for 2.0 I think we delayed it twice. IIRC 1.4 was supposed to be 2.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure that would be okay with me too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.
It is not clear here, is the minimum supported version kept for all minor releases in this proposal?
Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this comment by Will has been addressed (unless I missed it?)
to make it easier to find: the link is here, and says:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to more explicitly say that you need to do this (installing Arrow C++) manually and this is not possible through
pip install pyarrow
(there are other python packages that also have C/C++ code but that do that build automatically (if you have the dependencies such as a compiler) when installing from source)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for providing details here - is this (
pip install pyarrow
) much of a hurdle for these cases where pandas wheels aren't available?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry there was a "not" missing in "not possible through
pip install pyarrow
. Corrected now.But on the actual question how much of a hurdle this is: I would say, try it out yourself :) That's the best way to get an idea of how difficult it is, otherwise you can only take (or not) my words in saying that: yes, this is a huge hurdle. Installing pyarrow and Arrow C++ from source is far from trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks - from the installation instructions, it certainly seems tricky https://arrow.apache.org/docs/developers/cpp/building.html#