Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] to_pandas() fails when pandas option 'future.infer_string' is True #45296

Open
stephen-a-stc opened this issue Jan 17, 2025 · 6 comments

Comments

@stephen-a-stc
Copy link

stephen-a-stc commented Jan 17, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Summary

Using the pyarrow table method to_pandas() results in an exception if pandas.set_option('future.infer_string', True) has been set. This seems related to handling of string data.

Environment

Environments: Windows 11, and Linux (docker image "python")
Python versions tested: 3.10, 3.11, 3.12

Python packages: pyarrow==19.0.0, pandas==2.2.3

Example

import pandas
import pyarrow

print("Create pyarrow table")
pat=pyarrow.Table.from_pydict({"foo":["bar","baz"]})
print(pat)

print("convert to pandas")
df1 = pat.to_pandas()
print(df1)

print("Set 'future.infer_string' to True")
pandas.set_option('future.infer_string', True)

print("exception during convert to pandas")
df2 = pat.to_pandas()
print(df2)

Example's output

Create pyarrow table
pyarrow.Table
foo: string
----
foo: [["bar","baz"]]
convert to pandas
   foo
0  bar
1  baz
Set 'future.infer_string' to True
exception during convert to pandas
Traceback (most recent call last):
  File "C:\temp\foo.py", line 16, in <module>
    df2 = pat.to_pandas()
  File "pyarrow\\array.pxi", line 889, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow\\table.pxi", line 5132, in pyarrow.lib.Table._to_pandas
  File "C:\penv\brd_test\lib\site-packages\pyarrow\pandas_compat.py", line 800, in table_to_dataframe
    ext_columns_dtypes = _get_extension_dtypes(
  File "C:\penv\brd_test\lib\site-packages\pyarrow\pandas_compat.py", line 925, in _get_extension_dtypes
    ext_columns[field.name] = _pandas_api.pd.StringDtype(na_value=np.nan)
TypeError: StringDtype.__init__() got an unexpected keyword argument 'na_value'

Component(s)

Python

@raulcd
Copy link
Member

raulcd commented Jan 17, 2025

This is related to:

It seems the na_value was merged on pandas main but is not available on pandas 2.2 (where future.infer_string was added)

@jorisvandenbossche @WillAyd do we have to check this is specifically for Pandas 3 or should we specialize the call based on pandas 2 vs 3?

# for pandas 3.0+, use pandas' new default string dtype
if _pandas_api.uses_string_dtype() and not strings_to_categorical:
for field in table.schema:
if field.name not in ext_columns and (
pa.types.is_string(field.type)
or pa.types.is_large_string(field.type)
or pa.types.is_string_view(field.type)
) and field.name not in categories:
ext_columns[field.name] = _pandas_api.pd.StringDtype(na_value=np.nan)

@WillAyd
Copy link
Contributor

WillAyd commented Jan 17, 2025

Yea this is not expected to work with that option in pandas 2.2. The 2.3 release should resolve it

@WillAyd
Copy link
Contributor

WillAyd commented Jan 17, 2025

In the meantime @stephen-a-stc you can either downgrade Arrow to anything less than version 19, or use a development version of pandas with Arrow 19+, and I think you will get the result you are after

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 17, 2025

Ah, yes, the current code is assuming that the user only uses the 'future.infer_string' option when having pandas 2.3+ .. I should have included that in the check at:

def uses_string_dtype(self):
if self.is_ge_v3_strict():
return True
try:
if self.pd.options.future.infer_string:
return True

(checking for that option and pd.__version__ >= 2.3.0)

@stephen-a-stc
Copy link
Author

Okay, thanks for the information.
I can confirm that this wasn't an issue with earlier versions of pyarrow, for instance it works fine with pyarrow==18.1.0.

@raulcd
Copy link
Member

raulcd commented Jan 17, 2025

Yea this is not expected to work with that option in pandas 2.2. The 2.3 release should resolve it

Ah, ok, I wasn't sure this was going to be added to pandas 2.3. @jorisvandenbossche solution makes sense to me for future cases of people using pandas 2.2 and setting infer_string but I don't think we are currently planning a patch release for Arrow at the moment so as @WillAyd suggests using pyarrow < 19 until pandas 2.3 is released or dev pandas is the only solution.

I'll add the backport-candidate in case we do a patch release for pyarrow 19 and we can include this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants