-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add writing functionality for dataframes #45
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnmabus Thank you for the review. I have pushed the changes and left some comments/questions for discussion.
rdata/conversion/_conversion.py
Outdated
@@ -820,6 +872,9 @@ def _convert_next( # noqa: C901, PLR0912, PLR0915 | |||
|
|||
value = None | |||
|
|||
elif obj.info.type == parser.RObjectType.ALTREP: | |||
value = convert_altrep_to_range(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relates to the support for writing dataframes with range indices like pd.DataFrame(..., index=range(2, 5))
.
The reason that this change is in this PR is that it allows testing R-to-Python-to-R roundtrip for such a dataframe (in R data.frame(..., row.names=2:4)
, this is test_dataframe_range_rownames.rds
). Here row names is compact_intseq altrep, which would get converted to a regular array by default, which would then fail the roundtrip.
But, other than enabling that testing, this change wouldn't really belong to this PR, so it could be removed if you think that's better.
In general, I feel that it could be nicer if altreps would be expanded in the conversion stage instead of the parsing stage. The reason is that then the RData object would be a one-to-one representation of the file contents, which is not the case for altreps at the moment (unless expand_altrep=False
). I have interpreted compact_intseq altrep to implement the same idea as Python's range object, so it would be useful if users would have an option to choose whether they want this altrep to be converted to a numpy array or a range object, which seems not possible (or easy?) if altreps are expanded already in parsing. This would be larger change though (beyond this PR). What do you think in general?
@@ -430,7 +482,7 @@ def dataframe_constructor( | |||
and isinstance(row_names, np.ma.MaskedArray) | |||
and row_names.mask[0] | |||
) | |||
else tuple(row_names) | |||
else row_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was to keep range object as range object instead of expanding it to a tuple of values (relates to the altrep comment).
Returns: | ||
Numpy array. | ||
""" | ||
if isinstance(pd_array, pd.arrays.StringArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be done. I think it would clarify many things, but it would require changes in parsing side too.
The main issue here could be string arrays, where NA values are None
at the moment, so should parsing of string arrays return a possibly masked array instead? Also in writing, if user has strings = ["hello", None]
, now this can be converted to valid "string" array like array = np.array(strings)
(which is of object dtype, requiring special treatment of object arrays). With masked array user would need to do e.g. array = np.ma.array(data=["" if s is None else s for s in strings], mask=[s is None for s in strings])
, which might be more cumbersome for users to work with. A benefit would be though that this array is of unicode dtype, which would simplify the numpy array handling.
Do you think it would be useful to start using masked arrays in all these cases? (I feel it could fit better to a different PR though due to changes in parsing?)
if self.df_attr_order is not None: | ||
attributes = {k: attributes[k] for k in self.df_attr_order} | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I restructured these conversion to user-definable constructor functions similarly to R-to-Python conversion. I separated also some other conversion logic to functions.
For pd.Categorical
this was simple, but for pd.DataFrame
a bit more complex as it may contain pd.Categorical
or some other more complex objects. For that reason, the constructor functions take also parameter convert_to_r_object
that points back to ConverterFromPythonToR.convert_to_r_object
(which then in turn can keep track of references etc).
Do you have a suggestion for naming convention? Now there is dataframe_constructor
for both R-to-Python and Python-to-R conversions.
rdata/conversion/_conversion.py
Outdated
@@ -820,6 +872,9 @@ def _convert_next( # noqa: C901, PLR0912, PLR0915 | |||
|
|||
value = None | |||
|
|||
elif obj.info.type == parser.RObjectType.ALTREP: | |||
value = convert_altrep_to_range(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this is ok?
In principle it is ok to write altreps when possible by default. I am not sure if we should add an option for the Converter
not to use altreps, for having compatibility with tools that do not understand them.
@@ -430,7 +482,7 @@ def dataframe_constructor( | |||
and isinstance(row_names, np.ma.MaskedArray) | |||
and row_names.mask[0] | |||
) | |||
else tuple(row_names) | |||
else row_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that such mapping is an error. However, I do not have anything against this particular change, unless the tuple
call was necessary for some reason.
Returns: | ||
Numpy array. | ||
""" | ||
if isinstance(pd_array, pd.arrays.StringArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it makes sense to do it for float arrays (in a different PR).
rdata/_write.py
Outdated
@@ -40,6 +42,7 @@ def write_rds( | |||
compression: Compression. | |||
encoding: Encoding to be used for strings within data. | |||
format_version: File format version. | |||
constructor_dict: Dictionary mapping Python types to R classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not really true, right? It maps Python classes to functions to convert them to R classes (which is more powerful, as it can choose a different R class depending on the attributes of the object).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed to Dictionary mapping Python classes to functions converting them to R classes.
here and in other locations.
if self.df_attr_order is not None: | ||
attributes = {k: attributes[k] for k in self.df_attr_order} | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add that as mandatory (at least for writing, as there is not existing code that requires backwards compatibility).
This reverts commit 8a269ae.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vnmabus Sorry for the long delay with this PR - my past months were pretty busy. I have pushed fixes based on your suggestions.
rdata/_write.py
Outdated
@@ -40,6 +42,7 @@ def write_rds( | |||
compression: Compression. | |||
encoding: Encoding to be used for strings within data. | |||
format_version: File format version. | |||
constructor_dict: Dictionary mapping Python types to R classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed to Dictionary mapping Python classes to functions converting them to R classes.
here and in other locations.
rdata/conversion/__init__.py
Outdated
build_r_data as build_r_data, | ||
convert_to_r_object as convert_to_r_object, | ||
convert_to_r_object_for_rda as convert_to_r_object_for_rda, | ||
DEFAULT_CONSTRUCTOR_DICT as DEFAULT_CONSTRUCTOR_DICT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is clearer to import this directly from to_r
; removed from here. Regarding symmetric naming, do you mean we should rename it to DEFAULT_CLASS_MAP
(same as in reading, user imports the correct name from to_r
/ to_python
) or more explicit (but longer) like DEFAULT_PYTHON_TO_R_CLASS_MAP
?
Returns: | ||
Numpy array. | ||
""" | ||
if isinstance(pd_array, pd.arrays.StringArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Maybe an issue could be opened about it for a reminder? (Relates to the comment about R_FLOAT_NA, np.nan, and other NaNs too.)
elif np.issubdtype(source.dtype, np.floating): | ||
# We return the numpy array here, which keeps | ||
# R_FLOAT_NA, np.nan, and other NaNs as they were originally in the file. | ||
# Users can then decide if they prefer to interpret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a demo for changing how R's NA vs NaN values are interpreted on pandas side.
This is for a file created Rscript -e 'df = data.frame(a=c(1., NA, NaN)); saveRDS(df, file="na_nan.rds")'
.
import numpy as np
import pandas as pd
import rdata
from rdata.missing import is_na
df = rdata.read_rds('na_nan.rds')
print('Dataframe df:')
print(df)
print()
print('df.isna():')
print(df.isna())
print()
print('np.isnan(df):')
print(np.isnan(df))
print()
print("Update NA mask of column")
a = df["a"].to_numpy()
df["a"] = pd.arrays.FloatingArray(a, is_na(a))
print()
print('New dataframe df:')
print(df)
print()
print('df.isna():')
print(df.isna())
print()
print('np.isnan(df):')
print(np.isnan(df))
print()
Output:
Dataframe df:
a
1 1.0
2 NaN
3 NaN
df.isna():
a
1 False
2 True
3 True
np.isnan(df):
a
1 False
2 True
3 True
Update NA mask of column
New dataframe df:
a
1 1.0
2 <NA>
3 NaN
df.isna():
a
1 False
2 True
3 False
np.isnan(df):
a
1 False
2 <NA>
3 True
It's a good question though what should be the default behavior in this regard. The current parsing behavior is the same as before this PR.
@@ -430,7 +482,7 @@ def dataframe_constructor( | |||
and isinstance(row_names, np.ma.MaskedArray) | |||
and row_names.mask[0] | |||
) | |||
else tuple(row_names) | |||
else row_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests and mypy pass, so I kept this change.
rdata/conversion/_conversion.py
Outdated
@@ -820,6 +872,9 @@ def _convert_next( # noqa: C901, PLR0912, PLR0915 | |||
|
|||
value = None | |||
|
|||
elif obj.info.type == parser.RObjectType.ALTREP: | |||
value = convert_altrep_to_range(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now altrep is created only for suitable pd.RangeIndex objects. Users could avoid altreps here altogether by overriding the default constructor with
def rangeindex_constructor(data: pd.RangeIndex, converter: Converter) -> RObject:
return build_r_object(RObjectType.INT, value=np.array(data))
if self.df_attr_order is not None: | ||
attributes = {k: attributes[k] for k in self.df_attr_order} | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
References to issues or other PRs
Closes #20.
Describe the proposed changes
This PR will add support for writing pandas dataframes. This turned out to be a quite large change to the current Python-to-R conversion functionality. Overview of changes:
rdata.conversion.to_r
reorganized, a classConverterFromPythonToR
added to simplify keeping track of references in RData files.rdata.missing
.convert_altrep_to_range()
(rdata.conversion._conversion
) that enable converting compact intseq to range object (e.g. as a dataframe index).Additional information
The functions in
rdata.missing
could be useful for users to handle NA values in the desired way, e.g.,pd.arrays.FloatingArray(array)
would set all NaNs (including R's NA value) as "missing", butpd.arrays.FloatingArray(array, is_na(array))
can be used to set only R's NA values as "missing".Checklist before requesting a review