Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding dshape parameter to CSV #476

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions odo/backends/csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
import datashape

from datashape import discover, Record, Option
from datashape.predicates import isrecord
from datashape.predicates import isrecord, isdimension
from datashape.dispatch import dispatch

from ..compatibility import unicode, PY2
Expand Down Expand Up @@ -140,18 +140,25 @@ class CSV(object):
If the csv file has a header or not
encoding : str (default utf-8)
File encoding
dshape: datashape or string representation
used specified datashape
kwargs : other...
Various choices about dialect
"""
canonical_extension = 'csv'

def __init__(self, path, has_header=None, encoding='utf-8',
sniff_nbytes=10000, **kwargs):
sniff_nbytes=10000, dshape=None, **kwargs):
self.path = path
self._has_header = has_header
self.encoding = encoding or 'utf-8'
self._kwargs = kwargs
self._sniff_nbytes = sniff_nbytes
if dshape:
if isinstance(dshape, (str, unicode)):
dshape = datashape.dshape(dshape)
dshape = None if isdimension(dshape.subshape[0][0]) else dshape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this logic for? Don't you want to test if isrecord(dshape) and raise an exception if False?

If an invalid dshape is passed in we should raise an exception, not silently swallow it...

self._dshape = dshape

def _sniff_dialect(self, path):
kwargs = self._kwargs
Expand Down Expand Up @@ -330,6 +337,9 @@ def _():

@discover.register(CSV)
def discover_csv(c, nrows=1000, **kwargs):
if c._dshape:
return c._dshape
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add an ensure_consistent_dshape default arg (default False); if set to True, then the c._dshape would be tested against the dshape of the df loaded below to ensure the user-specified dshape is compatible with the discovered dshape.


df = csv_to_dataframe(c, nrows=nrows, **kwargs)
df = coerce_datetimes(df)

Expand Down
7 changes: 7 additions & 0 deletions odo/backends/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,13 @@ def test_discover_with_dotted_names():
assert dshape == datashape.dshape('var * {"a.b": int64, "c.d": int64}')
assert dshape.measure.names == [u'a.b', u'c.d']

def test_discover_csv_with_fixed_dshape():
with filetext('name,val\nAlice,1\nBob,2') as fn:
ds = datashape.dshape('var * {name: string, val: float64}')
csv = CSV(fn, dshape=ds)
ds1 = discover(csv)
assert ds1 == ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test that verifies that the passed-in datashape overrides the datashape when it isn't passed in.

Perhaps a CSV file like:

a,b
1,1.0
 , 
2,2.0

And an overridden dshape like var * {a: ?int32, ?int64}.



try:
unichr
Expand Down
4 changes: 2 additions & 2 deletions odo/backends/tests/test_mysql.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,8 +202,8 @@ def test_sql_to_csv(sql, csv):
csv = odo(sql, fn)
assert odo(csv, list) == data

# explicitly test that we do NOT preserve the header here
assert discover(csv).measure.names != discover(sql).measure.names
# explicitly test that we do NOT preserve the header here ???
#assert discover(csv).measure.name != discover(sql).measure.name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the header should NOT be preserved? Or was it just not preserved before because we kept rediscovering the datashape?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this commented out?



def test_sql_select_to_csv(sql, csv):
Expand Down