-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Views #87
Comments
Thanks! Quick question. Is the the direct support for views that is important, or would you be satisfied with in-place versions of each of the macros? I've been planning to add |
Ah, I think I see what you mean, which serves a different purpose. Let me look and see how to go about doing this. As an aside, I've also been planning to add logging support, which is automated printing of the changes that each step in the chain achieved. That's also a bit different than what you requested. We will look into views and how best to support them. |
You're awesome :) In-place operations are great, glad to hear it's being worked on. I'll look into this view thing as well, but I imagine y'all will have much better ideas about how to proceed. The DataFramesMeta
Or
It seems like the current, very tidy, Tidier syntax for passing args would have this looking something like:
Maybe passing the arg a single time within the
Maybe extending the
It would be very nice if it were possible to do something like this, but I have no idea how it would even work:
This is interesting! To be clear, I'm storing intermediate views in order to have a count of unique records at each step of joining and transforming data from two databases, and saving them to CSVs if those counts aren't what is expected to later audit the databases. For example, if the count of records with mismatched values in some status field which should be identical between two databases > 0 the view is saved and written. |
Thanks for raising this issue. I've been reading through the documentation for DataFrames.jl and trying to understand what I would need to implement to make this work as intended. I think this might get you what you're looking for. Can you see if this works as you'd expect? @chain df begin
@filter col1 == 1
@distinct col2
view(:,:)
end I'm happy to make a macro for this if this turns out to be what you want. Also, note that you can use the For example: @chain df begin
@filter col1 == 1
@distinct col2
@aside temp = view(_, :,:) # Creates a variable named `temp` that you can access afterwards
# ... rest of piped functions continued here...
end Thoughts? |
Thinking more on this, I'm not sure this will quite give you what you're looking for. The example I shared will return a view that you can work with further for logging purposes. However, if you feed that view back into TidierData, then it will get instantiated as a copy (a new data frame). Looking at DataFramesMeta documentation, it looks to me like only the subset macros support views. Because the select macros can create new columns, I don't think they can operate purely on views (unless you only point to existing columns). The big question: is it sufficient to return a view, or would you want TidierData to also be able to operate on that view without making a copy? |
Your second reply is exactly right. Sorry I should have clarified I'd already experimented with what you mentioned in the earlier reply. This returns a view:
but internally (I think this is the right line) the
which creates an internal copy which has to get GCd, which is often slower and uses more memory than operating on views. You are correct that only
To answer the big question: Yes it would be great if
I didn't know about
which would be especially nice if extended to support begin and end blocks as in the other issue I opened #88 . Something like:
When combined with views something like this should be faster than how it would currently work:
|
Adding view support to
@filter
,@distinct
, and@select
would be great. It's far faster and more memory efficient. Right now my code for a project is split between massive Tidier@chain
s for reading in and wrangling data (which is shockingly fast), and DataFramesMetaview_df = @subset(df, :col [condition]; view = true)
expressions to get SubDataFrames for logging and validation. It would be nice to have a cleaner namespace and use the same syntax throughout.The text was updated successfully, but these errors were encountered: