Can I fit a sklearn classifier using lazy? #390
-
Hi! 👋 My use-case is the following: I have lots of files (>100) which add up to ~45GB of data and I want to fit a sklearn classifier with them, which require a Not all of the events in my dataset are useful, so I would like to apply some cuts before fitting my classifier and as it is a lot of data, I can't read it all in one go. Not sure if this is possible, but I am trying to do it like this: data = uproot.lazy(
f"/data/*.root:my_branch",
filter_name=vars_to_use,
step_size=step_size, num_workers=workers)
# apply some cuts and store the result in "clean". For example:
clean = data[ (data.mode == 4) ]
# create the "target" array for the classifier
target = (clean.charge == +1) At this ponit, I have one question: does I am not sure if the inner workings of uproot and awkward-arrays make this possible, but from the point of view of the user, I think it could be solved by allowing Anyway, let's say I figure that out. I am testing with a small subset and when I try to fit my sklearn classifier with
Which is because Is there another way to achieve what I want? Is it even possible with uproot? Is there an alternative to sklearn that's recommended to use with uproot? Please let me know what you think or if I am misunderstanding something 😃 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
uproot.lazy defers reading until a slice of an array is requested, but Scikit-Learn's On the specific point of slicing a lazy array, many kinds of slices are deferred, so I think your cut doesn't cause the array to be read. Passing it to Scikit-Learn's On reshaping, an Awkward Array can't be arbitrarily reshaped because that presupposes that it's rectangular. However, this slice: array[:, np.newaxis] would do the same job, regardless of whether it's a rectangular NumPy array or an arbitrarily nested Awkward Array. (Scikit-Learn's recommendation could have been more general, but the authors of Scikit-Learn weren't thinking about nested data structures.) As for your main problem, you'll have to manually slice up the array and feed it to A Finally, if you go the route of |
Beta Was this translation helpful? Give feedback.
uproot.lazy defers reading until a slice of an array is requested, but Scikit-Learn's
fit
function asks for the entire sliced array. For Scikit-Learn to only pull one chunk of the array at a time, it would have to be knowledgeable about chunking—the fitting algorithm would need to be able to deal with batches and the interface would have to recognize that slicing the input and training one slice at a time is a benefit. In other words, Scikit-Learn would have to be "in on it [the batching of data]." There might be some interface between Scikit-Learn and Dask, but Scikit-Learn didn't know about Awkward Arrays. (Which is why we want to add interfaces between Awkward and Dask, so that third p…