Can I fit a sklearn classifier using lazy? #390

peguerosdc · 2021-07-07T22:41:07Z

peguerosdc
Jul 7, 2021

Hi! 👋 My use-case is the following:

I have lots of files (>100) which add up to ~45GB of data and I want to fit a sklearn classifier with them, which require a data (2D) and target (1D) arrays.

Not all of the events in my dataset are useful, so I would like to apply some cuts before fitting my classifier and as it is a lot of data, I can't read it all in one go.

Not sure if this is possible, but I am trying to do it like this:

data = uproot.lazy(
    f"/data/*.root:my_branch",
    filter_name=vars_to_use,
    step_size=step_size, num_workers=workers)
# apply some cuts and store the result in "clean". For example:
clean = data[ (data.mode == 4) ]
# create the "target" array for the classifier
target = (clean.charge == +1)

At this ponit, I have one question: does clean store in memory all the events that pass the cuts (sorry, I didn't understand quite well this part in the docs, but I think the answer is yes)? I wanted to use lazy to avoid doing precisely that. I was expecting that by passing it to the classifier as the data, it would read the events only as they are required in the fitting process, but the need of applying cuts is then forcing me to still read all the data and store a large percentage of it.

I am not sure if the inner workings of uproot and awkward-arrays make this possible, but from the point of view of the user, I think it could be solved by allowing lazy to apply cuts directly (the same way as iterate).

Anyway, let's say I figure that out. I am testing with a small subset and when I try to fit my sklearn classifier with model.fit(clean, target), I get the following error:

ValueError: Expected 2D array, got 1D array instead:
...
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Which is because clean is not a 2D array but one long "awkward" 1D array. Is there a workaround for this or is uproot not meant to work with sklearn and I am doing everything wrong? In numpy there's reshape, but not in awkward and lazy forces me to use ak.

Is there another way to achieve what I want? Is it even possible with uproot? Is there an alternative to sklearn that's recommended to use with uproot?

Please let me know what you think or if I am misunderstanding something 😃

Answered by jpivarski

Jul 7, 2021

uproot.lazy defers reading until a slice of an array is requested, but Scikit-Learn's fit function asks for the entire sliced array. For Scikit-Learn to only pull one chunk of the array at a time, it would have to be knowledgeable about chunking—the fitting algorithm would need to be able to deal with batches and the interface would have to recognize that slicing the input and training one slice at a time is a benefit. In other words, Scikit-Learn would have to be "in on it [the batching of data]." There might be some interface between Scikit-Learn and Dask, but Scikit-Learn didn't know about Awkward Arrays. (Which is why we want to add interfaces between Awkward and Dask, so that third p…

View full answer

jpivarski · 2021-07-07T23:31:11Z

jpivarski
Jul 7, 2021
Maintainer

uproot.lazy defers reading until a slice of an array is requested, but Scikit-Learn's fit function asks for the entire sliced array. For Scikit-Learn to only pull one chunk of the array at a time, it would have to be knowledgeable about chunking—the fitting algorithm would need to be able to deal with batches and the interface would have to recognize that slicing the input and training one slice at a time is a benefit. In other words, Scikit-Learn would have to be "in on it [the batching of data]." There might be some interface between Scikit-Learn and Dask, but Scikit-Learn didn't know about Awkward Arrays. (Which is why we want to add interfaces between Awkward and Dask, so that third party libraries would recognize them as lazy.)

On the specific point of slicing a lazy array, many kinds of slices are deferred, so I think your cut doesn't cause the array to be read. Passing it to Scikit-Learn's fit function definitely does, though.

On reshaping, an Awkward Array can't be arbitrarily reshaped because that presupposes that it's rectangular. However, this slice:

array[:, np.newaxis]

would do the same job, regardless of whether it's a rectangular NumPy array or an arbitrarily nested Awkward Array. (Scikit-Learn's recommendation could have been more general, but the authors of Scikit-Learn weren't thinking about nested data structures.)

As for your main problem, you'll have to manually slice up the array and feed it to fit one batch at a time. Some machine learning models can't be trained in batches, or at least, they're not guaranteed to get the same results when they are trained in batches, but often it can be done as an approximation. (I'm thinking of clustering, such needs all points for the exact answer, but would be close enough if iteratively trained in subsamples.) That's a fundamental issue that goes beyond interface. (Even with an interface that accepts a lazy array and extracts batches from it, the algorithm must be capable of batching...)

A for loop over range slices (array[start:stop]) would keep the whole array from being read at once. However, using iterate, rather than lazy is equivalent and more carefully controls memory usage (because iterate knows it's being used in a sequential loop and it can release the right memory after each step of the loop; lazy has to guess what memory to release based on least recently used).

Finally, if you go the route of iterate and your data are not jagged, nested, or anything like that, you could use iterate with library="np" and it would be faster. You pay for what you use, and if you're not using laziness or nested data structures, bypassing that infrastructure saves memory and CPU (not necessarily a lot in all cases, but nonzero).

2 replies

peguerosdc Jul 13, 2021
Author

Thank you for the very detailed response! It makes a lot of sense that Scikit-Learn needs to be aware of the batching which I hadn't considered. I will mark my question as answered.

For the curious, I ended up using iterate, but couldn't use library="np" because of the structure of my data. I was interested in using pandas because I wanted to feed the data into an existing pandas flow, but I found it was faster to read the data using library="ak", applying the cuts and then using ak.to_pandas() (per batch) than to directly use library="pd".

jpivarski Jul 13, 2021
Maintainer

I found it was faster to read the data using library="ak", applying the cuts and then using ak.to_pandas() (per batch) than to directly use library="pd".

I'm not really surprised by that. It's hard to make categorical performance statements, but Pandas is generally not very fast. Doing the conversion to Pandas later, when there's less data to convert, could make a noticeable difference.

To be fair, part of the cost may be constructing the MultiIndex from the nested data: the "offsets" top-down description of nesting is almost immediately what you get out of the ROOT file and it's all that Awkward (and Arrow) needs, but a MultiIndex bottom-up description is a different way of representing that nesting (similar to what Parquet uses, by the way). ak.to_pandas also has to do that calculation, but later. Oh! And it also gets to use a specialized C function to do it. Uproot doesn't do a for loop over the data or anything, but it has to resort to NumPy tricks to avoid depending on Awkward Array. That could be it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I fit a sklearn classifier using lazy? #390

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Can I fit a sklearn classifier using lazy? #390

peguerosdc Jul 7, 2021

Replies: 1 comment · 2 replies

jpivarski Jul 7, 2021 Maintainer

peguerosdc Jul 13, 2021 Author

jpivarski Jul 13, 2021 Maintainer

peguerosdc
Jul 7, 2021

Replies: 1 comment 2 replies

jpivarski
Jul 7, 2021
Maintainer

peguerosdc Jul 13, 2021
Author

jpivarski Jul 13, 2021
Maintainer