Replies: 3 comments
-
Hi @ekourlit, I'm not sure that Whether we can improve things here in the eager mode is not something I know much about. @jpivarski has discussed this before and I'm sure could drop a kernel of information. What I will say is that it sounds like you'd benefit from the dask-awkward integration that makes it possible for us to read less, as an alternative approach to reading more quickly. |
Beta Was this translation helpful? Give feedback.
-
This won't do anything for eager reading, as @agoose77 pointed out. In eager mode, it's equivalent to taking the output of Uproot and feeding it through ak.zip, like this: >>> import skhep_testdata
>>> import uproot
>>> import awkward as ak
>>> import vector
>>> vector.register_awkward()
>>>
>>> tree = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]
>>> arrays = tree.arrays(filter_name=["Electron_*", "Muon_*"])
>>> arrays.type.show()
2421 * {
Muon_Px: var * float32,
Muon_Py: var * float32,
Muon_Pz: var * float32,
Muon_E: var * float32,
Muon_Charge: var * int32,
Muon_Iso: var * float32,
Electron_Px: var * float32,
Electron_Py: var * float32,
Electron_Pz: var * float32,
Electron_E: var * float32,
Electron_Charge: var * int32,
Electron_Iso: var * float32
}
>>>
>>> restructured = ak.zip({
... "muon": ak.zip({
... "px": arrays.Muon_Px,
... "py": arrays.Muon_Py,
... "pz": arrays.Muon_Pz,
... "E": arrays.Muon_E,
... "charge": arrays.Muon_Charge,
... "iso": arrays.Muon_Iso,
... }, with_name="Momentum4D"),
... "electron": ak.zip({
... "px": arrays.Electron_Px,
... "py": arrays.Electron_Py,
... "pz": arrays.Electron_Pz,
... "E": arrays.Electron_E,
... "charge": arrays.Electron_Charge,
... "iso": arrays.Electron_Iso,
... }, with_name="Momentum4D"),
... },
... depth_limit=1,
... )
>>>
>>> restructured.muon.pt
<Array [[54.2, 37.7], [24.4], ..., [63.6], [42.9]] type='2421 * var * float32'>
>>> restructured.muon.eta
<Array [[-0.15, -0.295], [0.754], ..., [1.06]] type='2421 * var * float32'>
>>> restructured.muon.phi
<Array [[-2.92, 0.0184], [-1.6], ..., [-0.98]] type='2421 * var * float32'> by providing a Form that would do the restructuring, rather than doing it explicitly with Awkward functions. The reason that's beneficial in the delayed case is because Dask identifies which input branches were actually used in the calculation (only The files passed to uproot.dask still need to be fully opened and interpreted, though this is now done on Dask workers. Knowing the Form that the data will take is not enough to know where in the file to find the data (at which byte positions), so there aren't any shortcuts taken there. Even though you have thousands of files with the same branch names and titles, the metadata describing them has to be parsed to get at the arrays of TBasket locations ( On the other hand, that kind of shortcut can be achieved with a database full of byte positions in ROOT files at which to find the data, as well as their interpretations. This is something that we started exploring with tiled-uproot, which we talked about at this IRIS-HEP Topical meeting (video available). I still think this would be a good thing to pursue, since the metadata-parsing is painful when it has to be done many times in pure Python, and this would be a way to skip that step every time after the first. |
Beta Was this translation helpful? Give feedback.
-
I think this is a Discussion; I'm going to move it over there. |
Beta Was this translation helpful? Give feedback.
-
I recently realised there is the
known_base_form
argument for theuproot.dask
, would it make sense to add it to theuproot.open
as well in order to accelerate the opening of similar files?Relevant documentation: https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html#uproot._dask.dask
In ATLAS we have a common data format, the PHYSLITE, and analysers usually need to open O(1000) of identical in metadata and structure files. Thus, if along the files we provide the unique form we could potentially accelerate the I/O.
Tagging @jackharrison111 who's is working on a project limited by I/O.
Beta Was this translation helpful? Give feedback.
All reactions