You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the compute benchmarks, we've been generating and persisting the data in memory for every combination of chunk_size and chunking_scheme prior the computations:
Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.
How should we go on about incorporating I/O component in the compute benchmarks?
Should we focus on the read component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every chunk_size and chunking_scheme combination, and then testing a variety of access approaches?
Should the write component be taken into consideration too?
One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?
The text was updated successfully, but these errors were encountered:
Note that https://github.com/pangeo-data/storage-benchmarks (as you are probably aware) was trying to answer the question of IO performances. I've never had time to dig deeper in this repo (but still would like to do so), and so I'm not sure how much can be kept of it...
To answer some of your questions, I think we want both read and write component, in a independant way at first (if possible). I also think we can make those benchmark only Dask dependant, and mostly independant of the infrastructure. Using zarr backend at first should make it pretty easy to run benchmark on an HPC system or in a public cloud. If we want to test with NetCDF, cloud setup will be more tricky, but we already know this would lead to poor performances in an object store...
I think we should keep it simple at first:
parallel writing and scaling should be handled by Dask,
we should start with only Zarr backend (and at first only HPC system)
First, implement a write test that generates random data in memory and just writes it to disk
Then implement a read test that reads back this data (beware of storage solutions cache effect), and maybe do some trivial operation like a mean or a count.
For the compute benchmarks, we've been generating and persisting the data in memory for every combination of
chunk_size
andchunking_scheme
prior the computations:Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.
I/O benchmarks
A few months ago, @kmpaul and @halehawk conducted an IOR-based I/O scaling study (C/MPI-based code) that compared:
In zarr-hdf-benchmarks (Python/mpi4py-based code), @rabernat compared both the
write
andread
components.How should we go on about incorporating I/O component in the compute benchmarks?
read
component by generating a dataset with same chunking and compression to both netcdf4 and zarr for everychunk_size
andchunking_scheme
combination, and then testing a variety of access approaches?write
component be taken into consideration too?The text was updated successfully, but these errors were encountered: