Incorporate I/O component in the compute benchmarks #26

andersy005 · 2019-09-11T10:57:53Z

For the compute benchmarks, we've been generating and persisting the data in memory for every combination of chunk_size and chunking_scheme prior the computations:

  chunk_size:
    - 32MB
    - 64MB
    - 128MB
    - 256MB
  chunking_scheme:
    - spatial
    - temporal
    - auto

Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.

I/O benchmarks

A few months ago, @kmpaul and @halehawk conducted an IOR-based I/O scaling study (C/MPI-based code) that compared:

Z5
netCDF4
HDF5
PnetCDF
MPIIO
POSIX

In zarr-hdf-benchmarks (Python/mpi4py-based code), @rabernat compared both the write and read components.

How should we go on about incorporating I/O component in the compute benchmarks?

Should we focus on the read component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every chunk_size and chunking_scheme combination, and then testing a variety of access approaches?
Should the write component be taken into consideration too?
One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?

The text was updated successfully, but these errors were encountered:

guillaumeeb · 2019-09-19T19:52:44Z

Thanks @andersy005 for opening this issue!

Note that https://github.com/pangeo-data/storage-benchmarks (as you are probably aware) was trying to answer the question of IO performances. I've never had time to dig deeper in this repo (but still would like to do so), and so I'm not sure how much can be kept of it...

To answer some of your questions, I think we want both read and write component, in a independant way at first (if possible). I also think we can make those benchmark only Dask dependant, and mostly independant of the infrastructure. Using zarr backend at first should make it pretty easy to run benchmark on an HPC system or in a public cloud. If we want to test with NetCDF, cloud setup will be more tricky, but we already know this would lead to poor performances in an object store...

I think we should keep it simple at first:

parallel writing and scaling should be handled by Dask,
we should start with only Zarr backend (and at first only HPC system)
we should look what we can save from https://github.com/pangeo-data/storage-benchmarks
First, implement a write test that generates random data in memory and just writes it to disk
Then implement a read test that reads back this data (beware of storage solutions cache effect), and maybe do some trivial operation like a mean or a count.

kmpaul · 2021-02-02T17:01:36Z

I believe some of this issue was addressed by #44.

andersy005 pinned this issue Sep 11, 2019

kmpaul unpinned this issue Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorporate I/O component in the compute benchmarks #26

Incorporate I/O component in the compute benchmarks #26

andersy005 commented Sep 11, 2019 •

edited

Loading

guillaumeeb commented Sep 19, 2019

kmpaul commented Feb 2, 2021

Incorporate I/O component in the compute benchmarks #26

Incorporate I/O component in the compute benchmarks #26

Comments

andersy005 commented Sep 11, 2019 • edited Loading

I/O benchmarks

guillaumeeb commented Sep 19, 2019

kmpaul commented Feb 2, 2021

andersy005 commented Sep 11, 2019 •

edited

Loading