Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate I/O component in the compute benchmarks #26

Open
andersy005 opened this issue Sep 11, 2019 · 2 comments
Open

Incorporate I/O component in the compute benchmarks #26

andersy005 opened this issue Sep 11, 2019 · 2 comments

Comments

@andersy005
Copy link
Member

andersy005 commented Sep 11, 2019

For the compute benchmarks, we've been generating and persisting the data in memory for every combination of chunk_size and chunking_scheme prior the computations:

  chunk_size:
    - 32MB
    - 64MB
    - 128MB
    - 256MB
  chunking_scheme:
    - spatial
    - temporal
    - auto

Per discussions with @rabernat, @kmpaul, @tinaok, @guillaumeeb, it is crucial to have an I/O component that emulates real use cases: the data will almost always live on the filesystem and be bigger than what we can persist into memory.

I/O benchmarks

A few months ago, @kmpaul and @halehawk conducted an IOR-based I/O scaling study (C/MPI-based code) that compared:

  • Z5
  • netCDF4
  • HDF5
  • PnetCDF
  • MPIIO
  • POSIX

In zarr-hdf-benchmarks (Python/mpi4py-based code), @rabernat compared both the write and read components.


How should we go on about incorporating I/O component in the compute benchmarks?

  • Should we focus on the read component by generating a dataset with same chunking and compression to both netcdf4 and zarr for every chunk_size and chunking_scheme combination, and then testing a variety of access approaches?
  • Should the write component be taken into consideration too?
  • One of our longterm goals for this repo is that the benchmarks should be runnable on different platforms (HPC, Cloud) and storage systems. Both https://github.com/rabernat/zarr_hdf_benchmarks and https://github.com/NCAR/ior_scaling are MPI dependent, and I was wondering whether the I/O components for these benchmarks can be Python/Dask based?
@andersy005 andersy005 pinned this issue Sep 11, 2019
@guillaumeeb
Copy link
Member

Thanks @andersy005 for opening this issue!

Note that https://github.com/pangeo-data/storage-benchmarks (as you are probably aware) was trying to answer the question of IO performances. I've never had time to dig deeper in this repo (but still would like to do so), and so I'm not sure how much can be kept of it...

To answer some of your questions, I think we want both read and write component, in a independant way at first (if possible). I also think we can make those benchmark only Dask dependant, and mostly independant of the infrastructure. Using zarr backend at first should make it pretty easy to run benchmark on an HPC system or in a public cloud. If we want to test with NetCDF, cloud setup will be more tricky, but we already know this would lead to poor performances in an object store...

I think we should keep it simple at first:

  • parallel writing and scaling should be handled by Dask,
  • we should start with only Zarr backend (and at first only HPC system)
  • we should look what we can save from https://github.com/pangeo-data/storage-benchmarks
  • First, implement a write test that generates random data in memory and just writes it to disk
  • Then implement a read test that reads back this data (beware of storage solutions cache effect), and maybe do some trivial operation like a mean or a count.

@kmpaul kmpaul unpinned this issue Jan 14, 2020
@kmpaul
Copy link
Collaborator

kmpaul commented Feb 2, 2021

I believe some of this issue was addressed by #44.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants