You must be signed in to change notification settings - Fork 23
Roadmap and Milestones
Andre Merzky edited this page Oct 6, 2021
47 revisions
- env isolation [DONE]
- close some open branches
- scheduler lookup
- sandbox
- func executor
- COVID* / raptor
- nodes
- no-mongodb
- task descriptions / task types
- termination / no-heartbeats / destructors [RECONSIDER]
- tracer / logger service
- partitions
- asyncio as base for components?
- stand-alone components
- Python-3 [DONE]
- Pilot Partitioning
- scale (constant overhead)
- heterogeneous resources
- heterogeneous workloads
- API draft & partial implementation
- ZMQ Bridges
- separation of network overlay / communication overlay
- liberate tmgr / pmgr, decouple from agent
- towards disconnect / reconnect (see below)
- security implications?
- Draft Implementation
- -- how about stable communication services?
- decoupled agent
- decoupled components / bridges
- no Python multiprocessing
- provide communication / coordination layer to workload
- data coordination
- early experiment: NGE
- -- opens up to support different client types, also to use the client alone
- decoupled components / bridges
- resilience & fault tolerance
- component failure, (bridge failures), unit failures, node failures
- changing software stack
- fickle application environment [DONE]
- bootstrapper as single process root -> simplify?
- no preliminary work
- -- check if batch job survives node failures
- -- what are the failure modes or RP? Systematize error types
- -- what means failure recovery in each case?
- -- feasibility studies, prototypes for different failure modes / recoveries -> proposal
- Configuration management
- preliminary usage
RU support completed
- shell expansion in configs [DONE (variable expansion)]
- overloading by user configs [DONE]
- query with module like names (
) - json-schema based validation is planned
- Application Communication
- application level communication channels (UC: Sebastian/FU) [DONE]
- data pipelines
- service tasks [DONE]
- no preliminary work, but ZMQ channels are now independent and live in RU
- -- stay focused on RP core capabilities, don't expand too much into userspace
- faster dicts -> RU [DONE]
- re-investigate usage modes
- UC: batch submission of agent + workload
- client side API - what else do we need?
- no early results, but agent decoupling and partitioning are steps
- connectivity management -> RU
- rename CU to Task [DONE]
- Tau / Monitoring
- split scheduler in Resolver and Scheduler
- re-evaluate RS attribute interface -> RU
- use the
name space- implemented, awaits PR, review and merge
- json based configuration management
- see RP config management
- connectivity management -> RU
- implement data staging and state notification fallbacks on API level. This will significantly simplify the RP pilot launchers.
- capture state of completed batch jobs reliably
- separation of network overlay (ssh tunnels) from communication protocol (ZMQ)
- support basic communication patterns over above network overlay
- network overlay (not fully automated, independently usable)
- user-space micro-service as shell interface to remote hosts (persistent, reliable, stateful, transparent, secure, ZMQ-connected)
- incomplete prototype
- -- need to cover different login node policies
- -- ensure that only one instance is alive per user
- support for json-based configuration management
- see RP
- fast and lean dictionary implementation
- toward 10^7 unit descriptions
- json schema based?
- C extension?
- Python-3?
- performance bottlenecks implemented in C or other languages:
- scheduling (pattern searches)
- dictionary implementation (see above)
- profile mangling
- timstamps
- Event and Stats Dashboard (explored)
- Define the set of RP test targets
- Use cron/at on machines with multi-factor authentication
- Test supported launch methods for each resource
MS-Feature: 12, 2016
- Topic: feature
- Target: Winter '16
- Status: waiting for MS-Scale
- ?? - GPUs !
- ?? - disconnect / reconnect
- ?? - long running
- ?? - all MPI flavors
- ?? - easy extension (app schedulers, new clusters)
MS-Scale: 09, 2016
- Topic: scaling
- Target: Fall '16
- Status: waiting for MS-Refactor-2
- ?? - scheduler algorithm, data structure
- OK - agent partitions (rendered as partition scheduler in agent)
- ?? - possibly tailing cursors and/or zmq based client/pilot communication
- ?? - stability @ scale for data staging
- ?? - disconnect / reconnect
- ?? - long running
- ?? - ORTE-LIB in production
- ?? - routine of benchmarks and (stall-based) micro-benchmarks
MS-Refactor-2: 06, 2016
- Topic: Client Refactoring
- Target: Summer '16
- Status: delayed for termination issues
- OK - code sharing between agent and RP module
- OK - code refactoring on RP module side
- OK - cleanup of state management, entity ownership, state transitions on RP application side
- !! - shutdown issues (temporary resolution?)
- OK - performance (~SAGA performance for spawning pilots, 1 roundtrip for CU submission)
- ?? - improve integration of app kernels
- ?? - improve performance of late binding scheduler (at least understand performance)
- OK - scale of pilot / CUs
- ?? - better error analysis / provenance / tooling
- see Performance Challenges
- see State Management in RADICAL-Pilot
MS-Data-2: ??, 201x
- Topic: Data Management
- Target: ??
- ?? - Pilot Data
- ?? - Agent staging to/from arbitrary locations
MS-Resources-2: ??, 2015
- Topic: Resource Support
- Target: April
- 'stable': tests work at demo-scale
- 'production': EnsembleMD folx can use the resources
- individual target dates per resource
- OK - Blue Waters stable (proposal in June)
- OK - Titan stable
- OK - Hopper stable
- OK - SuperMIC stable
- OK - OSG
- OK - conceptual clarity on ORTE based agent
- see Performance Challenges
MS-Analysis: 2015
- Topic: Testing and Analysis
- OK - April - move to Pandas Frames (PDF) (documentation)
- OK - April - backports from aimes-experiments branch
- WIP - rebase plotting on PDF
- student project ??
- repository of radical scripts OK
- OK - integrated profiling over RADICAL stack (post-mortem to PDF)
MS9: Febuary 31, 2015
- ?? - Focus on
- OK - scalability (linear in #pilots, superlinear in #units)
- OK - scaling limits (O(100) pilots, O(10.000) units)
- OK - agent performance (20 unit ops/second)
- OK - agent adaptable to resource architecture / OS constraints
- OK - clean up of state management and entity ownership on agent level
- OK - agent ported to relevant (ie. accessible) architectures, while maintaining scalability
- see Performance Challenges
- see State Management in RADICAL-Pilot
- ?? - Focus on
MS8: August 15, 2014
- MS-8
- OK - Focus on
- OK - documentation
- OK - tutorials
- OK - examples
- OK - packaging
MS7: July 17, 2014
- MS-7
- OK - Sinon can replace BigJob as research vehicle, within the Radical, i.e. it is deemed fit for current and upcoming research projects, such as
- OK - Mark's work on workflows and pilot data
- OK - Ashley's work on Scheduling
- OK - Matteo's work on Federation
- OK - Andre's work on Application Modeling
- the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
MS6: March 15, 2014
- Focus on Data Capabilities, stability
- OK - Sinon has basic data management capabilities, short of PilotData
- CU level data staging
- support for $HOME or equivalent
- OK - ticket queue is under control
MS5: February 11, 2014
- OK - Sinon can replace BigJob-as-is, within the Radical, i.e. it provides the same functionality as bigjob, as used by:
- OK - Troy
- OK - Aimes
- ?? - Affinity Implementation
- the respective stakeholders decide when this milestone is met, and collaborate on the respective demos
- ability to replace BJ means that it is usable by the respective user
- required features:
- OK - reliable and simple deployment on local machines and on FutureGrid / XSEDE (alamo, sierra, india, hotel; possibly stampede, lonestar)
- OK - performance comparable to BigJob (or faster, obviously)
- OK - no major (show-stopping) tickets by 2 weeks after Tutorial / hand-over to Radical users
- OK - Vishal's use case can be run with BJS, where Sinon replaces BJ
- OK - Troy can reliably use multiple pilots, many CUs, for its demo-3 (https://github.com/saga-project/troy/wiki/Roadmap#final-demo-demo-3)
- OK - data staging is considered to be performed out-of-band, e.g. via SAGA-Python / BJS
- OK - supported functionality (as per above) is documented, and covered by unit tests.
- OK - Sinon can replace BigJob-as-is, within the Radical, i.e. it provides the same functionality as bigjob, as used by:
MS4: December 15, 2013
- OK - documentation for API and Data Model
- OK - packaging / pypi
- OK - examples
- OK - performance of components is measured and understood
- OK - performs as well as BigJob
- OK - multiple UnitManagers, multiple Pilots, Pilots on several UMs
- OK - can replace BigJob as Troy backend
- OK - can run bag of tasks O(100) tasks
- OK - submit 4 Pilot to india and 4 to sierra
- OK - create 2 UnitManagers with 4 pilots each
- OK - run 20 bulks of 100 CUs (CUs vary in runtime)
- OK - after 10 bulks: disconnect / reconnect
- OK - state changes for pilots and CUs are delivered via notifications
- OK - performance for above is measured and reported routinely
- Addendum: This milestone is delayed until Sinon can support Troy demos 1 and 2, on Futuregrid. That will define Sinon's readiness for a RC1.
MS3: November 30, 2013
- OK - multiple agents get notifications from DB
- ?? - API layer gets notification from DB
- OK - non-dumb scheduling over multiple agents
- OK - reconnect to agents
- OK - integration with Troy
- OK - agent works on FG
- OK - perf measurements of bag-of-tasks use case
- OK - API documentation
- OK - submit 2 Pilot to india and 2 to sierra
- OK - run 10 bulks of 10 CUs (CUs vary in runtime)
- OK - after 5 bulks: disconnect / reconnect
- OK - state changes for pilots and CUs are delivered via notifications
- OK - performance for above is measured and reported routinely
MS2: November 15, 2013
- OK - API layer pushes to DB
- OK - agent works on on one machine on FutureGrid
- OK - one agent pulls from DB
- OK - agent enacts
- OK - agent pushes state to DB
- OK - API pulls from DB
- OK - perf measurements of above
- OK - submit 1 Pilot to india
- OK - run 1 bulk of 10 CUs on that pilot (CUs have constant runtime)
- OK - state pulling from application reports CU and Pilot states truthfully
- OK performance for above is measured and reported routinely
MS1: October 30, 2013
- OK - API agreed upon
- OK - incl. packaging und pypi
- OK - DB backend agreed upon
- OK - first version of data model
- OK - coding framework in place
- OK - api layer
- OK - plugin scheduler / structure
- OK - interface to DB backend (rudimentary)
- OK - dumb unit scheduler
- OK - config file support
- OK - demo:
- OK - a local
results in a 'functional' installation (see below) - OK - UM, Pilot and CU representations can be created in the DB, from API calls.
- OK - a local
- OK - API agreed upon