[Evaluation] Add OSWorld Benchmark #6061

xingyaoww · 2025-01-06T02:51:55Z

What problem or use case are you trying to solve?

Adding OSWorld benchmark to OpenHands evaluation harness evaluation/benchmarks.

Describe the UX of the solution you'd like

Do you have thoughts on the technical implementation?

The primary challenge is to be able to emulate an OS inside the Docker-based runtime. Fortunately, OSWorld authors already figured out a way to do it by running qemu inside docker:

https://github.com/xlang-ai/OSWorld?tab=readme-ov-file#docker-server-with-kvm-support-for-the-better

This likely needs major work with our runtime to support -- but once they are working, OpenHands runtime will be much more capable since we can optionally have access to a full OS (with GUI) for both agents and humans to interact with.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

xingyaoww added enhancement New feature or request evaluation Related to running evaluations with OpenHands labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evaluation] Add OSWorld Benchmark #6061

[Evaluation] Add OSWorld Benchmark #6061

xingyaoww commented Jan 6, 2025

[Evaluation] Add OSWorld Benchmark #6061

[Evaluation] Add OSWorld Benchmark #6061

Comments

xingyaoww commented Jan 6, 2025