Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: OH fails to join existing conversations after an unclean exit #6148

Open
1 task done
kripper opened this issue Jan 8, 2025 · 7 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@kripper
Copy link

kripper commented Jan 8, 2025

[EDIT] Skip this and go directly to #6148 (comment)

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

This happens because join_conversation() is not calling self.maybe_start_agent_loop which starts the sandbox container, when event_stream is True:

event_stream = await self._get_event_stream(sid)
if not event_stream:
    return await self.maybe_start_agent_loop(sid, settings)

OpenHands Installation

Development workflow

OpenHands Version

main branch from 2025-01-08

Operating System

WSL on Windows

Logs, Errors, Screenshots, and Additional Context

.

@kripper kripper added the bug Something isn't working label Jan 8, 2025
@kripper
Copy link
Author

kripper commented Jan 8, 2025

There is another problem:

The port is obtained this way:

def _attach_to_container(self):
    self._container_port = 0
    self.container = self.docker_client.containers.get(self.container_name)
    for port in self.container.attrs['NetworkSettings']['Ports']:  # type: ignore
        self._container_port = int(port.split('/')[0])
        break

But the sandbox containers don't expose ports:

$ docker inspect fa3ff536487c | jq '.[0].NetworkSettings.Ports'
{}

But checking the /alive endpoint returns status: ok.

@kripper
Copy link
Author

kripper commented Jan 8, 2025

Another problem is that the containers are not restarted. Doing "docker start" fixes the problem.

@kripper kripper mentioned this issue Jan 8, 2025
1 task
@kripper
Copy link
Author

kripper commented Jan 8, 2025

But the sandbox containers don't expose ports

I confirmed this issue can be fixed with #6080

@mamoodi
Copy link
Collaborator

mamoodi commented Jan 8, 2025

@kripper can you explain how you reproduce this. I tried this:

  1. Run OpenHands
  2. Fill out settings
  3. Prompt "4 + 5"
  4. Ctrl + C terminal to kill OpenHands
  5. Run OpenHands
  6. Press "Jump back to recent conversation"
  7. Prompt "add 3"
    Screenshot 2025-01-08 at 2 15 54 PM

@tofarr
Copy link
Collaborator

tofarr commented Jan 8, 2025

@kripper The stack trace you posted on #6114 has a line that seems critical to me:

No such file or directory: '/home/codespace/openhands_file_store/sessions/e770430539174979bf2296e8c6d3fde5/agent_state.pkl'

Can you confirm that:

  • The sessions directory exists in your test environment?
  • There are sessions within it?
  • They contain files that look something like this:
    image

@enyst
Copy link
Collaborator

enyst commented Jan 8, 2025

@tofarr

Ctrl + C terminal to kill OpenHands

I think this means there was no agent_state.pkl created, because it's saved when the controller closes normally. The /events are saved, of course. I'm not sure when metadata.json is saved.

We recently solved an issue when the pickle doesn't exist (due to runtime errors etc, it can be missing)

I think conversation loading also should ideally not depend on the existence of this file?

@kripper
Copy link
Author

kripper commented Jan 8, 2025

Previous issues were fixed applying #6114.

But there is still present this one last issue, preventing to re-join conversations after an unclean exit or reboot of the box (I use to restart the OH container using Docker Desktop).

It's not critical, but it's worth to report here for mental health.

I can reproduce this bug consistently.

It happens only the first time I execute make run after a forced (unclean) reboot of the box.
When I execute "make run" the second time it works fine (and so on).
I compared /tmp files and running processes before and after the first make run and there was nothing suspicious there.
Mabye OH creates some lock file or similar that must be cleaned on exit.

If I interrupt make run using CTRL+C (clean exit), reboot and make run , I can join conversations without problem.

Thus, this issue ocurrs only after OH was uncleanly terminated.

This is the stacktrace that is generated after trying to join a conversation after the first make run:

21:31:01 - openhands:INFO: docker_runtime.py:147 - [runtime aa4cca868229460fb319c227be9c65db] Waiting for client to become ready at http://localhost:0...
21:31:01 - openhands:ERROR: agent_session.py:200 - Runtime initialization failed: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
Traceback (most recent call last):
  File "/workspaces/OpenHands/openhands/server/session/agent_session.py", line 198, in _create_runtime
    await self.runtime.connect()
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 150, in connect
    await call_sync_from_async(self._wait_until_alive)
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 398, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
                                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python/3.12.1/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/codespace/.cache/pypoetry/virtualenvs/openhands-ai-QLt0qIPP-py3.12/lib/python3.12/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/workspaces/OpenHands/openhands/runtime/impl/docker/docker_runtime.py", line 328, in _wait_until_alive
    raise AgentRuntimeDisconnectedError(
openhands.core.exceptions.AgentRuntimeDisconnectedError: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.LOADING to AgentState.ERROR
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.ERROR to AgentState.INIT
21:31:01 - openhands:INFO: agent_controller.py:388 - [Agent Controller aa4cca868229460fb319c227be9c65db] Setting agent(CodeActAgent) state from AgentState.INIT to AgentState.FINISHED
21:31:02 - openhands:ERROR: manager.py:209 - Error connecting to conversation aa4cca868229460fb319c227be9c65db: Container openhands-runtime-aa4cca868229460fb319c227be9c65db has exited.

@kripper kripper changed the title [Bug]: After restarting OH, it fails to join existing conversations [Bug]: OH fails to join existing conversations after an unclean exit Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants