-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib; docs] Docs do-over (new API stack): Env pages vol 01. #49165
[RLlib; docs] Docs do-over (new API stack): Env pages vol 01. #49165
Conversation
Signed-off-by: sven1977 <[email protected]>
@@ -0,0 +1,20 @@ | |||
from ray.rllib.env.multi_agent_env import make_multi_agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply moved some example classes in here for order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Awesome PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run we should ask for a professional designer to make these diagrams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, does this wants to say that the top agent is acting whenever the lower ones are not or could this happen simultaneously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both is possible. Our example script has always only one level acting at a time.
"""Two-player environment for the famous rock paper scissors game. | ||
|
||
# __sphinx_doc_1_end__ | ||
Optionally, the "Sheldon Cooper extension" can be activated by passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hilarious! :D
|
||
# The observations are always the last taken actions. Hence observation- and | ||
# action spaces are identical. | ||
self.observation_spaces = self.action_spaces = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe simplify to:
self.sheldon_cooper_mode = self.config.get("sheldon_cooper_mode", False)
if self.sheldon_cooper_mode:
num_actions = 5
else:
num_actions = 3
self.action_spaces = self.observation_spaces = {
"player1": gym.spaces.Discrete(num_actions),
"player2": gym.spaces.Discrete(num_actions),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but I wanted to leave the Sheldon Cooper mode out of the docs entirely (to keep docs as simple as possible). Therefore I had to spacially separate these two logics entirely in the file.
| 6| 7| 8| | ||
---------- | ||
|
||
The action space is Discrete(9) and actions landing on an alredy occupied field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"alredy" -> "already"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
win_val = [-1, -1, -1] | ||
if ( | ||
# Horizontal win. | ||
self.board[:3] == win_val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
): | ||
# Final reward is +5 for victory and -5 for a loss. | ||
rewards[self.current_player] += 5.0 | ||
rewards[opponent] = -5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it works better if win and loss are rewarded in a different amount than a wrong placement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are rewarded separately, with +1.0 and -1.0.
The misplacement penalty should be learnt pretty quickly by the agents (b/c it hurts a lot) and after that, they should be able to "focus" on the actual game, not misplacing any pieces anymore. 🤞
|
||
return ( | ||
{self.current_player: np.array(self.board, np.float32)}, | ||
rewards, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here that tells users how these rewards are handled in the MultiAgentEpisode
- that it is treated in there as the reward for the last current player (the one that sent the action); this is counter-intuitive at first for new users.
…_redo_cleanup_old_api_stack_01_00
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…oject#49165) Signed-off-by: ujjawal-khare <[email protected]>
[RLlib] Docs do-over (new API stack): Env pages vol 01
examples/envs/classes/multi_agent/..
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.