Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_reader: tighten parsing #438

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mtrofin
Copy link
Collaborator

@mtrofin mtrofin commented Feb 10, 2025

The log format has a few places where we insert \n for human readability. They should be checked they are indeed just that character. This will allow us, in a subsequent PR, to avoid infinitely-blocking cases due to the process producing the log exiting unexpectedtly, by inserting errors into the reader to unblock and exit it.

Similarly, checking that the tensor data received matches in size what was expected.

Refactored a bit the test utility for constructing examples.

Created using spr 1.3.5
@tvmarino
Copy link
Collaborator

Are the env.py timeout changes going to be in a different pr?

@mtrofin
Copy link
Collaborator Author

mtrofin commented Feb 11, 2025

Are the env.py timeout changes going to be in a different pr?

Yes. The commit message should suggest that.

Copy link
Collaborator

@tvmarino tvmarino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, some minor nits.


def __init__(
self,
*,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to force keyword arguments here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity.

for i in range(1, len(LogTestExampleBuilder.ErrorMarkers)):
create_example(
logfile, introduce_errors_pos=LogTestExampleBuilder.ErrorMarkers(i))
with self.assertRaises(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about the exception here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some are json, some are the log protocol. Listing them all would be tedious. We could wrap them, but how's that different from just saying "Exception"; finally they are individually not that interesting programmatically, insofar as I can tell, because regardless of the underlying error, programmatically there's little we can do. So the exceptions here matter as something to ultimately report to the user, but that's about it.

writer.write_observation_marker(0)
writer.write_buff([1], ctypes.c_int16)

with self.assertRaises(Exception):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same point

Copy link
Collaborator

@cclauss cclauss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a member of this team so please treat my suggestions as optional.

Comment on lines 37 to 38
nl = '\n'.encode('utf-8')
error_nl = 'hi there'.encode('utf-8')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more self-documenting to use newline and error_newline instead?

Also,

Suggested change
nl = '\n'.encode('utf-8')
error_nl = 'hi there'.encode('utf-8')
nl = b'\n'
error_nl = b'hi there'

% ruff rule UP012

unnecessary-encode-utf8 (UP012)

Derived from the pyupgrade linter.

Fix is always available.

What it does

Checks for unnecessary calls to encode as UTF-8.

Why is this bad?

UTF-8 is the default encoding in Python, so there is no need to call
encode when UTF-8 is the desired encoding. Instead, use a bytes literal.

Example

"foo".encode("utf-8")

Use instead:

b"foo"

References

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, also renamed write_nl to write_newline.

Btw, any idea why ruff didn't report UP012?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Ruff does not report because #436 is still in review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. well that explains that :)

@boomanaiden154 can #436 go in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be good to go now.

Comment on lines 74 to 82
self.write_nl(LogTestExampleBuilder.ErrorMarkers.CTX_MARKER_POS)

def write_observation_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'observation': obs_idx}))
self.write_nl(LogTestExampleBuilder.ErrorMarkers.OBS_MARKER_POS)

def write_outcome_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'outcome': obs_idx}))
self.write_nl(LogTestExampleBuilder.ErrorMarkers.OUTCOME_MARKER_POS)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.write_nl(LogTestExampleBuilder.ErrorMarkers.CTX_MARKER_POS)
def write_observation_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'observation': obs_idx}))
self.write_nl(LogTestExampleBuilder.ErrorMarkers.OBS_MARKER_POS)
def write_outcome_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'outcome': obs_idx}))
self.write_nl(LogTestExampleBuilder.ErrorMarkers.OUTCOME_MARKER_POS)
self.write_nl(self.ErrorMarkers.CTX_MARKER_POS)
def write_observation_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'observation': obs_idx}))
self.write_nl(self.ErrorMarkers.OBS_MARKER_POS)
def write_outcome_marker(self, obs_idx: int):
self._opened_file.write(json_to_bytes({'outcome': obs_idx}))
self.write_nl(self.ErrorMarkers.OUTCOME_MARKER_POS)

Also, perhaps replace these with a single write_marker(self, field_value: dict, marker: ErrorMarkers) method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why self instead of LogTestExampleBuilder, the class isn't really instance-variant. As in, is it just that it's shorter?

I don't follow the write_marker suggestion, how would that work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highly optional!

write_observation_marker(self, obs_idx=1)
    --> write_marker(self, field_value={'observation': 1}, marker=self.ErrorMarkers.OBS_MARKER_POS)

write_outcome_marker(self, obs_idx=2)
    --> write_marker(self, field_value={'outcome': 2}, marker=self.ErrorMarkers.OUTCOME_MARKER_POS)


def write_buff(self, buffer: list, ct):
# we should get the ctypes array to bytes for pytype to be happy.
if self._introduce_error_pos == \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 cautions against the use of backslash line continuation in Python code because any whitespace to the right of the backslash breaks the code on a change that is invisible to the reader.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 280 to 282
for i in range(1, len(LogTestExampleBuilder.ErrorMarkers)):
create_example(
logfile, introduce_errors_pos=LogTestExampleBuilder.ErrorMarkers(i))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in range(1, len(LogTestExampleBuilder.ErrorMarkers)):
create_example(
logfile, introduce_errors_pos=LogTestExampleBuilder.ErrorMarkers(i))
for error_marker in self.ErrorMarkers:
if error_marker: # Skip the None marker
create_example(logfile, introduce_errors_pos=error_marker)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 69 to 70
self._opened_file.write(LogTestExampleBuilder.error_nl if position == self
._introduce_error_pos else LogTestExampleBuilder.nl)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._opened_file.write(LogTestExampleBuilder.error_nl if position == self
._introduce_error_pos else LogTestExampleBuilder.nl)
self._opened_file.write(self.error_nl if position == self._introduce_error_pos else self.nl)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, depends on the motivation for using self rather than the class name (previous question)

Created using spr 1.3.5
Comment on lines +34 to +40
class LogTestExampleBuilder:
"""Construct a log."""

newline = b'\n'
error_newline = b'hi there'

class ErrorMarkers(enum.IntEnum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline, error_newline, and ErrorMarkers are all defined inside LogTestExampleBuilder so any instance of LogTestExampleBuilder can access them via self. which is shorter to write/read but will automatically update if the outer class is renamed and they will also be accessible from subclasses of the outer class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... (being of a C++ mindset) not sure how to digest the auto-update part of the motivation. Is that (friendliness to opportunistic extensibility) the pythonic way though? (happy to follow that! basically asking to educate myself)

@cclauss cclauss self-requested a review February 12, 2025 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants