Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEP: Recording exceptions as log based events #4333

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

lmolkova
Copy link
Contributor

@lmolkova lmolkova commented Dec 10, 2024

Related to open-telemetry/semantic-conventions#1536

Changes

Recording exceptions as span events is problematic since it

  • ties recording exceptions to tracing/sampling
  • duplicates exceptions recorded by instrumented libraries on logs
  • does not leverage log features such as typical log filtering based on severity

This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.

If accepted, the follow-up spec changes are expected to replace existing (stable) documents:


@lmolkova lmolkova changed the title OTEP: Recording exceptions and errors with OpenTelemetry OTEP: Recording exceptions as log based events Dec 10, 2024
@pellared
Copy link
Member

pellared commented Dec 10, 2024

I think this is a related issue:

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@tedsuo tedsuo added the OTEP OpenTelemetry Enhancement Proposal (OTEP) label Dec 12, 2024
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch 2 times, most recently from b06a09f to 76c7d85 Compare December 17, 2024 17:30
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch from db27087 to e9f38aa Compare January 3, 2025 01:42
@carlosalberto
Copy link
Contributor

A small doubt:

If this instrumentation supports tracing, it should capture the error in the scope of the processing span.

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

currentSpan.recordException(e);
logger.logRecordBuilder
    .addException(e);

Is this the case?

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.


5. An error should be logged with appropriate severity depending on the available context.

- Errors that don't indicate any issue should be recorded with severity not higher than `Info`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example of this? I'm struggling to understand when this would be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most popular is exception-driven logic - you check that something is available and it throws if it's not. You try to delete something and it throws if someone concurrently deleted it.
You send a request, it times-out, you retry and get 'already exists', etc.

It's not always the best development style, but there are plenty of real-life examples. Adding one to the text.

5. An error should be logged with appropriate severity depending on the available context.

- Errors that don't indicate any issue should be recorded with severity not higher than `Info`.
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know an error is transient when writing instrumentation? I think you mean errors that you KNOW the application will attempt to handle / retry, right?

I'd suggest rewording (or defining the meaning of transient).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to

  • Errors that are expected to be retried or handled by the caller or another
    layer of the component SHOULD be recorded with severity not higher than Warn.

with additional explanations and examples.

4. It's not recommended to record the same error as it propagates through the stack trace or
attach the same instance of exception to multiple log records.

5. An error should be logged with appropriate severity depending on the available context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the goal of the taxonomy, but think we need to crisp up the language around Info/Warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, updated each severity bullet with examples and some criteria related to error impact. PTAL

> OTel should provide API like `setException` when creating log record that will record only necessary information depending
> on the configuration and log severity.

It should not be an instrumentation library concern to decide whether exception stack trace should be recorded or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. SHOULD is normative, so please capatilize (and I think IS a normative statement here).
  2. This may not be language neutral, so I think SHOULD is the right guidance here. For example, in Rust, stack traces are something you can opt-in on an error. They leave some details to libraries (see Rust Backtrace/Source capabilities on https://docs.rs/thiserror/latest/thiserror/ e.g. or C++ [prior to 23] https://github.com/jeremy-rifkin/cpptrace).
    Additionally, in some highly green-thread/async APIs, I've seen custom stack traces created (e.g. Scala's ZIO where they try to preserve logical stack when physical stack is a confusing mess of work-stealing green-threads. We should allow these to interact with exception reporting in OTEL in some fashion.

I agree with the sentiment, I'd expand the wording though to allow languages like Rust/C++ (and Java ecosystem) to provide stack trace compatibility with their library ecosystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use normative language. Also added

The signature of the method is to be determined by each language
and can be overloaded as appropriate including ability to customize stack trace
collection.

It MUST be possible to efficiently set exception information on a log record based on configuration
and without using the setException method.

with appropriate severity (or stop reporting them).
- We should provide opt-in mechanism for existing instrumentations to switch to logs.

2. Recording exceptions as log-based events would result in UX degradation for users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also call out the fact that now we have two channels of exporting/batching/recording information of exceptions and Traces. In this new world, you may see a trace before an exception or vice versa, and one may be dropped where the other is not.

We probably need some other mitigatioin should that requiring knowledge of an exception event under a Span is no longer needed (e.g. more aggressively using Span.status and attributes around "transient failures" as we discussed in Semconv SIG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarification in the OTEP that logs are intended to replace span events (gracefully). I.e. there should be just one channel of communication at a time.

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
Copy link
Member

@pellared pellared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

CHANGELOG.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved
Comment on lines 88 to 89
Such exceptions can be used to control application logic and have a minor impact, if any,
on application functionality, availability, or performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these indicate actual issues and could be reported with Warn severity. Minor impact is still some impact and multiple minor issues could multiply to be a major issue. Maybe better remove this sentence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which ones?

this section talks about

  • Exceptions or errors that don't indicate actual issues SHOULD be recorded with
    severity not higher than Info.

    Such exceptions can be used to control application logic and have a minor impact, if any,
    on application functionality, availability, or performance.

    Examples:

    • exception is thrown when checking optional dependency or resource existence.
    • exception thrown when client disconnects before reading full response from the server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 356 to 358
1. Deduplicate exception info by marking exception instances as logged.
This can potentially mitigate the problem for existing application when it logs exceptions extensively.
We should still provide optimal guidance for the greenfield applications and libraries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could mitigate it with https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/logdedupprocessor. The SDK all Contrib could also provide something similar for each language.

Copy link
Contributor Author

@lmolkova lmolkova Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's less about identical log records and more about writing code that writes multiple different log records as exception propagates through the stack, but stamps exception info on all of them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it is in "Prior art and alternatives" and not in "Open questions" or "Future possibilities".
I also think that this description could be clarified as I would never guess what it was about. I thought this is more about deduplicating e.g. callstack information.

@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch from fd2d692 to 66dec5a Compare January 18, 2025 20:48
@lmolkova
Copy link
Contributor Author

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

@carlosalberto thanks for the clarification! The intent here is to eventually replace span events with logs in the instrumentations (in non-breaking manner for existing ones).

I clarified it in the text, PTAL.

@trask
Copy link
Member

trask commented Jan 21, 2025

github u/x is showing a weird diff, have seen this before, may be worth trying to rebase and force push

image

@lmolkova lmolkova force-pushed the exceptions-on-logs-otep branch from 5a4b286 to 61d9551 Compare January 21, 2025 05:48
Comment on lines +108 to +157
- Errors that don't indicate actual issues SHOULD be recorded with
severity not higher than `Info`.

Such errors can be used to control application logic and have a minor impact, if any,
on application functionality, availability, or performance (beyond performance hit introduced
if exception is used to control application logic).

Examples:

- error is returned when checking optional dependency or resource existence.
- exception is thrown on the server when client disconnects before reading
full response from the server

- Errors that are expected to be retried or handled by the caller or another
layer of the component SHOULD be recorded with severity not higher than `Warn`.

Such errors represent transient failures that are common and expected in
distributed applications. They typically increase the latency of individual
operations and have a minor impact on overall application availability.

Examples:

- attempt to connect to the required remote dependency times out
- remote dependency returns 401 "Unauthorized" response code
- writing data to a file results in IO exception
- remote dependency returned 503 "Service Unavailable" response for 5 times in a row,
retry attempts are exhausted and the corresponding operation has failed.

- Unhandled (by the application code) errors that don't result in application
shutdown SHOULD be recorded with severity `Error`

These errors are not expected and may indicate a bug in the application logic
that this application instance was not able to recover from or a gap in the error
handling logic.

Examples:

- Background job terminates with an exception
- HTTP framework error handler catches exception thrown by the application code.

Note: some frameworks use exceptions as a communication mechanism when request fails. For example,
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html)
exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code.
Application code, in this case, is expected to log this at appropriate severity.
General-purpose instrumentation MAY record such errors, but at severity not higher than `Warn`.

- Errors that result in application shutdown SHOULD be recorded with severity `Fatal`.

- The application detects an invalid configuration at startup and shuts down.
- The application encounters a (presumably) terminal error, such as an out-of-memory condition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OTEP OpenTelemetry Enhancement Proposal (OTEP)
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

9 participants