-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTEP: Recording exceptions as log based events #4333
base: main
Are you sure you want to change the base?
Conversation
I think this is a related issue: |
b06a09f
to
76c7d85
Compare
1a1ea49
to
5ddfd05
Compare
db27087
to
e9f38aa
Compare
A small doubt:
Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this: currentSpan.recordException(e);
logger.logRecordBuilder
.addException(e); Is this the case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.
|
||
5. An error should be logged with appropriate severity depending on the available context. | ||
|
||
- Errors that don't indicate any issue should be recorded with severity not higher than `Info`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's an example of this? I'm struggling to understand when this would be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most popular is exception-driven logic - you check that something is available and it throws if it's not. You try to delete something and it throws if someone concurrently deleted it.
You send a request, it times-out, you retry and get 'already exists', etc.
It's not always the best development style, but there are plenty of real-life examples. Adding one to the text.
5. An error should be logged with appropriate severity depending on the available context. | ||
|
||
- Errors that don't indicate any issue should be recorded with severity not higher than `Info`. | ||
- Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know an error is transient when writing instrumentation? I think you mean errors that you KNOW the application will attempt to handle / retry, right?
I'd suggest rewording (or defining the meaning of transient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to
- Errors that are expected to be retried or handled by the caller or another
layer of the component SHOULD be recorded with severity not higher thanWarn
.
with additional explanations and examples.
4. It's not recommended to record the same error as it propagates through the stack trace or | ||
attach the same instance of exception to multiple log records. | ||
|
||
5. An error should be logged with appropriate severity depending on the available context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the goal of the taxonomy, but think we need to crisp up the language around Info
/Warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, updated each severity bullet with examples and some criteria related to error impact. PTAL
> OTel should provide API like `setException` when creating log record that will record only necessary information depending | ||
> on the configuration and log severity. | ||
|
||
It should not be an instrumentation library concern to decide whether exception stack trace should be recorded or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- SHOULD is normative, so please capatilize (and I think IS a normative statement here).
- This may not be language neutral, so I think SHOULD is the right guidance here. For example, in Rust, stack traces are something you can opt-in on an error. They leave some details to libraries (see Rust Backtrace/Source capabilities on https://docs.rs/thiserror/latest/thiserror/ e.g. or C++ [prior to 23] https://github.com/jeremy-rifkin/cpptrace).
Additionally, in some highly green-thread/async APIs, I've seen custom stack traces created (e.g. Scala's ZIO where they try to preserve logical stack when physical stack is a confusing mess of work-stealing green-threads. We should allow these to interact with exception reporting in OTEL in some fashion.
I agree with the sentiment, I'd expand the wording though to allow languages like Rust/C++ (and Java ecosystem) to provide stack trace compatibility with their library ecosystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use normative language. Also added
The signature of the method is to be determined by each language
and can be overloaded as appropriate including ability to customize stack trace
collection.
It MUST be possible to efficiently set exception information on a log record based on configuration
and without using thesetException
method.
with appropriate severity (or stop reporting them). | ||
- We should provide opt-in mechanism for existing instrumentations to switch to logs. | ||
|
||
2. Recording exceptions as log-based events would result in UX degradation for users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also call out the fact that now we have two channels of exporting/batching/recording information of exceptions and Traces. In this new world, you may see a trace before an exception or vice versa, and one may be dropped where the other is not.
We probably need some other mitigatioin should that requiring knowledge of an exception event under a Span is no longer needed (e.g. more aggressively using Span.status and attributes around "transient failures" as we discussed in Semconv SIG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarification in the OTEP that logs are intended to replace span events (gracefully). I.e. there should be just one channel of communication at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Such exceptions can be used to control application logic and have a minor impact, if any, | ||
on application functionality, availability, or performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these indicate actual issues and could be reported with Warn
severity. Minor impact is still some impact and multiple minor issues could multiply to be a major issue. Maybe better remove this sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which ones?
this section talks about
-
Exceptions or errors that don't indicate actual issues SHOULD be recorded with
severity not higher thanInfo
.Such exceptions can be used to control application logic and have a minor impact, if any,
on application functionality, availability, or performance.Examples:
- exception is thrown when checking optional dependency or resource existence.
- exception thrown when client disconnects before reading full response from the server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it has non-zero impact then in my opinion it can be Warn
as well.
References:
- https://www.crowdstrike.com/en-us/cybersecurity-101/next-gen-siem/logging-levels
- https://sematext.com/blog/logging-levels/
- https://github.com/serilog/serilog/wiki/configuration-basics
- https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-severitynumber
1. Deduplicate exception info by marking exception instances as logged. | ||
This can potentially mitigate the problem for existing application when it logs exceptions extensively. | ||
We should still provide optimal guidance for the greenfield applications and libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could mitigate it with https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/logdedupprocessor. The SDK all Contrib could also provide something similar for each language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's less about identical log records and more about writing code that writes multiple different log records as exception propagates through the stack, but stamps exception info on all of them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why it is in "Prior art and alternatives" and not in "Open questions" or "Future possibilities".
I also think that this description could be clarified as I would never guess what it was about. I thought this is more about deduplicating e.g. callstack information.
fd2d692
to
66dec5a
Compare
@carlosalberto thanks for the clarification! The intent here is to eventually replace span events with logs in the instrumentations (in non-breaking manner for existing ones). I clarified it in the text, PTAL. |
Co-authored-by: Joao Grassi <[email protected]>
Co-authored-by: Trask Stalnaker <[email protected]>
5a4b286
to
61d9551
Compare
- Errors that don't indicate actual issues SHOULD be recorded with | ||
severity not higher than `Info`. | ||
|
||
Such errors can be used to control application logic and have a minor impact, if any, | ||
on application functionality, availability, or performance (beyond performance hit introduced | ||
if exception is used to control application logic). | ||
|
||
Examples: | ||
|
||
- error is returned when checking optional dependency or resource existence. | ||
- exception is thrown on the server when client disconnects before reading | ||
full response from the server | ||
|
||
- Errors that are expected to be retried or handled by the caller or another | ||
layer of the component SHOULD be recorded with severity not higher than `Warn`. | ||
|
||
Such errors represent transient failures that are common and expected in | ||
distributed applications. They typically increase the latency of individual | ||
operations and have a minor impact on overall application availability. | ||
|
||
Examples: | ||
|
||
- attempt to connect to the required remote dependency times out | ||
- remote dependency returns 401 "Unauthorized" response code | ||
- writing data to a file results in IO exception | ||
- remote dependency returned 503 "Service Unavailable" response for 5 times in a row, | ||
retry attempts are exhausted and the corresponding operation has failed. | ||
|
||
- Unhandled (by the application code) errors that don't result in application | ||
shutdown SHOULD be recorded with severity `Error` | ||
|
||
These errors are not expected and may indicate a bug in the application logic | ||
that this application instance was not able to recover from or a gap in the error | ||
handling logic. | ||
|
||
Examples: | ||
|
||
- Background job terminates with an exception | ||
- HTTP framework error handler catches exception thrown by the application code. | ||
|
||
Note: some frameworks use exceptions as a communication mechanism when request fails. For example, | ||
Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html) | ||
exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code. | ||
Application code, in this case, is expected to log this at appropriate severity. | ||
General-purpose instrumentation MAY record such errors, but at severity not higher than `Warn`. | ||
|
||
- Errors that result in application shutdown SHOULD be recorded with severity `Fatal`. | ||
|
||
- The application detects an invalid configuration at startup and shuts down. | ||
- The application encounters a (presumably) terminal error, such as an out-of-memory condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the description here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-severitynumber good enough?
Related to open-telemetry/semantic-conventions#1536
Changes
Recording exceptions as span events is problematic since it
This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.
If accepted, the follow-up spec changes are expected to replace existing (stable) documents:
Related OTEP(s) #CHANGELOG.md
file updated for non-trivial changesspec-compliance-matrix.md
updated if necessary