OTEP: Recording exceptions as log based events #4333

lmolkova · 2024-12-10T02:44:05Z

Related to open-telemetry/semantic-conventions#1536

Changes

Recording exceptions as span events is problematic since it

ties recording exceptions to tracing/sampling
duplicates exceptions recorded by instrumented libraries on logs
does not leverage log features such as typical log filtering based on severity

This OTEP provides guidance on how to record exceptions using OpenTelemetry logs focusing on minimizing duplication and providing context to reduce the noise.

If accepted, the follow-up spec changes are expected to replace existing (stable) documents:

Related issues Document generic approach for span status (code + description) and exception event when instrumented code throws semantic-conventions#1536
~~Related OTEP(s) #~~
Links to the prototypes (when adding or changing features)
CHANGELOG.md file updated for non-trivial changes
~~spec-compliance-matrix.md updated if necessary~~

oteps/4333-recording-exceptions-on-logs.md

pellared · 2024-12-10T05:27:14Z

I think this is a related issue:

Define whether instrumentations can emit non-event log records #4234

oteps/4333-recording-exceptions-on-logs.md

carlosalberto · 2025-01-07T00:01:12Z

A small doubt:

If this instrumentation supports tracing, it should capture the error in the scope of the processing span.

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

currentSpan.recordException(e);
logger.logRecordBuilder
    .addException(e);

Is this the case?

jsuereth

Overall I'm very supportive. Just some nits and one mitigation I'd like to see called out/addressed.

jsuereth · 2025-01-07T13:10:14Z

oteps/4333-recording-exceptions-on-logs.md

+
+5. An error should be logged with appropriate severity depending on the available context.
+
+   - Errors that don't indicate any issue should be recorded with severity not higher than `Info`.


What's an example of this? I'm struggling to understand when this would be used.

Most popular is exception-driven logic - you check that something is available and it throws if it's not. You try to delete something and it throws if someone concurrently deleted it.
You send a request, it times-out, you retry and get 'already exists', etc.

It's not always the best development style, but there are plenty of real-life examples. Adding one to the text.

jsuereth · 2025-01-07T13:11:04Z

oteps/4333-recording-exceptions-on-logs.md

+5. An error should be logged with appropriate severity depending on the available context.
+
+   - Errors that don't indicate any issue should be recorded with severity not higher than `Info`.
+   - Transient errors (even if it's the last try) should be recorded with severity not higher than `Warning`.


How do you know an error is transient when writing instrumentation? I think you mean errors that you KNOW the application will attempt to handle / retry, right?

I'd suggest rewording (or defining the meaning of transient).

updated to

Errors that are expected to be retried or handled by the caller or another
layer of the component SHOULD be recorded with severity not higher than Warn.

with additional explanations and examples.

jsuereth · 2025-01-07T13:12:13Z

oteps/4333-recording-exceptions-on-logs.md

+4. It's not recommended to record the same error as it propagates through the stack trace or
+   attach the same instance of exception to multiple log records.
+
+5. An error should be logged with appropriate severity depending on the available context.


I like the goal of the taxonomy, but think we need to crisp up the language around Info/Warning

makes sense, updated each severity bullet with examples and some criteria related to error impact. PTAL

jsuereth · 2025-01-07T13:23:58Z

oteps/4333-recording-exceptions-on-logs.md

+> OTel should provide API like `setException` when creating log record that will record only necessary information depending
+> on the configuration and log severity.
+
+It should not be an instrumentation library concern to decide whether exception stack trace should be recorded or not.


Two things:

SHOULD is normative, so please capatilize (and I think IS a normative statement here).

This may not be language neutral, so I think SHOULD is the right guidance here. For example, in Rust, stack traces are something you can opt-in on an error. They leave some details to libraries (see Rust Backtrace/Source capabilities on https://docs.rs/thiserror/latest/thiserror/ e.g. or C++ [prior to 23] https://github.com/jeremy-rifkin/cpptrace).
Additionally, in some highly green-thread/async APIs, I've seen custom stack traces created (e.g. Scala's ZIO where they try to preserve logical stack when physical stack is a confusing mess of work-stealing green-threads. We should allow these to interact with exception reporting in OTEL in some fashion.

I agree with the sentiment, I'd expand the wording though to allow languages like Rust/C++ (and Java ecosystem) to provide stack trace compatibility with their library ecosystem.

updated to use normative language. Also added

The signature of the method is to be determined by each language
and can be overloaded as appropriate including ability to customize stack trace
collection.

It MUST be possible to efficiently set exception information on a log record based on configuration
and without using the setException method.

jsuereth · 2025-01-07T13:28:37Z

oteps/4333-recording-exceptions-on-logs.md

+     with appropriate severity (or stop reporting them).
+   - We should provide opt-in mechanism for existing instrumentations to switch to logs.
+
+2. Recording exceptions as log-based events would result in UX degradation for users


I think we should also call out the fact that now we have two channels of exporting/batching/recording information of exceptions and Traces. In this new world, you may see a trace before an exception or vice versa, and one may be dropped where the other is not.

We probably need some other mitigatioin should that requiring knowledge of an exception event under a Span is no longer needed (e.g. more aggressively using Span.status and attributes around "transient failures" as we discussed in Semconv SIG.

Added a clarification in the OTEP that logs are intended to replace span events (gracefully). I.e. there should be just one channel of communication at a time.

oteps/4333-recording-exceptions-on-logs.md

pellared

👍

CHANGELOG.md

oteps/4333-recording-exceptions-on-logs.md

pellared · 2025-01-14T15:13:14Z

oteps/4333-recording-exceptions-on-logs.md

+     Such exceptions can be used to control application logic and have a minor impact, if any,
+     on application functionality, availability, or performance.


I think these indicate actual issues and could be reported with Warn severity. Minor impact is still some impact and multiple minor issues could multiply to be a major issue. Maybe better remove this sentence?

which ones?

this section talks about

Exceptions or errors that don't indicate actual issues SHOULD be recorded with
severity not higher than Info.

Such exceptions can be used to control application logic and have a minor impact, if any,
on application functionality, availability, or performance.

Examples:

exception is thrown when checking optional dependency or resource existence.

exception thrown when client disconnects before reading full response from the server

If it has non-zero impact then in my opinion it can be Warn as well.

References:

https://www.crowdstrike.com/en-us/cybersecurity-101/next-gen-siem/logging-levels

https://sematext.com/blog/logging-levels/

https://github.com/serilog/serilog/wiki/configuration-basics

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-severitynumber

pellared · 2025-01-14T15:19:52Z

oteps/4333-recording-exceptions-on-logs.md

+1. Deduplicate exception info by marking exception instances as logged.
+   This can potentially mitigate the problem for existing application when it logs exceptions extensively.
+   We should still provide optimal guidance for the greenfield applications and libraries.


One could mitigate it with https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/logdedupprocessor. The SDK all Contrib could also provide something similar for each language.

That's less about identical log records and more about writing code that writes multiple different log records as exception propagates through the stack, but stamps exception info on all of them

I am not sure why it is in "Prior art and alternatives" and not in "Open questions" or "Future possibilities".
I also think that this description could be clarified as I would never guess what it was about. I thought this is more about deduplicating e.g. callstack information.

oteps/4333-recording-exceptions-on-logs.md

lmolkova · 2025-01-18T20:58:35Z

Although (I think) it's not called out, I'm understanding exceptions should now be explicitly reported as both 1) Span.Event and 2) Log/Event? i.e. coding wise you should do this:

@carlosalberto thanks for the clarification! The intent here is to eventually replace span events with logs in the instrumentations (in non-breaking manner for existing ones).

I clarified it in the text, PTAL.

trask · 2025-01-21T04:46:18Z

github u/x is showing a weird diff, have seen this before, may be worth trying to rebase and force push

Co-authored-by: Joao Grassi <[email protected]>

Co-authored-by: Trask Stalnaker <[email protected]>

…r span events

pellared · 2025-01-21T08:22:15Z

oteps/4333-recording-exceptions-on-logs.md

+   - Errors that don't indicate actual issues SHOULD be recorded with
+     severity not higher than `Info`.
+
+     Such errors can be used to control application logic and have a minor impact, if any,
+     on application functionality, availability, or performance (beyond performance hit introduced
+     if exception is used to control application logic).
+
+     Examples:
+
+      - error is returned when checking optional dependency or resource existence.
+      - exception is thrown on the server when client disconnects before reading
+        full response from the server
+
+   - Errors that are expected to be retried or handled by the caller or another
+     layer of the component SHOULD be recorded with severity not higher than `Warn`.
+
+     Such errors represent transient failures that are common and expected in
+     distributed applications. They typically increase the latency of individual
+     operations and have a minor impact on overall application availability.
+
+     Examples:
+
+      - attempt to connect to the required remote dependency times out
+      - remote dependency returns 401 "Unauthorized" response code
+      - writing data to a file results in IO exception
+      - remote dependency returned 503 "Service Unavailable" response for 5 times in a row,
+        retry attempts are exhausted and the corresponding operation has failed.
+
+   - Unhandled (by the application code) errors that don't result in application
+     shutdown SHOULD be recorded with severity `Error`
+
+     These errors are not expected and may indicate a bug in the application logic
+     that this application instance was not able to recover from or a gap in the error
+     handling logic.
+
+     Examples:
+
+      - Background job terminates with an exception
+      - HTTP framework error handler catches exception thrown by the application code.
+
+        Note: some frameworks use exceptions as a communication mechanism when request fails. For example,
+        Spring users can throw [ResponseStatusException](https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/server/ResponseStatusException.html)
+        exception to return unsuccessful status code. Such exceptions represent errors already handled by the application code.
+        Application code, in this case, is expected to log this at appropriate severity.
+        General-purpose instrumentation MAY record such errors, but at severity not higher than `Warn`.
+
+   - Errors that result in application shutdown SHOULD be recorded with severity `Fatal`.
+
+      - The application detects an invalid configuration at startup and shuts down.
+      - The application encounters a (presumably) terminal error, such as an out-of-memory condition.


Isn't the description here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-severitynumber good enough?

lmolkova changed the title ~~OTEP: Recording exceptions and errors with OpenTelemetry~~ OTEP: Recording exceptions as log based events Dec 10, 2024

cijothomas reviewed Dec 10, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

cijothomas reviewed Dec 10, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

cijothomas reviewed Dec 10, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Show resolved Hide resolved

cijothomas reviewed Dec 10, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

cijothomas reviewed Dec 10, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

pellared assigned lmolkova Dec 10, 2024

trask reviewed Dec 11, 2024

View reviewed changes

tedsuo added the OTEP OpenTelemetry Enhancement Proposal (OTEP) label Dec 12, 2024

lmolkova force-pushed the exceptions-on-logs-otep branch 2 times, most recently from b06a09f to 76c7d85 Compare December 17, 2024 17:30

reyang reviewed Dec 20, 2024

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

trask mentioned this pull request Dec 21, 2024

Allow user to enable/disable storing entire stack trace in spans open-telemetry/opentelemetry-java-instrumentation#431

Open

lmolkova mentioned this pull request Dec 27, 2024

What's the use case for exception.escaped? open-telemetry/semantic-conventions#1516

Closed

lmolkova marked this pull request as ready for review December 27, 2024 17:41

lmolkova requested review from a team as code owners December 27, 2024 17:41

lmolkova force-pushed the exceptions-on-logs-otep branch 4 times, most recently from 1a1ea49 to 5ddfd05 Compare December 27, 2024 18:28

lmolkova mentioned this pull request Dec 28, 2024

Add common guidance on recording errors on spans and metrics, clarify DB conventions open-telemetry/semantic-conventions#1716

Merged

3 tasks

joaopgrassi reviewed Jan 2, 2025

View reviewed changes

lmolkova force-pushed the exceptions-on-logs-otep branch from db27087 to e9f38aa Compare January 3, 2025 01:42

jsuereth reviewed Jan 7, 2025

View reviewed changes

pellared reviewed Jan 7, 2025

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

This was referenced Jan 7, 2025

Define whether instrumentations can emit non-event log records #4234

Closed

logs: Remove Events operations in favor of making whole Logger user-facing #4352

Merged

trask reviewed Jan 13, 2025

View reviewed changes

pellared reviewed Jan 14, 2025

View reviewed changes

JacksonWeber mentioned this pull request Jan 15, 2025

Monitor telemetry exporter: no way to report exception from consumer span. Azure/azure-sdk-for-js#32387

Open

lmolkova mentioned this pull request Jan 16, 2025

[WIP] Deprecate exception.escaped attribute, update exception example #4368

Draft

5 tasks

trask reviewed Jan 17, 2025

View reviewed changes

oteps/4333-recording-exceptions-on-logs.md Outdated Show resolved Hide resolved

lmolkova force-pushed the exceptions-on-logs-otep branch from fd2d692 to 66dec5a Compare January 18, 2025 20:48

lmolkova and others added 19 commits January 20, 2025 21:48

OTEP: Recording exceptions and errors with OpenTelemetry

50ef640

filename

cfc75d1

feedback and cleanup

28b7590

cleanups

5473397

feedback: recording all exceptions by default is too noisy

39264f6

minor fixes

e33290f

minor fixes

91e4e3c

clean up

810c864

changelog and lint

23a723c

more cleanups and lint

5087efc

ore fixes

9d5a075

more cleanups and lint

d8ad5c0

Apply suggestions from code review

e7556b2

Co-authored-by: Joao Grassi <[email protected]>

feedback

26f599c

remove the note

81d17ab

Apply suggestions from code review

3285ea7

Co-authored-by: Trask Stalnaker <[email protected]>

feedback: more details on severity, language-specific, replacement fo…

b53cda0

…r span events

more feedback, define error/exception

f035e98

lint

61d9551

lmolkova force-pushed the exceptions-on-logs-otep branch from 5a4b286 to 61d9551 Compare January 21, 2025 05:48

pellared reviewed Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTEP: Recording exceptions as log based events #4333

OTEP: Recording exceptions as log based events #4333

lmolkova commented Dec 10, 2024 •

edited

Loading

pellared commented Dec 10, 2024 •

edited

Loading

carlosalberto commented Jan 7, 2025

jsuereth left a comment

jsuereth Jan 7, 2025

lmolkova Jan 14, 2025

jsuereth Jan 7, 2025

lmolkova Jan 18, 2025

jsuereth Jan 7, 2025

lmolkova Jan 18, 2025

jsuereth Jan 7, 2025

lmolkova Jan 18, 2025

jsuereth Jan 7, 2025

lmolkova Jan 18, 2025

pellared left a comment

pellared Jan 14, 2025

lmolkova Jan 18, 2025

pellared Jan 21, 2025

pellared Jan 14, 2025

lmolkova Jan 18, 2025 •

edited

Loading

pellared Jan 21, 2025

lmolkova commented Jan 18, 2025

trask commented Jan 21, 2025

pellared Jan 21, 2025


		5. An error should be logged with appropriate severity depending on the available context.

		- Errors that don't indicate any issue should be recorded with severity not higher than `Info`.

		Such exceptions can be used to control application logic and have a minor impact, if any,
		on application functionality, availability, or performance.

OTEP: Recording exceptions as log based events #4333

Are you sure you want to change the base?

OTEP: Recording exceptions as log based events #4333

Conversation

lmolkova commented Dec 10, 2024 • edited Loading

Changes

pellared commented Dec 10, 2024 • edited Loading

carlosalberto commented Jan 7, 2025

jsuereth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pellared left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova commented Jan 18, 2025

trask commented Jan 21, 2025

Choose a reason for hiding this comment

lmolkova commented Dec 10, 2024 •

edited

Loading

pellared commented Dec 10, 2024 •

edited

Loading

lmolkova Jan 18, 2025 •

edited

Loading