Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new partitioned TaskEvent table, and switch to it gradually as new runs are created #1696

Merged
merged 3 commits into from
Feb 12, 2025

Conversation

ericallam
Copy link
Member

@ericallam ericallam commented Feb 12, 2025

Summary by CodeRabbit

  • New Features
    • Introduced enhanced configuration options for task event partitioning, enabling optimized event management and improved performance.
    • Added new methods for managing task events, improving data retrieval and storage capabilities.
  • Refactor
    • Streamlined run and span inspection interfaces by removing extraneous link displays for a cleaner view.
    • Updated navigation and data retrieval workflows to deliver more accurate task and trace information.

Copy link

changeset-bot bot commented Feb 12, 2025

⚠️ No Changeset found

Latest commit: e353738

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Feb 12, 2025

Walkthrough

This pull request removes the span link rendering functionality from several components by deleting the SpanLinkElement function and associated conditional logic in components like RunInspector.tsx, SpanInspector.tsx, and related route files. It also revises several presenters and services to incorporate a new task event store mechanism by using the getTaskEventStoreTableForRun function and additional time-based parameters. Furthermore, new environment configuration options are added, and database migrations/schema updates are introduced to support task event partitioning.

Changes

File(s) Change Summary
apps/webapp/app/components/runs/v3/RunInspector.tsx,
apps/webapp/app/components/runs/v3/SpanInspector.tsx,
apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
Removed code for displaying span links and deleted the SpanLinkElement function, eliminating link handling in inspector components and routes.
apps/webapp/app/env.server.ts Added new environment properties: TASK_EVENT_PARTITIONING_ENABLED (string, default "0") and TASK_EVENT_PARTITIONED_WINDOW_IN_SECONDS (integer, default 60).
apps/webapp/app/presenters/v3/RunPresenter.server.ts,
apps/webapp/app/presenters/v3/SpanPresenter.server.ts
Modified query logic to use getTaskEventStoreTableForRun and include additional parameters (createdAt, completedAt) for retrieving trace summaries and span data.
apps/webapp/app/utils/pathBuilder.ts,
apps/webapp/app/utils/taskEvent.ts
Removed the v3TraceSpanPath function and eliminated the recordId property from span objects in the task event preparation functions.
apps/webapp/app/v3/eventRepository.server.ts,
apps/webapp/app/v3/services/cancelAttempt.server.ts,
apps/webapp/app/v3/services/cancelTaskRun.server.ts,
apps/webapp/app/v3/services/completeAttempt.server.ts,
apps/webapp/app/v3/services/crashTaskRun.server.ts,
apps/webapp/app/v3/services/expireEnqueuedRun.server.ts,
apps/webapp/app/v3/services/triggerTask.server.ts
Updated method signatures and calls to pass task run context by incorporating getTaskEventStoreTableForRun along with run timestamps. This refactors event querying, completion, and task run handling across multiple services.
apps/webapp/app/v3/taskEventStore.server.ts New file added to manage task events. Introduces types (CommonTaskEvent, TraceEvent, TaskEventStoreTable), helper functions (getTaskEventStoreTableForRun, getTaskEventStore), and the TaskEventStore class with methods to create and query task events.
Database migrations and schema Added new migration script to create the partitioned table TaskEventPartitioned with related indexes and modified the TaskRun table by adding the taskEventStore column. The Prisma schema is updated with a new model TaskEventPartitioned and corresponding field in TaskRun.

Sequence Diagram(s)

sequenceDiagram
    participant Service as Service/Presenter
    participant Resolver as getTaskEventStoreTableForRun
    participant Repo as EventRepository
    participant DB as Database

    Service->>Resolver: Determine event store table for run
    Resolver-->>Service: Return table name
    Service->>Repo: Call updated query/completion with (table, runId, createdAt, completedAt)
    Repo->>DB: Execute query/command on the specified table
    DB-->>Repo: Return events/result
    Repo-->>Service: Return processed data
Loading

Possibly related PRs

Poem

I’m a rabbit hopping through the code,
Leaving link traces in the dust mode.
Span elements vanished in a blink,
Task events now have more time to think.
With schema and queries primed anew,
Celebrate the changes – for code, we grew! 🐰✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

apps/webapp/app/v3/taskEventStore.server.ts

Oops! Something went wrong! :(

ESLint: 8.45.0

ESLint couldn't find the config "custom" to extend from. Please check that the name of the config is correct.

The config "custom" was referenced from the config file in "/.eslintrc.js".

If you still have problems, please stop by https://eslint.org/chat/help to chat with the team.

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (2)
apps/webapp/app/v3/eventRepository.server.ts (1)

231-285: ⚠️ Potential issue

Synchronize partial events resolution
When completeEvent queries incomplete events and then inserts a final record, a race condition could occur if multiple requests handle the same span in parallel. Consider adding an upsert or a guard to avoid duplicate finalization.

+ // Suggestion: Add a check to ensure that only one thread can finalize a span at a time
+ // through a transactional or upsert approach.
apps/webapp/app/v3/services/cancelTaskRun.server.ts (1)

87-104: 🛠️ Refactor suggestion

Deduplicate event cancellation logic.

This event cancellation logic is duplicated with cancelAttempt.server.ts. Consider extracting it to a shared service.

Create a new shared service CancelEventsService that both CancelTaskRunService and CancelAttemptService can use:

// cancelEvents.server.ts
export class CancelEventsService extends BaseService {
  public async call(
    taskRun: { friendlyId: string; createdAt: Date; completedAt: Date | null },
    cancelledAt: Date,
    reason: string
  ) {
    const inProgressEvents = await eventRepository.queryIncompleteEvents(
      getTaskEventStoreTableForRun(taskRun),
      { runId: taskRun.friendlyId },
      taskRun.createdAt,
      taskRun.completedAt ?? undefined
    );

    logger.debug("Cancelling in-progress events", {
      inProgressEvents: inProgressEvents.map((event) => event.id),
    });

    await Promise.all(
      inProgressEvents.map((event) =>
        eventRepository.cancelEvent(event, cancelledAt, reason)
      )
    );
  }
}
🧹 Nitpick comments (15)
apps/webapp/app/v3/services/triggerTask.server.ts (2)

398-398: Consider adding error handling for getTaskEventStore().

The getTaskEventStore() call could potentially fail. Consider wrapping it in a try-catch block or handling potential errors to ensure graceful degradation.

-                    taskEventStore: getTaskEventStore(),
+                    taskEventStore: await this.#handleTaskEventStore(),

Add this private method to handle the task event store:

async #handleTaskEventStore() {
  try {
    return getTaskEventStore();
  } catch (error) {
    logger.error("Failed to get task event store", { error });
    return undefined;
  }
}

62-649: Consider breaking down the call method into smaller, focused methods.

The call method is quite long (587 lines) and handles multiple concerns. Consider extracting the following responsibilities into separate private methods:

  1. Validation logic
  2. Task run creation
  3. Queue management
  4. Dependency handling

This would improve maintainability, testability, and readability.

Example refactor for validation logic:

private async validateTaskTrigger(
  taskId: string,
  environment: AuthenticatedEnvironment,
  body: TriggerTaskRequestBody,
  options: TriggerTaskServiceOptions
): Promise<void> {
  // Extract validation logic here
  if (environment.type !== "DEVELOPMENT" && !options.skipChecks) {
    const result = await getEntitlement(environment.organizationId);
    if (result && result.hasAccess === false) {
      throw new OutOfEntitlementError();
    }
  }
  // ... other validation logic
}
🧰 Tools
🪛 Biome (1.9.4)

[error] 254-255: Change to an optional chain.

Unsafe fix: Change to an optional chain.

(lint/complexity/useOptionalChain)

apps/webapp/app/v3/taskEventStore.server.ts (3)

1-35: Ensure clear distinction between standard and partitioned usage
Overall, the definitions and exported helpers for determining which table to use look good. Consider documenting the scenarios when each table is selected to prevent confusion among contributors about why or how the environment variable toggles partitioning.


72-115: Revisit readReplica usage and date-range conditions
When querying partitioned tables, reads are routed to the readReplica. For systems with eventual consistency, there could be a small time delay before data is visible. Confirm whether consistency requirements are relaxed enough to allow read operations on the replica, especially when date filtering might boundary-include/exclude rows.

Do you want help writing a verification script to confirm the replication lag or indexing overhead for these time-based queries?


117-176: Check for potential large result sets in findTraceEvents
The raw SQL query with LIMIT ${env.MAXIMUM_TRACE_SUMMARY_VIEW_COUNT} helps cap the result size, but if traceId references a high-traffic process, you could still have large partial sets to process. Consider implementing pagination or streaming if you foresee extremely large trace detail queries.

apps/webapp/app/v3/eventRepository.server.ts (3)

361-421: Clarify logic for “allowCompleteDuplicate”
The queryIncompleteEvents method filters out events that have a completed duplicate. This logic is beneficial, but it might be unclear to new contributors. Consider adding a brief comment explaining how “allowCompleteDuplicate” changes the filtering outcome.


423-646: Potential performance constraints in large trace retrieval
getTraceSummary, getRunEvents, and getSpan combine to build robust trace overviews. If used for extremely large runs, performance might degrade. Evaluate whether you need pagination or summarization for extremely large traces or runs beyond the current limit approach.


647-787: Prevent infinite ancestor walks
#createSpanFromEvent and #walkSpanAncestors rely on a level >= 8 safeguard to stop indefinite recursion. Confirm that 8 is the correct threshold or consider a configurable approach for more elaborate nested calls.

apps/webapp/app/routes/resources.runs.$runParam.logs.download.ts (1)

35-40: Add error handling for undefined dates.

While the implementation is correct, it would be beneficial to add validation for the date parameters to ensure they are defined and in the correct order.

   const runEvents = await eventRepository.getRunEvents(
     getTaskEventStoreTableForRun(run),
     run.friendlyId,
+    // Ensure createdAt is defined
+    run.createdAt ?? new Date(0),
+    // Ensure completedAt is after createdAt
+    run.completedAt && run.completedAt > run.createdAt ? run.completedAt : undefined
-    run.createdAt,
-    run.completedAt ?? undefined
   );
apps/webapp/app/v3/services/cancelAttempt.server.ts (1)

76-93: Consider extracting event cancellation logic to a separate method.

The event querying and cancellation logic could be extracted into a dedicated method for better organization and reusability.

+  private async cancelInProgressEvents(
+    taskRun: { friendlyId: string; createdAt: Date; completedAt: Date | null },
+    cancelledAt: Date,
+    reason: string
+  ) {
+    const inProgressEvents = await eventRepository.queryIncompleteEvents(
+      getTaskEventStoreTableForRun(taskRun),
+      { runId: taskRun.friendlyId },
+      taskRun.createdAt,
+      taskRun.completedAt ?? undefined
+    );
+
+    logger.debug("Cancelling in-progress events", {
+      inProgressEvents: inProgressEvents.map((event) => event.id),
+    });
+
+    await Promise.all(
+      inProgressEvents.map((event) =>
+        eventRepository.cancelEvent(event, cancelledAt, reason)
+      )
+    );
+  }
apps/webapp/app/presenters/v3/RunPresenter.server.ts (1)

111-116: Consider adding type safety for taskEventStore field.

The taskEventStore field should be properly typed to ensure type safety throughout the application.

+type TaskEventStore = {
+  taskEventStore: string;
+  createdAt: Date;
+  completedAt: Date | null;
+};
+
 const traceSummary = await eventRepository.getTraceSummary(
   getTaskEventStoreTableForRun(run as TaskEventStore),
   run.traceId,
   run.createdAt,
   run.completedAt ?? undefined
 );
apps/webapp/app/env.server.ts (2)

485-486: Add documentation for the new environment variables.

Consider adding JSDoc comments to explain:

  • The purpose and impact of enabling task event partitioning
  • The significance of the partitioned window duration
  • Any performance implications or recommendations
+/**
+ * Controls whether task events are stored in partitioned tables.
+ * @default "0" (disabled)
+ */
 TASK_EVENT_PARTITIONING_ENABLED: z.string().default("0"),
+/**
+ * The time window in seconds for each task event partition.
+ * @default 60 (1 minute)
+ */
 TASK_EVENT_PARTITIONED_WINDOW_IN_SECONDS: z.coerce.number().int().default(60), // 1 minute

485-485: Consider using boolean coercion for the partitioning flag.

For better type safety and consistency with boolean values, consider using z.coerce.boolean() instead of string.

-TASK_EVENT_PARTITIONING_ENABLED: z.string().default("0"),
+TASK_EVENT_PARTITIONING_ENABLED: z.coerce.boolean().default(false),
apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)

75-75: Consider removing unused taskEventStore selection.

The taskEventStore field is selected in the query but not used in the subsequent code. Consider removing it if it's not needed for the current implementation.

internal-packages/database/prisma/migrations/20250212053026_create_task_event_partitioned_table/migration.sql (1)

1-54: Well-designed partitioned table for task events.

The table structure is well thought out:

  1. Range partitioning on createdAt is perfect for time-series data
  2. Composite primary key (id, createdAt) supports partitioning
  3. Fields comprehensively cover all event tracking needs

Consider creating partition maintenance procedures to:

  • Automatically create new partitions ahead of time
  • Archive/drop old partitions for data retention
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0726af and b734ee2.

📒 Files selected for processing (21)
  • apps/webapp/app/components/runs/v3/RunInspector.tsx (0 hunks)
  • apps/webapp/app/components/runs/v3/SpanInspector.tsx (2 hunks)
  • apps/webapp/app/env.server.ts (1 hunks)
  • apps/webapp/app/presenters/v3/RunPresenter.server.ts (3 hunks)
  • apps/webapp/app/presenters/v3/SpanPresenter.server.ts (5 hunks)
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.traces.$traceId.spans.$spanId/route.tsx (0 hunks)
  • apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx (0 hunks)
  • apps/webapp/app/routes/resources.runs.$runParam.logs.download.ts (2 hunks)
  • apps/webapp/app/utils/pathBuilder.ts (0 hunks)
  • apps/webapp/app/utils/taskEvent.ts (0 hunks)
  • apps/webapp/app/v3/eventRepository.server.ts (17 hunks)
  • apps/webapp/app/v3/services/cancelAttempt.server.ts (2 hunks)
  • apps/webapp/app/v3/services/cancelTaskRun.server.ts (2 hunks)
  • apps/webapp/app/v3/services/completeAttempt.server.ts (5 hunks)
  • apps/webapp/app/v3/services/crashTaskRun.server.ts (2 hunks)
  • apps/webapp/app/v3/services/expireEnqueuedRun.server.ts (2 hunks)
  • apps/webapp/app/v3/services/triggerTask.server.ts (2 hunks)
  • apps/webapp/app/v3/taskEventStore.server.ts (1 hunks)
  • internal-packages/database/prisma/migrations/20250212053026_create_task_event_partitioned_table/migration.sql (1 hunks)
  • internal-packages/database/prisma/migrations/20250212075957_add_task_event_store_to_task_run/migration.sql (1 hunks)
  • internal-packages/database/prisma/schema.prisma (2 hunks)
💤 Files with no reviewable changes (5)
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.v3.$projectParam.traces.$traceId.spans.$spanId/route.tsx
  • apps/webapp/app/utils/taskEvent.ts
  • apps/webapp/app/utils/pathBuilder.ts
  • apps/webapp/app/components/runs/v3/RunInspector.tsx
  • apps/webapp/app/routes/resources.orgs.$organizationSlug.projects.v3.$projectParam.runs.$runParam.spans.$spanParam/route.tsx
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (16)
apps/webapp/app/v3/services/triggerTask.server.ts (1)

32-32: LGTM!

The import statement for getTaskEventStore is correctly placed and follows the project's import style.

apps/webapp/app/components/runs/v3/SpanInspector.tsx (3)

1-1: LGTM! Import changes align with component requirements.

The import changes correctly support the component's functionality:

  • Core utilities for duration formatting
  • Spinner component for pending state visualization
  • Simplified path utilities after link handling removal

Also applies to: 8-8, 20-20


219-264: LGTM! SpanTimeline improvements enhance user experience.

The SpanTimeline component effectively:

  • Provides clear visual feedback for pending state using the Spinner
  • Maintains consistent time formatting
  • Handles all states (pending, complete, error) appropriately

25-210: Verify the impact of link removal on user navigation.

While the component's core functionality remains intact, the removal of link handling might affect user navigation patterns. Please ensure that:

  1. Users can still effectively navigate between related spans
  2. The removal of links doesn't break any critical user journeys
  3. Alternative navigation paths are available where needed

Let's verify the impact with this script:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Check for any remaining link-related code or dependencies

# Search for any remaining link-related patterns
echo "Checking for remaining link patterns..."
rg -i "link" --type tsx --type ts

# Search for navigation patterns to ensure alternatives exist
echo "Checking navigation patterns..."
rg -i "navigate|route|path" --type tsx --type ts

Length of output: 302


I'll now run an additional script to search for navigation-related components—specifically looking for the usage of the TextLink component and the v3RunsPath function—to confirm that navigation remains in place despite the removal of legacy link-handling code.


🏁 Script executed:

[run_scripts]
#!/bin/bash
echo "Searching for 'TextLink' usage..."
rg -i "TextLink"
echo "Searching for 'v3RunsPath' usage..."
rg -i "v3RunsPath"

Length of output: 16320


Navigation remains intact despite link removal changes.

Our analysis confirmed that the TextLink component and the v3RunsPath function are actively used in multiple areas—including in the SpanInspector component via the "Task" property—indicating that users can still navigate effectively. The current implementation supports navigation with updated patterns, and no critical issues were found.

apps/webapp/app/v3/taskEventStore.server.ts (1)

37-60: Validate inputs before inserting
The create and createMany methods rely on trusted data. Consider adding assertion checks or validations for critical fields (e.g., ensuring data or array length is not empty) before writing to the database to reduce the likelihood of invalid records.

apps/webapp/app/v3/eventRepository.server.ts (2)

37-106: Partitioning checks for partial events
Introducing the TaskEventStore property and the partitioning flag is a good approach. As you rely on the environment variable for toggling, confirm that higher-level code (such as migrations or data backfill routines) aligns with whether partitioning is expected.


1078-1096: Ensure consistent usage of taskEventStoreTable
The private get taskEventStoreTable() fallback and subsequent use in createMany is a neat approach. Double-check references in other code paths to confirm consistent usage across your entire codebase so the partitioning logic is never bypassed inadvertently.

apps/webapp/app/v3/services/expireEnqueuedRun.server.ts (2)

7-7: Confirm import location
The import of getTaskEventStoreTableForRun appears correct. If you rename or refactor the store logic in the future, ensure the path remains consistent across services like ExpireEnqueuedRunService or CancelTaskRunService.


81-103: Complete event logic
Using eventRepository.completeEvent with getTaskEventStoreTableForRun(run) is good for correctness under partitioned scenarios. However, ensure that if run.createdAt is far in the past, no relevant events get excluded by your date constraints in event repository queries.

apps/webapp/app/v3/services/crashTaskRun.server.ts (1)

123-131: LGTM! Proper integration of task event store partitioning.

The changes correctly integrate the new task event store functionality while maintaining existing error handling and logging. The addition of timestamps provides better context for event queries.

apps/webapp/app/presenters/v3/SpanPresenter.server.ts (1)

210-216: LGTM! Consistent implementation of task event store partitioning.

The changes correctly integrate the new task event store functionality while maintaining existing error handling and data transformation logic.

Also applies to: 366-372

apps/webapp/app/v3/services/completeAttempt.server.ts (1)

168-186: LGTM! Comprehensive integration of task event store partitioning.

The changes correctly integrate the new task event store functionality across all paths (success, failure, and system failure), while maintaining:

  • Existing error handling
  • Retry logic
  • Event completion semantics

Also applies to: 317-337, 379-386, 414-435

internal-packages/database/prisma/migrations/20250212075957_add_task_event_store_to_task_run/migration.sql (1)

1-5: LGTM! The migration safely adds the task event store column.

The migration correctly adds the taskEventStore column with a default value, ensuring backward compatibility with existing rows.

internal-packages/database/prisma/migrations/20250212053026_create_task_event_partitioned_table/migration.sql (1)

56-63: Appropriate indexes for common query patterns.

The indexes on traceId, spanId, and runId will support efficient:

  • Trace/span lookups for distributed tracing
  • Run-specific event queries
✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Verify if similar indexes exist on the original TaskEvent table for consistency
rg -A 5 'CREATE INDEX.*TaskEvent.*(traceId|spanId|runId)'

Length of output: 1618


Index Consistency Verified
The migration file for the partitioned table correctly creates indexes on traceId, spanId, and runId, matching the indexes previously established in the TaskEvent table migrations. This ensures that query patterns for distributed tracing and run-specific events are consistently optimized.

  • The partitioned table (TaskEventPartitioned) defines indexes analogous to those on the original table (TaskEvent).
  • Shell script results confirm that index creation for these columns exists in both places.
internal-packages/database/prisma/schema.prisma (2)

1732-1732: LGTM! TaskRun model updated with event store tracking.

The taskEventStore field is correctly added with a default value of "taskEvent", matching the migration.


2710-2805: Well-structured TaskEventPartitioned model.

The model correctly implements the partitioned table with:

  1. Composite primary key using @@id([id, createdAt])
  2. Matching indexes from the migration
  3. Field definitions consistent with the original TaskEvent model

Let's verify the field consistency between TaskEvent and TaskEventPartitioned models:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Extract and compare field definitions between TaskEvent and TaskEventPartitioned
ast-grep --pattern $'model TaskEvent {
  $$$
}'

ast-grep --pattern $'model TaskEventPartitioned {
  $$$
}'

Length of output: 108


I'll generate a revised script to extract and display the full definitions of both models using rg. This should help us manually verify the field consistency between the TaskEvent and TaskEventPartitioned models.

Please run the following shell script:

Once we have the output from these commands, we can compare the field definitions to confirm consistency.


🏁 Script executed:

[run_scripts]
#!/bin/bash
echo "==== TaskEvent model definition ===="
rg "model TaskEvent {" -A 200 internal-packages/database/prisma/schema.prisma

echo "==== TaskEventPartitioned model definition ===="
rg "model TaskEventPartitioned {" -A 200 internal-packages/database/prisma/schema.prisma

Length of output: 698


Let's try a new script using fixed-string search with rg -F to avoid regex errors. Please run the following shell script to extract the full definitions of both models:

This will help us directly verify field consistency between the two models.


🏁 Script executed:

[run_scripts]
#!/bin/bash
echo "==== TaskEvent model definition ===="
rg -F "model TaskEvent {" -A 300 internal-packages/database/prisma/schema.prisma

echo "==== TaskEventPartitioned model definition ===="
rg -F "model TaskEventPartitioned {" -A 300 internal-packages/database/prisma/schema.prisma

Length of output: 11720


Field consistency verified between TaskEvent and TaskEventPartitioned

The field definitions align exactly between the two models except for the intentional change in the primary key—in TaskEvent, a single field is marked as the primary key, while in TaskEventPartitioned a composite primary key (id and createdAt) is used to support partitioning. This confirms that the partitioned model correctly mirrors the original TaskEvent model’s structure with the planned adjustments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
apps/webapp/app/v3/taskEventStore.server.ts (5)

25-31: Consider handling unknown or malformed taskEventStore values more explicitly.
If run.taskEventStore contains an unexpected string, it silently defaults to "taskEvent". Consider validating the field or logging a warning to flag data inconsistencies.


33-35: Ensure environment variable usage is consistent with boolean semantics.
Currently, partitioning is enabled by checking if TASK_EVENT_PARTITIONING_ENABLED === "1". Consider using a more standard boolean environment variable (true/false) for clarity.


43-49: Optional: Add error handling or logging in the create method.
If the database call fails, the error is not caught or logged. Consider wrapping this call in a try/catch or letting the caller handle the error with more context.


54-60: Optional: Add error handling or logging in the createMany method.
Similar to create, consider adding a mechanism to log or handle failures when inserting multiple records.


62-115: Update code comment to match the actual time offset logic.
The comment at line 83 references adding “1 minute” when computing the endCreatedAt boundary. However, the code uses env.TASK_EVENT_PARTITIONED_WINDOW_IN_SECONDS, which might exceed 60 seconds. Consider updating the comment for clarity.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b734ee2 and e353738.

📒 Files selected for processing (1)
  • apps/webapp/app/v3/taskEventStore.server.ts (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
apps/webapp/app/v3/taskEventStore.server.ts (1)

6-23: Use caution with partial and picked types for future schema extensions.
By Omit-ing and Pick-ing fields, additions to the underlying schema may need updates here. Ensure you maintain these types as the schema evolves.

Comment on lines +117 to +175
async findTraceEvents(
table: TaskEventStoreTable,
traceId: string,
startCreatedAt: Date,
endCreatedAt?: Date
) {
if (table === "taskEventPartitioned") {
return await this.readReplica.$queryRaw<TraceEvent[]>`
SELECT
"spanId",
"parentId",
"runId",
"idempotencyKey",
LEFT(message, 256) as message,
style,
"startTime",
duration,
"isError",
"isPartial",
"isCancelled",
level,
events,
"environmentType"
FROM "TaskEventPartitioned"
WHERE
"traceId" = ${traceId}
AND "createdAt" >= ${startCreatedAt.toISOString()}::timestamp
AND "createdAt" < ${(endCreatedAt
? new Date(endCreatedAt.getTime() + env.TASK_EVENT_PARTITIONED_WINDOW_IN_SECONDS * 1000)
: new Date()
).toISOString()}::timestamp
ORDER BY "startTime" ASC
LIMIT ${env.MAXIMUM_TRACE_SUMMARY_VIEW_COUNT}
`;
} else {
return await this.readReplica.$queryRaw<TraceEvent[]>`
SELECT
id,
"spanId",
"parentId",
"runId",
"idempotencyKey",
LEFT(message, 256) as message,
style,
"startTime",
duration,
"isError",
"isPartial",
"isCancelled",
level,
events,
"environmentType"
FROM "TaskEvent"
WHERE "traceId" = ${traceId}
ORDER BY "startTime" ASC
LIMIT ${env.MAXIMUM_TRACE_SUMMARY_VIEW_COUNT}
`;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Align selected columns in partitioned vs. unpartitioned queries.
In partitioned mode, the id column is omitted, while the unpartitioned query includes it, even though TraceEvent type excludes id. This discrepancy could cause confusion. Consider returning identical columns or updating the type.

@ericallam ericallam merged commit bd0cc54 into main Feb 12, 2025
11 checks passed
@ericallam ericallam deleted the ea-branch-20 branch February 12, 2025 09:12
samejr pushed a commit that referenced this pull request Feb 13, 2025
… new runs are created (#1696)

* Create new partitioned TaskEvent table, and switch to it gradually as new runs are created

* Add env var for partition window in seconds

* Make startCreatedAt required in task event store
@coderabbitai coderabbitai bot mentioned this pull request Feb 13, 2025
samejr added a commit that referenced this pull request Feb 13, 2025
* Uses image cards for the frameworks

* Removes old snippets

* New AI agents side menu section

* WIP adding new ai agent pages

* Better overview page

* More copy added to the agent example pages

* Copy improvements

* Removes “Creating a project” page and side menu section

* Fixes broken links

* Updates to the latest Mintlify version, fixes issues, changes theme

* Adds descriptions to the main dropdown menu items

* Reformatted Introduction docs ‘landing page’

* Retry heartbeat timeouts by putting back in the queue (#1689)

* If there’s a heartbeat error and no attempts we put it back in the queue to try again

* When nacking, return whether it was put back in the queue or not

* Try and nack, if it fails then fail the run

* Consolidated switch statement

* Fail executing/retrying runs

* OOM retrying on larger machines (#1691)

* OOM retrying on larger machines

* Create forty-windows-shop.md

* Update forty-windows-shop.md

* Only retry again if the machine is different from the original

* Kubernetes OOMs appear as non-zero sigkills, adding support for treating these as OOMs

* Complete the original attempt span if retrying due to an OOM

* Revert "Complete the original attempt span if retrying due to an OOM"

This reverts commit 5f652c6.

* chore: Update version for release (#1666)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Release 3.3.14

* Set machine when triggering docs

* Batch queue runs that are waiting for deploy (#1693)

* Detect ffmpeg OOM errors, added manual OutOfMemoryError (#1694)

* Detect ffmpeg OOM errors, added manual OutOfMemoryError

* Create eighty-spies-knock.md

* Improved the machines docs, including the new OutOfMemoryError

* chore: Update version for release (#1695)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Release 3.3.15

* Create new partitioned TaskEvent table, and switch to it gradually as new runs are created (#1696)

* Create new partitioned TaskEvent table, and switch to it gradually as new runs are created

* Add env var for partition window in seconds

* Make startCreatedAt required in task event store

* Don't create an attempt if the run is final, batchTriggerAndWait bad continue fix (#1698)

* WIP fix for ResumeAttemptService selecting the wrong attempt (which has no error or output)

* Don’t create an attempt if the run is already in a final status

* Don’t get all the columns for the query. Improved the logging.

* Added a log to the batch example

* Filter out the undefined values

* Fix missing logs on child runs by using the root task run createdAt if it exists (#1697)

* Provider changes to support image cache (#1700)

* add env var for additional pull secrets

* make static images configurable

* optional image prefixes

* optional labels with sample rates

* add missing core paths

* remove excessive logs

* Fix run container exits after OOM retries (#1701)

* remove unused imports

* tell run to exit before force requeue

* handle exit for case where we already retried after oom

* improve retry span and add machine props

* don't try to exit run in dev

* Upgrade local dev to use electric beta.15 (#1699)

* Text fixes

* Removed pnpm files

---------

Co-authored-by: Matt Aitken <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Eric Allam <[email protected]>
Co-authored-by: nicktrn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant