Build failure dataset script #4617

benjaminmah · 2024-11-13T16:01:00Z

(STILL A WORK IN PROGRESS)

Script to create the build failure dataset. Collects the Revision ID, Initial and Fix Commit Hash, Interdiff and Error Logs.

jmaher

I like this "in general". Some specifics. I think collecting this data and storing it longer term is good. Can you store in .json artifacts- then we could ingest in the future? Maybe fixed_by_commit revision or bugid as the filename. Then these exist and can be ingested overtime.

The risk is that there are multiple backouts for a given bugid/original_changeset.

It would be a bonus if we could collect test regressions as well- it is ok if these are in a separate location/db/filesystem/etc. The primary use case I can see is to determine if we are finding regressions in code we ship, or just forgetting or dealing with old test cases. This is pretty easy by looking at the interdiff and determining which files were changed

jmaher · 2025-01-16T22:00:18Z

scripts/build_failure_data_collection.py

+    main()
+
+# 0. Download databases
+# 1. Identify bugs in Bugzilla that have a backout due to build failures X


I normally get my regression (backed out) data from treeherder database. The main reason why is we can get all the tasks that failed, which could help determine if this is failing ALL builds, or a certain platform or build failure. Not required, but the additional information is good to know.

jmaher · 2025-01-16T22:06:20Z

scripts/build_failure_data_collection.py

+            for key in ["node", "bug_id", "pushdate", "backedoutby", "backsout", "desc"]
+        }
+
+        bug_commits.setdefault(commit["bug_id"], []).append(commit_data)


how will you make this work when there are multiple landings and backouts? In general do you care about each transaction, or the end state? Sometimes the end state is a secondary patch that hot fixes the situation, or a rebase which changes other files/code blocks not originally edited.

I think any solution should account for >1 backout.

jmaher · 2025-01-16T22:08:59Z

scripts/build_failure_data_collection.py

+            and "for causing" in desc.lower()
+            and "build" in desc.lower()
+        ):
+            return commit


in this case why do you need bugzilla data?

We use the Bugzilla entry to identify backed out commits. Is there a more accurate method that you may be aware of to identify backed out commits (aside from the Treeherder DB)?

I thought commits via hg had metadata:
https://hg.mozilla.org/mozilla-central/rev/99a8f2b2b00d85148c743f16db75d9abefb33513

if you want a short list of commits that are backed out, then bugzilla comment parsing would do.

One value add method would be to propose a change to the backout process where there is queryable metadata so everytime there is a backout, data is stored in bugzilla/treeherder/some_random_db with original commit, backed out commit, related links, list of failing tasks and related error messages. Thinking more, developers relanding would have to go through a process to document what they were relanding, and confirm the interdiff :)

my original comment here was assuming you were doing a massive hg log to look at commit history and parse out "backed out" commit messages to build a list.

jmaher · 2025-01-16T22:10:35Z

scripts/build_failure_data_collection.py

+        if not backing_out_commit:
+            continue
+
+        logger.info("Backing out commit found!")


it would be nice to mention the commit id :)

I've included the commit information when identifying the backing out commit: 460e669

jmaher · 2025-01-16T22:12:36Z

scripts/build_failure_data_collection.py

+            failed_tasks.add(task["status"]["taskId"])
+
+    # 6. find intersection between build tasks and failed tasks
+    failed_build_tasks = list(build_tasks & failed_tasks)


if you use treeherder db (search for fixed_by_commit) instead of bugzilla to find backouts you get most of this stuff for free.

jmaher · 2025-01-16T22:14:00Z

scripts/build_failure_data_collection.py

+            #     continue
+
+            commit_diff = repository.get_diff(
+                repo_path="hg_dir", original_hash=commits[0], fix_hash=commits[1]


as many fixed revisions do not have the original code referenced, this might have more edge cases than you think. Also for multiple cycles, it is probably best to compare commits[0] <-> commits[-1]

benjaminmah added 16 commits November 5, 2024 16:50

Preliminary dataset creation script

9c60d55

Added revision finder for backed out commits

c3c98c0

Added return bug array

cb31228

Added commit node finder, removed phabricator requirement

23ebc08

Added hg diff functions

1fce464

CSV creation

6a9830c

Fixed the revision collection script

2287975

Added matrix message reference for log collection

2ae3029

Fixed assertion error

9a298aa

Added TC API error line search

df779ea

Added error line retrieval in dataset creation

f837635

Replaced client ID with environment variable

dc5a83a

Fixed comments

bbfd499

Uncommented revision finder

5f4c2af

Refactored code

529067c

Included all revisions of a push

dc4544e

benjaminmah requested a review from suhaibmujahid December 4, 2024 22:26

benjaminmah requested a review from marco-c January 3, 2025 20:36

jmaher reviewed Jan 16, 2025

View reviewed changes

benjaminmah added 2 commits January 23, 2025 10:00

Added commit information when identifying backing out commit

460e669

Skipping bugs with multiple backouts

2131fe6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure dataset script #4617

Build failure dataset script #4617

benjaminmah commented Nov 13, 2024 •

edited

Loading

jmaher left a comment

jmaher Jan 16, 2025

jmaher Jan 16, 2025

jmaher Jan 16, 2025

benjaminmah Jan 23, 2025

jmaher Jan 23, 2025

jmaher Jan 23, 2025

jmaher Jan 16, 2025

benjaminmah Jan 23, 2025

jmaher Jan 16, 2025

jmaher Jan 16, 2025

Build failure dataset script #4617

Are you sure you want to change the base?

Build failure dataset script #4617

Conversation

benjaminmah commented Nov 13, 2024 • edited Loading

jmaher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benjaminmah commented Nov 13, 2024 •

edited

Loading