-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only about 1/7 of all Linux commits recorded in GHA #213
Comments
I've saved all single events JSONs which refer to |
Yes, I would expect some events to be missing, but no.. not 6/7th of them. Can you share the query that you're running? Are you accounting for multiple commits in a single push? |
This is the query to count distinct SHAs (I'm taking all commits from push events, otherwise Linux only has about 700 pushes/year (while 75k commits for the same range - this is essential):
-- Returns 9841 for a year range {[table}} -> And this returns list of distinct SHAs for a given table and condition (which can be for example filter by repo.name =
But I was also grepping all JSON files (all that refer to |
Hmm, this is an interesting puzzle. I don't see any obvious issues with the query. Curious, have you tried capturing the error instead of just returning an empty array? Wondering if we're swallowing any errors there that we ought to pay attention to. Re, grepping JSON: do you see a difference in counts between JSON archives and BigQuery? |
The error returning [] was for event types other than 'PushEvent' (so there was no |
Gotcha, thanks for the context. Have you tried looking at some of the missing commits: are there any common patterns or distinctions for those vs. what's in the archives? Are there time gaps, or maybe other distinctions? |
No I didn't make such an analysis, sorry. |
The push event docs state:
If devs on this repo have a habit of pushing up larger amounts of commits at the same time, this might explain the discrepancy. You could do some spot checks through the commits API to compare push event payloads and actual commit counts. |
The precise date and time and hash of any missing commit would likely help diagnose this precisely. |
Hi, I have a question:
Should all commits stored in
git
repository be present on the GitHub archives?I have a local clone of
torvalds/linux
repository.I'm counting all distinct commit's SHA's:
git log --pretty=format:"%H" --since="2018-04-01" --until="2019-04-01" | uniq | wc -l
. It gives 75141 commits for one year period 2018-04-01 - 2019-04-01.When analyzing all
torvalds/linux
commits stored in GitHub archives I can only get 10542 distinct SHAs. Maybe the problem is that Linux GitHub repository is only a mirror and most commits are not stored in GitHub API then?The text was updated successfully, but these errors were encountered: