Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only about 1/7 of all Linux commits recorded in GHA #213

Open
lukaszgryglicki opened this issue Apr 9, 2019 · 9 comments
Open

Only about 1/7 of all Linux commits recorded in GHA #213

lukaszgryglicki opened this issue Apr 9, 2019 · 9 comments

Comments

@lukaszgryglicki
Copy link
Contributor

Hi, I have a question:
Should all commits stored in git repository be present on the GitHub archives?
I have a local clone of torvalds/linux repository.
I'm counting all distinct commit's SHA's: git log --pretty=format:"%H" --since="2018-04-01" --until="2019-04-01" | uniq | wc -l. It gives 75141 commits for one year period 2018-04-01 - 2019-04-01.
When analyzing all torvalds/linux commits stored in GitHub archives I can only get 10542 distinct SHAs. Maybe the problem is that Linux GitHub repository is only a mirror and most commits are not stored in GitHub API then?

@lukaszgryglicki
Copy link
Contributor Author

I've saved all single events JSONs which refer to torvalds/linux into a big JSON file (it is about 600M in size). There are many SHAs that are not present in this JSON. Google BigQuery also don't return them - but it uses githubarchive data-source, so I would expect the rootcause is here.
I know that there were problems with getting all events in the past (GitHub API limits) - is it possible that only 1/7 of Linux commits are recorded in GHA due to this?

@igrigorik
Copy link
Owner

Yes, I would expect some events to be missing, but no.. not 6/7th of them.

Can you share the query that you're running? Are you accounting for multiple commits in a single push?

@lukaszgryglicki
Copy link
Contributor Author

This is the query to count distinct SHAs (I'm taking all commits from push events, otherwise Linux only has about 700 pushes/year (while 75k commits for the same range - this is essential):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and repo.name = 'torvalds/linux'
)
select
  count(distinct sha) as commits
from (
  select
    sha
  from
    pushes
  cross join
    unnest(pushes.shas) as sha
  where
    sha is not null
)

-- Returns 9841 for a year range {[table}} -> year.2018.

And this returns list of distinct SHAs for a given table and condition (which can be for example filter by repo.name = torvalds/linux):

#standardSQL
create temp function get_shas(json string)
returns array<string>
language js as """
  try {
    return JSON.parse(json).commits.map(x=>x.sha);
  }
  catch(error) {
    return []
  }
""";
with pushes as (
  select
    get_shas(payload) as shas
  from
    `githubarchive.{{table}}`
  where
    type = 'PushEvent'
    and {{cond}}
)
select
  distinct sha
from
  pushes
cross join
  unnest(pushes.shas) as sha
order by
  sha

But I was also grepping all JSON files (all that refer to torvalds/linux) and many many SHAs were not found in full JSON body.

@igrigorik
Copy link
Owner

Hmm, this is an interesting puzzle.

I don't see any obvious issues with the query. Curious, have you tried capturing the error instead of just returning an empty array? Wondering if we're swallowing any errors there that we ought to pay attention to.

Re, grepping JSON: do you see a difference in counts between JSON archives and BigQuery?

@lukaszgryglicki
Copy link
Contributor Author

lukaszgryglicki commented Apr 11, 2019

The error returning [] was for event types other than 'PushEvent' (so there was no commits field).
The BQ number of commits and number of commits from DevStats (which parses all GHA JSON files) are "almost" equal (the difference is minute, less than 0.5% and that was probably due to slightly different date ranges I've checked recently). If needed I can check for the exact data later (it takes more than 1TB to process) but really DevStats returned about 10250 for a 4/2018-4/2019 and BQ returned 9840 for the year 2018 AFAIk.
git returns 70-80k so the difference is huge.

@igrigorik
Copy link
Owner

Gotcha, thanks for the context.

Have you tried looking at some of the missing commits: are there any common patterns or distinctions for those vs. what's in the archives? Are there time gaps, or maybe other distinctions?

@lukaszgryglicki
Copy link
Contributor Author

No I didn't make such an analysis, sorry.

@chillu
Copy link

chillu commented Sep 12, 2020

The push event docs state:

The array includes a maximum of 20 commits

If devs on this repo have a habit of pushing up larger amounts of commits at the same time, this might explain the discrepancy. You could do some spot checks through the commits API to compare push event payloads and actual commit counts.

@xloem
Copy link

xloem commented Mar 23, 2023

The precise date and time and hash of any missing commit would likely help diagnose this precisely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants