Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend lm_eval functionality #1729

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

12010486
Copy link
Contributor

What does this PR do?

Extended run_lm_eval.py to support latest tasks, like MMLU, MMLU-Pro, IFEval and GSM8K.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@12010486 12010486 requested a review from regisss as a code owner January 27, 2025 19:09
@12010486
Copy link
Contributor Author

12010486 commented Jan 27, 2025

@yafshar, @alexey-belyakov this is part II

@yafshar
Copy link
Contributor

yafshar commented Jan 29, 2025

Can you add examples that you tried to the PR README?

@12010486
Copy link
Contributor Author

12010486 commented Feb 3, 2025

Related to the examples to add, @yafshar, I've added llama-3.2-1B. While for this model (w/o Instruct) the accuracy we get is the same as reported in https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (i.e., 32.2), I've noticed that for other llama models it is sometimes around 2-3 point less, so I'm checking if everything is correct. As soon as done, I will add also an example for ifeval. I believe we should remove older examples, rather than simply add another couple. What is your thought?

@yafshar
Copy link
Contributor

yafshar commented Feb 4, 2025

@12010486 please let me know when the PR ready for the review! Thanks

@12010486
Copy link
Contributor Author

12010486 commented Feb 5, 2025

@yafshar ack, I will ping you. In the background I'm adding needed or useful parameters, but I'm blocked on ifeval task, as accuracies on that task are still not replicable by our code.

logger.info(
f"Model type is '{self._config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
)

self._max_length = options.max_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@12010486 is this correct? You set this before in previous PR, so the super class max_length is using this, but the current derived class has a different max_length

    @property
    def max_length(self) -> int:
        return self.buckets[-1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @yafshar. We need the property to be defined that way, or else max_lengh is going to change every time causing OOM. I can remove line 192, though

@veritas9872
Copy link

Hello! Thank you for the progress on this update!
Some information that may be helpful.
First, the scores on the Llama website are different from the scores obtained by LM Eval harness because Meta uses its internal code for measuring metrics. I have also found that scores drop relative to the ones reported by Meta.
A good way might be to compare scores obtained when using a vLLM API with the scores obtained by HuggingFace models.
Also, some variation in scores is inevitable. I have found that results from A100 and H100 devices are different as well.

@veritas9872
Copy link

One minor comment concerning GSM8K.
The gsm8k_cot_llama should be used to reproduce the Llama results instead of gsm8k_cot. This is because Meta uses a slightly different method for GSM8K.

@12010486
Copy link
Contributor Author

12010486 commented Feb 7, 2025

@veritas9872, very useful comments! In fact, using gsm8k_cot_llama really boosted the exact match accuracy! As for the differences in score, I'm fine with tiny variations (as it is hard to control the whole randomness, esp if you change batch size and number of cards) but still it seems to be a bit too much in our case, so I'm double checking step by step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants