Extend lm_eval functionality #1729

12010486 · 2025-01-27T19:09:16Z

What does this PR do?

Extended run_lm_eval.py to support latest tasks, like MMLU, MMLU-Pro, IFEval and GSM8K.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Co-authored-by: Yaser Afshar <[email protected]>

Improve after first pass review

12010486 · 2025-01-27T19:10:56Z

@yafshar, @alexey-belyakov this is part II

examples/text-generation/run_lm_eval.py

yafshar · 2025-01-29T15:23:00Z

Can you add examples that you tried to the PR README?

Co-authored-by: Yaser Afshar <[email protected]>

12010486 · 2025-02-03T14:43:41Z

Related to the examples to add, @yafshar, I've added llama-3.2-1B. While for this model (w/o Instruct) the accuracy we get is the same as reported in https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (i.e., 32.2), I've noticed that for other llama models it is sometimes around 2-3 point less, so I'm checking if everything is correct. As soon as done, I will add also an example for ifeval. I believe we should remove older examples, rather than simply add another couple. What is your thought?

yafshar · 2025-02-04T16:03:37Z

@12010486 please let me know when the PR ready for the review! Thanks

12010486 · 2025-02-05T15:31:49Z

@yafshar ack, I will ping you. In the background I'm adding needed or useful parameters, but I'm blocked on ifeval task, as accuracies on that task are still not replicable by our code.

yafshar · 2025-02-06T23:18:02Z

examples/text-generation/run_lm_eval.py

+            logger.info(
+                f"Model type is '{self._config.model_type}', part of the Gemma family--a BOS token will be used as Gemma underperforms without it."
+            )
+
        self._max_length = options.max_length


@12010486 is this correct? You set this before in previous PR, so the super class max_length is using this, but the current derived class has a different max_length

@property def max_length(self) -> int: return self.buckets[-1]

Good catch @yafshar. We need the property to be defined that way, or else max_lengh is going to change every time causing OOM. I can remove line 192, though

veritas9872 · 2025-02-07T01:52:46Z

Hello! Thank you for the progress on this update!
Some information that may be helpful.
First, the scores on the Llama website are different from the scores obtained by LM Eval harness because Meta uses its internal code for measuring metrics. I have also found that scores drop relative to the ones reported by Meta.
A good way might be to compare scores obtained when using a vLLM API with the scores obtained by HuggingFace models.
Also, some variation in scores is inevitable. I have found that results from A100 and H100 devices are different as well.

veritas9872 · 2025-02-07T01:54:06Z

One minor comment concerning GSM8K.
The gsm8k_cot_llama should be used to reproduce the Llama results instead of gsm8k_cot. This is because Meta uses a slightly different method for GSM8K.

12010486 · 2025-02-07T15:46:03Z

@veritas9872, very useful comments! In fact, using gsm8k_cot_llama really boosted the exact match accuracy! As for the differences in score, I'm fine with tiny variations (as it is hard to control the whole randomness, esp if you change batch size and number of cards) but still it seems to be a bit too much in our case, so I'm double checking step by step

12010486 and others added 30 commits December 23, 2024 11:55

lm eval harness version above 0.4.0. Similar to [SW-190418]

09e60ad

Fix assertion error on cuda dev

9c47787

Add _create_model()

0c6fc48

Working on multiple HPU

72aa69f

Merge branch 'huggingface:main' into lm_eval-to-latest

612833b

Update requirements_lm_eval.txt to v0.4.7

207a044

Update run_lm_eval.py

e0d52a2

Fix for pretrained entry

46fe13c

Fix for Json output

6659f78

Better solution for Json and clean-up

13ca64e

Make style

e85f4d3

Update copyright

c6d5d66

Co-authored-by: Yaser Afshar <[email protected]>

Further polising

41cda35

Improve super().__init__

a3e1fc1

Co-authored-by: Yaser Afshar <[email protected]>

Polishing

b6f4bdb

Co-authored-by: Yaser Afshar <[email protected]>

Merge branch 'huggingface:main' into lm_eval_to_0.4.7

3e66fab

Redefine init

7619437

Merge branch 'huggingface:main' into lm_eval_to_0.4.7

78160b1

Merge branch 'huggingface:main' into change_base_class

0a51191

Make style

946396c

Merge branch 'huggingface:main' into change_base_class

852594f

Merge branch 'huggingface:main' into lm_eval_to_0.4.7

4c51b12

Remove unused function

39b31ac

Further pruning

d38ff53

Merge pull request #3 from 12010486/change_base_class

b2fdecf

Improve after first pass review

wip new tasks update

e056516

Fix

337aa18

Add gen_kwargs

5bc2c17

Fix on gen_kwargs

5ae127b

Merge branch 'extend_functionality' into extend_lm_eval

0f221fd

12010486 requested a review from regisss as a code owner January 27, 2025 19:09

yafshar reviewed Jan 29, 2025

View reviewed changes

examples/text-generation/run_lm_eval.py Outdated Show resolved Hide resolved

12010486 and others added 4 commits February 3, 2025 11:30

Merge branch 'huggingface:main' into extend_functionality

a0948e4

Merge branch 'huggingface:main' into extend_functionality

d1c5543

Add Llama-3.2-1B example

d70da92

Remove space

689555b

Co-authored-by: Yaser Afshar <[email protected]>

12010486 added 3 commits February 3, 2025 15:42

Add BOS for Gemma

0ebc4ea

Add few_shots_as_multiturn param

2a54235

Add check integrity

e286b3e

12010486 added 2 commits February 5, 2025 15:46

Merge branch 'huggingface:main' into extend_functionality

56c2d9a

Add verbosity flag to debug. Clean-up

91876ee

12010486 added 4 commits February 6, 2025 12:45

Merge branch 'huggingface:main' into extend_functionality

3947bf6

Add Llama-3.2-1B-Instruct on 8 Gaudi2 on task GSM8K (CoT)

21170b8

Fix on Llama-3.2-1B-Instruct

3992100

Merge branch 'huggingface:main' into extend_functionality

9b5e48a

yafshar reviewed Feb 6, 2025

View reviewed changes

12010486 added 3 commits February 7, 2025 11:37

Fix on Llama-3.2-3B-Instruct-gsm8k

09b9123

gen_kwargs from config_generation

4689b0a

Better handling of gen_kwargs

54bdae9

12010486 added 3 commits February 7, 2025 16:22

Improve perf

9bb95a6

Perf improve

8b20066

Merge branch 'huggingface:main' into extend_functionality

a0d7fca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend lm_eval functionality #1729

Extend lm_eval functionality #1729

12010486 commented Jan 27, 2025

12010486 commented Jan 27, 2025 •

edited

Loading

yafshar commented Jan 29, 2025

12010486 commented Feb 3, 2025

yafshar commented Feb 4, 2025

12010486 commented Feb 5, 2025

yafshar Feb 6, 2025

12010486 Feb 7, 2025

veritas9872 commented Feb 7, 2025

veritas9872 commented Feb 7, 2025

12010486 commented Feb 7, 2025

Extend lm_eval functionality #1729

Are you sure you want to change the base?

Extend lm_eval functionality #1729

Conversation

12010486 commented Jan 27, 2025

What does this PR do?

Before submitting

12010486 commented Jan 27, 2025 • edited Loading

yafshar commented Jan 29, 2025

12010486 commented Feb 3, 2025

yafshar commented Feb 4, 2025

12010486 commented Feb 5, 2025

yafshar Feb 6, 2025

Choose a reason for hiding this comment

12010486 Feb 7, 2025

Choose a reason for hiding this comment

veritas9872 commented Feb 7, 2025

veritas9872 commented Feb 7, 2025

12010486 commented Feb 7, 2025

12010486 commented Jan 27, 2025 •

edited

Loading