refactor(benchmarks) Update dependencies for LLM evaluation pipelines (…

…#4910)
adap · Feb 5, 2025 · 7f14165 · 7f14165
1 parent 0538cb6
commit 7f14165
Show file tree

Hide file tree

Showing 8 changed files with 15 additions and 15 deletions.
diff --git a/benchmarks/flowertune-llm/evaluation/code/README.md b/benchmarks/flowertune-llm/evaluation/code/README.md
@@ -12,7 +12,7 @@ Three datasets have been selected for this evaluation: [MBPP](https://huggingfac
 git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/code ./flowertune-eval-code && rm -rf flower && cd flowertune-eval-code
 ```
 
-Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:
+Create a new Python environment (we recommend Python 3.11), activate it, then install dependencies with:
 
 ```shell
 # From a new python environment, run:
@@ -40,7 +40,7 @@ sudo apt-get install g++
 Then, download the `main.py` script from `bigcode-evaluation-harness` repository.
 
 ```shell
-git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git && cd bigcode-evaluation-harness && git checkout 0f3e95f0806e78a4f432056cdb1be93604a51d69 && mv main.py ../ && cd .. && rm -rf bigcode-evaluation-harness
+git clone https://github.com/yan-gao-GY/bigcode-evaluation-harness.git && cd bigcode-evaluation-harness && mv main.py ../ && cd .. && rm -rf bigcode-evaluation-harness
 ```
 
 
@@ -51,7 +51,7 @@ git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git && c
 
 ```bash
 python main.py \
---model=mistralai/Mistral-7B-v0.3 \
+--model=your-base-model-name \ # e.g., mistralai/Mistral-7B-v0.3
 --peft_model=/path/to/fine-tuned-peft-model-dir/  \ # e.g., ./peft_1
 --max_length_generation=1024  \ # change to 2048 when running mbpp
 --batch_size=4 \

diff --git a/benchmarks/flowertune-llm/evaluation/code/requirements.txt b/benchmarks/flowertune-llm/evaluation/code/requirements.txt
@@ -1,8 +1,8 @@
-peft==0.6.2
+peft==0.14.0
 datasets==2.20.0
 evaluate==0.3.0
 sentencepiece==0.2.0
 protobuf==5.27.1
-bitsandbytes==0.45.0
+bitsandbytes==0.45.1
 hf_transfer==0.1.8
-git+https://github.com/bigcode-project/bigcode-evaluation-harness.git@6116c6a9a5672c69bd624373cfbc8938b7acc249
+git+https://github.com/yan-gao-GY/bigcode-evaluation-harness.git
diff --git a/benchmarks/flowertune-llm/evaluation/finance/README.md b/benchmarks/flowertune-llm/evaluation/finance/README.md
@@ -10,7 +10,7 @@ Three datasets have been selected for this evaluation: [FPB](https://huggingface
 git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/finance ./flowertune-eval-finance && rm -rf flower && cd flowertune-eval-finance
 ```
 
-Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:
+Create a new Python environment (we recommend Python 3.11), activate it, then install dependencies with:
 
 ```shell
 # From a new python environment, run:

diff --git a/benchmarks/flowertune-llm/evaluation/finance/requirements.txt b/benchmarks/flowertune-llm/evaluation/finance/requirements.txt
@@ -1,7 +1,7 @@
-peft==0.6.2
+peft==0.14.0
 scikit-learn==1.5.0
 datasets==2.20.0
 sentencepiece==0.2.0
 protobuf==5.27.1
-bitsandbytes==0.45.0
+bitsandbytes==0.45.1
 hf_transfer==0.1.8
diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/README.md b/benchmarks/flowertune-llm/evaluation/general-nlp/README.md
@@ -10,7 +10,7 @@ The [MMLU](https://huggingface.co/datasets/lukaemon/mmlu) dataset is used for th
 git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/general-nlp ./flowertune-eval-general-nlp && rm -rf flower && cd flowertune-eval-general-nlp
 ```
 
-Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:
+Create a new Python environment (we recommend Python 3.11), activate it, then install dependencies with:
 
 ```shell
 # From a new python environment, run:

diff --git a/benchmarks/flowertune-llm/evaluation/general-nlp/requirements.txt b/benchmarks/flowertune-llm/evaluation/general-nlp/requirements.txt
@@ -1,8 +1,8 @@
-peft==0.6.2
+peft==0.14.0
 pandas==2.2.2
 scikit-learn==1.5.0
 datasets==2.20.0
 sentencepiece==0.2.0
 protobuf==5.27.1
-bitsandbytes==0.45.0
+bitsandbytes==0.45.1
 hf_transfer==0.1.8
diff --git a/benchmarks/flowertune-llm/evaluation/medical/README.md b/benchmarks/flowertune-llm/evaluation/medical/README.md
@@ -10,7 +10,7 @@ Three datasets have been selected for this evaluation: [PubMedQA](https://huggin
 git clone --depth=1 https://github.com/adap/flower.git && mv flower/benchmarks/flowertune-llm/evaluation/medical ./flowertune-eval-medical && rm -rf flower && cd flowertune-eval-medical
 ```
 
-Create a new Python environment (we recommend Python 3.10), activate it, then install dependencies with:
+Create a new Python environment (we recommend Python 3.11), activate it, then install dependencies with:
 
 ```shell
 # From a new python environment, run:

diff --git a/benchmarks/flowertune-llm/evaluation/medical/requirements.txt b/benchmarks/flowertune-llm/evaluation/medical/requirements.txt
@@ -1,8 +1,8 @@
-peft==0.6.2
+peft==0.14.0
 pandas==2.2.2
 scikit-learn==1.5.0
 datasets==2.20.0
 sentencepiece==0.2.0
 protobuf==5.27.1
-bitsandbytes==0.45.0
+bitsandbytes==0.45.1
 hf_transfer==0.1.8