Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

thuytrang32 · 2025-01-07T10:52:44Z

What happened?

I am trying to fine-tune an LLM using Kubeflow without GPU devices. However, I encountered two issues during the process :

When I removed the gpu key from resources_per_worker, the training job still attempted to allocate GPUs, resulting in the CUDA error: invalid device ordinal (the training job tried to allocate GPUs)
To address this, I tried adding ddp_backend="gloo" to training_parameters. However, this led to another error:

I followed this instruction : https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ . This is the code i ran :

`import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceTrainerParams,
HuggingFaceDatasetParams,
)

TrainingClient().train(
name="fine-tune-bert",
# BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://google-bert/bert-base-cased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
    ),
    # Set LoRA config to reduce number of trainable model parameters.
    lora_config=LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        bias="none",
    ),
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=16, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    #"gpu": 0,
    "cpu": 16,
    "memory": "16G",
},

)`

What did you expect to happen?

The training job should correctly initialize without attempting to allocate GPUs.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6+k3s2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6+k3s2

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest

Training Operator Python SDK version:

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:

Impacted by this bug?

👍

The text was updated successfully, but these errors were encountered:

thuytrang32 · 2025-01-07T14:23:23Z

I also have error [rank0]: ValueError: Please specify target_modules in peft_config . I tried to delete the lora config but that error still exists

`
import transformers
from peft import LoraConfig

from kubeflow.training import TrainingClient
from kubeflow.storage_initializer.hugging_face import (
HuggingFaceModelParams,
HuggingFaceTrainerParams,
HuggingFaceDatasetParams,
)

TrainingClient().train(
name="fine-tune-bert",
# BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://distilbert/distilbert-base-uncased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        #ddp_backend="gloo",
    ),
    
    # Set LoRA config to reduce number of trainable model parameters.
    
    #lora_config=LoraConfig(
        #r=8,
        #lora_alpha=8,
        #lora_dropout=0.1,
        #bias="none",
        #target_modules=["encoder.layer.*.attention.self.query", "encoder.layer.*.attention.self.key"]
    #),
    
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=20, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    "cpu": 20,
    "memory": "20G",
},

)
`

andreyvelich · 2025-01-07T14:49:56Z

Thank you for creating this!
For the first error, please can you check the PyTorchJob ?
It should create it without GPU resources.

kubectl get pytorchjob -n <NAMESPACE> -o yaml

thuytrang32 · 2025-01-07T14:53:53Z

Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.
kubectl get pytorchjob -n <NAMESPACE> -o yaml

Hi , this is the output
(base) jovyan@ex-0:~$ kubectl get pytorchjobs -n kubeflow-user-example-com -o yaml
apiVersion: v1
items:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
creationTimestamp: "2025-01-07T14:10:47Z"
generation: 1
name: fine-tune-bert
namespace: kubeflow-user-example-com
resourceVersion: "580563"
uid: 72ed106c-5299-4de3-9f27-8e5464d4e59b
spec:
nprocPerNode: "20"
pytorchReplicaSpecs:
Master:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --model_uri
- hf://distilbert/distilbert-base-uncased
- --transformer_type
- AutoModelForSequenceClassification
- --num_labels
- None
- --model_dir
- /workspace/model
- --dataset_dir
- /workspace/dataset
- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'
- --training_parameters
- '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
false, "do_eval": false, "do_predict": false, "evaluation_strategy":
"no", "prediction_loss_only": false, "per_device_train_batch_size":
8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
"per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
"eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
"adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
"max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
{}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
"warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
"logging_strategy": "steps", "logging_first_step": false, "logging_steps":
500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
"use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
"O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
"fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
{"split_batches": false, "dispatch_batches": null, "even_batches":
true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
"group_by_length": false, "length_column_name": "length", "report_to":
[], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
"ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
false, "hub_always_push": false, "gradient_checkpointing": false,
"gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
"", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
"torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
false, "neftune_noise_alpha": null}'
image: docker.io/kubeflow/trainer-huggingface
name: pytorch
resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G
volumeMounts:
- mountPath: /workspace
name: storage-initializer
initContainers:
- args:
- --model_provider
- hf
- --model_provider_parameters
- '{"model_uri": "hf://distilbert/distilbert-base-uncased", "transformer_type":
"AutoModelForSequenceClassification", "access_token": null, "num_labels":
null}'
- --dataset_provider
- hf
- --dataset_provider_parameters
- '{"repo_id": "yelp_review_full", "access_token": null, "split": "train[:100]"}'
image: docker.io/kubeflow/storage-initializer
name: storage-initializer
volumeMounts:
- mountPath: /workspace
name: storage-initializer
volumes:
- name: storage-initializer
persistentVolumeClaim:
claimName: storage-initializer
Worker:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --model_uri
- hf://distilbert/distilbert-base-uncased
- --transformer_type
- AutoModelForSequenceClassification
- --num_labels
- None
- --model_dir
- /workspace/model
- --dataset_dir
- /workspace/dataset
- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'
- --training_parameters
- '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
false, "do_eval": false, "do_predict": false, "evaluation_strategy":
"no", "prediction_loss_only": false, "per_device_train_batch_size":
8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
"per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
"eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
"adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
"max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
{}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
"warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
"logging_strategy": "steps", "logging_first_step": false, "logging_steps":
500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
"use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
"O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
"fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
{"split_batches": false, "dispatch_batches": null, "even_batches":
true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
"group_by_length": false, "length_column_name": "length", "report_to":
[], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
"ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
false, "hub_always_push": false, "gradient_checkpointing": false,
"gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
"", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
"torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
false, "neftune_noise_alpha": null}'
image: docker.io/kubeflow/trainer-huggingface
name: pytorch
resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G
volumeMounts:
- mountPath: /workspace
name: storage-initializer
volumes:
- name: storage-initializer
persistentVolumeClaim:
claimName: storage-initializer
runPolicy:
suspend: false
status:
conditions:
- lastTransitionTime: "2025-01-07T14:10:47Z"
  lastUpdateTime: "2025-01-07T14:10:47Z"
  message: PyTorchJob fine-tune-bert is created.
  reason: PyTorchJobCreated
  status: "True"
  type: Created
- lastTransitionTime: "2025-01-07T14:11:16Z"
  lastUpdateTime: "2025-01-07T14:11:16Z"
  message: PyTorchJob fine-tune-bert is running.
  reason: PyTorchJobRunning
  status: "True"
  type: Running
  replicaStatuses:
  Master:
  active: 1
  selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master
  Worker:
  active: 1
  selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker
  startTime: "2025-01-07T14:10:47Z"
  kind: List
  metadata:
  resourceVersion: ""
  (base) jovyan@ex-0:~$

andreyvelich · 2025-01-07T14:59:53Z

So, as you can see the GPU has not been allocated to your PyTorch's pod:

resources:
  limits:
    cpu: 20
    memory: 20G
  requests:
    cpu: 20
    memory: 20G

Locally on Kind using MacOS, I was able to run the example on CPU using docker.io/kubeflow/trainer-huggingface image.

Where do you run your Kubernetes cluster ?

thuytrang32 · 2025-01-07T15:03:04Z

Yes but it still has this error when I check with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com

I ran this code in Notebook of Kubeflow UI

andreyvelich · 2025-01-07T15:17:36Z

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ?
@deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

thuytrang32 · 2025-01-07T15:21:36Z

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? @deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

I used Jarvice to create a Kubeflow instance.

andreyvelich · 2025-01-07T15:24:33Z

Do you know which instances do they run for Kubernetes Nodes ?
E.g. is it AMD Linux machines with CPUs ?

thuytrang32 · 2025-01-07T15:30:45Z

Sorry , i don't know

andreyvelich · 2025-01-07T15:36:09Z

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

thuytrang32 · 2025-01-07T15:52:44Z

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

When I ran with both ddp_backend and use_cpu , it still had the old error

Then I tried to run with use_cpu = True only , the code passed. Then i checked with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com, it had these errors again

For kubectl describe pod fine-tune-bert-worker-0 -n kubeflow-user-example-com , because the worker has GPU , it didn't have CUDA error but it still had this

helenxie-bit · 2025-01-07T17:59:47Z

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)

Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

thuytrang32 · 2025-01-07T18:25:34Z

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

thuytrang32 · 2025-01-07T18:46:29Z

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

It didn't work even though i put both no_cuda=True, use_cpu = True

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?

helenxie-bit · 2025-01-07T23:24:52Z

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

Oh I see. It seems that when lora_config is not explicitly set, the API assigns its default values and passes them into the container, as shown in the output of kubectl get pytorchjob -n <NAMESPACE> -o yaml:

- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'

As a result, the trainer still attempts to configure the PEFT model as indicated in the script:

training-operator/sdk/python/kubeflow/trainer/hf_llm_training.py

Lines 118 to 130 in 25c760c

    
           def setup_peft_model(model, lora_config): 
        
               # Set up the PEFT model 
        
               lora_config = LoraConfig(**json.loads(lora_config)) 
        
               reference_lora_config = LoraConfig() 
        
               for key, val in lora_config.__dict__.items(): 
        
                   old_attr = getattr(reference_lora_config, key, None) 
        
                   if old_attr is not None: 
        
                       val = type(old_attr)(val) 
        
                   setattr(lora_config, key, val) 
        
               model.enable_input_require_grads() 
        
               model = get_peft_model(model, lora_config) 
        
               return model

It seems that even without specifying lora_config, it is still included in the fine-tuning process. Could you try setting lora_config and explicitly defining target_modules to see if that resolves the issue?

Meanwhile, @andreyvelich do you think this might be a bug?

helenxie-bit · 2025-01-07T23:42:55Z

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.
I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.
It didn't work even though i put both no_cuda=True, use_cpu = True

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?

For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?

Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?

thuytrang32 added kind/bug lifecycle/needs-triage labels Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

thuytrang32 commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025 •

edited

Loading

helenxie-bit commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025 •

edited

Loading

helenxie-bit commented Jan 7, 2025

helenxie-bit commented Jan 7, 2025

Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

Cannot fine-tune LLM without GPU - CUDA error and DDP initialization #2371

Comments

thuytrang32 commented Jan 7, 2025

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

andreyvelich commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025 • edited Loading

helenxie-bit commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025 • edited Loading

helenxie-bit commented Jan 7, 2025

helenxie-bit commented Jan 7, 2025

thuytrang32 commented Jan 7, 2025 •

edited

Loading

thuytrang32 commented Jan 7, 2025 •

edited

Loading