Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: crash on trying to process data when batched is set to true #450

Open
HarikrishnanBalagopal opened this issue Jan 25, 2025 · 1 comment

Comments

@HarikrishnanBalagopal
Copy link
Contributor

This line fails when batched is set to true:

f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token

ERROR:sft_trainer.py:multiprocess.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3558, in _map_single
    batch = apply_function_on_filtered_inputs(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3427, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/data/data_handlers.py", line 96, in apply_dataset_formatting
    f"{dataset_text_field}": element[f"{dataset_text_field}"] + tokenizer.eos_token
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
TypeError: can only concatenate list (not "str") to list
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 650, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 317, in train
    ) = process_dataargs(
        ^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/data/setup_dataprocessor.py", line 348, in process_dataargs
    train_dataset, eval_dataset, dataset_text_field = _process_dataconfig_file(
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/data/setup_dataprocessor.py", line 71, in _process_dataconfig_file
    train_dataset = processor.process_dataset_configs(data_config.datasets)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/data/data_processors.py", line 322, in process_dataset_configs
    train_dataset = self._process_dataset_configs(dataset_configs, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/data/data_processors.py", line 273, in _process_dataset_configs
    raw_datasets = raw_datasets.map(handler, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/dataset_dict.py", line 869, in map
    {
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/dataset_dict.py", line 870, in <dictcomp>
    k: dataset.map(
       ^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3259, in map
    for rank, done, content in iflatmap_unordered(
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 718, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/home/tuning/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 718, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: can only concatenate list (not "str") to list
@HarikrishnanBalagopal HarikrishnanBalagopal changed the title fails to process data when batched is set to true bug: crash on trying to process data when batched is set to true Jan 25, 2025
@dushyantbehl
Copy link
Contributor

As discussed on slack this is not a bug but has not been supported yet and will be part of subsequent release.

Please refrain from using batched processing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants