Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial support for Pipeline Parallelism #279

Merged
merged 87 commits into from
Jan 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
4bdf600
Refactor and creation of PipelineParallelismSpecs
michaelbenayoun Oct 30, 2023
92b8253
Refactoring
michaelbenayoun Oct 30, 2023
e394ec5
[WIP] initial support for pp
michaelbenayoun Oct 30, 2023
2920df7
[WIP] initial support for pp
michaelbenayoun Oct 31, 2023
1b82fbc
[WIP] initial support for pp
michaelbenayoun Nov 2, 2023
4712e95
[WIP] initial support for pp
michaelbenayoun Nov 2, 2023
0c55877
[WIP] initial support for pp
michaelbenayoun Nov 7, 2023
0acf510
[WIP] initial support for pp
michaelbenayoun Nov 7, 2023
3ea12dd
[WIP] initial support for pp
michaelbenayoun Nov 8, 2023
2fd6abf
Update examples
michaelbenayoun Nov 10, 2023
4fb51ee
[WIP] add tests
michaelbenayoun Nov 10, 2023
c74b724
Add PP to test_examples.py
michaelbenayoun Nov 14, 2023
6aac412
Merge branch 'main' into initial_pp
michaelbenayoun Nov 14, 2023
d0df211
[WIP] fix TP + PP training
michaelbenayoun Nov 15, 2023
a4cc66c
Merge branch 'main' into initial_pp
michaelbenayoun Nov 28, 2023
959b3b0
Style
michaelbenayoun Nov 28, 2023
1ef90b8
[WIP]
michaelbenayoun Nov 29, 2023
cbdf51f
Refactor Mistral for sequence parallelism
michaelbenayoun Nov 29, 2023
0571524
Add DistributedTest class
michaelbenayoun Nov 29, 2023
f57a210
[WIP] tests
michaelbenayoun Nov 29, 2023
017bbbd
Refacotr
michaelbenayoun Nov 30, 2023
ce6e4ac
[WIP] tests
michaelbenayoun Nov 30, 2023
3e6586f
[WIP] tests
michaelbenayoun Nov 30, 2023
01cf4cd
DistributedTest works
michaelbenayoun Dec 1, 2023
ef25839
[WIP] tests
michaelbenayoun Dec 4, 2023
43550ba
[WIP] tests
michaelbenayoun Dec 5, 2023
db939b0
[WIP] tests
michaelbenayoun Dec 5, 2023
650771e
[WIP] tests
michaelbenayoun Dec 6, 2023
2ad63a0
test_common almost done
michaelbenayoun Dec 6, 2023
9f912be
[WIP] tests
michaelbenayoun Dec 7, 2023
0f7abd8
[WIP] tests
michaelbenayoun Dec 8, 2023
52d01af
[WIP] tests
michaelbenayoun Dec 12, 2023
1f9df87
[WIP] tests
michaelbenayoun Dec 12, 2023
269f17b
Small cleanup
michaelbenayoun Dec 12, 2023
f51ad74
Clean tests
michaelbenayoun Dec 12, 2023
ba1137f
Styling
michaelbenayoun Dec 12, 2023
5e889a2
[WIP] tests
michaelbenayoun Dec 13, 2023
730efb4
[WIP] tests
michaelbenayoun Dec 13, 2023
2905b05
Styling
michaelbenayoun Dec 13, 2023
b967840
[WIP] tests
michaelbenayoun Dec 14, 2023
54f2de7
Merge branch 'main' into initial_pp
michaelbenayoun Dec 15, 2023
cb9dbeb
[WIP] tests
michaelbenayoun Dec 18, 2023
0c9e053
[WIP] tests
michaelbenayoun Dec 19, 2023
fb98746
[WIP] tests
michaelbenayoun Dec 20, 2023
0679ade
[WIP] tests
michaelbenayoun Dec 20, 2023
05164dd
[WIP] tests
michaelbenayoun Dec 20, 2023
2d5db07
[WIP] tests
michaelbenayoun Dec 20, 2023
1bfac8a
Merge branch 'main' into initial_pp
michaelbenayoun Dec 20, 2023
f47ada5
Styling
michaelbenayoun Dec 20, 2023
ec39922
Fix test
michaelbenayoun Jan 4, 2024
c88fe86
Update workflow
michaelbenayoun Jan 4, 2024
ec7a8ad
fix test
michaelbenayoun Jan 4, 2024
5ded810
fix test
michaelbenayoun Jan 4, 2024
dade072
fix test
michaelbenayoun Jan 4, 2024
d2126df
clean test
michaelbenayoun Jan 4, 2024
30241d3
[WIP] tests
michaelbenayoun Jan 5, 2024
5ad63ec
Fix small issues
michaelbenayoun Jan 5, 2024
4904932
Fix doc
michaelbenayoun Jan 5, 2024
4d15239
[WIP] cache system support for PP
michaelbenayoun Jan 5, 2024
238cf88
[WIP] fix tests
michaelbenayoun Jan 5, 2024
a669b60
Fix save_and_load test
michaelbenayoun Jan 8, 2024
e7a4c13
Fix test_optimizer_parameters_match_models_parameters
michaelbenayoun Jan 8, 2024
9800a42
Fix GPTNeo(x) tests
michaelbenayoun Jan 8, 2024
c04fc68
[WIP] fix llama tests
michaelbenayoun Jan 9, 2024
d7e7b40
[WIP] test_training
michaelbenayoun Jan 9, 2024
e27d87b
[WIP] test_training
michaelbenayoun Jan 10, 2024
d272416
Fix cache add test
michaelbenayoun Jan 10, 2024
baba59a
Cleanup
michaelbenayoun Jan 10, 2024
de55c9d
Pin huggingface_hub version
michaelbenayoun Jan 10, 2024
4e3e7ab
Cleanup
michaelbenayoun Jan 10, 2024
a82e44a
Disable dp=4,tp=pp=2 for test_common for now
michaelbenayoun Jan 10, 2024
533ffce
Fix tests in test_common.py
michaelbenayoun Jan 11, 2024
109aa67
Merge branch 'main' into initial_pp
michaelbenayoun Jan 11, 2024
f1b18d7
Fix tests in test_common.py
michaelbenayoun Jan 11, 2024
cfa5288
Fix
michaelbenayoun Jan 12, 2024
d94057f
Fix test
michaelbenayoun Jan 12, 2024
dce046c
Fix test
michaelbenayoun Jan 12, 2024
189bea9
Fix
michaelbenayoun Jan 12, 2024
51f0a65
Update workflow
michaelbenayoun Jan 15, 2024
bce46b5
Merge branch 'main' into initial_pp
michaelbenayoun Jan 16, 2024
7bdad6a
Skip GPTNeo tests
michaelbenayoun Jan 16, 2024
410a77b
Move model to device by default
michaelbenayoun Jan 16, 2024
d7e85fb
Fix test
michaelbenayoun Jan 16, 2024
95499cf
Test without test_training
michaelbenayoun Jan 17, 2024
0adbab6
Apply David's suggestions
michaelbenayoun Jan 22, 2024
840ea9d
Apply Jingya's suggestion
michaelbenayoun Jan 22, 2024
e6fa03a
Move distributed test conftest
michaelbenayoun Jan 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/test_trainium_common.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ jobs:
run: echo "/home/ubuntu/.local/bin" >> $GITHUB_PATH
- name: Set pip repository pointing to the Neuron repository
run: pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
- name: Update pip
run: pip install -U pip
- name: Install Python dependencies
run: pip install .[tests,neuronx]
- name: Run tests on Neuron cores
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test_trainium_distributed.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,5 +35,5 @@ jobs:
run: pip install .[tests,neuronx]
- name: Run tests on Neuron cores
run: |
HF_TOKEN=${{ secrets.HF_TOKEN_OPTIMUM_NEURON_CI }} pytest -m "is_trainium_test" tests/distributed/
HF_TOKEN=${{ secrets.HF_TOKEN_OPTIMUM_NEURON_CI }} pytest -m "is_trainium_test" tests/distributed/ -v --durations=0 -x --ignore tests/distributed/test_training.py

6 changes: 3 additions & 3 deletions docs/source/guides/distributed_training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -182,11 +182,11 @@ Just as for ZeRO-1, it is possible to wrap the optimizer class to make it lazy.
```python
from torch.optim import AdamW
from optimum.neuron import NeuronAccelerator
from optimum.neuron.accelerate.utils import TensorParallelismPlugin
from optimum.neuron.accelerate.utils import ModelParallelismPlugin
from optimum.neuron.distributed import lazy_load_for_parallelism

tensor_parallel_size = 8
tp_plugin = TensorParallelismPlugin(
mp_plugin = ModelParallelismPlugin(
tensor_parallel_size,
parallelize_embeddings=True,
sequence_parallel_enabled=True,
Expand All @@ -195,7 +195,7 @@ tp_plugin = TensorParallelismPlugin(

accelerator = NeuronAccelerator(
...
tp_plugin=tp_plugin,
mp_plugin=mp_plugin,
)

with lazy_load_for_parallelism(tensor_parallel_size=tensor_parallel_size):
Expand Down
2 changes: 1 addition & 1 deletion docs/source/package_reference/distributed.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The [`~optimum.neuron.distributed.Parallelizer`] class is the base abstract clas
[[autodoc]] distributed.Parallelizer
- _parallelize
- parallelize
- optimizer_for_tp
- optimizer_for_mp
- save_model_checkpoint
- load_model_checkpoint

Expand Down
59 changes: 48 additions & 11 deletions examples/image-classification/run_image_classification.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import logging
import os
import sys
import warnings
from dataclasses import dataclass, field
from typing import Optional

Expand All @@ -28,6 +29,7 @@
from torchvision.transforms import (
CenterCrop,
Compose,
Lambda,
Normalize,
RandomHorizontalFlip,
RandomResizedCrop,
Expand Down Expand Up @@ -56,7 +58,7 @@
logger = logging.getLogger(__name__)

# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.31.0")
check_min_version("4.35.0")

require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt")

Expand Down Expand Up @@ -143,12 +145,28 @@ class ModelArguments:
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
image_processor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
token: str = field(
default=None,
metadata={
"help": (
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
)
},
)
use_auth_token: bool = field(
default=None,
metadata={
"help": "The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead."
},
)
trust_remote_code: bool = field(
default=False,
metadata={
"help": (
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
"with private models)."
"Whether or not to allow for custom models defined on the Hub in their own modeling files. This option"
"should only be set to `True` for repositories you trust and in which you have read the code, as it will "
"execute code present on the Hub on your local machine."
)
},
)
Expand Down Expand Up @@ -177,6 +195,15 @@ def main():
else:
model_args, data_args, training_args = parser.parse_args_into_dataclasses()

if model_args.use_auth_token is not None:
warnings.warn(
"The `use_auth_token` argument is deprecated and will be removed in v4.34. Please use `token` instead.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which package? transformers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I'm not in favor of adding that. These files are updated automatically by cloning the examples from Transformers. This is a bad side effect but I think that's ok.

FutureWarning,
)
if model_args.token is not None:
raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
model_args.token = model_args.use_auth_token

# Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
# information sent is the one passed as arguments along with your Python/PyTorch versions.
send_example_telemetry("run_image_classification", model_args, data_args)
Expand All @@ -200,8 +227,8 @@ def main():

# Log on each process the small summary:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, 16-bits training: {training_args.fp16}"
)
logger.info(f"Training/evaluation parameters {training_args}")

Expand Down Expand Up @@ -230,7 +257,7 @@ def main():
data_args.dataset_config_name,
cache_dir=model_args.cache_dir,
task="image-classification",
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
)
else:
data_files = {}
Expand Down Expand Up @@ -277,32 +304,42 @@ def compute_metrics(p):
finetuning_task="image-classification",
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
trust_remote_code=model_args.trust_remote_code,
)
with lazy_load_for_parallelism(tensor_parallel_size=training_args.tensor_parallel_size):
with lazy_load_for_parallelism(
tensor_parallel_size=training_args.tensor_parallel_size,
pipeline_parallel_size=training_args.pipeline_parallel_size,
):
model = AutoModelForImageClassification.from_pretrained(
model_args.model_name_or_path,
from_tf=bool(".ckpt" in model_args.model_name_or_path),
config=config,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
trust_remote_code=model_args.trust_remote_code,
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
)

image_processor = AutoImageProcessor.from_pretrained(
model_args.image_processor_name or model_args.model_name_or_path,
cache_dir=model_args.cache_dir,
revision=model_args.model_revision,
use_auth_token=True if model_args.use_auth_token else None,
token=model_args.token,
trust_remote_code=model_args.trust_remote_code,
)

# Define torchvision transforms to be applied to each image.
if "shortest_edge" in image_processor.size:
size = image_processor.size["shortest_edge"]
else:
size = (image_processor.size["height"], image_processor.size["width"])
normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
normalize = (
Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
if hasattr(image_processor, "image_mean") and hasattr(image_processor, "image_std")
else Lambda(lambda x: x)
)
_train_transforms = Compose(
[
RandomResizedCrop(size),
Expand Down
Loading
Loading