Added Distributed(Tensor Parallel) Inference Recipe #2245

acisseJZhong · 2025-01-10T06:09:05Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
Enabled TP for inference, for llama3, 3.1, 3.3 70B text only models. Llama3.2 vision 90B are still work in progress. It's blocked by #2277, will enable it as a follow up PR.

Copied dev/generate_v2.py and added TP to the recipe. The main change is in __init__ and __setup__.
Added TP plan for llama3, under the model folders _paralellism.py file.
Added TP utilities in _distritbuted.py, note that the utilities are for now only designed to work with llama models.
Added distributed inference config for llama3 70B and 3.1 70B, and 3.3 70B.
Generalize load_from_full_model_state_dict to general parallelism, not only FSDP.
Fixed a few typos.

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Compared distributed inference results for llama3 8b with non distributed inference. Generations are at parity, below are the screenshot results for distributed inference and non-distributed inference. For distributed, it takes longer time but has lower peak memory.

Running llama3 70B distributed inference on 8 Gpus:

Time for inference: 9.03 sec total, 3.10 tokens/sec
Bandwidth achieved: 408.57 GiB/s
Max memory allocated: 18.70 GiB

We see that the max memory allocated is only 18.7 GiB, which indicates we may use less number of GPUs.
Running on 2 Gpus, we see that the max memory increases.

Time for inference: 10.42 sec total, 3.17 tokens/sec
Bandwidth achieved: 417.21 GiB/s
Max memory allocated: 67.98 GiB

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2025-01-10T06:09:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2245

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 1ad2f76 with merge base 7747db1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…/torchtune into distributed_inference

codecov-commenter · 2025-01-10T07:06:19Z

Codecov Report

Attention: Patch coverage is 28.94737% with 27 lines in your changes missing coverage. Please review.

Project coverage is 23.95%. Comparing base (baae232) to head (41941a9).
Report is 15 commits behind head on main.

Files with missing lines	Patch %	Lines
torchtune/training/_distributed.py	25.00%	24 Missing ⚠️
torchtune/modules/model_fusion/_fusion_layers.py	0.00%	2 Missing ⚠️
torchtune/modules/attention.py	0.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (baae232) and HEAD (41941a9). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (baae232) HEAD (41941a9)

3 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2245       +/-   ##
===========================================
- Coverage   64.30%   23.95%   -40.35%     
===========================================
  Files         352      357        +5     
  Lines       20566    21174      +608     
===========================================
- Hits        13225     5073     -8152     
- Misses       7341    16101     +8760

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

felipemello1

some minor comments/question

felipemello1 · 2025-01-15T20:44:33Z

torchtune/training/_distributed.py

@@ -45,6 +48,18 @@
    "dev" not in torch_version and torch_version_ge("2.6.0")
 ) or ("dev" in torch_version and torch_version.split("dev")[1] >= "20241220")

+BASE_LLAMA_TP_PLAN = {


Do we need one for each family of models? If so, is this file the right place to store it?

Yeah I am also curious what's the best place to store this info. The plan should be shared within llama3, 3.1, and 3.2, but we should define unique plans for 3.2 vision and 4. Maybe it should be stored in _model_builders.py? What's a better place?

I am not sure. I feel that _model_builders.py would be too scattered. If we had training/distributed/, i would put it there. Does TorchTitan have something like this for multiple models? maybe we could check how they do it.

Every model has a checkpoint mapping torchtune <-> hf. How do we handle it? Probably we should follow the same pattern.

Torchtitan didn't define plan for each model, they just have one apply_tp function for llama3 in a parallelize_llama.py.

I saw each model has a convert_weights.py file for converting weights format between hf and torchtune. Maybe let me create a parallelism file for each model, to put all the plans.

I don't think we need a parallelism file for every model. The vast majority of models that we support will be able to fall under the BASE_LLAMA_TP_PLAN. It doesn't have to live in training/ but it should live in somewhere centralized. Then, if there is a specific TP plan that we want to enable for, say, LLama3.2V, then we can define it either in the _model_builders.py file OR we can add a _parallelism.py file under the model directory where we define the TP plan.

Call it TRANSFORMER_DECODER_TP_PLAN or similar so it's not llama-specific. maybe we'll finally need a distributed folder 👀

I'd vote for something like BASIC_TP_PLAN

BASED_TP_PLAN

(this is a joke)

Hahahaha ... unless? 👀

felipemello1 · 2025-01-15T20:45:55Z

torchtune/training/_distributed.py

+    "tok_embeddings": RowwiseParallel(input_layouts=Replicate()),
+    "output": ColwiseParallel(output_layouts=Replicate()),
+    "layers.*.attn.q_proj": ColwiseParallel(),
+    "layers.*.attn.k_proj": ColwiseParallel(),
+    "layers.*.attn.v_proj": ColwiseParallel(),
+    "layers.*.attn.output_proj": RowwiseParallel(),
+    "layers.*.mlp.w1": ColwiseParallel(),
+    "layers.*.mlp.w2": RowwiseParallel(),
+    "layers.*.mlp.w3": ColwiseParallel(),
+}


n00b question: is this row/col the optimal setup? or is it somewhat arbitrary?

For matrix multiplication, we just need to make sure one matrix is Col and the other is Row. For example, because the math is mlp.w2(mlp.w1(x) * mlp.w3(x)), therefore we just need to make sure that w1 and w3 are col and w2 is row, or the other way around.

optional: Maybe adding this comment on top of it would be good

felipemello1 · 2025-01-15T20:48:11Z

torchtune/training/_distributed.py

+def get_tp_plan(model_type: str) -> Dict[str, ParallelStyle]:
+    """
+    Get the TP plan for a given model type.
+
+    Args:
+        model_type (str): The model type to get the TP plan for.
+
+    Returns:
+        Dict[str, str]: A dictionary mapping layer names to their corresponding TP plan.
+    """
+    # For now, we only support base TP plan, will add more plan later
+    return BASE_LLAMA_TP_PLAN


I understand that this is a v0, but should we add something like:

if model_type not in LLAMA_MODEL_TYPES: raise "TP only supported for llama type models"

felipemello1 · 2025-01-15T20:49:43Z

torchtune/training/_distributed.py

+    Returns:
+        nn.Module: Adjusted model.
+    """
+    for transformer_block in model.layers:


this will break for vision model, since we do model.decoder.layers, unless we call adjust_attention_for_tp(model=model.decoder)

thanks! yeah I had this in my local changes, trying to make vision 3.2 work. I did the following:

model = getattr(model, 'decoder', model)

Let me know if you have better ideas.

not my proudest moment:

torchtune/torchtune/training/_compile.py

Line 46 in 890deab

if isinstance(model, DeepFusionModel):

felipemello1 · 2025-01-15T20:51:56Z

torchtune/training/_distributed.py

+    """
+    for transformer_block in model.layers:
+        # Adjust attention module to use the local number of heads
+        attn_layer = transformer_block.attn


nit: this is ok, but maybe a more robust option would be to look for the module type == SelfAttentionLayer

felipemello1 · 2025-01-15T20:59:40Z

recipes/dev/generate_v2_distributed.py

+    Expects the YAML to look like:
+        system: You are a helpful AI assistant.
+        user: What is the capital of France?
+
+    or if it includes an image:
+        system: You are a helpful AI assistant.
+        user:
+            image: url or path_to_image
+            text: Describe the image in detail.


We should denote that it is a ::codeblock: yaml, ask some llm for formating

Even the strongest LLMs cannot comprehend sphinx rst syntax

Fwiw this isn't gonna show up in our live docs anyways, right? In that case I would lean away from Sphinx directives -- if people are just reading the code it'll needlessly clutter things up

(But separately we should think about putting this somewhere besides the recipe file anyways, especially now that we're copying the same class to two different recipes)

felipemello1 · 2025-01-16T01:35:28Z

recipes/dev/generate_v2_distributed.py

+        self._dtype = training.get_dtype(dtype=cfg.dtype, device=self._device)
+        self._logger = utils.get_logger(cfg.log_level)
+        # Set up distributed env
+        dist.init_process_group("cuda:nccl")


i have seen in other parts of the code this resulting in errors if we dont do init_process_group("cuda:nccl,cpu:gloo")

I did see other file has cpu:gloo curious why? Since we will not use cpu as backend for inference

i have a vague memory that there may be some weight that is initialized using cpu, for some reason, and without cpu:gloo, it raises an error. But I dont remember exactly the issue. In any case, I dont think it hurts to add it.

I tried adding cpu:gloo when initializing the process group. But I am getting some RMSNorm cuda cpu device mismatch. With just cuda:nccl and the exact same code, it works with no problem. https://www.internalfb.com/phabricator/paste/view/P1713611664

Yeah I think gloo PG is only relevant for CPU offloading, which we aren't doing today anyways

felipemello1 · 2025-01-16T01:36:32Z

recipes/dev/generate_v2_distributed.py

+        # Set up tenosr parallel device mesh
+        tp_degree = dist.get_world_size()  # Using all GPUs for TP
+        tp_mesh_shape = (tp_degree,)
+        tp_device_mesh = dist.init_device_mesh("cuda", tp_mesh_shape)


n00b question: should we worry about other device types, e.g. npu?

I think it's fine to leave that out in a first pass. Last I knew of we don't yet have distributed support for NPUs anyways (though @noemotiovon can inform me if my info is out of date here)

felipemello1 · 2025-01-16T01:39:04Z

recipes/dev/generate_v2_distributed.py

+
+        # This method will convert the full model state dict into a sharded state
+        # dict and load into the model
+        training.load_from_full_model_state_dict(


nit: as a rule of thumb, i think its worth using key arguments for all args, not only strict and cpu_offload

felipemello1 · 2025-01-16T01:41:01Z

recipes/dev/generate_v2_distributed.py

+            f"Bandwidth achieved: {model_size * tokens_per_second / 1e9:.02f} GB/s"
+        )
+        self._logger.info(
+            f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB"


nit: i think in general we prefer to use GiB, otherwise it may appear that we used more memory than the GPU has available --> to change replace 1e9 with /1024/1204

joecummings

What kind of tok/sec are we seeing with TP and Llama3 for the distributed inference recipe?

joecummings · 2025-01-16T19:12:00Z

torchtune/training/_distributed.py

@@ -546,3 +564,72 @@ def shard_model(

    # Finally shard the entire model to account for any stragglers
    fully_shard(model, **fsdp_kwargs)
+
+
+def get_tp_plan(model_type: str) -> Dict[str, ParallelStyle]:


I actually want to avoid something like this. It's very similar to how we did checkpointing where we would have complicated if/else logic that quickly got very confusing.

I'm imagining that the user would be able to pass in their TP plan directly from the config if they want to use tensor parallel.

Agreed. I think just pointing directly to a function in the config with the plan they want gives the users the most flexibility

tensor_parallel_plan: _component_: torchtune.training.BASE_LLAMA_TP_PLAN

I am concerned that things could become messier if we allow user to directly change TP plan from config, especially if we have more parallelism enabled. In most cases the default TP plan should suffice, it's unclear why users would need to change it. Ideally each model will have default TP plan under _parallelism.py and if advanced user really want to experiment with plan, they can modify there.

if advanced user really want to experiment with plan, they can modify there.

This would necessitate users using git clone to interact with torchtune, which isn't the case for many of our users. It has to somehow be possible to override the TP plan via some builder or directly in the config. We could just call the value parallelism_plan, which could extend to other types as well.

I'd say those are two separate problems - where to place the plans and how to specify them. I agree with placing the plans under each model's _parallelism.py or some central place for the default, but for accessing them you'll either need to do some weird dictionary mapping or specify it directly from the config. Since all the generation configs are model specific, we can just point to the default for each model.

The direct from config approach just lets you bypass the whole dict mapping, which is cleaner imo. Otherwise you will need to keep updating the mapping for every new model.

joecummings · 2025-01-16T19:13:39Z

torchtune/training/_distributed.py

@@ -45,6 +48,18 @@
    "dev" not in torch_version and torch_version_ge("2.6.0")
 ) or ("dev" in torch_version and torch_version.split("dev")[1] >= "20241220")

+BASE_LLAMA_TP_PLAN = {


I don't think we need a parallelism file for every model. The vast majority of models that we support will be able to fall under the BASE_LLAMA_TP_PLAN. It doesn't have to live in training/ but it should live in somewhere centralized. Then, if there is a specific TP plan that we want to enable for, say, LLama3.2V, then we can define it either in the _model_builders.py file OR we can add a _parallelism.py file under the model directory where we define the TP plan.

joecummings · 2025-01-16T19:14:35Z

torchtune/training/_distributed.py

+    tp_mesh: DeviceMesh,
+) -> nn.Module:
+    """
+    Adjusts the number of attention heads and dimension in the model to account for tensor parallelism.


This description is too vague. We should communicate what exactly is happening to the users.

joecummings · 2025-01-16T19:15:48Z

torchtune/training/_distributed.py

+    """
+    # Consider the case of Early Fusion or Deep Fusion models
+    if isinstance(model, DeepFusionModel):
+        model = model.docoder


Suggested change

model = model.docoder

model = model.decoder

i think that docoder sounds nicer

joecummings · 2025-01-16T19:17:09Z

torchtune/training/_distributed.py

+            assert attn.num_heads % tp_mesh.size() == 0
+            assert attn.num_kv_heads % tp_mesh.size() == 0
+            assert attn.embed_dim % tp_mesh.size() == 0
+            attn.num_heads = attn.num_heads // tp_mesh.size()


nit: Don't need to use the floor division operator if you already determined that the tp_mesh.size() goes evenly into the num_heads, etc.

it's expected to be an int, so I used //

Right, but both num_heads and tp_mesh.size() are ints so it'll always be an int.

it's not in my case :(, I printed out in code, even though both numerator and denominator are int, the result is float.

weird ... okay sounds good then!

joecummings · 2025-01-16T19:17:29Z

torchtune/training/_distributed.py

+        # Adjust attention module to use the local number of heads
+        attention_layers = ([layer.attn] if not isinstance(layer, FusionLayer) else [layer.fusion_layer.attn, layer.layer.attn])
+        for attn in attention_layers:
+            assert attn.num_heads % tp_mesh.size() == 0


May as well pull tp_mesh.size() call out top.

joecummings · 2025-01-16T19:19:33Z

torchtune/training/_distributed.py

+        raise ValueError("TP is only supported for llama type models right now.")
+
+
+def adjust_attention_for_tp(


Should we be more precise here? Something like scale_attention_heads_by_tp_size?

The only reason I could think not to do this is if there might be other adjustments to the attention we might need to do.

It changes number of heads as well as the emb dim. maybe shard_attention_params_for_tp?

Just reviewed and I personally still found shard_attention_params_for_tp a bit confusing. Really you are just setting num_heads so that the reshapes are TP-aware, right? If that's the case I would even lean towards calling it prepare_mha_for_tp or something (the current name makes me thing you are actually distributed the params across devices in this utility, which you are not). Have I mentioned I hate naming things

joecummings · 2025-01-16T19:20:27Z

recipes/dev/generate_v2_distributed.py

+        tp_device_mesh = dist.init_device_mesh("cuda", tp_mesh_shape)
+
+        # Get TP plan and apply TP
+        tp_plan = training.get_tp_plan(cfg.checkpointer.model_type)


I would not count on model_type staying around.

Agreed, we should find some other way to reliably get the TP plan depending on the model... maybe this could be a parameter in the config that points to a function?

RdoubleA

Really like this, we just need to figure out how we expect users to use our built-in TP plans and if they want to supply their own. and how this can be specified from the config.

RdoubleA · 2025-01-16T20:02:17Z

recipes/dev/generate_v2_distributed.py

+    Expects the YAML to look like:
+        system: You are a helpful AI assistant.
+        user: What is the capital of France?
+
+    or if it includes an image:
+        system: You are a helpful AI assistant.
+        user:
+            image: url or path_to_image
+            text: Describe the image in detail.


Even the strongest LLMs cannot comprehend sphinx rst syntax

RdoubleA · 2025-01-16T20:02:45Z

recipes/dev/generate_v2_distributed.py

+    This *does not* currently support the following features:
+        - torch.compile
+        - quantization through torchao
+        - multi-GPU generation


should update this and call out that this can be run distirbuted for larger models using TP. And could point to some pytorch docs on TP

RdoubleA · 2025-01-16T20:03:19Z

recipes/dev/generate_v2_distributed.py

+        self._dtype = training.get_dtype(dtype=cfg.dtype, device=self._device)
+        self._logger = utils.get_logger(cfg.log_level)
+        # Set up distributed env
+        dist.init_process_group("cuda:nccl,cpu:gloo")


I think we have a utility for this that also sets the port, etc?

I was following full_finetune_distributed.py and seems like it doesn't use any util. Just init_process_group()

RdoubleA · 2025-01-16T20:03:32Z

recipes/dev/generate_v2_distributed.py

+        with training.set_default_dtype(self._dtype), torch.device("meta"):
+            model = config.instantiate(cfg.model)
+
+        # Set up tenosr parallel device mesh


Suggested change

# Set up tenosr parallel device mesh

# Set up tensor parallel device mesh

RdoubleA · 2025-01-16T20:04:41Z

recipes/dev/generate_v2_distributed.py

+            model = config.instantiate(cfg.model)
+
+        # Set up tenosr parallel device mesh
+        tp_degree = dist.get_world_size()  # Using all GPUs for TP


will this also work on multinode?

Please don't make me worry about multi-node INFERENCE too

then we should prevent users from trying multinode somewhere?

Can drop a comment at the top of the recipe

Added into the docstring for the InferenceRecipe class.

RdoubleA · 2025-01-16T20:07:50Z

torchtune/_recipe_registry.py

@@ -433,6 +433,17 @@ class Recipe:
        ],
        supports_distributed=False,
    ),
+    Recipe(


you need to add the 3.1 and 3.2 configs here, no?

added config for 3, 3.1, and 3.2, for 3.2 I am still debugging.

RdoubleA · 2025-01-16T20:08:27Z

torchtune/models/llama3_2_vision/_model_builders.py



+# Define the parallelism plan for Llama3.2 vision model
+LLAMA_DEEP_FUSION_VISION_TP_PLAN = {


I would just call this LLAMA_3_2_VISION_TP_PLAN

RdoubleA · 2025-01-16T20:10:07Z

torchtune/training/_distributed.py

@@ -45,6 +48,18 @@
    "dev" not in torch_version and torch_version_ge("2.6.0")
 ) or ("dev" in torch_version and torch_version.split("dev")[1] >= "20241220")

+BASE_LLAMA_TP_PLAN = {


Call it TRANSFORMER_DECODER_TP_PLAN or similar so it's not llama-specific. maybe we'll finally need a distributed folder 👀

RdoubleA · 2025-01-16T20:11:30Z

torchtune/training/_distributed.py

@@ -546,3 +564,72 @@ def shard_model(

    # Finally shard the entire model to account for any stragglers
    fully_shard(model, **fsdp_kwargs)
+
+
+def get_tp_plan(model_type: str) -> Dict[str, ParallelStyle]:


Agreed. I think just pointing directly to a function in the config with the plan they want gives the users the most flexibility

tensor_parallel_plan: _component_: torchtune.training.BASE_LLAMA_TP_PLAN

RdoubleA · 2025-01-16T20:12:20Z

torchtune/training/_distributed.py

+        for attn in attention_layers:
+            assert attn.num_heads % tp_mesh.size() == 0
+            assert attn.num_kv_heads % tp_mesh.size() == 0
+            assert attn.embed_dim % tp_mesh.size() == 0


use if... raise instead. It is more descriptive and you can describe what needs to be changed to fix the error

acisseJZhong · 2025-01-16T23:15:11Z

What kind of tok/sec are we seeing with TP and Llama3 for the distributed inference recipe?

Take ~10 seconds, ~3 tokens/second to inference a single prompt for llama3 and 3.1.

This reverts commit 3442bbe.

ebsmothers

This is looking great! Really excited to see this landing in the library. Apart from inline comments, main questions are around testing -- it'd be great to update the PR summary to address the following questions.

Are the generations at parity with the corresponding ones from generate_v2.py (obviously not feasible for 70B, but could test on e.g. an 8B model)?
What's the throughput and peak memory? (On a related note, are 8 devices necessary to run 70B inference? If not, can maybe use a smaller number in the config) Edit: I see you answered this in a comment above, still might be good to include comprehensive results in the PR test plan
Do the TP sharding utilities work with our other model families (Gemma, Mistral, Phi, etc)? Any models that they definitely do not work with? If that's the case we don't have to block on supporting everything, just want to be explicit about that.

ebsmothers · 2025-01-17T15:31:11Z

recipes/dev/generate_v2.py

@@ -109,10 +113,10 @@ def log_metrics(self, total_time: int, tokens_per_second: float) -> None:
            f"Time for inference: {total_time:.02f} sec total, {tokens_per_second:.02f} tokens/sec"
        )
        self._logger.info(
-            f"Bandwidth achieved: {model_size * tokens_per_second / 1e9:.02f} GB/s"
+            f"Bandwidth achieved: {model_size * tokens_per_second / 1024 / 1024:.02f} GB/s"


Maybe I'm missing something obvious, but shouldn't there be a 3rd 1024 here (and below)?

ebsmothers · 2025-01-17T15:41:40Z

recipes/configs/llama3/70B_generation_distributed.yaml

+#  tune download meta-llama/Meta-Llama-3-70B-Instruct --output-dir /tmp/Meta-Llama-3-70B-Instruct --ignore-patterns "original/consolidated*" --hf-token <HF_TOKEN>
+#
+# To launch, run the following command from root torchtune directory:
+#    tune run --nproc_per_node 8 dev/generate_v2_distributed --config llama3/70B_generation_distributed.yaml


Should remove .yaml from these commands once the configs are added to the recipe registry

ebsmothers · 2025-01-17T15:44:16Z

torchtune/models/llama3/_parallelism.py

+    "tok_embeddings": RowwiseParallel(input_layouts=Replicate()),
+    "output": ColwiseParallel(output_layouts=Replicate()),


Curious about this comment in the torchchat code

yeah I am curious how they got into this conclusion. For torchtune inference, I tried commenting these two lines out, the inference speed doesn't make too much difference(I ran several times, sometimes it's slower and sometimes faster, all differs by 0.0x seconds.)

Thanks, that's good to know. cc @fduwjj as the author of the corresponding torchchat PR in case you have any insights

ebsmothers · 2025-01-17T15:45:56Z

torchtune/models/llama3/_parallelism.py

+}
+
+
+def base_llama_tp_plan() -> Dict[str, Any]:


Can values in the return dict be typed as ParallelStyle? Or are there cases where the plan for a layer might fall outside these classes

ebsmothers · 2025-01-17T15:52:08Z

torchtune/models/llama3_2_vision/_parallelism.py

+    "decoder.layers.*.layer.attn.k_proj": ColwiseParallel(),
+    "decoder.layers.*.layer.attn.v_proj": ColwiseParallel(),
+    "decoder.layers.*.layer.attn.output_proj": RowwiseParallel(),
+    "decoder.layers.*.layer.mlp.w1": ColwiseParallel(),


Can we consolidate e.g. decoder.layers.*.layer.mlp.w1 and decoder.layers.*.fusion_layer.mlp.w1 -> decoder.layers.*.mlp.w1 or something like that? (Similar question for other layers)

good catch, I think it's the same, the wildcard should support this. But let me test it on 3.2 later, right now i am having problem because some distributed state dict problem with 3.2.

ebsmothers · 2025-01-17T15:58:14Z

recipes/dev/generate_v2_distributed.py

+    Expects the YAML to look like:
+        system: You are a helpful AI assistant.
+        user: What is the capital of France?
+
+    or if it includes an image:
+        system: You are a helpful AI assistant.
+        user:
+            image: url or path_to_image
+            text: Describe the image in detail.


(But separately we should think about putting this somewhere besides the recipe file anyways, especially now that we're copying the same class to two different recipes)

ebsmothers · 2025-01-17T15:59:53Z

recipes/dev/generate_v2_distributed.py

+        self._dtype = training.get_dtype(dtype=cfg.dtype, device=self._device)
+        self._logger = utils.get_logger(cfg.log_level)
+        # Set up distributed env
+        dist.init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")


Do we actually wanna support this on CPU?

probably not, let me remove that.

ebsmothers · 2025-01-17T16:01:49Z

recipes/dev/generate_v2_distributed.py

+        # Set up tenosr parallel device mesh
+        tp_degree = dist.get_world_size()  # Using all GPUs for TP
+        tp_mesh_shape = (tp_degree,)
+        tp_device_mesh = dist.init_device_mesh("cuda", tp_mesh_shape)


I think it's fine to leave that out in a first pass. Last I knew of we don't yet have distributed support for NPUs anyways (though @noemotiovon can inform me if my info is out of date here)

ebsmothers · 2025-01-17T16:03:31Z

recipes/dev/generate_v2_distributed.py

+            f"Bandwidth achieved: {model_size * tokens_per_second / 1024 / 1024:.02f} GB/s"
+        )
+        self._logger.info(
+            f"Max memory allocated: {torch.cuda.max_memory_allocated() / 1024 / 1024:.02f} GB"


Same comment here as in generate_v2.py

ebsmothers · 2025-01-17T16:17:58Z

torchtune/training/_distributed.py

+            attn.num_heads = attn.num_heads // tp_size
+            attn.num_kv_heads = attn.num_kv_heads // tp_size
+            attn.embed_dim = attn.embed_dim // tp_size


Just want to make sure I understand the purpose of this utility: the idea is to set the appropriate params on the attention module so that any reshapes etc performed there will result in the correctly-shaped input for sharded Q, K, V, output projections?

Also wonder whether we could just iterate over e.g. model.modules() and check isinstance(m, MultiHeadAttentionLayer) to avoid any dependency on the actual higher-level model arch details (fusion vs not, etc). For training this might be risky (e.g. AC will wrap modules and then you'd need to decide whether to apply on the AC-wrapped module vs not), but I don't see an obvious case it would break in inference.

joecummings · 2025-01-17T13:58:08Z

torchtune/training/_distributed.py

@@ -546,3 +550,71 @@ def shard_model(

    # Finally shard the entire model to account for any stragglers
    fully_shard(model, **fsdp_kwargs)
+
+
+def shard_attention_params_for_tp(


joecummings · 2025-01-17T21:17:57Z

recipes/configs/llama3/70B_generation_distributed.yaml

+model:
+  _component_: torchtune.models.llama3.llama3_70b
+
+tensor_parallel_plan:


I think we might want to mirror the parrelilize_module API and call this parallelize_plan. See https://pytorch.org/docs/main/distributed.tensor.parallel.html#torch.distributed.tensor.parallel.parallelize_module

joecummings

Incredible work!

joecummings · 2025-01-17T22:09:54Z

torchtune/training/_distributed.py

@@ -546,3 +548,66 @@ def shard_model(

    # Finally shard the entire model to account for any stragglers
    fully_shard(model, **fsdp_kwargs)
+
+
+def prepare_mha_for_tp(


This should probably have a test :)

added! thanks for the reminder.

joecummings · 2025-01-17T22:10:11Z

recipes/dev/generate_v2.py

@@ -109,10 +113,10 @@ def log_metrics(self, total_time: int, tokens_per_second: float) -> None:
            f"Time for inference: {total_time:.02f} sec total, {tokens_per_second:.02f} tokens/sec"
        )
        self._logger.info(
-            f"Bandwidth achieved: {model_size * tokens_per_second / 1e9:.02f} GB/s"
+            f"Bandwidth achieved: {model_size * tokens_per_second / 1024 / 1024 / 1024:.02f} GiB/s"


nit nit nit: 1024 ** 3

jessicazhongeee added 2 commits January 9, 2025 15:31

udpate cuda version

b8c96a8

add distributed inference

9fd582e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2025

acisseJZhong changed the title ~~[TP] Added Distributed Inference Recipe~~ Added Distributed(Tensor Parallel) Inference Recipe Jan 10, 2025

acisseJZhong and others added 3 commits January 9, 2025 22:15

Merge branch 'pytorch:main' into distributed_inference

975ec47

recover generation config

240d454

Merge branch 'distributed_inference' of ssh://github.com/acisseJZhong…

f7615a1

…/torchtune into distributed_inference

added configs

f746048

acisseJZhong requested review from RdoubleA, ebsmothers and joecummings and removed request for RdoubleA and ebsmothers January 10, 2025 07:21

jessicazhongeee added 5 commits January 9, 2025 23:23

remove 3.2 vision generation config

a318544

formatting

832d60a

formatting

7304ea5

remove imports

fcb36a5

misc

3f2d6ce

felipemello1 reviewed Jan 16, 2025

View reviewed changes

jessicazhongeee added 4 commits January 15, 2025 22:04

trying to add vision3.2

41941a9

address comments

8ff6c95

misc

1c7b394

address comments

5fd02e6

joecummings reviewed Jan 16, 2025

View reviewed changes

RdoubleA reviewed Jan 16, 2025

View reviewed changes

addressed comments

345b350

jessicazhongeee added 2 commits January 16, 2025 15:08

delete unused functino

129c844

misc

19100d2

jessicazhongeee added 3 commits January 16, 2025 18:02

debugging

3442bbe

Revert "debugging"

63f0423

This reverts commit 3442bbe.

add llama3.3 config

04c18b3

ebsmothers reviewed Jan 17, 2025

View reviewed changes

RdoubleA mentioned this pull request Jan 17, 2025

[WIP] Add 2D Parallelism (FSDP + Tensor Parallel) LoRA #2204

Open

7 tasks

jessicazhongeee added 2 commits January 17, 2025 10:15

address commnets

835cbc7

deubgging

cc7ece5

joecummings reviewed Jan 17, 2025

View reviewed changes

jessicazhongeee added 4 commits January 17, 2025 13:31

address comments

a929e66

remove 3.2 vision

1bc1b4d

formatting

5ad117b

added recipes for registry

68aee31

joecummings approved these changes Jan 17, 2025

View reviewed changes

misc

a80b7e5

acisseJZhong force-pushed the distributed_inference branch from 1b4f781 to a80b7e5 Compare January 17, 2025 22:29

jessicazhongeee and others added 6 commits January 17, 2025 14:31

merge main

510944e

merge main

f14655d

Merge branch 'main' into distributed_inference

5b36960

formatting

7f37b6b

add tests

9db97a0

formatting

1ad2f76

acisseJZhong merged commit 779569e into pytorch:main Jan 18, 2025
17 checks passed

RdoubleA mentioned this pull request Jan 21, 2025

v0.6.0 tracker #2232

Open

		raise ValueError("TP is only supported for llama type models right now.")


		def adjust_attention_for_tp(

	# Set up tenosr parallel device mesh
	# Set up tensor parallel device mesh



		# Define the parallelism plan for Llama3.2 vision model
		LLAMA_DEEP_FUSION_VISION_TP_PLAN = {

		"tok_embeddings": RowwiseParallel(input_layouts=Replicate()),
		"output": ColwiseParallel(output_layouts=Replicate()),

Added Distributed(Tensor Parallel) Inference Recipe #2245

Added Distributed(Tensor Parallel) Inference Recipe #2245

Conversation

acisseJZhong commented Jan 10, 2025 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Jan 10, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2245

⏳ No Failures, 3 Pending

codecov-commenter commented Jan 10, 2025 • edited Loading

Codecov Report

felipemello1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acisseJZhong commented Jan 10, 2025 •

edited

Loading

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading

codecov-commenter commented Jan 10, 2025 •

edited

Loading

acisseJZhong Jan 16, 2025 •

edited

Loading

acisseJZhong Jan 16, 2025 •

edited

Loading

acisseJZhong Jan 16, 2025 •

edited

Loading

acisseJZhong Jan 16, 2025 •

edited

Loading

felipemello1 Jan 16, 2025 •

edited

Loading

acisseJZhong Jan 16, 2025 •

edited

Loading