Support SegmentID when doing data prallel SPMD #8425

JackCaoG · 2024-11-27T23:27:25Z

this is built on top of #8333

When sharding spec is provided, we also need to shard the segment ID. The data parallel case is the easiest one.

Q: [B, num_head, Q_S,  head_dim]
K/V: [B, num_kv_head, KV_S,  head_dim]

Q_segment_id: [B, Q_S]
K/V_segment_id: [B, KV_S]

in the data parallel(or fsdp in this manner since we will do a all_gather on all parameters which make parameter full), the mesh is 1D like (num_device, ), name=("data") and the sharding spec we passed to flash_attention will be ("data", None, None, None). We just need to shard the segment_id the same way.

The tricky part is what do we save for the backward. I think we need to save the sharded segment_ids. You can imagine that after the enable_manual_sharding all of the computation becomes based on local shape. segment_ids is not the output of the flash_attnetion hence we don't have to bring it back to full. We saved the full_q/k/v but we also used enable_manual_sharding to shard it again.

Note that another tricky part is that q_segment_id is not what we passed to the pallas kernel, we actually add one dimension to it. check

xla/torch_xla/experimental/custom_kernel.py

Lines 219 to 224 in 20f5166

    
           q_segment_ids = q_segment_ids.unsqueeze(-1).expand( 
        
               [-1 for _ in q_segment_ids.shape] + [FlashAttention.NUM_LANES]) 
        
           kv_segment_ids = kv_segment_ids.unsqueeze(1).expand([ 
        
               kv_segment_ids.shape[0], FlashAttention.NUM_SUBLANES, 
        
               kv_segment_ids.shape[1] 
        
           ])

for more details. In this pr I also rename the 3d tensor to q_segment_ids_fa to make it more clear.

JackCaoG added 2 commits November 27, 2024 23:17

Support SegmentID when doing data prallel SPMD

0453dd1

lint

5e2cb30

JackCaoG added the tpuci label Nov 27, 2024

qihqi approved these changes Nov 27, 2024

View reviewed changes

rename segment_id for pallas kernel to make code less confusing

c4db598

JackCaoG force-pushed the JackCaoG/enable_segmentid_spmd_data_parallel branch from 165da7b to c4db598 Compare November 28, 2024 00:04

JackCaoG marked this pull request as ready for review November 28, 2024 09:06

JackCaoG merged commit 1c91219 into master Nov 28, 2024
12 checks passed

JackCaoG mentioned this pull request Dec 2, 2024

TPU memory use increased significantly in torch/xla - 2.6.0.dev20241107 #8423

Open

rpsilva-aws pushed a commit to rpsilva-aws/xla that referenced this pull request Dec 6, 2024

Support SegmentID when doing data prallel SPMD (pytorch#8425)

c00bf53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SegmentID when doing data prallel SPMD #8425

Support SegmentID when doing data prallel SPMD #8425

JackCaoG commented Nov 27, 2024 •

edited

Loading

	q_segment_ids = q_segment_ids.unsqueeze(-1).expand(
	[-1 for _ in q_segment_ids.shape] + [FlashAttention.NUM_LANES])
	kv_segment_ids = kv_segment_ids.unsqueeze(1).expand([
	kv_segment_ids.shape[0], FlashAttention.NUM_SUBLANES,
	kv_segment_ids.shape[1]
	])

Support SegmentID when doing data prallel SPMD #8425

Support SegmentID when doing data prallel SPMD #8425

Conversation

JackCaoG commented Nov 27, 2024 • edited Loading

JackCaoG commented Nov 27, 2024 •

edited

Loading