-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#16888: Fix Conv2D when output is in Row Major #16937
base: main
Are you sure you want to change the base?
Conversation
bd63047
to
ee50790
Compare
ee50790
to
364b76a
Compare
c5c06ec
to
62b6161
Compare
@@ -42,7 +42,7 @@ def run_conv( | |||
padded_input_channels=None, | |||
fp32_accum=False, | |||
packer_l1_acc=False, | |||
output_layout=ttnn.TILE_LAYOUT, | |||
output_layout=ttnn.ROW_MAJOR_LAYOUT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing the default? we also need to test the TILE output layout more extensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most tests use the default value, and to ensure that untilize_out works, I changed the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep tile as default as that is what is being used by models, add row major variant in tests where needed
if ( | ||
shard_layout == ttnn.TensorMemoryLayout.HEIGHT_SHARDED | ||
and output_channels > 256 | ||
and output_layout == ttnn.ROW_MAJOR_LAYOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this case work with TILE layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to added to validate function as well as TT_FATAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
and output_channels > 256 | ||
and output_layout == ttnn.ROW_MAJOR_LAYOUT | ||
): | ||
pytest.skip("Skipping when out_block_w > 8 for untilize_out == True") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also create a ticket to add support for this case, and mention the ticket number here.
if fp32_accum and packer_l1_acc and output_layout == ttnn.ROW_MAJOR_LAYOUT: | ||
conv_config.output_layout = ttnn.TILE_LAYOUT | ||
logger.warning( | ||
"Forcing output_layout to TILE when act_block_h_override, fp32_accum and packer_l1_acc are enabled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why TILE only for fp32_accum and packer_l1_acc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails when untilize_out packer_l1 and fp32_accum are all enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add to validate function TT_FATAL for combinations we know are not working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bump
@@ -169,7 +182,8 @@ def run_conv( | |||
return_weights_and_bias=True, | |||
return_output_dim=True, | |||
) | |||
|
|||
# import numpy as np | |||
# np.save("ref.npy",weights_device.cpu().to_torch().numpy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -1213,7 +1229,7 @@ def test_resnet50_conv_wh_fp32( | |||
) | |||
@pytest.mark.parametrize( | |||
"activations_dtype", | |||
[ttnn.bfloat8_b], | |||
[ttnn.bfloat16], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test row major output.
@@ -86,7 +86,7 @@ bool check_non_tile_mul_width( | |||
auto elem_size = conv_config.weights_dtype == DataType::BFLOAT8_B ? 1 : 2; | |||
bool is_non_tile_mul_width = | |||
(conv_config.shard_layout.has_value() && conv_config.shard_layout == TensorMemoryLayout::BLOCK_SHARDED) && | |||
conv_config.act_block_h_override == 0 && | |||
// conv_config.act_block_h_override == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
@@ -24,6 +24,9 @@ | |||
// SliceRange srr = SliceRange{.h0 = 0, .h1 = 1, .hs = 8, .w0 = 0, .w1 = 32, .ws = 1}; | |||
// SliceRange srr1 = SliceRange{.h0 = 1, .h1 = 2, .hs = 8, .w0 = 0, .w1 = 32, .ws = 1}; | |||
// SliceRange src = SliceRange{.h0 = 0, .h1 = 32, .hs = 1, .w0 = 0, .w1 = 1, .ws = 1}; | |||
// SliceRange row_range = SliceRange{.h0 = 0, .h1 = 1, .hs = 1, .w0 = 0, .w1 = 16, .ws = 1}; | |||
// SliceRange col_range = SliceRange{.h0 = 0, .h1 = 16, .hs = 1, .w0 = 0, .w1 = 1, .ws = 1}; | |||
// SliceRange sq_range = SliceRange{.h0 = 0, .h1 = 4, .hs = 1, .w0 = 0, .w1 = 4, .ws = 1}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets delete the whole commented out block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will remove once I've got all the tests passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe in PR what was the issue in compute kernel and what was done to fix, I can't follow the code.
4894625
to
3afb00d
Compare
ad9d18e
to
ccb25e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the impact of this change on tests runtime?
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
Outdated
Show resolved
Hide resolved
ttnn/cpp/ttnn/operations/conv/conv2d/device/kernels/conv_bmm_tilize_col_major_out_blocks.cpp
Show resolved
Hide resolved
|
||
if (curr_matmul_out_cb == matmul_partials_cb) { | ||
UNPACK( | ||
if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr = | ||
partials_cb_read_ptr); | ||
PACK( | ||
if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_wr_ptr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -366,7 +375,9 @@ void MAIN { | |||
cb_pop_front(in1_cb_id, in1_block_num_tiles); | |||
} // for in0_num_blocks_w | |||
if constexpr (matmul_partials_cb == mm_out_cb_id) { | |||
UNPACK(get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr = partials_cb_read_ptr); | |||
UNPACK( | |||
if (use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -416,6 +427,10 @@ void MAIN { | |||
in1_index_subblock_offset += out_subblock_w; | |||
} // for in1_num_subblocks | |||
} // in0_num_subblocks | |||
if (untilize_out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make constexpr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ccb25e9
to
474ed8c
Compare
@@ -226,17 +226,26 @@ std::vector<TensorSpec> OptimizedConvNew::compute_output_specs(const std::vector | |||
dtype, PageConfig(output_layout), mem_config, output_shape, padded_output_shape))}; | |||
} else if (this->memory_config.memory_layout == TensorMemoryLayout::WIDTH_SHARDED) { | |||
uint32_t total_height_tiles = padded_output_shape.volume() / padded_output_shape[-1] / TILE_HEIGHT; | |||
std::array<uint32_t, 2> shard_shape = { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's tricky to use logical sharding for conv output only for width sharding.
In general most ops are not ready to consume logical sharding (like TMs) and change like this could trigger models to fail.
Now people will have to have in mind that widht sharing outputs logical shards but HS and BS output physical shard.
What was to reason to go for logical sharding here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With width sharding and the output as Row Major Layout, this condition was failing when the output height * width was not a multiple of 32.
When the output was in Tile Layout, it would correctly pad the shard. However, in Row Major Layout, the physical size of the shard would always be the logical size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should HS and BS also be changed to Logical Sharding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you trigger same problem with HS if you use shallow cone (input_channel_aligmnet to 8 or 16) and untialize output?
This problem doesn't sound specific to WS.
If you can make all models work with conv output as logical sharding that would be great.
There is around a 5% increase in kernel execution time when the output is in Row Major. |
Sorry I wasn't clear enough I asked about is runtime of test_new_conv2d.py increased now, and if so by how much (as we were trying to reduce the runtime of this to make post commit faster) |
Ticket
#16888
Problem description
Lots of test cases fail when output is in Row Major.
What's changed
Matmul_partials_cb is the input CB to the untilize block. Matmul_partials_cb may be directly mapped to the output buffer, or have its own memory allocated.
If it's mapped to the output buffer,
use_partials_for_out == True
, then the read & write pointers have to be incremented after every in0 height block.If it has it's own memory allocated,
use_partials_for_out == False
, then the read & write pointers have to be reset to the original position.Checklist