Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#16888: Fix Conv2D when output is in Row Major #16937

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sankarmanoj-tt
Copy link
Contributor

@sankarmanoj-tt sankarmanoj-tt commented Jan 21, 2025

Ticket

#16888

Problem description

Lots of test cases fail when output is in Row Major.

What's changed

Matmul_partials_cb is the input CB to the untilize block. Matmul_partials_cb may be directly mapped to the output buffer, or have its own memory allocated.

If it's mapped to the output buffer, use_partials_for_out == True, then the read & write pointers have to be incremented after every in0 height block.

If it has it's own memory allocated, use_partials_for_out == False, then the read & write pointers have to be reset to the original position.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable) passes
  • Model regression CI testing passes
  • Device performance regression CI testing passes (if applicable)
  • Nightly L2 passes
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

@sankarmanoj-tt sankarmanoj-tt marked this pull request as ready for review February 3, 2025 06:42
@sankarmanoj-tt sankarmanoj-tt requested a review from a team as a code owner February 3, 2025 06:42
@@ -42,7 +42,7 @@ def run_conv(
padded_input_channels=None,
fp32_accum=False,
packer_l1_acc=False,
output_layout=ttnn.TILE_LAYOUT,
output_layout=ttnn.ROW_MAJOR_LAYOUT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing the default? we also need to test the TILE output layout more extensively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tests use the default value, and to ensure that untilize_out works, I changed the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep tile as default as that is what is being used by models, add row major variant in tests where needed

if (
shard_layout == ttnn.TensorMemoryLayout.HEIGHT_SHARDED
and output_channels > 256
and output_layout == ttnn.ROW_MAJOR_LAYOUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case work with TILE layout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to added to validate function as well as TT_FATAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

and output_channels > 256
and output_layout == ttnn.ROW_MAJOR_LAYOUT
):
pytest.skip("Skipping when out_block_w > 8 for untilize_out == True")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also create a ticket to add support for this case, and mention the ticket number here.

if fp32_accum and packer_l1_acc and output_layout == ttnn.ROW_MAJOR_LAYOUT:
conv_config.output_layout = ttnn.TILE_LAYOUT
logger.warning(
"Forcing output_layout to TILE when act_block_h_override, fp32_accum and packer_l1_acc are enabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why TILE only for fp32_accum and packer_l1_acc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails when untilize_out packer_l1 and fp32_accum are all enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add to validate function TT_FATAL for combinations we know are not working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

@@ -169,7 +182,8 @@ def run_conv(
return_weights_and_bias=True,
return_output_dim=True,
)

# import numpy as np
# np.save("ref.npy",weights_device.cpu().to_torch().numpy())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -1213,7 +1229,7 @@ def test_resnet50_conv_wh_fp32(
)
@pytest.mark.parametrize(
"activations_dtype",
[ttnn.bfloat8_b],
[ttnn.bfloat16],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test row major output.

@@ -86,7 +86,7 @@ bool check_non_tile_mul_width(
auto elem_size = conv_config.weights_dtype == DataType::BFLOAT8_B ? 1 : 2;
bool is_non_tile_mul_width =
(conv_config.shard_layout.has_value() && conv_config.shard_layout == TensorMemoryLayout::BLOCK_SHARDED) &&
conv_config.act_block_h_override == 0 &&
// conv_config.act_block_h_override == 0 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@@ -24,6 +24,9 @@
// SliceRange srr = SliceRange{.h0 = 0, .h1 = 1, .hs = 8, .w0 = 0, .w1 = 32, .ws = 1};
// SliceRange srr1 = SliceRange{.h0 = 1, .h1 = 2, .hs = 8, .w0 = 0, .w1 = 32, .ws = 1};
// SliceRange src = SliceRange{.h0 = 0, .h1 = 32, .hs = 1, .w0 = 0, .w1 = 1, .ws = 1};
// SliceRange row_range = SliceRange{.h0 = 0, .h1 = 1, .hs = 1, .w0 = 0, .w1 = 16, .ws = 1};
// SliceRange col_range = SliceRange{.h0 = 0, .h1 = 16, .hs = 1, .w0 = 0, .w1 = 1, .ws = 1};
// SliceRange sq_range = SliceRange{.h0 = 0, .h1 = 4, .hs = 1, .w0 = 0, .w1 = 4, .ws = 1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets delete the whole commented out block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will remove once I've got all the tests passing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

@pavlejosipovic pavlejosipovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe in PR what was the issue in compute kernel and what was done to fix, I can't follow the code.

@sankarmanoj-tt sankarmanoj-tt force-pushed the smanoj/fix_conv2d_RM branch 4 times, most recently from 4894625 to 3afb00d Compare February 21, 2025 19:11
Copy link
Contributor

@pavlejosipovic pavlejosipovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the impact of this change on tests runtime?


if (curr_matmul_out_cb == matmul_partials_cb) {
UNPACK(
if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make constexpr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr =
partials_cb_read_ptr);
PACK(
if (!use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_wr_ptr =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make constexpr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -366,7 +375,9 @@ void MAIN {
cb_pop_front(in1_cb_id, in1_block_num_tiles);
} // for in0_num_blocks_w
if constexpr (matmul_partials_cb == mm_out_cb_id) {
UNPACK(get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr = partials_cb_read_ptr);
UNPACK(
if (use_partials_for_out) get_local_cb_interface(matmul_partials_cb).fifo_rd_ptr =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make constexpr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -416,6 +427,10 @@ void MAIN {
in1_index_subblock_offset += out_subblock_w;
} // for in1_num_subblocks
} // in0_num_subblocks
if (untilize_out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make constexpr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -226,17 +226,26 @@ std::vector<TensorSpec> OptimizedConvNew::compute_output_specs(const std::vector
dtype, PageConfig(output_layout), mem_config, output_shape, padded_output_shape))};
} else if (this->memory_config.memory_layout == TensorMemoryLayout::WIDTH_SHARDED) {
uint32_t total_height_tiles = padded_output_shape.volume() / padded_output_shape[-1] / TILE_HEIGHT;
std::array<uint32_t, 2> shard_shape = {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's tricky to use logical sharding for conv output only for width sharding.
In general most ops are not ready to consume logical sharding (like TMs) and change like this could trigger models to fail.
Now people will have to have in mind that widht sharing outputs logical shards but HS and BS output physical shard.

What was to reason to go for logical sharding here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With width sharding and the output as Row Major Layout, this condition was failing when the output height * width was not a multiple of 32.

When the output was in Tile Layout, it would correctly pad the shard. However, in Row Major Layout, the physical size of the shard would always be the logical size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should HS and BS also be changed to Logical Sharding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you trigger same problem with HS if you use shallow cone (input_channel_aligmnet to 8 or 16) and untialize output?
This problem doesn't sound specific to WS.
If you can make all models work with conv output as logical sharding that would be great.

@sankarmanoj-tt
Copy link
Contributor Author

There is around a 5% increase in kernel execution time when the output is in Row Major.
ops_perf_results_rm_comp_2025_02_27_07_57_07.csv

@pavlejosipovic
Copy link
Contributor

There is around a 5% increase in kernel execution time when the output is in Row Major. ops_perf_results_rm_comp_2025_02_27_07_57_07.csv

Sorry I wasn't clear enough I asked about is runtime of test_new_conv2d.py increased now, and if so by how much (as we were trying to reduce the runtime of this to make post commit faster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants