Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UMD] Switch pci_cores and dram_cores to CoreCoord api. #17620

Merged
merged 9 commits into from
Feb 26, 2025

Conversation

broskoTT
Copy link
Contributor

@broskoTT broskoTT commented Feb 5, 2025

Ticket

Related to #17002

Problem description

Switched .pci_cores and .dram_cores to new API.

What's changed

  • Changed .pci_cores and .dram_cores to get_cores
  • Changed old get_core_for_dram_channel with new get_dram_core_for_channel

Checklist

@abhullar-tt
Copy link
Contributor

on WH the translated space does not modify pcie/dram cores but on BH we do so I think when noc translation is enabled Metal host will use the translated pcie/dram coords from Andrews spec

In my testing I was trying to pass in physical pcie coordinate and get translated but I saw it returns physical rather than (19,24). I think this might be a bug?

@abhullar-tt
Copy link
Contributor

on WH the translated space does not modify pcie/dram cores but on BH we do so I think when noc translation is enabled Metal host will use the translated pcie/dram coords from Andrews spec

In my testing I was trying to pass in physical pcie coordinate and get translated but I saw it returns physical rather than (19,24). I think this might be a bug?

The above mistranslation seems more like a machine issue since noc translation is being readback as not enabled

@abhullar-tt
Copy link
Contributor

on WH the translated space does not modify pcie/dram cores but on BH we do so I think when noc translation is enabled Metal host will use the translated pcie/dram coords from Andrews spec
In my testing I was trying to pass in physical pcie coordinate and get translated but I saw it returns physical rather than (19,24). I think this might be a bug?

The above mistranslation seems more like a machine issue since noc translation is being readback as not enabled

side note: after hardcoding UMD's Cluster to set noc translation enabled to true yyz-lab-125 there might be a bug because it fails to translate (11,0) from physical to translated

Copy link
Contributor

@abhullar-tt abhullar-tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not gonna block approval on the pcie translation issue. can be debugged separately

broskoTT added a commit that referenced this pull request Feb 24, 2025
…17707)

### Ticket
Related to #17002

### Problem description
A couple of leftover usages of old soc descriptor API. After this and
other PRs from this set, tt-metal will finally build with harvesting
code completely removed from tt::umd::Cluster and members of
tt_SocDescriptor made private, so that all usages are forced through
get_cores() and other APIs.
Related PRs:
#17620
#17642
#17645
#17674
#17678

### What's changed
- .dram_cores changed with get_cores_for_dram_channel
- dram_cores.size() changed with get_grid_size
- replace .workers and .ethernet_cores from tlb_config with get_cores

### Checklist
- [x] All post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197621962
- [x] Newest All post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13499189013
- [x] Blackhole post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197623746
- [ ] (Single-card) Model perf tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197626137
- [ ] (Single-card) Device perf regressions :
https://github.com/tenstorrent/tt-metal/actions/runs/13197628487
- [ ] (T3K) T3000 unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197630092
- [ ] (T3K) T3000 demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197632086
- [ ] (TG) TG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197633394
- [ ] (TG) TG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197635275
- [x] (TGG) TGG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197637219
- [x] (TGG) TGG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13197639736
broskoTT added a commit to tenstorrent/tt-umd that referenced this pull request Feb 26, 2025
…onfigurations (#539)

### Issue
#279

### Description
This is how it is used by tt-metal, so this change will support the
metal's scenario for both boards.

### List of the changes
- Exposed the board type and "is_chip_remote" up to soc descriptor
- Don't use _type1 and _type2 soc descriptors. Add default bh soc
descriptor
- Fix router cores for _type2 soc descriptor (not needed)
- Introduced get_blackhole_chip_type to be used at a couple of places,
along with BlackholeChipType enum, for clarity.
- Added BoardBasedPCIE test
- test in test_cluster_bh that only single PCIE core exists in resulting
soc descriptor
- changed "is_chip_remote" to "asic_location" which communicates better
the intention. Although we still don't have p300 to confirm behavior
- Changed constructor of local_chip to accept sdesc_path, so that it
works on tt_metal. We still have a problem that our internal soc
descriptors are unused.
- Changed _no_eth soc descriptor to hold both pcie cores.


### Testing
Added tests to test new feature.
tt_metal test, which removes its own logic on choosing pcie_core,
passes:
- APC run which passes on tt_metal with this branch:
https://github.com/tenstorrent/tt-metal/actions/runs/13541164354
- BH run:
https://github.com/tenstorrent/tt-metal/actions/runs/13541162298/job/37843118101

### API Changes
There are no API changes in this PR.
Previous PR which changed P150A to P150 broke metal build. But
fortunately, exactly the changes which I'm doing in this PR and related
metal PR fix that (remove that dependancy altogether):
tenstorrent/tt-metal#17620
@broskoTT broskoTT merged commit da63d50 into main Feb 26, 2025
255 of 266 checks passed
@broskoTT broskoTT deleted the brosko/soc_new_api4 branch February 26, 2025 14:37
broskoTT added a commit that referenced this pull request Feb 26, 2025
…18352)

### Ticket
Related to #17002

### Problem description
A follow up from removing .dram_cores in
#17620 and removing
eth_cores in #17642

Not sure if I made mistake during merging changes, or if these changes
happened in the meantime. With this changes, tt_metal builds with this
UMD change which is the ultimate goal:
tenstorrent/tt-umd#509

### What's changed
- Changed .dram_cores with .get_dram_cores()
- Changed .ethernet_cores with .get_cores(CoreType::ETH)

### Checklist
All runs on brosko/umd_api_last_changes :
- [ ] All post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548653822
- [x] Blackhole post-commit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548656443
- [ ] (Single-card) Model perf tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548658812
- [ ] (Single-card) Device perf regressions :
https://github.com/tenstorrent/tt-metal/actions/runs/13548660915
- [ ] (T3K) T3000 unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548663044
- [ ] (T3K) T3000 demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548664896
- [ ] (TG) TG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548666720
- [ ] (TG) TG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548669182
- [x] (TGG) TGG unit tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548671739
- [x] (TGG) TGG demo tests :
https://github.com/tenstorrent/tt-metal/actions/runs/13548674558
broskoTT added a commit to tenstorrent/tt-umd that referenced this pull request Feb 27, 2025
### Issue
Last one for #439 

### Description
Finally removing old harvesting code. All the tt_metal's usages have
been moved to the new soc_descriptor and core_coordinate_manager apis.

### List of the changes
- Remove using_harvested_soc_descriptors,
get_harvesting_masks_for_soc_descriptors and
translate_to_noc_table_coords functions, and performed_harvesting,
harvested_rows_per_target, translation_tables_en from the public cluster
API
- Remove harvesting code from Cluster with many internal functions
- move old soc_descriptor structures to be private.
- Adjust tests accordingly
- Added noc_translation public member to soc_descriptor. Verify that the
one passed through constructor and through soc_descriptor matches. This
code should be improved.

### Testing
Code builds

### API Changes
There are no breaking API changes in this PR.
Related metal PRs prior to this change:
tenstorrent/tt-metal#17003
tenstorrent/tt-metal#17306
tenstorrent/tt-metal#17543
tenstorrent/tt-metal#17620
tenstorrent/tt-metal#17642
tenstorrent/tt-metal#17645
tenstorrent/tt-metal#17674
tenstorrent/tt-metal#17678
tenstorrent/tt-metal#17707
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants