-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#753: Syncing device host times for tracy profiler #8101
Conversation
e35a490
to
51264b6
Compare
Are you planning to turn it on by default in the future? |
Perhaps it may be useful to show everyone the accuracy of your new approach with the data you showed me |
I was hoping to get some milage on it before turning it on by default. |
8b2d430
to
63d8846
Compare
a35c838
to
aeef077
Compare
This data is only used by tracy GUI to align device and host zones. It is disabled by default. It is enabled by setting TT_METAL_PROFILER_SYNC=1.
aeef077
to
46c7f2f
Compare
This is the PR for syncing device and host time for tracy.
The data is only used by tracy GUI. A tt_metal sync program is created that loads the sync kernel to device.
With the kernel running and waiting, host writes it time to a L1 location and device reads and tags the time with its own wall clock time. This happens for 249 iterations which is driven by profiler L1 buffer size.
Each sync program takes ~ 1s. i.e. 249 x 4ms (Sleep time between host time stamps) ~= 1s.
Multiple of these sync programs can happen per device. By default, at least 2 will happen per device, on at init_device and one at dump.
Host then post processes all the paired host-device timestamps and calculates the delay and frequency of the device.
Syncing is off by default and can be turned on by
TT_METAL_PROFILER_SYNC=1
Best way to evaluate the precision is to note that I am roughly getting 5 if not 6 significant digits on my frequency calculation. Separate runs are producing frequencies that are equal up to 6 significant digits. That can be seen as microsecond precision on the sync. Certainly sub 10us.
Below shows FD1 dispatch core end to the host finish call end. We can see the diff of
2.46us
. Part of this delay is real, it is the time for the message to travel. This is showing ~1us accuracy.Green CI 🟢
Post commit: https://github.com/tenstorrent/tt-metal/actions/runs/9388576940
Profiler with latest rebase: https://github.com/tenstorrent/tt-metal/actions/runs/9421179535
Device perf: https://github.com/tenstorrent/tt-metal/actions/runs/9389438009
T3K profiler: https://github.com/tenstorrent/tt-metal/actions/runs/9421174522
uBenchmark: https://github.com/tenstorrent/tt-metal/actions/runs/9401574470