Improving Sparta Scheduler Performance #329

klingaard · 2022-04-15T18:39:26Z

klingaard
Apr 15, 2022
Maintainer

This is a discussion sparked from issue #187

With my original statement in issue #187, the Sparta scheduler is required to stick to deterministic behavior to ensure that back-to-back simulations (even with parameter changes) are consistent. For example, the user should not be chasing performance differences due to an accidental re-ordering of events in simulation due to, for example, resource capacity changes or even topology changes. The framework does help avoid such issues with auto-precedence support from derived sparta::Unit components. But the rest of the ordering is completely controlled by the user (see Using Phases).

With that said, I definitely agree that the Sparta Scheduler could use a performance uplift. It's currently very robust and very close to deterministic (again, that determinism is mostly controlled by the user's indicating expected ordering), but the algorithm the Scheduler uses to execute scheduled events is rudimentary, essentially being a double-for loop "searching" for scheduled events. As Pete would probably say, "Yuck." I'd have to agree.

To understand the following description of how the scheduler works, understand a couple of points:

All events in simulation are static meaning that they are pre-defined by a modeler during unit construction and are registered with the framework at that time. Modelers cannot dynamically create new events during a running simulation -- all work must be done using existing registered events
All registered events in simulation are given a group index before simulation starts to indicate their "place in time" with respect to those events that either precede or follow that event. This ordering is assigned by a DAG during simulation startup

How the Scheduler Works

The Scheduler maintains a time quantum list. A time quantum represents those events to be fired on a given tick. Each time quantum points to the next time quantum that has scheduled events. To "skip time," one time quantum (say at tick 100) would point to the next time quantum (say at tick 150) so that when the first quantum finishes, time will "skip" to 150.

Each time quantum maintains a two dimensional vector of events that have been scheduled. The first dimension of this vector is the event group index, which was assigned by the DAG during startup. The size of this vector is fixed to the total number of groups identified by the DAG. Those events with a small group index must be executed before those events in a larger group index. This is how the Scheduler maintains ordering between events.

Within each group is a vector of events to be executed. These events could (in theory) be executed in parallel as they are in the same group and the modeler has not indicated any relationships between these events. But there's a catch...

As events in that group are executed, they might schedule additional events that also fall into that same group in the same time quantum (0-cycle events). This does happen quite frequently in simulation, and has prevented Scheduler designs from using threading to its full potential.

So, at this point, we have this layout per time quantum:

quantum->
   group[0] -> event1, event2, event3
   group[1] -> event4, event7, event6, event5    // <-- out of order, but there's no relationship, so no problem
   group[2] -> null
   group[3] -> null
   group[4] -> event20, event50
   ...
   group[N] -> eventN, eventN+1, eventN+2

To see this in action, run the sparta_core_example with a scheduler logger enabled:

./sparta_core_example -l scheduler debug 1 | less -SR

One optimization I did make a while ago is to start the event execution at the first group that had an event scheduled. It's impossible (supposed to be) for an event to be scheduled at an earlier grouping once the time quantum is put into motion -- that would be an out-of-order situation and the framework would assert on you. But unfortunately there are still checks (like in group[2] and group[3] above) where there are no events scheduled, but the scheduler checks for work anyway.

Ideas for Improvement

I've toyed with a few ideas, some of them similar to Pete's basic scheduler concept documented in the reference issue. But keeping a running list of chained events trades searching for the insertion point (during scheduling) against the speed of traversing the list during execution. Currently insertion into the scheduler is an O(1) operation: group[index].push_back(event). A running list of events is an O(n) operation to find the correct place "in time" to insert the event.

One idea I played around with is extending the method delegation class with a "next" operation that is returned when an event is finished. Instead of walking event lists, events would "know" the next event to be executed; the scheduler would be dramatically simplified to a simple while loop:

    next_event = first_event_;
    while(next_event && running_) {
        next_event = next_event->execute();
    }
    if(false == running_) {
        first_event_ = next_event;  // to start it back up
    }

To do this, event handling would need to be extended to know the next event to execute. Insertion of an event would be tricky, however, but one idea is to look at the DAG for relationships. If eventA comes before eventB then if eventA is already scheduled and eventB is to be scheduled, just point eventA's next pointer to eventB and eventB's pointer to eventA's original next pointer. But this might require ALL events to be ordered in some fashion. More 🤔 to be done.

petewilson-bsc · 2022-04-15T20:00:26Z

petewilson-bsc
Apr 15, 2022

Knute

Could you provide me with an email address? I'd like to go over a few things with you; I think the things will be of interest to a wider audience, but I'd rather get the canards in a row before sharing too much...

I'm peter.wilson@bsc.es.

2 replies

petewilson-bsc Apr 15, 2022

Follow-up

I'm happy to contribute to the discussion here, but I'd rather initially discuss offline to get to the nub of the matter.

Basically, looks like we in Barcelona have an existence proof of a scalable very high performance multiprocessor RISC-V simulator, courtesy Peter Hsu; and of the ability to modestly 'restructure' Spike to make direct integration into a Sparta-like framework much more efficient. I've also made useful progress on the scheduler thoughts we mentioned way back then, which I'd be happy to share; not sure it directly benefits Sparta, but all knowledge is useful.

We've had good experiences with Sparta - it works as advertised, and doesn't seem to break - and we're happy to position Coyote as successfully based on Sparta. Our primary goal right now is to much better integrate the guts of Spike into a Sparta model - reworking scheduler stuff for higher native speed would be good, but we think the integration would benefit us a great deal right now, and would be a massive boon to all other folk seeking to model RISC-V systems.

I'm currently working from home in Normandy, so 7 hours different (before) you

klingaard Apr 17, 2022
Maintainer Author

Sent you a PM with the account to use.

petewilson-bsc · 2022-05-04T09:50:30Z

petewilson-bsc
May 4, 2022

I'm wondering if a timing wheel might be a good basis for the majority of event insertion.
A timing wheel trades size for time. It is especially good if one knows the largest 'delta t' (time between now and a future insertion time) before beginning simulation.

Suppose you know this time, and it's (say) 105 ticks. Then you make a table of event queues which is the next largest power of 2; 128 in this case. You keep a 'now' pointer or index to the current event queue - that is, one of the 128 queues correspond to now, and the other 127 are times in the future. You place an event on a queue at time T by computing (T-now)model 128. Provided that T <= 128 all is good and you know where to add the event in.

Depending on exactly what the events represent, you may also find that many events run every clock. Handling these can be speeded up by having the 'schedule N ticks in the future' code check for N being next tick. If its next tick, do nothing. If it's not next tick, insert in future. Deschedule the event if appropriate (no future work). Then, before moving to next time slot, move the whole of the remaining events queue to the next time slot. This would result in each entry in the wheel having two queues - one for inserted events, one for the 'inherited' queue. The move to the next slot is quite cheap.

Note - have not tried this as described.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Sparta Scheduler Performance #329

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improving Sparta Scheduler Performance #329

klingaard Apr 15, 2022 Maintainer

How the Scheduler Works

Ideas for Improvement

Replies: 2 comments · 2 replies

petewilson-bsc Apr 15, 2022

petewilson-bsc Apr 15, 2022

klingaard Apr 17, 2022 Maintainer Author

petewilson-bsc May 4, 2022

klingaard
Apr 15, 2022
Maintainer

Replies: 2 comments 2 replies

petewilson-bsc
Apr 15, 2022

klingaard Apr 17, 2022
Maintainer Author

petewilson-bsc
May 4, 2022