PARSA Scheduler #46

pooriaPoorsarvi · 2025-03-02T15:48:36Z

We need something that can schedule our jobs for research by all teams in PARSA and use all the machines that PARSA has available

The current requirements are as follows

Be able to use all our compute resources and integrate machines that we might add later
Have a shared file system between our machines, so different stages of a pipeline can share inputs and outputs
Be able to schedule on machines based on timing and resource availability
Easy to define pipelines, schedule, and manage

pooriaPoorsarvi · 2025-03-02T16:06:24Z

We will see if we can find the old system; if not, we can use something like airflow or something similar, with the requirements being:

Easy to learn (both for current and future students)
Can keep up with our work long term
Can use all our resources

@branylagaffe was there any progress on the old system?

@pooriaPoorsarvi suggests airflow.

It is very easy to use (it has logging, schedules are defined easily through Python, and it has an intuitive UI, maintained by Apache, and used in the industry for ETL pipelines).

Just need one thing to be checked:
If we can schedule the jobs based on resource availability on the machines (i.e. do not run multiple jobs if CPU usage or ram usage is exceeded more than a predefined limit)

Right now @pooriaPoorsarvi is checking that

@xusine Also gave us an example of the current job that we can use for the requirements of the system:

Stage 1: Functional Warming
Input: QEMU Image, initial snapshot, sampling interval, sample size, cache parameter, whether the stage is parallel or sequential, and the quantum size.
Output: A checkpoints folder, which contains the sample generated by the functional warming, and depending on the sample size and the workload, the size of the folder can vary between 500GB to 3TB. Larger sample is also possible.
Requirement: This stage may run on different machines, depending on whether the simulation is sequential or parallel.
Stage 2: Timing Simulation
Input: The checkpoint folder provided by the previous folder, the timing.cfg specifying the timing simulation parameter.
Output: A csv file recording the result, can be aggregated from the input folder
Requirement: This stage needs to check whether the machine has enough disk, and whether the machine has enough DRAM to enable parallel timing instances. Multiple sampling units can run in parallel, and each timing simulation instance at most requires one core.
Two stages may run on different machines, depending on whether the first stage is in parallel or in sequential. The second stage cannot start until the first stage finishes.

pooriaPoorsarvi assigned pooriaPoorsarvi, xusine and branylagaffe Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARSA Scheduler #46

PARSA Scheduler #46

pooriaPoorsarvi commented Mar 2, 2025 •

edited

Loading

pooriaPoorsarvi commented Mar 2, 2025 •

edited

Loading

PARSA Scheduler #46

PARSA Scheduler #46

Comments

pooriaPoorsarvi commented Mar 2, 2025 • edited Loading

We need something that can schedule our jobs for research by all teams in PARSA and use all the machines that PARSA has available

The current requirements are as follows

pooriaPoorsarvi commented Mar 2, 2025 • edited Loading

pooriaPoorsarvi commented Mar 2, 2025 •

edited

Loading

pooriaPoorsarvi commented Mar 2, 2025 •

edited

Loading