Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARSA Scheduler #46

Open
pooriaPoorsarvi opened this issue Mar 2, 2025 · 1 comment
Open

PARSA Scheduler #46

pooriaPoorsarvi opened this issue Mar 2, 2025 · 1 comment
Assignees

Comments

@pooriaPoorsarvi
Copy link
Collaborator

pooriaPoorsarvi commented Mar 2, 2025

We need something that can schedule our jobs for research by all teams in PARSA and use all the machines that PARSA has available

The current requirements are as follows

  • Be able to use all our compute resources and integrate machines that we might add later
  • Have a shared file system between our machines, so different stages of a pipeline can share inputs and outputs
  • Be able to schedule on machines based on timing and resource availability
  • Easy to define pipelines, schedule, and manage
@pooriaPoorsarvi
Copy link
Collaborator Author

pooriaPoorsarvi commented Mar 2, 2025

We will see if we can find the old system; if not, we can use something like airflow or something similar, with the requirements being:

  1. Easy to learn (both for current and future students)
  2. Can keep up with our work long term
  3. Can use all our resources

@branylagaffe was there any progress on the old system?

@pooriaPoorsarvi suggests airflow.

It is very easy to use (it has logging, schedules are defined easily through Python, and it has an intuitive UI, maintained by Apache, and used in the industry for ETL pipelines).

Just need one thing to be checked:
If we can schedule the jobs based on resource availability on the machines (i.e. do not run multiple jobs if CPU usage or ram usage is exceeded more than a predefined limit)

Right now @pooriaPoorsarvi is checking that

@xusine Also gave us an example of the current job that we can use for the requirements of the system:

Stage 1: Functional Warming
Input: QEMU Image, initial snapshot, sampling interval, sample size, cache parameter, whether the stage is parallel or sequential, and the quantum size.
Output: A checkpoints folder, which contains the sample generated by the functional warming, and depending on the sample size and the workload, the size of the folder can vary between 500GB to 3TB. Larger sample is also possible.
Requirement: This stage may run on different machines, depending on whether the simulation is sequential or parallel.
Stage 2: Timing Simulation
Input: The checkpoint folder provided by the previous folder, the timing.cfg specifying the timing simulation parameter.
Output: A csv file recording the result, can be aggregated from the input folder
Requirement: This stage needs to check whether the machine has enough disk, and whether the machine has enough DRAM to enable parallel timing instances. Multiple sampling units can run in parallel, and each timing simulation instance at most requires one core.
Two stages may run on different machines, depending on whether the first stage is in parallel or in sequential. The second stage cannot start until the first stage finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants