Skip to content

Latest commit

 

History

History
275 lines (162 loc) · 5.83 KB

Grid.md

File metadata and controls

275 lines (162 loc) · 5.83 KB

View description: None

View type: List View


Projects:

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

H * * * *

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: Check if there are jobs queued for condor nodes, and if so kill placeholder job.

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

H/2 * * * *

Description: shutdowns condor job.

TODO: needs to check if manuall deletion is working.

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: Update sif image for LUMI HPC

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

H/10 * * * *

Description: None

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build

Description: This job is triggered by HTCondor pilot jobs. When we submit a request to get a HTCondor node then first pilot runs and triggers this job at various stages with different "STATUS"

  • STATUS=online
    • When pilot script first runs on the HTCondor node then it sends an "online" event along with CONDOR_JOB_ID. In this case, this job tried to add the new HTCondor node as jenkins agent. In case a job with "online" status fails:
  1. First check if the new HTCondor node has been successfully added as a Jenkins agent (https://cmssdt.cern.ch/jenkins/computer/). It can happen that the agent is created, but the connection failed. In this case, just re-starting the agent should work.
  2. If the node has not been successfully added, then please first run https://cmssdt.cern.ch/jenkins/job/grid-check-jobs/ job and see if HTCondor job with CONDOR_JOB_ID is still running. For example look for messages like:
    OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
    cmsbuild ID: CONDOR_JOB_ID DATE TIME _ 1 _ _ 1 CONDOR_JOB_ID"
    If it is still in "RUN" state, then just re-try this job. If it is in "IDLE" state then do not do anything and if it is in "DONE" state then better to run https://cmssdt.cern.ch/jenkins/view/Grid/job/grid-shutdown-node/ to kill it.

  • STATUS=offline

    • This event is sent when HTCondor pilot has run 90% of its max allocated time. When this event is received then this jobs marks the grid${CONDOR_JOB_ID} jenkins agent as offline so that no new job can be run on this agent. When grid${CONDOR_JOB_ID} is offline, and it is not running any job then this agent is automatically deleted by https://cmssdt.cern.ch/jenkins/view/Grid/job/grid-check-nodes/. Depending on the agent "LABELS" (e.g. auto-recreate) , this job can request a new HTCondor node to replace it.

  • STATUS=shutdown

    • This event is sent when a pilot has reached its max life and going to shutdown. In this case this job tried to delete the agent from the jenkins.

Project is enabled.

Upstream projects:

Downstream projects:

Sub-projects:

Triggers from: []

Periodic builds:

Not periodically build