Skip to content

Commit

Permalink
GH-459: Fix agents fail to clean shm
Browse files Browse the repository at this point in the history
dds-agent: Fixed: Ignore SIGTERM while performing cleaning procedures. (GH-459)
dds-slurm-plugin: Fixed: Make sure that scancel's SIGTERM is properly handled by all job steps and their scripts. (GH-459)
  • Loading branch information
AnarManafov committed Jul 25, 2022
1 parent cb01993 commit 5efade2
Show file tree
Hide file tree
Showing 8 changed files with 41 additions and 18 deletions.
2 changes: 2 additions & 0 deletions ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Added: Support for Apple's arm64 architecture. (GH-393)
### dds-agent
Fixed: Address potential crash in the external process termination routines.
Fixed: Revised handling of the slots container.
Fixed: Ignore SIGTERM while performing cleaning procedures. (GH-459)

### dds\_intercom\_lib
Fixed: Stability improvements.
Expand Down Expand Up @@ -49,6 +50,7 @@ Fixed: ssh cfg parser is passing cfg files of all plug-ins. (GH-413)
Added: Support for SubmissionID (GH-411)

### dds-slurm-plugin
Fixed: Make sure that scancel's SIGTERM is properly handled by all job steps and their scripts. (GH-459)
Added: Support for SubmissionID (GH-411)
Added: Support of minimum number of agents to spawn. (GH-434)
Modified: Replace array job submission with nodes requirement. (GH-430)
Expand Down
8 changes: 7 additions & 1 deletion dds-agent/src/AgentConnectionManager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,14 @@ CAgentConnectionManager::~CAgentConnectionManager()
void CAgentConnectionManager::doAwaitStop()
{
m_signals.async_wait(
[this](boost::system::error_code /*ec*/, int /*signo*/)
[this](boost::system::error_code /*ec*/, int signo)
{
// The server is stopped by cancelling all outstanding asynchronous
// operations. Once all operations have finished the io_context::run()
// call will exit.
LOG(dds::misc::info) << "Received a signal: " << signo;
LOG(dds::misc::info) << "Stopping DDS connetion manager...";

// Stop transport engine
stop();
});
Expand Down
5 changes: 5 additions & 0 deletions dds-agent/src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,11 @@ void clean()

int main(int argc, char* argv[])
{
// ignore SIGTERM
// This is mainly for the clean mode to be able to finish the clean process.
// Other agent modes will reassign sig handlers via asio in anyway.
std::signal(SIGTERM, SIG_IGN);

// Command line parser
SOptions_t options;
try
Expand Down
28 changes: 16 additions & 12 deletions dds-commander/src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -192,22 +192,26 @@ int main(int argc, char* argv[])
jobs.push_back(protoSlurmSubmitInfo.slurm_job_id(0));
}
}
const fs::path scancelPath{ bp::search_path("scancel") };

stringstream ssCmd;
ssCmd << scancelPath.string();
for (const auto& id : jobs)
if (!jobs.empty())
{
ssCmd << " " << id;
}
const fs::path scancelPath{ bp::search_path("scancel") };

LOG(log_stdout) << "SLURM JOB CANCEL: " << ssCmd.str();
string sout;
string serr;
execute(ssCmd.str(), chrono::seconds(30), &sout, &serr);
if (!serr.empty())
LOG(log_stderr) << "SLURM JOB CANCEL: " << serr;
stringstream ssCmd;
ssCmd << scancelPath.string();
ssCmd << " --full ";
for (const auto& id : jobs)
{
ssCmd << " " << id;
}

LOG(log_stdout) << "SLURM JOB CANCEL: " << ssCmd.str();
string sout;
string serr;
execute(ssCmd.str(), chrono::seconds(30), &sout, &serr);
if (!serr.empty())
LOG(log_stderr) << "SLURM JOB CANCEL: " << serr;
}
// <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

return EXIT_SUCCESS;
Expand Down
2 changes: 1 addition & 1 deletion dds-protocol-lib/src/ConnectionManagerImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ namespace dds
// operations. Once all operations have finished the io_context::run()
// call will exit.
LOG(dds::misc::info) << "Received a signal: " << signo;
LOG(dds::misc::info) << "Stopping DDS transport server";
LOG(dds::misc::info) << "Stopping DDS transport server...";

stop();
});
Expand Down
2 changes: 1 addition & 1 deletion dds-topology-lib/src/TopoBase.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
#define __DDS__TopoBase__

// STD
#include <map>
#include <sstream>
#include <string>
#include <vector>
#include <map>
// BOOST
#include <boost/property_tree/ptree.hpp>

Expand Down
2 changes: 1 addition & 1 deletion etc/DDSWorker.sh.in
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ wait_and_kill()
kill -9 $1
break
fi
sleep 1
sleep 0.3
done
}
#=============================================================================
Expand Down
10 changes: 8 additions & 2 deletions plugins/dds-submit-slurm/src/job.slurm.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,13 @@

#DDS_USER_OPTIONS

# execute DDS Scout
srun --no-kill --kill-on-bad-exit=0 --output=slurm-%j-%N.out /usr/bin/env bash -c 'eval JOB_WRK_DIR=%DDS_AGENT_ROOT_WRK_DIR%/${SLURM_JOB_NAME}_${SLURM_JOBID}_${SLURMD_NODENAME}; mkdir -p $JOB_WRK_DIR; cd $JOB_WRK_DIR; cp %DDS_SCOUT% $JOB_WRK_DIR/; ./DDSWorker.sh'
# ignore signals
# continue waiting for child processes by any means
trap -- '' SIGINT SIGTERM

# execute DDS Scoullt
srun --no-kill --kill-on-bad-exit=0 --output=slurm-%j-%N.out /usr/bin/env bash -c 'trap '"'"'kill $PID && wait'"'"' SIGINT SIGTERM; eval JOB_WRK_DIR=%DDS_AGENT_ROOT_WRK_DIR%/${SLURM_JOB_NAME}_${SLURM_JOBID}_${SLURMD_NODENAME}; mkdir -p $JOB_WRK_DIR; cd $JOB_WRK_DIR; cp %DDS_SCOUT% $JOB_WRK_DIR/; ./DDSWorker.sh & PID=$!; wait' &

wait

exit 0

0 comments on commit 5efade2

Please sign in to comment.