Skip to content

Commit

Permalink
add back missing stuff,
Browse files Browse the repository at this point in the history
  • Loading branch information
CallumWalley committed Sep 19, 2023
1 parent b2a7657 commit 17b4260
Showing 1 changed file with 10 additions and 7 deletions.
17 changes: 10 additions & 7 deletions _episodes/07-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@ As a reminder, our slurm script `example-job.sl` currently looks like this.
```
{% include example_scripts/example-job.sl.1 %}
```

{: .language-bash}

We will now submit the same job again with more CPUs.
We ask for more CPUs using by adding `#SBATCH --cpus-per-task 4` to our script.
Your script should now look like this:

```
Expand Down Expand Up @@ -56,10 +58,12 @@ Note in squeue, the number under cpus, should be '4'.

Checking on our job with `sacct`.
Oh no!
{% include {{ site.snippets }}/scaling/OOM.snip %}

{% include {{ site.snippets }}/scaling/OOM.snip %}
{: .language-bash}

To understand why our job failed, we need to talk about the resources involved.

Understanding the resources you have available and how to use them most efficiently is a vital skill in high performance computing.

Below is a table of common resources and issues you may face if you do not request the correct amount.
Expand Down Expand Up @@ -98,6 +102,7 @@ Below is a table of common resources and issues you may face if you do not reque

## Measuring Resource Usage of a Finished Job

Since we have already run a job (succesful or otherwise), this is the best source of info we currently have.
If we check the status of our finished job using the `sacct` command we learned earlier.

```
Expand All @@ -107,8 +112,6 @@ If we check the status of our finished job using the `sacct` command we learned

{% include {{ site.snippets }}/scheduler/basic-job-status-sacct.snip %}

<!-- Put big formulas here. -->

With this information, we may determine a couple of things.

Memory efficiency can be determined by comparing <strong style="color:#66cdaa">ReqMem</strong> (requested memory) with <strong style="color:#00e400">MaxRSS</strong> (maximum used memory), MaxRSS is given in KB, so a unit conversion is usually required.
Expand Down Expand Up @@ -201,7 +204,7 @@ We can do this with the command `squeue --me`, and looking under the 'NODELIST'

{% include {{ site.snippets }}/resources/get-job-node.snip %}

Now that we know the location of the job (wbn189) we can use SSH to run htop there.
Now that we know the location of the job (wbn189) we can use `ssh` to run `htop` _on that node_.

```
{{ site.remote.prompt }} ssh wbn189 -t htop -u $USER
Expand All @@ -211,8 +214,8 @@ Now that we know the location of the job (wbn189) we can use SSH to run htop the
You may get a message:

```
ECDSA key fingerprint is SHA256:Se1WKeayCfi3lAxDzS7fBlS83kBaBEvBgxHoAz2HVkM.
ECDSA key fingerprint is MD5:9d:03:fc:43:07:ac:ac:9b:78:85:45:52:ac:7a:ed:cd.
ECDSA key fingerprint is SHA256:############################################
ECDSA key fingerprint is MD5:9d:############################################
Are you sure you want to continue connecting (yes/no)?
```
{: .language-bash}
Expand Down

0 comments on commit 17b4260

Please sign in to comment.