-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.md.old
356 lines (237 loc) · 12.7 KB
/
README.md.old
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
# Embedded Nimrod
## Background
The Nimrod portal is a well established tool for high throughput computing providing a mechanism to process many jobs on HPCs without the need for individual job submissions. The Nimrod portal sits outside of HPC and submits work to the HPC on your behalf.
Embedded Nimrod is a version of the Nimrod high throughput computing tool that can be utilized _within_ batch jobs on the HPC.
Embedded Nimrod builds a miniature Nimrod environment _within_ your PBS job resources and starts processing the experiment plan file you included with the job submission.
## Embedded Nimrod Use Cases
Embedded Nimrod is best suited for the following use cases:
* workloads with tasks that are repeated over and over for statistical sampling,
* workloads with tasks that are run with different input parameters,
* workloads with tasks that have a relatively, or annoyingly, short walltime,
(such that the time taken to set up and take down the job is comparable to the walltime of the actual task,
* workloads where the computing footprint of each task is relatively small (e.g. `ncpus=1:mem=1gb`)
## Advantages
1. Unlike a job array or a sequence of regular batch jobs (eek!), all the looping over parameter values or inputs can be contained within the one PBS job.
2. Once an embedded Nimrod job starts it should keep going until all the parameter combinations are finished.
3. Once your job starts, the node resources you requested are yours for the duration of the walltime (or until the Nimrod experiment has been completed)
4. You can request as much resource as you need. Ideally this would be an entire node, and you would get as many instances (agents) running as can fit within the computational resources.
* If, for example, your tasks require 2 cores each (`ompthreads=2`), then if you requested `ncpus=24` cores in total, then you would automatically get 12 tasks running in parallel.
* If, instead, your tasks require 1 core (`ompthreads=1`) and 10GB of RAM, then if you requested `ncpus=12` cores then 12 agents running your processing would fit within the RAM of a Tinaroo HPC node.
5. Unlike Job Arrays which are constrained to a single integer index value and your imagination for how to use it, the Nimrod experiment plan file allows for a variety of parameter types and methods of specifying them.
6. The configuration of tasks within the job are primarily governed by two resource parameters, *`ncpus`* and *`ompthreads`*. Memory footprint also needs to specified.
7. Embedded Nimrod can be run with multiple nodes (`select=4`). To extend our previous example, if your tasks require 2 cores each (`ompthreads=2`) and you request 24 cores (`ncpus=24`) on 4 nodes (`select=4`), then you would get 48 tasks runnning in parallel.
The formula used to determine the number of agents to spawn is:
```
select * (ncpus / ompthreads)
```
## Disadvantages
1. Unlike job arrays, you need to ensure that you request sufficient memory for the _total_ number of tasks that will run concurrently on the same node.
2. If you had a large number of parameter combinations to run through you might have to break the sweeps up into more manageable chunks.
3. It kinda forces you to use TMPDIR ;-)
## Parameter Types and Declarations
The following parameter types and declarations are supported:
* ranges of integers
`parameter i integer range from 1 to 11 step 2`
* lists of integers
`parameter j integer select anyof 1 3 5 7 9`
* ranges and lists of float parameter values
`parameter x float range from 1.5 to 1.8 step 0.1`
`parameter y float range from 5.0E2 to 5.5E2 step 0.1E2`
`parameter z float select anyof 1.51 1.55 1.62 1.77 1.80`
* single or lists of text values
`parameter t text "someText"`
`parameter u text select anyof "AAA" "BBB" "CCC"`
* lists of filenames generated by globs (wildcards)
`parameter f files select anyof "NimrodDemo.e*"`
* random selections of values in a range
`parameter k integer random from 1 to 101 points 5`
* ~~even an empty parameter that gets updated later using a JSON file inputted to the Nimrod~~
- coming soon(tm)
## How to Use Embedded Nimrod
Using Embedded Nimrod is as simple as
* writing a PBS Pro job submission to request the computation resources necessary to host concurrent tasks on a single host
* writing a Nimrod plan file for the "experiment"
* submitting your PBS job script
* waiting for PBS job to start and for the _magic_ to happen
While the job is running, the batch system will deploy and run a personal Nimrod/G infrastructure to manage the combinations of parameters contained in the plan file.
### A Sample Job Script
```bash
#!/bin/bash
#
#PBS -N NimrodDemo
#PBS -A UQ-RCC
#PBS -l walltime=168:00:00
#PBS -l select=4:ncpus=12:ompthreads=2:mem=24GB
# In this demo I will run 24 tasks in parallel (with 2cpus & 4GB per task) in 4 half a HPC nodes.
# The select statement is requesting ncpus=12 (i.e. 6x2cpus) and mem=24GB (i.e. 6x4GB)
# The ompthreads value governs how many cpus each task is allocated, and ncpus / ompthreads governs how parallel instances can fit.
# The walltime is the total for the entire job (i.e. a multiple of the single task walltime)
#Locate the Nimrod plan file
PLANFILE="NIMROD/plan"
#Load the embedded nimrod module
module load embedded-nimrod
# Compile plan file to check it is OK
nimrod compile --no-out $PLANFILE
if [ $? -ne 0 ]; then
echo "There was a problem with the compilation of your plan file. Better go check it!"
exit 1
fi
#Bail out here if you just want to check the plan file is good
#exit
# Now nimrun the plan file
nimrun $PLANFILE
```
### A Sample Plan File
```
// Sweep across 3 numerical parameters and write out a fake CSV line for each combination ... not very realistic!
parameter i integer range from 1 to 3 step 1
parameter x float range from 5.0E2 to 5.2E2 step 0.2E2
parameter y float range from 1.5 to 1.7 step 0.2
parameter t text "NimrodDemo"
task main
// 0. Nimrod Setup
onerror ignore // or fail ... is an option!
redirect stdout to out // like OU in PBS
redirect stderr to err // like ER in PBS
// Now down to business
exec echo "$i,$x,$y"
// Do not forget to copy back the stdout and stderr files for each data set.
copy node:out root:OE/$t.out.$jobindex.txt
copy node:err root:OE/$t.err.$jobindex.txt
endtask
```
### Plan file compilation
When you compile the plan file, it is parsed and checked for syntax errors.
Do not proceed with running the experiment if the plan file fails to parse correctly.
### Plan file gotchas
The plan file parser is Java code.
When it is not happy with your plan file it will spit out lots of dire-looking messages.
Do not despair!
Here are a couple of examples of how to avoid screens of java error messages:
#### Parameter values in shexec commands
Basically, you need to escape quote any reference to a Nimrod parameter when you are within a quoted `shexec` string:
```
parameter t text "Wombat"
shexec "R CMD BATCH --no-save --no-restore '--args i=\"$i\"' MyRCode/I_Dig_\"$t\"_Burrows.R Rout"
```
#### Be careful near underscores
You can use a plan file parameter in a file path that includes the `_` character as long as you are careful to delineate the variable.
The problem arises because the `_` is a valid character for a variable name in shell.
```
parameter i integer range from 1 to 5 step 1
parameter t text "Echidna"
copy node:output.csv root:${t}_Spines_1_${i}.csv
```
#### Beware of System Load
This is not really a parsing gotcha but worth a mention, nonetheless.
We have seen situations where the PBS Pro batch system intervenes when it detects excessive load.
(your job is supposed to stay within the jobs resources you requested)
The batch system allows a little bit of momentary lee-way but will quickly kill off a job that appears to be running out of control.
If your Nimrod experiment involves strenuous computations, and/or uses Java or MATLAB compiler runtime environments, then you may need to decrease the density of tasks running by increasing the ompthreads setting.
So, for example, if your job could run on a single CPU and you might like to run your experiment with `ompthreads=2` to halve the number of tasks performed concurrently.
We are working on a better mechanism to handle this situation.
#### Environment variables and modules don't inherit across nodes
When using multi-node jobs, environment variables (and thus modules) aren't carried over. If a custom environment or modules are required, it is recommended that you write a wrapper shell script and `exec` that via the planfile.
## Usage Scenarios
In the following scenarios you would need to create a PBS job script based on the sample provided above.
### Running a program against a set of input filenames
Say you need to run the same program against a list of filenames that are XML input files.
#### Plan file
```
// Sweep over a set of tar filenames in the TAR directory
parameter f files select anyof "INPUTS/*.xml"
parameter t text "ThisRun"
task main
// 0. Nimrod Setup
onerror fail //or stop right away
redirect stdout to out
redirect stderr to err
// Now down to business
// 1. Set up the work area on the Node
exec mkdir INPUTS
exec mkdir OUTPUTS
exec mkdir CODE
// 2. Copy over the XML file (note that $f includes the "INPUTS/" part of the name
copy root:$f node:$f
// 3. Copy over the Code as a tar file and unpack it
copy root:CODE.tar node:CODE.tar
exec tar xf CODE.tar
exec rm -f CODE.tar
// 4. Take a selfie of what we are working with
exec ls -salR
// 5. Run the program with the current XML input file
exec CODE/myprogram.exe -i $f -o OUTPUTS/out.dat
// 6. Have a look at the OUTPUTS folder and copy the out.dat file back
exec ls -salR OUTPUTS
copy node:OUTPUTS/out.dat root:OUTPUTS/$t.$jobindex.dat
// Do not forget to copy back the stdout and stderr files for each data set.
copy node:out root:OE/$t.out.$jobindex.txt
copy node:err root:OE/$t.err.$jobindex.txt
endtask
```
### Running an R program for different values of a single parameter
We see this a lot.
R scripts running for different values of an input parameter.
#### The R Code
```R
#Simple script to test our ability to pass in a specific value for the index
#https://www.r-bloggers.com/passing-arguments-to-an-r-script-from-command-lines/
#https://www.r-bloggers.com/including-arguments-in-r-cmd-batch-mode/
#
#call using
# R CMD BATCH --no-save --no-restore '--args i=4' test.R test.out
##We need to capture all the arguments provided at the command line
args=(commandArgs(TRUE))
## The args variable is a list of character vectors that needs to be decomposed.
if(length(args)==0){
#If no arguments were provided ... probably should quit
stop("No arguments supplied.")
}else{
#arguments were provided so need to pull apart the args variable
for(ii in 1:length(args)){
eval(parse(text=args[[ii]]))
}
}
#i=3
x=c(1,2,3,4,5)
y=x*x
#Show us the power of the passed-in parameter!
i
x[i]
y[i]
```
#### The Plan File
```
parameter i integer range from 1 to 5 step 1
parameter t text "TestingR"
task main
// 0. Nimrod Setup
onerror ignore // or fail ... is an option!
redirect stdout to out // like OU in PBS
redirect stderr to err // like ER in PBS
// Now down to business
copy root:test.R node:Rscript
//must escape quote the values for plan file parameters
shexec "R CMD BATCH --no-save --no-restore '--args i=\"${i}\"' Rscript Rout"
// Do not forget to copy back the stdout and stderr files for each data set.
copy node:out root:OE/$t.out.$i.txt
copy node:err root:OE/$t.err.$i.txt
copy node:Rout root:OE/Rout.$i
endtask
```
## License
This project is licensed under the [Apache License, Version 2.0](https://opensource.org/licenses/Apache-2.0):
Copyright © 2019 [The University of Queensland](http://uq.edu.au/)
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
* * *
This project uses the [`JSON for Modern C++`](https://github.com/nlohmann/json) library by Niels Lohmann which is licensed under the [MIT License](http://opensource.org/licenses/MIT) (see above). Copyright © 2013-2018 [Niels Lohmann](http://nlohmann.me/)
* * *
This project uses the [`parg`](https://github.com/jibsen/parg) library by Jørgen Ibsen which is licensed under [CC0](https://creativecommons.org/publicdomain/zero/1.0/).