-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathREADME
149 lines (113 loc) · 9.39 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
This project is a simple demonstration of using different threading technologies as the basis of a HEP multi-threaded framemework
--------------------------------------
COMPILING
1) You'll need to have available a version of boost which is used to parse the configuration files
http://www.boost.org
2) To use the Intel Thread Building Block (TBB) framework, you'll need a version of tbb
https://www.threadingbuildingblocks.org
3) create a subdirectory named build to hold the build products
4) in the build directory, run the command
cmake -DCMAKE_BUILD_TYPE=Release -DBOOST_ROOT=<boost install directory> -DTBB_ROOT_DIR=<tbb install directory> ..
5) Then just compile
make
You will wind up with four executables
BusyWaitCalibration/BusyWaitCalibrate:
Used to calibrate how long it takes to run a simple integration function. This is used for tests where you want a module to actually do work (as apposed to just sleep).
TBBProcessingDemo/TBBDemo:
Processes configuration files to run trivial modules via TBB in a way analogous to a full HEP framework. Useful for testing scalability of the system.
OpenMPProcessingDemo/OpenMPDemo:
Processes configuration files to run trivial modules via OpenMP in a way analogous to a full HEP framework. Useful for testing scalability of the system.
SingleThreadedDemo:
Similar to the other *Demos exception only uses one thread and is implemented in the most trivial (and fastest) way possible. Used for timing comparisons with the same configurations passed to *Demos.
NOTE: The DispatchProcessingDemo which made use of Apple's Grand Central Dispatch system is no longer supported. It is possible to build it on Linux using
https://www.heily.com/trac/libdispatch
which need libkqueue
http://mark.heily.com/book/export/html/52
--------------------------------------
TESTING
To see that the program works at all, you can run DispatchDemo using the simple test configuration
TBBProcessingDemo/TBBDemo ../TBBProcessingDemo/test_config.json
This should generate an output like
demo::PassFilter first 0x5ca9a0
demo::PassFilter second 0x5c6910
demo::PassFilter third 0x5c5fe0
demo::ThreadSaferGetterFilter fourth 0x5c6190
demo::BusyWaitProducer alpha 0x5c6470
demo::BusyWaitProducer beta 0x5c66e0
# threads:24 # simultaneous events:4 total # events:3000 cpu time:182.333 real time:26.2779 events/sec:114.164
--------------------------------------
UNIT TESTING
To build the unit tests, you need to tell cmake how to find the cppunit framework. The options to pass to cmake are
-DCPPUNIT_INCLUDE_DIR=<include directory> -DCPPUNIT_LIBRARY=<path to library>/libcppunit.so
with the last one requiring the full name of the library to be used. Then to run the tests
TBBProcessinDemo/TBBRunUnitTests
--------------------------------------
CALIBRATING
In order to simulate the effect of actually doing HEP type work, the code does a simple integration to eat up 100% of a CPU. However, how long a given loop takes needs to be calibrated for the machine being used. This is done by running 'BusyWaitCalibrate'
Running this program gives an output similar to the following
BusyWaitCalibration/BusyWaitCalibrate
43.3487
100000 0.004535 2.20507e+07
195.367
1000000 0.042997 2.32574e+07
199.928
10000000 0.441726 2.26385e+07
6.4598
100000000 5.10827 1.95761e+07
The grouping of numbers is the results from running 4 different iteration values, each differing by a factor of 10. The lone number on a line can be ignored. The the numbers on one line are
- the number of iterations used
- the length of time it took to do that many iterations
- the ratio of (# of iterations)/(length of time)
From experimentation, I've found that the first 3 iterations tend to be converging to a reasonable common number but the longest iteration tends to be slower, probably because the operating system is taking some time from the job in order to do some other work. Therefore I tend to take the largest value of the ration from the first three values.
The ratio is then used to set the value in the various configuration files. The value is
process.options.busyWaitScaleFactor
The easy way to change the value is to just search for 'busyWaitScaleFactor' in the configuration file since it only appears once.
--------------------------------------
CONFIGURATIONS
The configuration files are JSON format files which contain the information such as
- What is the busy wait scale factor to use (process.options.busyWaitScaleFactor)
- How many events should the job 'process' (process.source.iterations)
- How many events should be processed simultaneously (process.options.nSimultaneousEvents)
- What 'modules' to run and how the various 'modules' are related
The demo framework uses standard CMS concepts:
-Tasks are encapsulated in 'modules' where 'modules' only communicate by adding object to and reading objects from the Event
-Modules which only read data from the Event and decide if we should continue processing an Event are called Filters
-Filters are placed in sequence into a Path
-The job allows multiple Paths, where each Path is considered independent and can be run in parallel
-Modules which add data to an event are called Producers. Producers can depend on data in the Event. A Producer will be run the first time a data item is requested from a given Event.
-Data in the event is identified by the 'label' which was assigned to the Producer which makes the data.
The list of Filters is defined in the 'process.filters' value.
The list of Producers is defined in the 'process.producers' value.
The list of Paths is defined in the 'process.paths' value.
Filters and Producers are configured with the mandatory parameters
'@label': the label used to uniquely defined the module (and the data the module makes)
'@type': what C++ class to instantiate when creating the module
Each C++ class type can then have its own parameters, although most accept
'eventTimes': how long the module runs for a given event. Specifying time for each event allows one to measure effects of correlations of times between modules. The units are seconds.
'threadType': used by TBBDemo and OpenMPDemo to decide how thread safe to treat the module. The allowed values are
ThreadSafeBetweenInstances: the same instance of a module can be passed multiple events simultaneously
ThreadSafeBetweenModules: the same instance of a module can only handle one event at a time but it does not interact with any other module
ThreadUnsafe: the module can not be run at the same time as other 'ThreadUnsafe' modules
'toGet': A list of the data to acquire from the event. The list is pair of 'label' (which is the label of the Producer) and a 'product' which is not used now.
--------------------------------------
EXPERIMENTS
The experimental configurations are found in cms-data:
reco_minbias_busywait.config:
This simulates doing event Reconstruction on a common and quick to process event type where each module is using real CPU time. The job has one Filter which is simulating the behavior of the output module which would write a ROOT file. This Filter is set to be 'ThreadUnsafe' to simulate present ROOT behavior. The source is set to be ThreadSafeBetweenModules so only one request is processed at a time in the module. All other modules are configured with ThreadSafeBetweenInstances so are assumed to be perfectly thread safe.
reco_minbias_sleeping.config
Similar to reco_minbias_busywait.config but instead of having modules use real CPU time, they 'sleep' instead. The libdispatch is smart enough to see when a thread is sleeping and then assign more work to the system. This configuration allows one to push the nSimultaneousEvents to much larger levels than the actual number of CPUs on the machine. I was able to run 700 nSimultaneousEvents on a 2 core machine.
reco_minbias_sleeping_1000.config
The same as reco_minbias_sleeping.config, however instead of each module having specified timing for 100 events, this has timing for 1000 events. This is only here for historic reasons.
reco_minbias_sleeping_betweenmodules.config
The same as reco_minbias_sleeping.config except the Producers have been set to only be able to process one event at a time (ThreadSafeBetweenModules). This allows scaling test for what happens if modules are not fully thread safe.
reco_minbias_sleeping_perfectIO.config
The same as reco_minbias_sleeping.config except both the source and output (i.e. the Filter) are set to completely thread safe (ThreadSafeBetweenInstances). This tests the scaling limits of the framework itself.
reco_multijet_sleeping.config
The same as reco_minbias_sleeping.config except the events take much longer to process.
reco_multijet_sleeping_1000.config
The same as reco_multijet_sleeping.config exception each module specifies 1000 event times rather than just 100. This is only here for historic reasons.
So the suggested way to proceed with a experimenting is
1) Run reco_minbias_busywait.config while changing both 'nSimultaneousEvents' and 'iterations' (trying to keep the total job time about constant) until you see the events/sec results plateau
2) Run reco_minbias_sleeping.config while changing both 'nSimultaneousEvents' and 'iterations' (trying to keep the total job time about constant) until you see the events/sec results plateau. You should be able to go to much higher nSimultaneousEvents. Also, results from 1) with the same 'nSimultaneousEvents' should be the same as 2)
3) Same as 2) but using reco_minbias_sleeping_betweenmodules.config
4) Same as 2) but using reco_minbias_sleeping_perfectIO.config