The tracing procedure
Tracing involves five phases : preprocessing, clock
sampling, execution, clock sampling and post mortem trace collection.
1) Preprocessing, compiling and linking
A program to be traced has to be compiled with the specific tracing Makefile,
compiling and linking commands. Preprocessing performed during the
compilation step replaces each call to an Athapascan function by a call to
an instrumented function.
For instance, each a0Send will be replaced by _a0_perf_Send
(the instrumented function) which in turn calls _a0Send (the non instrumented function).
Object files are then linked with the instrumented library containing
both normal and instrumented functions.
2) Clock sampling before execution (optional)
Parallel systems providing a global clock on all nodes do not require the
sampling step.
Without a global clock, the local clocks at nodes are not coherent.
In Athapscan-tr, a global clock is implemented by software by correcting
the local clocks of the trace files post mortem. This correction step uses
an estimation of the clock drifts computed during a samplig phase prior to
the program execution:
prompt% a0clock -a0procs=nb_nodes -of=clock_file1 -nbp=nb_points -ws=win_size -delay=time
Clock_file1 is the name of the clock drift file to create.
Nb_points and time define the number of samples and the delay (in microseonds)
between two samples points.
The smoothing window retains the value of the win_size (typically 2 to 5) last
samples. In order to remove erratic samples, each sample is replaced by the
smoothing window median value.
Use a0decode -clock_file=clock_file1 in step 5.
3) Execution
Traced programs are executed similarly to the normal ones, with
some extra parameters:
a0run my_prog -a0procs=nb_nodes -a0trace_file=tfile -a0nbuf=nb_buf -a0sbuf=bufsize
Nb_buffer and bufsize are the number and size (in bytes) of the trace buffers
and tfile defines the name of the trace files (/TMP/perf_trace by default)
to build.
Put your trace file on a local file system (such as /tmp)
rather than a remote one to reduce network file system daemons (NFS) activity
during your program execution. Check you really have rights to write the
trace file, especially when using the default file name.
Use at least as many buffers as the maximum number of concurrent
threads (add a dozen of internal dameons to your own threads).
4) Clock sampling after execution (optional)
A second clock sampling step after program execution produces more accurate
results. It takes the result of step 3 as input:
prompt% a0clock -a0procs=nb_nodes -of=clock_file2 -nbp=nb_points -ws=win_s
ize -delay=time -if=clock_file1
Use a0decode -clock_file=clock_file2 in step 5
5) Post mortem trace processing
Program execution creates a set of trace files tfile.x (where tfile
is the trace file name passed to a0run and x is the node number).
Collect all trace files:
prompt% rcp my_node_0:tfile.0 .
prompt% rcp my_node_1:tfile.1 .
...
prompt% rcp my_node_n-1:tfile.n-1 .
Then decode, merge and sort node trace files in a single ASCII trace file:
a0decode tfile.* > my_ascii_file.trace
Use my_ascii_trace_file.trace as input file for Paje.
|