1) History
While the current Athapascan-0 layer is built upon the MPI communication
kernel, early Athapascan designs were based on PVM. A software tracer
named TAPE/PVM was designed to monitor PVM applications.
TAPE/PVM generates event traces of PVM applications for post-mortem
performance analysis.
The design of TAPE/PVM focused on the following two points :
- precise, causally coherent event dating
- minimal perturbation of analyzed applications
Designed with the same goal , Athapascan-tr is a new software tracer for
Athapascan (built on top of MPI). It takes into account new programming model
extensions such as :
- non blocking communications
- multithreading of computing nodes
Athapascan-tr retains the causally coherent dating scheme developped in
TAPE/PVM but performs preprocessing, trace encoding and buffer management
in a different way.
2) Description
In order for an Athapascan application to be traced, it has to be recompiled
using specific include files and an instrumented Athapascan-0 library.
Events are recorded in as many binary trace files as Athapascan nodes.
Tools are provided to sort, merge and translate them into a single trace file
in the ASCII input format of the Paje visualization tool.
Post-mortem trace analysis is performed by a simulator. This simulator takes
a trace file (set of events) as input and reconstructs the successive global
states of the system (matching send and receive communication events,
for instance) on which the traces were generated. The simulator is not
provided as a separate tool, but included in the Paje visualization tool.
3) Intended usage
Athapascan-tr generates execution traces for post-mortem performance analysis
of error-free programs. Athapascan-tr is used to detect performance
related problems such as load balancing and scheduling problems.
In order to minimize perturbations of the traced applications, flushing trace
buffers to trace files is delayed until either buffers overflows or
thread or application terminate.
Athapascan-tr will therefore be of little use to detect program errors
preventing a0Abort or a0Terminate executions : events
generated before the error would be recorded in memory, but not written into
the trace file.
|