I am the main developper of rayon-logs the tracing
extension for the rayon parallel computing library.
Rayon is a work stealing library for shared memory machines developped in the rust.
programming language. Rayon provides a fantastic API for coding
complex parallel algorithms in a clear code. Our research team (datamove) has a strong experience in
fine grain work-stealing engines and I've been eager to transfer some of our techniques into rayon itself.
Before that however I needed to be able to profile parallel executions of different algorithms in a lightweight manner.
Since no real solution was available I decided to develop one myself. While development is still ongoing
it is now quite usable and I decided to release it to the public before deciding to include it directly into rayon
or to keep it separated.
rayon-logs is built on top of rayon and allows for an easy profiling
of rayon applications. You basically need to import rayon_logs as an external crate in place of rayon.
The easiest is to use extern crate rayon_logs as rayon;. All functions (not all available) will have identical behavior
except install which returns an added log.
After that you need to manually create a tracing ThreadPool. The install method on the pool will execute
its closure in the pool and return an execution trace together with its result.
Rayon allows you to describe parallelism as a set of tasks forming a fork-join graph. Rayon provides
a join function
which takes as arguments two closures executed them potentially in parallel. The execution of the caller function will stop
and wait until both children closures are executed before continuing. By (virtually) dividing the calling closure into two parts
(pre-call and post-call) we create a set of tasks linked by dependencies and form our graph.
Here is an example manually computing a max in an array:
the code:
externcraterayon_logsasrayon;userayon::{join,ThreadPoolBuilder};fnmanual_max(slice: &[u32])-> u32{ifslice.len()<200_000{slice.iter().max().cloned().unwrap()}else{letmiddle=slice.len()/2;let(left,right)=slice.split_at(middle);let(mleft,mright)=join(||manual_max(left),||manual_max(right));std::cmp::max(mleft,mright)}}fnmain(){letv: Vec<u32>=(0..2_000_000).collect();letpool=ThreadPoolBuilder::new().num_threads(2).build().expect("building pool failed");let(max,log)=pool.install(||manual_max(&v));assert_eq!(max,v.last().cloned().unwrap());log.save_svg("manual_max.svg").expect("saving svg file failed");}
the trace:
So, the initial closure (installed one) ends up being displayed as two tasks: the top and bottom most ones.
You can see that only the sequential tasks end up really taking some time and appear in color : red for first thread
and green for second one. Completely at the bottom the idle times are displayed and you can see they are in fact
very small. This is due to the fact that tasks creations (which induce some overheads) stop when a limit size of 200,000 is reached.
You can hover on tasks to get additional information and reload the animation by reloading the page.
If you look closely you can also have the impression that blocs of same sizes are not computed in similar times.
It is possible to put more information in the logs in order to get a better display.
In fact, we can replace the line slice.iter().max().cloned().unwrap() by
sequential_task(0, slice.len(), || slice.iter().max().cloned().unwrap()) and by this way tag the given closure
with a task category (I take the identifier 0) and a work size (here the length of the block since the sequential
max algorithm is linear).
I then end up with the following logs where bright colors mean a fast computation speed and
dim colors a slow one:
You can see some gradual improvements in the executions but keep in mind that executions can vary quite a lot
between runs so one trace is not enough for taking conclusions. Additional information is needed to figure out
what is really going on and we would surely need to use hardware performance counters.