I am the main developper of rayon-logs the tracing extension for the rayon parallel computing library.

Rayon is a work stealing library for shared memory machines developped in the rust. programming language. Rayon provides a fantastic API for coding complex parallel algorithms in a clear code. Our research team (datamove) has a strong experience in fine grain work-stealing engines and I've been eager to transfer some of our techniques into rayon itself. Before that however I needed to be able to profile parallel executions of different algorithms in a lightweight manner. Since no real solution was available I decided to develop one myself. While development is still ongoing it is now quite usable and I decided to release it to the public before deciding to include it directly into rayon or to keep it separated.

rayon-logs is built on top of rayon and allows for an easy profiling of rayon applications. You basically need to import rayon_logs as an external crate in place of rayon. The easiest is to use extern crate rayon_logs as rayon;. All functions (not all available) will have identical behavior except install which returns an added log. After that you need to manually create a tracing ThreadPool. The install method on the pool will execute its closure in the pool and return an execution trace together with its result.

Rayon allows you to describe parallelism as a set of tasks forming a fork-join graph. Rayon provides a join function which takes as arguments two closures executed them potentially in parallel. The execution of the caller function will stop and wait until both children closures are executed before continuing. By (virtually) dividing the calling closure into two parts (pre-call and post-call) we create a set of tasks linked by dependencies and form our graph.

Here is an example manually computing a max in an array:

  • the code:
extern crate rayon_logs as rayon;
use rayon::{join, ThreadPoolBuilder};

fn manual_max(slice: &[u32]) -> u32 {
    if slice.len() < 200_000 {
        slice.iter().max().cloned().unwrap()
    } else {
        let middle = slice.len() / 2;
        let (left, right) = slice.split_at(middle);
        let (mleft, mright) = join(|| manual_max(left), || manual_max(right));
        std::cmp::max(mleft, mright)
    }
}

fn main() {
    let v: Vec<u32> = (0..2_000_000).collect();

    let pool = ThreadPoolBuilder::new()
        .num_threads(2)
        .build()
        .expect("building pool failed");
    let (max, log) = pool.install(|| manual_max(&v));
    assert_eq!(max, v.last().cloned().unwrap());

    log.save_svg("manual_max.svg")
        .expect("saving svg file failed");
}
  • the trace:
task: 0, duration: 0.004948 (ms) task: 1, duration: 0.000145 (ms) task: 4, duration: 0.000136 (ms) task: 7, duration: 0.00013 (ms) task: 10, duration: 0.320845 (ms) task: 11, duration: 0.254624 (ms) task: 12, duration: 0.012621 (ms) task: 8, duration: 0.000176 (ms) task: 25, duration: 0.186779 (ms) task: 26, duration: 0.183576 (ms) task: 27, duration: 0.0001 (ms) task: 9, duration: 0.000086 (ms) task: 5, duration: 0.000281 (ms) task: 28, duration: 0.00016 (ms) task: 34, duration: 0.183891 (ms) task: 35, duration: 0.185937 (ms) task: 36, duration: 0.000065 (ms) task: 29, duration: 0.000291 (ms) task: 40, duration: 0.186523 (ms) task: 41, duration: 0.186183 (ms) task: 42, duration: 0.00008 (ms) task: 30, duration: 0.000081 (ms) task: 6, duration: 0.00008 (ms) task: 2, duration: 0.003805 (ms) task: 13, duration: 0.000141 (ms) task: 16, duration: 0.000135 (ms) task: 19, duration: 0.320901 (ms) task: 20, duration: 0.255802 (ms) task: 21, duration: 0.00007 (ms) task: 17, duration: 0.000245 (ms) task: 22, duration: 0.18734 (ms) task: 23, duration: 0.190638 (ms) task: 24, duration: 0.00008 (ms) task: 18, duration: 0.000085 (ms) task: 14, duration: 0.000175 (ms) task: 31, duration: 0.000155 (ms) task: 37, duration: 0.186283 (ms) task: 38, duration: 0.186824 (ms) task: 39, duration: 0.00007 (ms) task: 32, duration: 0.000261 (ms) task: 43, duration: 0.186979 (ms) task: 44, duration: 0.187145 (ms) task: 45, duration: 0.00007 (ms) task: 33, duration: 0.00019 (ms) task: 15, duration: 0.000085 (ms) task: 3, duration: 0.000551 (ms)

So, the initial closure (installed one) ends up being displayed as two tasks: the top and bottom most ones. You can see that only the sequential tasks end up really taking some time and appear in color : red for first thread and green for second one. Completely at the bottom the idle times are displayed and you can see they are in fact very small. This is due to the fact that tasks creations (which induce some overheads) stop when a limit size of 200,000 is reached. You can hover on tasks to get additional information and reload the animation by reloading the page.

If you look closely you can also have the impression that blocs of same sizes are not computed in similar times. It is possible to put more information in the logs in order to get a better display. In fact, we can replace the line slice.iter().max().cloned().unwrap() by sequential_task(0, slice.len(), || slice.iter().max().cloned().unwrap()) and by this way tag the given closure with a task category (I take the identifier 0) and a work size (here the length of the block since the sequential max algorithm is linear).

I then end up with the following logs where bright colors mean a fast computation speed and dim colors a slow one:

task: 0, duration: 0.004606 (ms) task: 1, duration: 0.000145 (ms) task: 4, duration: 0.000145 (ms) task: 7, duration: 0.00014 (ms) task: 10, duration: 0.000155 (ms) task: 13, duration: 0.345808 (ms), work: 125000, speed: 0.5127151482903808, type: 0 task: 14, duration: 0.000085 (ms) task: 11, duration: 0.000141 (ms) task: 26, duration: 0.250077 (ms), work: 125000, speed: 0.7089856324252131, type: 0 task: 27, duration: 0.000076 (ms) task: 12, duration: 0.00007 (ms) task: 8, duration: 0.000146 (ms) task: 30, duration: 0.00013 (ms) task: 33, duration: 0.220317 (ms), work: 125000, speed: 0.8047540589241866, type: 0 task: 34, duration: 0.000105 (ms) task: 31, duration: 0.000226 (ms) task: 40, duration: 0.186533 (ms), work: 125000, speed: 0.9505074169181861, type: 0 task: 41, duration: 0.00007 (ms) task: 32, duration: 0.000075 (ms) task: 9, duration: 0.000085 (ms) task: 5, duration: 0.000331 (ms) task: 50, duration: 0.000146 (ms) task: 53, duration: 0.00015 (ms) task: 56, duration: 0.177301 (ms), work: 125000, speed: 1, type: 0 task: 57, duration: 0.00007 (ms) task: 54, duration: 0.000136 (ms) task: 60, duration: 0.182528 (ms), work: 125000, speed: 0.9713632976858345, type: 0 task: 61, duration: 0.00007 (ms) task: 55, duration: 0.000065 (ms) task: 51, duration: 0.000145 (ms) task: 64, duration: 0.000145 (ms) task: 67, duration: 0.177456 (ms), work: 125000, speed: 0.9991265440447209, type: 0 task: 68, duration: 0.00007 (ms) task: 65, duration: 0.000151 (ms) task: 74, duration: 0.1793 (ms), work: 125000, speed: 0.988851087562744, type: 0 task: 75, duration: 0.00007 (ms) task: 66, duration: 0.00007 (ms) task: 52, duration: 0.00007 (ms) task: 6, duration: 0.00019 (ms) task: 2, duration: 0.004512 (ms) task: 15, duration: 0.000161 (ms) task: 18, duration: 0.000146 (ms) task: 21, duration: 0.00015 (ms) task: 24, duration: 0.350716 (ms), work: 125000, speed: 0.505540095119698, type: 0 task: 25, duration: 0.00008 (ms) task: 22, duration: 0.00015 (ms) task: 28, duration: 0.246995 (ms), work: 125000, speed: 0.7178323447843075, type: 0 task: 29, duration: 0.000095 (ms) task: 23, duration: 0.00008 (ms) task: 19, duration: 0.00016 (ms) task: 35, duration: 0.000136 (ms) task: 38, duration: 0.21496 (ms), work: 125000, speed: 0.8248092668403425, type: 0 task: 39, duration: 0.00017 (ms) task: 36, duration: 0.000151 (ms) task: 42, duration: 0.186047 (ms), work: 125000, speed: 0.9529903734002698, type: 0 task: 43, duration: 0.00007 (ms) task: 37, duration: 0.000095 (ms) task: 20, duration: 0.00008 (ms) task: 16, duration: 0.000155 (ms) task: 44, duration: 0.000141 (ms) task: 47, duration: 0.011353 (ms) task: 58, duration: 0.178417 (ms), work: 125000, speed: 0.9937449906679297, type: 0 task: 59, duration: 0.000081 (ms) task: 48, duration: 0.000146 (ms) task: 62, duration: 0.181866 (ms), work: 125000, speed: 0.974899101536296, type: 0 task: 63, duration: 0.000076 (ms) task: 49, duration: 0.000105 (ms) task: 45, duration: 0.00026 (ms) task: 69, duration: 0.00013 (ms) task: 72, duration: 0.177516 (ms), work: 125000, speed: 0.9987888415692108, type: 0 task: 73, duration: 0.000075 (ms) task: 70, duration: 0.000151 (ms) task: 76, duration: 0.180428 (ms), work: 125000, speed: 0.9826689870751768, type: 0 task: 77, duration: 0.00007 (ms) task: 71, duration: 0.000076 (ms) task: 46, duration: 0.00008 (ms) task: 17, duration: 0.00008 (ms) task: 3, duration: 0.000651 (ms)

You can see some gradual improvements in the executions but keep in mind that executions can vary quite a lot between runs so one trace is not enough for taking conclusions. Additional information is needed to figure out what is really going on and we would surely need to use hardware performance counters.

Let's continue nonetheless with more complex algorithms.