anomalies@exascale

anomalies@exascale is an INRIA associate team.

It bridges together members from the MOAIS INRIA team project and researchers from Argonne National Laboratory.

Current members of the team are :

Presentation

As we progress toward extreme scale parallel computing, the community is gaining a better understanding of the factors impacting performances and correctness of parallel executions. Until recently, only obvious performance degradations and process failures were taken into account, leading to new models and algorithms to improve the time to solution by adopting better scheduling and better fault tolerance strategies. The degradations were so obvious that detectors of factors impacting performance and correctness were essentially trusted: only easily detectable performance variations and fail stop failures were considered.

The progresses towards extreme scale platforms show a slightly different picture: even tiny performance degradations due to system noise or network congestion or a single bit flip on petascale memory can significantly impair the performance and the correctness of parallel executions. However these anomalies, because they are very small, are difficult to detect and despite efforts to design accurate detectors, one shall admit that these detectors are not perfect. More generally the notions of perfect performance degradation and perfect failure/fault detectors are questionable even for larger phenomena, because by nature detectors are external observers focusing on a specific factor and they cannot consider all the complexity of an extreme scale parallel execution.

In addition to anomaly detection, recent researches are exploring anomaly predictions. The hope is to intervene and take proactive actions based on predictions before a performance degradation or a failure happens. These predictors are by nature less perfect than anomaly detectors, because they forecast the behavior of the system from anomaly detectors and correlation methods that fundamentally reduce forecast accuracy.

This general evolution of the context of parallel executions raises the challenging problems of performance optimization and resilience in presence of imperfect anomaly detectors/predictors. The notion of unreliable fault detectors has already been investigated in the context of distributed systems. However, to our knowledge, two major problems have not yet been tackled:

  1. performance optimization of parallel executions in presence of imperfect detector/predictor and

  2. resilience approaches considering local imperfect silent data corruptions detectors"

Ongoing Work

We currently consider networking anomalies due to shared use of networking ressources by several jobs.

Two orthogonal lines of research have already been identified:

  1. Enhanced allocation strategies:

  2. I/O pacing: The network use tends to be bursty, hence leading to congestion. By pacing or delaying the I/O of some nodes, we may avoid congestion and thus gain in performance. Instead of detecting such pattern with packets' drop, we could be proactive by using the gathered metrics.

Publications

Mohamed-Slim Bouguerra, Derrick Kondo, Fernando Mendonca, Denis Trystram.
Fault-tolerant Scheduling on Parallel Systems with Non-memoryless Failure Distributions.
J. Parallel Distrib. Comput., 74(5):2411-2422, 2014.