“The Φ Accrual Failure Detector” by Hayashibara, Ishikawa, Defago, Yared, and Katayama describes a novel approach to fault tolerance that parts from the traditional boolean failure detection and instead implements detection using a statistical probability that represents the confidence of failure.
To understand the implications of this paper, we have to first examine the limitations that are inherent in a boolean model of failure detection. I will attempt to compare the boolean failure detection model with the Φ accrual failure model in light of these drawbacks to help frame the research presented in the paper.
The first drawback to the boolean model of failure detection is that the nodes in the system are limited to only two actions based on the return value of the failure detector. There can be no half-measures or safety checks performed, since the failure detector gives no gauge of failure.
Using an accrual failure detector, the detector returns a gauge of failure which allows the nodes to implement safety checks or slowdowns for certain nodes. In the paper, the author provides an example where a single master stops sending work to nodes that have a certain confidence value that indicates that they might be failing. This serves the purpose of not overloading the possibly failing node and limiting the damage caused if the node eventually fails.
A second drawback with the boolean model of failure detection is that the failure detector has to be closely tied with the underlying application, since tuning the detection would require some knowledge of the application requirements. In a boolean model, the failure is binary, with only two possible options of either trust or suspect. The algorithm sensitivity tuning would require that the failure detector not be a generally applied tool, since it is returning absolute values.
Using an accrual failure detector, the failure detector can be generalized and separated from the underlying application. This is due to the fact that the failure detector does not return an absolute value, but rather returns a scaled confidence predictor. This allows the decision on how to act be tuned inside the application, utilizing the output of the failure detector as an input. This benefits application designers in that they can use out-of-the-box failure detectors and do in-application tuning.
Given these benefits, the authors went about designing an algorithm for creating an accrual failure detector. They describe three functions that exist within all failure detector systems: Monitoring, Interpretation, and Action. In a boolean failure detector, the Monitoring and Interpretation are done by the failure detector and result in a boolean value generated by the failure detector. In an accrual failure detector, the Monitoring is done by the failure detector which generates a continuous value. The Interpretation and Action are then implemented in the application.
The creation of the Φ value was done by using heatbeats and calculating the mean and variance of heatbeats that have arrived within a certain window. The authors then use these values to calculate a normal distribution of arrival times and then maps the the arrival time of the current heartbeat into a probability model that will return a confidence value.
Given the structure of the two types of failure detectors, one could convert an accrual failure detector into a boolean failure detector by using a classification function. The authors demonstrate this in their paper and then go on to prove that this ensures strong completeness since the thresholds are finite. In addition, they can prove eventual strong accuracy by proving that there is a time after which the threshold is never crossed.
The change from a binary value to a continuous value gives way to some interesting possibilities about how we can improve or expand on the interpretation part of the failure detection. As I was reading the paper, it struck me that the Φ value reflects a probability based on a finite number of samples per node for a particular failure detector. The introduction of multiple failure detectors would introduce the ability for the application to apply another layer of statistical probability to correlate failure.
For example, using a simple bayesian classifier that learns based on results from multiple failure detectors would be a good first step. The application would maintain a table recording the past probabilities returned by various detectors and would implement a bayesian classifier based on these values.
A more sophisticated approach might be to use a support vector machine to determine where the line between various thresholds should be for multiple failure detectors. If the application requires multiple classifications (available, slow, failed), such as the authors had provided in their real-world example, the SVM would support that as well. This model would be interesting as you could implement a SVM using the window size, similar to the one that the authors used in the calculation of Φ, and it would provide classifications that would adapt continually in response to changing failure detector return values.
In addition to the single-process approach, continuous values can be combined statistically to create accrual failure detectors based on process clustering. If the application designer wanted to sacrifice efficiency for increased reliability, the accrual failure detector allows the application to cluster processes and assign a probability that at least a single node has not failed.
In summary, the approach presented in this paper was novel but opened up many possibilities for further failure detection research. In addition, the accrual failure detector can be converted easily to a boolean failure detector and so it fits nicely into mathematical rigor and testing. The ability for an application to detect or classify multiple phases of failure can be very useful in designing robust systems than can respond to failure with a more granular approach than what is possible with boolean models.