Category: Appendix I: High Availability Deep-Dive

Disk-Based Health Check

On the primary node, one of the HAMon’s threads (“heart beat”) periodically refreshes the reservation key for all 8 SSDs, and the standby checks for the reservation holder status. When the HAMon on standby detects that the primary is not responding to the network-based health check, then it starts monitoring the key status for some…
Read more


January 25, 2018 0

Network based Health Check

The HAMon on the secondary node periodically communicates with the primary node using thrift transport-based ping and retrieves state information (thrift sends/receives packets over the internal Ethernet links). If the ping fails, it checks for disk reservation holder status some number of times (150) with intermittent delays [(haMonReleasePollIntervalUs (0.01s)] inserted in between. If the Primary…
Read more


January 25, 2018 0

HAMon

Each node runs an instance of the HAMon daemon. Each HAMon independently assesses the state of its own node. The HAMons also communicate to exchange peer node state and manage the election of a Primary node. When the system is in transition, a simultaneous state query to both HAMons might return different results, but the…
Read more


January 25, 2018 0