Adaptive Heartbeats For Our Information Superhighway


By Deepanshu

In our previous posts from the Courier series, we had discussed about the Courier library for mobile clients. We had also briefly discussed the Adaptive Keepalive feature, used for optimising the number of ping requests (or Keepalive packets) sent over the network.

In this post, we’ll talk about the Adaptive Keepalive feature.

Background

When a long-running connection between a client and a server is idle for a long time, it may get torn down due to TCP binding timeout. In order to keep the connection alive, the client device needs to send keep-alive packets(or ping requests) through the connection when it is otherwise idle. The interval at which these packets are sent is the Keepalive Interval.

The Courier library uses AlarmPingSender for sending ping requests, which schedules alarms to wake up the device when it needs to send a PING REQUEST to the MQTT broker.

We can see that the number of alarms triggered is directly proportional to the number of ping requests sent. Also, each time a keep-alive packet is sent, there is the cost of bringing the radio to a high power state, if the device was otherwise idle. Both of these be reduced by minimising the number of ping requests sent.

Adaptive Keepalive

Adaptive Keepalive is a feature in the Courier library which tries to find the most optimal Keepalive interval for a client on a particular network. This helps in optimising the number of ping requests sent over the network and keeping the connection alive.

Optimal Keepalive interval is the farthest time possible within the TCP binding timeout, which means that an idle connection cannot be kept alive beyond this interval.

There are multiple possible approaches for finding the optimal Keepalive interval like Binary Search, Exponential Search, Composite Search, and Linear Search.

Approaches

All approaches use an iterative probing that repeatedly send keep-alive packets at progressively longer intervals, until the optimal keep-alive interval is found.

This flowchart explains how optimal keep-alive interval is found using probing. This process is continued till high > low.

All the approaches are evaluated on basis of following parameters:

  • Probe Count: Number of probes performed in finding the optimal keep-alive interval.
  • Convergence Time: Total time taken to find the optimal keep-alive interval.

In this approach, alpha is chosen as (low+high)/2 i.e., at each iteration, the mid-point of the search space is chosen as the next KA interval to be tested. After the test is completed, the search space gets halved.
If the test was successful, then we search the second half of the initial search space to see if a better interval can be obtained. If, on the other hand, the test failed, then we continue searching in the first half of the initial search space. And this process is repeated.

In this approach, the difference between successive tested intervals follows geometric progression with a ratio of 2. So, the first few test intervals are 2, 4, 8, 16, 32 minutes and so on. If the next interval to be tested goes beyond high, then there is no need to test that interval. Instead, low is increased to the last successfully tested interval; and the progression of the interval differences is restarted at 1. On the other hand, if the next tested interval overshoots the binding timeout, then the connection is lost. In this case, in addition to the updates mentioned above, the high is reduced to 1 minute less than the failed test interval. This process is repeated until the optimal KA interval is reached.

Composite search is a combination of binary and exponential searches. Initially, it acts like the exponential search. The difference between successive tested intervals follows geometric progression with ratio of 2.
When a tested interval overshoots the actual binding timeout, the connection will get terminated. At that point, low is increased to the last successfully tested interval; and high is reduced to 1 minute less than the failed test interval. Subsequently, binary search is performed in the new search range.

In linear search, the difference between successive tested intervals follows an arithmetic progression. So, the first few test intervals are 4, 8, 12, 16, 20 minutes and so on. If the next interval to be tested goes beyond high, then there is no need to test that interval. Instead, low is increased to the last successfully tested interval; and the progression of the interval differences is decreased by half. On the other hand, if the next tested interval overshoots the binding timeout, then the connection is lost. In this case, in addition to the updates mentioned above, the high is reduced to 1 minute less than the failed test interval. This process is repeated until the optimal KA interval is reached.

Conclusion

Comparison of multiple approaches
  • In the above experiment, all approaches are evaluated for a lower bound of 4 minutes and upper bound of 120 minutes.
  • For low optimal Keepalive values, exponential search and composite search takes least amount of time and minimum number of keepalive packets are sent.
  • For high optimal Keepalive values, binary search performs the best with minimum time and Keepalive packets being sent.
  • Linear search also works well when the difference between lower and upper bounds is not very wide and the value of d is chosen wisely. Also, its the most simple approach in terms of implementation.

Network Change Behaviour

TCP binding timeout can vary significantly for different networks. Therefore it is important to calculate the optimal keep-alive interval for every network change.

Also, finding the optimal keep-alive interval for a particular network can be a long running process. So the clients should persist the optimal keep-alive interval found for a particular network.

Impact of Keepalive packet failure

So far, when a Keepalive message fails over the connection, we have assumed that the middle-box has dropped the connection due to too long an idleness. However, a packet could also be dropped for some transient issues in the network.

In case of mobile devices, the sender can also go back in sleep mode after sending the keep-alive packet which would result into ping failure.

The sender has no way to differentiate between the different causes of failure. This can be solved by sending keep-alive packet multiple times whenever a failure happens. If it fails every time, it can be safely considered a valid case.

References

Using Adaptive Heartbeat Rate on Long-Lived TCP Connections

Find more stories from our vault, here.
Also, we’re hiring! Check out open job positions by clicking below: