By Dhanesh Arole
I work at GOJEK’s payments and financial arm GO-PAY. Our platform is composed of hundreds of microservices, and GO-PAY processes almost half of all transactions in GOJEK. At this scale, having a well-defined and deterministic mechanism of service discovery is of paramount importance.
A general solution to service discovery is to use a distributed key-value store like Consul, Zookeeper, or Etcd, unless you have completely architected (or say, re-architected) your system to deploy service mesh. We built our own service discovery mechanism using Consul and Envoy proxy (the details and rationale of why we did this is a topic for another blog post).
One of the core features of Consul is the watch. In our time using Consul, we learned a fair bit about how watches work internally. This post is an attempt to share some of our learnings.
An overview of Service Discovery using Consul
In our system, whenever a new instance of a service spins up, it registers itself to Consul and adds a health check for its status. Similarly, whenever a service instance goes down or shuts down gracefully, it deregisters itself from Consul. Any service/consumer that needs to talk to another service discovers its current instances from the Consul catalog and adds a ‘watch’ on changes in the status of that service.
So, how do Consul watches actually work internally?
Consul doesn’t have a native Watch implementation the way Zookeeper has on nodes. So, most of the language SDKs implement watches outside Consul using Consul’s feature of blocking queries. Many endpoints such as
catalog support blocking queries.
For example, whenever a watch-loop of type ‘service’ for target service foobar starts, an API call of /v1/health/service/foobar?index=0 is made the first time. Health API calls are also redirected to Consul server (this is a special implementation of raw /catalog API).
As a part of this API response, the Consul server will return a header called
X-Consul-Index. This value is represented in internal raft log data structure by a field named
modifyIndex. This header indicates the value of raft log index at which the last write to that particular service was successfully applied and gossiped. This is the current version of the requested resource from Consul catalog.
With this in mind, let’s say in the case of our service foobar, Consul returned X-Consul-Index header value as x1 the first time. Consul SDKs store this value with them, and the next time make an API call of /v1/health/service/foobar?index=x1 in watch loop to detect any change in the health of service foobar.
The nature of these blocking queries is such that they don’t return a response until, for that particular endpoint, a
modifyIndex greater than x1 is present or blocking query timeout is reached. Usually, this timeout is very high (default: 5 minutes. Refer to blocking queries ). If it times out, then Consul returns the response with the same value of index=x1 in X-Consul-Index header.
Whenever a response on such blocking queries is received, Consul SDKs watch implementations check for a previously known value (in this case x1) against the new value returned in the X-Consul-Index header. If the value returned is greater than the last known value, they assume that something has changed in that endpoint and call that particular watch’s handler function.
An important thing to note here is, they don’t look at the response of the API at all to decide if the Watch’s handler function should be triggered. It’s only based on consecutive X-Consul-Index header values. So even if two consecutive calls to /v1/health/service/foobar returned the same response, they are assumed to be different if the value of X-Consul-Index header in the second response is greater than the first one.
This is an explanation of how watches are implemented in Consul. Interestingly, this particular way of handling X-Consul-Index header values caused us a couple of problems when using Consul watches at scale (we did mention GO-PAY is Indonesia’s largest payments platform, right?). That’s an intriguing story, and deserves a post of its own. Keep watching this space!
Working at GOJEK is a continuous learning process. Our entire on-demand platform is built by just 250+ engineers, so suffice to say we spend a lot of time understanding how things work. If you like to learn and apply your smarts to tough problems, we’re more than happy to provide the opportunities. Check out gojek.jobs and come learn with us 🙌