Go-routines going gaga!

Solving leaky go-routines when computing surge pricing at scale

Go-routines going gaga!

By Shubham Saxena

GO-JEK’s ride services (Go-Car, Go-Ride and Go-Bluebird) all offer dynamic (or surge) pricing to ensure high availability of drivers.

Surge factors for every region we service are computed by approximately 100 go-routines, running on a single server. At peak this service handles ~90K concurrent orders.

Recently, after a release of a new feature on this service, we saw a sharp upward spike in the number of go-routines.

Our system was thrashing with 3 million active go routines spawned. Our best guess: a memory leak.

What actually happened

So, each of these 100 go-routines uses a persistence object to write surge data to our database. It looks something like this:

We were using repository.NewSurgeRepository()every time we needed to perform a database transaction in the surge calculation operation.

But, initialising a cache object should not result in a spike in active go routines.

We dug further.

Here is how the cache was initialised:

Note that the function getCleanupInterval() returns a non-zero value based on our configuration. The important thing here is to understand how the method cache.New(getDefaultExpiration(), getCleanupInterval()) works.

When we call cache.New, it initialises a new cache with a Janitor like so:

Janitor is a utility which deletes expired elements from the cache every tick in the cleanupInterval.

Here is how that works:

Note that it initialises a go-routine inside the runJanitor(c, ci)function as shown below.

This is how Janitor runs:

The problem here is that the Janitor runs an infinite for loop until the object is garbage collected. By the next tick of the garbage collector, there would already be thousands of go-routines waiting to be garbage collected.

This, then was the reason why active go-routines were spiking into the millions on our production servers.

How we solved it

Creation of a Repository was originally handled within Perform like so:

We parameterized and moved this one level up like so:

What changed

We basically injected our dependencies (the surge repository) intoPerform. Perform is now free of the responsibility of creating a repository for itself and using it (all hail the SRP).

How that solves our problem? It eliminates the large number of repository objects being created that were in turn delaying garbage collection of go-routines.

Major Takeaways

  • SRP matters.
  • Avoid running high memory consumption processes in an infinite loop. If you have to, ensure you have monitoring and alerting on available resources.
  • Git logs sometimes come in handy to debug code. You can always go back and check from where the anomalies started and take a look at the code that was checked in just before.
  • If you are using a third party library, dig into it and understand how it works. It will help you write bug free code.

Resources

Further Reading

Mentioning some of the resources I gained some insight from which you might find worth reading.

with great power comes great responsibility. image source: golangbridge.org

(Thank you, Chirag Aggarwal for the awesome explanation. 😃)