Embracing failures at GO-JEK — Part 1


By Bergas Bimo Branarto

In my previous post, we spoke about hacking our way to automate everything. But sometimes, our automation hacks fail. And failures are important. The more we fail, the faster we grow. This is how we learn — the hard way, by feedbacks, mistakes and references. Some of our mistakes:

Missing the wood for the trees

It all began with Stan Marsh — our legacy monolith codebase. It was designed and developed with the assumption that once a business process of a feature is decided, it will never be changed. Naturally, we designed it with a rigid OOP structure. What followed was unprecedented growth and a story of routine failures. Because of this, we’ve been on a constant journey of experimentation, which has come at some cost.

The consequence of the way the business has developed and having a rigid of code structure has made this a continuous challenge. Often, the code needs to be refactored routinely on every iteration, making this initial mistake a big one.

We ended up hacking our way out by adding more code to the existing code just to avoided large refactors.

In our defense, speed was the main focus back then. We needed an MVP and had a lot to be done. Only later did we realise by adding more complexities, we were eventually going to face a mammoth problem that would take years to tame.

No YAGNI mindset and keeping a broken window

When we code, we also try to find another use case which might be needed and incorporate it in the code. We seldom realise how this can over-complicate the code. We should also be able to track coverage of the code which adds value to the business and operations — something we failed at.

You Ain’t Gonna Need It! We need to be able to grow bigger with frequent feedback iterations, with minimum efforts as can be. We didn’t embrace this mindset at the initial stages.

The problems arose when we had to make minor changes; could be a new feature or a fix which touched several domains. A small tweak here would bring down the entire system. It becomes a spaghetti code with more hacks than one can fathom.

Before we realized it, we were too late. And now, we were afraid to touch it because it would break some other part of the code. As a result we had a broken window. When more features needed to be worked on, or when new member joined the team and start to code, a broken window is prone to make another broken window, and so on.

To avoid broken windows, we had to be more conscious with our experiments or upgrades to current logic. But that’s easier said than done. Back then, it’s safe to say our code was a potpourri of ‘how not do code’. But hey, those mistakes have served us well today.

No dashboards and system monitoring

Another mistake we did was to lower the priority of setting up dashboards and alerts. Every system has limits. Things can go wrong when a system reaches its limit, and nobody tells you systems will fail soon because of these limits.

We had to make the machines tell us when things will go wrong so we need to setup alerts to prevent downtimes. We also needed a dashboard when doing refactors to get feedback on how the refactor affects the system. We should’ve defined which metric we are improving with the refactor and see how much improvement it results in. The metric could be a number of error statuses, http responses, or how many throughput increase, compared to num of query decreases.

Simply put, we didn’t measure how big our system was and how it behaves, because we weren’t optimising things we couldn’t measure.

To measure the success of our optimization, we had to be able to compare a value before and after the changes. This was a critical lesson in our early days.

No configurable connection/thread pool: This is also related to system resources limitation. When service A accesses something from service B, it requires to be served by some of B’s resources: file system, memory, disk, threads in processors, i/o usage, etc. Service A should also concern with B’s limitation considering the character of the functionality it’s processing. So everything needs to be pooled and made configurable. With good monitoring, we can measure it and tweak to stabilize the system if needed, in real time.

No circuit breaker and fallback mechanism

Whatever can break, will break. Prepare for failures. We need to define how the system should behave if there is a certain failure, whatever it may be.

In a distributed system, communication between systems is one of the most common failure points. Downtimes are normal considering system resource limitations, number of transactions that needs to be processed, number of upstream servers which needs to be called, all put together etc... These issues can cause unexpected behaviors if there is any downstream service having a slow response time. Every upstream service should define the maximum timeout duration which ensure its own stability. We also need to consider a fast fallback mechanism, if any downstream server is not accessible.

We also need to define how we should handle data if there is a failure inside its transaction process. Should we rollback? Should we keep them but mark it as a failed transaction?

No data archival strategy or TTL, no clear usage definition

Data storage has its limits. We use database and redis in a lot of places. Each of them has its own characteristics of failure, if unmanaged.

Databases will encounter disk max limit. Also, for big data, if we don't index it right, it causes slow queries. Redis stores data in s disk (if persisted) and RAM (if not persisted), they can cause the box got disk full or Out Of Memory. Solutions for both are clear: we needed to manage how much data we should keep in those systems and how to make it consistent across systems.

There were some cases where we abused one instance of redis to be used for queueing messages, cache (maintain temporary data to reduce number of calls to db) and as a database (storing data which is referenced as part of certain transaction, with no backup in other datastore). What shall we do when that box got OOM? Clear the data? What about data which has no backup elsewhere? What happens to our messaging data, is it lost?

We need to define and separate things based on usages.

These are just some of the mistakes we made. In the next post I will detail more. Needless to say, these mistakes, errors in judgement have helped us tremendously. We’re stronger, sharper, more willing to experiment because of these failures. Oh, and the beautiful thing about these mistakes: It’s enshrined in our Engineering principles. We don’t look back and bicker around decisions. From our Group CTO, Ajey Gore: “Every decision is correct at the time it is made”. This principle alone allows us to think bold, take ownership and scale.

As you can see, we have enough complex problems to solve. Managing 18 products and expanding into 4 different countries comes with its own myriad problems. Join us. Help us. We want the best engineering talent to consider GO-JEK before any other org. Grab this chance, I can assure you it will be worth your time.

For Part of the the story, please click here

gojek.jobs