Failures make us stronger — Part 2
By Bergas Bimo Branarto
This is the second post detailing the many failures at GO-JEK Tech in the past 36 months when we grew 6600x. For the first one, click here
No unit and automated integration tests
There is a story about our legacy monolith service when we stopped doing unit tests. The first deployment of the code change was fine. When some changes were needed on top of that or needed refactors, we realised how convoluted the code was. One minor change can affect other domains making it a huge mess.
And thus, Stan Marsh was born.
We had to start creating unit tests for the codebase again. By this time, the code was too complex and convoluted. It even needed some code refactors to be able to mock dependencies. We then started testing for features and fixes which we were currently working on. Running tests for each test per class is fine, but we were still failing for the whole. This was because the mocks were conflicting between tests. And so we needed to skip tests on our build pipeline.
As a result, we were still not capturing the dependencies between domains, and code failures are were prone to happen. It seemed logical to rewrite this entire codebase to a bunch of small services, than to refactor for fixing it. But we took time to address the root cause.
Sometimes, we lower the priority to build unit tests with proper coverage and rely a lot on our QA team. One of the most popular reasons is, ‘Time is not enough, we need more time to create this’. The thing is: manual tests by humans won’t scale with our growth, and there will always be holes in our logic when we code.
First, doing the same tests manually over and over again regularly while also testing new codes is a boring and time consuming process. We need the help of machines to test how our code behaves.
Second, we can measure the duration when development starts until it is done. Done means the code starts giving value to systems or operations or business — in other words it means deployed to production. Then, we can compare the duration between creating unit test on development part and doing a ‘ping-pong’ of dev/fix vs manual tests between QA and developer. Most times having a good coverage of unit and automated integration tests is making things faster or safer so it can be deployed to production.
Unit tests also show its power when there is a need to refactor things. As everything can be changed, we need to ensure the availability and consistency of the system while refactoring codes to make it scalable.
To achieve the expected speed and maintain the level of growth we had to scale services and teams. Each team had its own focus and plans. There is a tendency the team will eventually work in a silos, limiting communication to other teams.
We had to realise in a distributed system every service depends on each other, from API contract, process latency, domain boundary, etc. There are so many moving parts which requires every team owning tasks. When splitting a monolith to microservices, there are so many cases where refactors happen between services to get the ideal domain boundary. This communication between teams is a huge task.
Sometimes an upgrade to a service requires some changes on other services. This could be just a field addition to an existing contract or removing a part of the flow to be migrated to another service. From time to time this happens between two or more teams maintaining services, who have their own deployment plans. This is part and parcel of being a Super App of 18 products. Some teams roll out deployments in a staggered manner, by traffic percentage, using canary by machine instances, or a big bang. Dependencies is a concern for other services to roll out related changes.
A Roll out plan should be discussed between teams. Feature toggles usually helps a lot in this scenario. Backward compatibility should always be considered, and there should be a dashboard to monitor the roll out. Once all service changes are fully rolled out safely (no rollback needed) then we can start deleting the old implementations.
Burning out and not knowing when to pause
This is more about me, but safe to say i’ve seen enough of us go through this cycle. We should always take some time to destress, introspect, pause and not think of engineering. It’s our own responsibility to know the limit of our own self. I feel far more productive and get twice as much done after a good break. The majority of human errors happen by burnt out people which can cost lots of money, human capital, potential business losses, and even loyal customers.
It’s not a race. If it was, we’d hire 1000s and win. Solving complex problems require intellectual rigour. To stimulate creativity, breaks are important. Take a breather every once in a while and you come back sharper.
And lots more!
These mistakes happens across teams. Most of them are fixed, some are still a work in progress. Slowly, but steadily and surely we’re witnessing improvements in our offerings. We’re far more stable now. We’re better connected in how we work personally and between teams. These mistakes make us better. This is not to say we have cracked a code. No one does. And we’re not perfect, we can always do better. We’re in this continuous exercise to strive for better.
These are just some of the mistakes we made and we will make more. The beautiful thing about these mistakes: It’s enshrined in our Engineering principles. We don’t look back and bicker around decisions. From our Group CTO, Ajey Gore: “Every decision is correct at the time it is made”. This principle alone allows us to think bold, take ownership and scale.
Working with GO-JEK is an amazing experience because of its culture. It’s fun, exciting and you should grab your chance to work with a company that’s expanding across South East Asia. Trust me on this, we have one of the best engineering cultures and unique problems that come with scale. Join us! Head to gojek.jobs for more.