How GOJEK Manages 1 Million Drivers With 12 Engineers (Part 2)
An overburdened codebase, an updated tech stack, a big rewrite— how a lean team built the foundation for a Super App
By Adithya Venkatesan
This is Part 2 of a feature story profiling the ‘Allocations’ team at GOJEK. For Part 1 of the story, please click here.
The Infinite Onion
Every onion layer you peel is accompanied by more tears. It seems like an endless problem. And just when you think it’s done, there’s another layer. For the next 3 months, it was onion after onion, layer after layer across teams at GOJEK. Downtimes were the new normal by the beginning of 2016.
Back to square one.
The ‘Broadcast algorithm’ the bid engine team was relying on was failing. But how?
Every driver was seeing the same order multiple times. The algorithm ‘broadcasted’ the same order across its driver database. So if there were 100 orders in a specific area and 200 drivers, each driver would see the order, but not necessarily be able to fulfil it. There was a three-fold problem to the algorithm: Accountability, High-concurrency and promoting Unhealthy competition.
Accountability: How can we reward the drivers who are doing more orders, zero cancellations etc… when he/she simply couldn’t accept the order? How can we deny bonus, because by design, a driver was not getting an order due to a dozen reasons? There was no accountability for the driver, or the business fundamentals.
High-concurrency: The sheer volume of orders meant drivers were missing out on orders because it was blasted across phones. Some orders were not being fulfilled because of multiple blasts and server loads. More orders, less drivers = some orders not being fulfilled, which resulted in a poor customer experience.
Note: The location-based orders are a peculiar problem for GOJEK.
Why? In a distance of 20metres, you’ll spot more than 30+ GO-RIDE scooters, as opposed to maybe a maximum of 10 cars.
Unhealthy competition: Once you’re blasting an order to all, you’re not factoring in quality drivers for customers. We were also not getting the nearest driver for an order. This breeds unhealthy competitiveness among drivers.
There is an adequate probability of doubt in the nature the algorithm was designed, and other constraints that are outside of the realm. Who gets the order became a function of the phone — better GPS, hardware, Internet, software; all played a critical portion. And that was unfair. So zero accountability and high congestion of drivers meant things were going awry.
10x growth, 100% failure
When Niranjan pulled a couple of all nighters and and rewrote the code, the core portion was rewritten to make it a SPIKE. What is a spike? You break the rules and throw caution to the air with the objective of shipping something out to keep the company afloat. The problem with SPIKE is that it wasn’t the end-solution. And that meant more downtimes and more failures. But, the team was in murky waters by late 2015.
At this point, GOJEK was managing 300,000+ orders every day. Failures were routine. Again. Wherever Nadiem went, he was questioned on why the app was crashing or users could simply not find customers. At this point, the tech team was made up of around 10 people, who were firefighting every day. When Shobhit, one of our star programmers, went to a Domino’s store nearby to grab a quick bite, drivers started questioning him. Anyone who wore a GOJEK T-shirt became the unofficial complaint box. Something needed to change, and fast.
This was again an underestimation of how much Indonesians relied on GOJEK. Everyone wanted to use GOJEK. It made life easier in the traffic-congested glut that was Indonesia. Importantly, jobs and lives depended on it.
“No project has a budget and impact as big as this in GOJEK’s history”
The big rewrite — The Perfect Allocation
The team needed to work on a different algorithm: 1–1 personalisation, pin accountability on drivers, identify what a perfect driver looks like, and ideate on how to frame this persona. The big rewrite began in the middle of 2016. The ‘bid engine’ team was now rechristened as the ‘Allocations’ team. At this point, we were still losing customers. There were leaky faucets that were not sealed. After all, the work of the Allocations team criss-crossed all of GOJEK’s products and services. It was time to revisit the mothership.
Back to square one. Back to taking risks. By now, the core team was all too familiar with handling high-pressure timelines and live codebases. Clojure was an obvious choice because of the specific complexities it intended to solve.
“Only two in the team knew Clojure then, but it solved an important business problem. We went with it and we all had to learn. Back to school. Again.” — Niranjan Paranjape
The first task was to replicate the bid engine logic. A 6-member team got to work with Clojure. Why Clojure? Because the language designs better abstractions for a specific problem the team needed to solve. While Golang was the modern superbike that had it all, Clojure was the cruiser — really simple and capable of designing complex code. Clojure ushered this idea of getting organised and ensuring good software development practices.
On the left, you see the Allocation code in Go. On the right, the exact same code in Clojure.
This is not to state one language is better than the other. It’s tempting to arrive at that conclusion when you see the image above. There were trade-offs made when the switch was made. While Go is superior in performance, the capability to make changes and add features was hard. Language was traded for design.
The innate abstraction to sniff out what works when, how and why is what makes lean engineering so special at GOJEK.
“The more boring a rewrite is, the sweeter the success,” — Shobhit. After the 2-month long big rewrite, a stable product was live. Pause. Breathe. After 3 days of releasing, no one noticed there was a new codebase/algorithm in place. That’s what success tastes like. Smooth as butter. No issues and achieving scale.
Shaping a mindset
That’s half the story told. A million mistakes later, we’re still making mistakes. But that’s the good part. We fail fast. We build fast. No hierarchy. There’s an ingrained mentality of managing more with less. Anything that’s repetitive gets automated. One could argue this was born out of the desperation of GOJEK products being the arteries criss-crossing through the heart of Indonesia. Regardless, the engineering psyche was passed down and filters through our recruitment. Here’s a reckoner on why GOJEK is hard to get into and equally hard to abandon.
The simplification of the story would merely state the Allocation team allocates drivers to customers. But their genesis is filled with fascinating engineering insights. How do you factor supply and demand, how do you reward drivers, manage driver health by reducing the workload, figure surge pricing, check for loopholes, and so on. Each function has dozens of people in similar startups at GOJEK’s scale. We are able to cut this down because of our emphasis on lean engineering. We don’t make compromises on our recruitment either. Leaders code. Everyone codes.
Engineers are running their own startups in a startup. GOJEK is creating a one of a kind Super App with a platform for other startups to be part of
Today, anyone within a 300-meter range can grab a ride. That’s only an average. In popular malls near Jakarta, there are drivers every 10 metres or lesser. Then came dashboards and data to crunch driver statistics, daily research to tweak the algorithm. The last time I checked, GOJEK does more than 35+ orders each second across our services like GO-FOOD, GO-SEND, GO-MASSAGE etc…
The ship of Theseus
You use a really old car to commute to office. It breaks, stutters and sometimes refuses to move entirely. You can’t scrap it because it’s the only car you have. But you want a supercar. So you go about buying the steering wheel and fixing it to the old car. Then comes the rims, the music system, leather seats and slowly, the car begins to take shape. But it still has components of nostalgia; the car that ferried you in dark times when nothing else would. Stan Marsh is that old car.
Remember Stan Marsh? The old legacy code on which GOJEK was being built?
10% of Stan Marsh survives. Even till date. (There is a plan to eventually put it to bed) It’s there for legacy reasons. I suspect the team is also sentimental about it. Think of it as the ‘Ship of Theseus’ conundrum. No matter what engineers who join GOJEK think of Stan Marsh, it was the foundation on which GOJEK was built. Smart engineering is also about working with a legacy codebase and improving it. Fly with what you have and make it better. Everything else will follow. The team embraced that challenge.
It all boils down to the kind of people you let in the system. People are empowered to make decisions at GOJEK. As our India Head, Sidu Ponnappa often repeats, “Don’t throw people at a problem.” It’s a typical outsourcing mindset Indian Engineers have been cajoled into. More people does not mean better work. More people does not mean better code. If that were true, GOJEK simply wouldn’t exist today doing more than a 100 million+ orders a month with a paltry 200+ engineers.
Story credits: Shobhit Srivastava, Ranjeet Singh, Mehakdeep Singh, Bergas Bimo Branarto.
Do leave a 👏 if you enjoyed reading this. There are incredible, untold stories inside GOJEK and the aim is to cover them on this blog. Watch this space for more. We’re hiring engineers! If you want to know more, visit superapp.is. And… we’re expanding across South East Asia. Grab this opportunity to work with folks who move the needle for an entire country. Just FYI— we do about half a million orders per engineer. 🖖