Tech

A tale of two Lambdas — Solving Event Sourcing at GO-JEK

Gojek

May 16, 2018 • 3 min read

Team Lambda is trying to solve Event Sourcing at GO-JEK. We are building products for our developers to make the transition to event-sourcing seamless. In a nutshell, we are building a Heroku like PAAS framework to allow us to quickly create and deploy units of logic.

Solving Events to Action
Event Sourcing is a Pub-Sub model in which a service publishes its events onto a message bus (Kafka in our case — we process ~6 Billion events daily). Subscribers listen to relevant events and perform actions accordingly, translating Events into Actions.

We are abstracting this out into a framework with plug-and-play capability for custom actions. So each Lambda Actor can be an independent action. This helps in decoupling the applications from each other reducing cross-service dependency and improving uptime.

Our aim is to improve our developers’ productivity and reduce the time-to-production for new services.

The Study
Since we are just starting up, we don’t have a solid number around the improvement in efficiency. But, with the MVP, we found some interesting observations about the peripheral benefits, namely Cost and Response Time. Watch this space for more as we delve deeper.

AWS lambda is the most logically-adjacent product to what we are building. So, we did a cost analysis of running the same products on AWS.

Where things stand

The base framework is implemented in Clojure. The actors use Kafka Streams in the background to read messages from Kafka. For stream support, we need a JVM based language. Also, since we wanted to treat each actor merely as a function, picking a functional language made sense. Clojure fits the bill. That we also ❤️ Clojure sealed the deal.

Let’s take ‘Payments Processor Worker’ as our example, which is a Transport Service worker process, implemented in JRuby + Sidekiq. It is used to reverse balance reservations in the user’s Go-Pay wallet for bookings that are cancelled.

We re-wrote the processing function in Clojure and deployed it to production on top of the Lambda framework. It’s running in parallel to current Sidekiq jobs, processing cancellations. Given our implementation is idempotent in nature, it gives us real data to benchmark against.

Analysis:

1. For Sidekiq Worker
(assuming we run 2 AWS lambdas with 3 GB RAM each)
The data collected here is for a Week (from 5th April to 12th April)

Total # of Reversal Jobs weekly = ~500,000
Avg. # of Reversal Jobs per day = ~70,000
Avg. Response Time = 90 ms

Total Bill per month = $37.41

Unit economics (cost per reversal) = $0.0000187

2. For Lambda Actor

In the Event Sourcing model, the actor will process all messages in the topic rather than specific bookings assigned to it.

Total # of transactions processed by Actor weekly = ~35,000,000 (⬆70X)
Avg. Response Time = 1 ms (⬇90X)

Total Bill per month = $13.15

Unit economics (cost per reversal) = $0.0000066

What happened there?

1. The response time dropped from 90ms to 1ms.
But that is a bit deceiving as we are processing way more events than before (Almost 70x more).

Even 70*1 = 70ms, which is still a 22% better performance to process the same number of reversals.

2. The price dropped from $37 to $13.15, Almost a 2/3rd reduction.

There is a demonstrable performance increase in terms of cost and efficiency on moving to the Event Sourcing model

Scalability

Lambda Actor (Payment Processor Worker) in MVP phase is processing 5 million events a day with two VMs each with 4 core and 4 GB RAM.

Moving ahead we will be incorporating our container initiative. This will further improve the resource utilisation, scalability and reliability. And make it a true infrastructure-on-demand product.

All these things combined makes us really optimistic for what lies ahead for our product.

Other Value Additions
Some other features that we are building in the framework:

A Retry-on-Failure Service with backoffs and Dead Letter Queue. Which queues the messages for retry whenever the function returns :retry.
It includes an API to view and retry dead-sets.
Drastic reduction in Time-to-Production, from days to hours.
Time saved in creating the pipelining around services to get logs, metrics and alerts
A CLI to provision and configure actors.

References:
- AWS Lambda Pricing

We’re scaling rapidly, from 15k to 100+ million monthly completed orders — that’s 6600x — in 36 months. Consequently, we have no dearth of Hard Technical Problems™ and are looking for hackers with strong, hands-on engineering skills to join our team. Check out gojek.jobs for more.

Sign up for more like this.