Asgard: A case study to envision data infrastructure automation at GO-JEK

.

Asgard: A case study to envision data infrastructure automation at GO-JEK

By Ravi Suhag

Given that our internal services generate billions of events a day, our data infrastructure must be highly scalable, available, and flexible enough to keep up with rapid product iteration and exponential data growth. This results in spending a substantial portion of our time in infrastructure provisioning, load testing, and system recovery. This article describes how we envisioned Asgard, our toolbox for automating end-to-end data infrastructure challenges.

Product Goal

Every product needs a vision. This can come from many sources; most important being solving existing pain points and fulfilling a real, deeply-felt human need. The goal is to create an experience which respects users’ time, efforts, and make product the no-brainer solution.

It’s always tempting to make a prescriptive goal (which end up being too macro or vague). To align efforts towards a common goal, gather rich information and evidence about who the customer or user is, and what their experience is like. For us, Asgard has clear goals:

  1. Reduce the data infrastructure provisioning man hours to zero.
  2. Make load testing an integral part of the provisioning.
  3. Being able to run disaster simulation in an automated controlled environment.
  4. Build an internal auto healing playbook/service to empower teams and live in a pager free world.

User Personas

A user persona is a character that represents a potential user of your product. It’s a representation of the goals and behavior of a hypothesized group of users. To introduce persona development into your product thinking, frame the problems and opportunities as they relate to the most important people — your users.

John Doe: Data Engineer with privileges for infra provisioning
​​John Roe: Product Engineer from internal teams who need data infrastructure
Richard Roe: Data analysts who consume data through different data products
Jane Roe: Management stakeholder who wants to ensure systems don’t fail.

Product Narrative

The power of storytelling has an immense impact on human culture. We are wired to respond to stories, so it makes perfect sense to use narrative to create the emotional, human connection to a product.

The product narrative is the answer to the question of ‘why’ does this product exist. Answering ‘why’ provides context for the vision of the product. It creates the world in which the vision exists.

Asgard Architecture

This product narrative then translated to five services:

Odin — The Infrastructure orchestrator

Odin helps to safely and predictably create, change, and improve infrastructure. It allows us to define infrastructure as code to increase operator productivity and transparency.

Loki — Infrastructure disaster simulation

Loki is a disaster simulation tool that helps ensure our infrastructure can tolerate random instance failures. Loki randomly terminates compute instances and containers that run inside of a production environment. Exposing engineers to failures more frequently incentivises them to build resilient services.

Thor- Infrastructure auto healing

Thor is a service for infrastructure auto healing and workload balancing. Thor can automatically detect broker failure and reassign the workload on the failed nodes to other nodes. Thor can also perform load balancing and make sure broker usage does not exceed the defined settings.

Heimdall- Data monitoring

Heimdall is data monitoring and a tracing service. It collects and visualises the data and events collected by data engineering infrastructure. Heimdall builds reporting dashboards for monitoring the state of data collection in GO-JEK daily.

Bifrost- Infrastructure access

Bifrost allows internal teams to request data infrastructure without the intervention of a data team. It also allows users to take a closer look at data services dedicated to the concerned teams.

In the coming posts, I will be discussing each service in much more detail.

How it turned out and where we’re going next

Odin has helped us reduce provisioning time by 99% despite increasing number of requests. We can now load test and run disaster simulation with Loki on our performance infrastructure with full confidence. We can run recipes on Thor to replay old data for preventing any data loss.

At the time of writing this post, We’re working on the model to allow these services to communicate and coordinate their actions with each other.

Leading a product from conception to completion is no easy task with prioritizing features, organizing requirements, creating and maintaining product roadmaps. Over the last few years, product narrative has proved to be a very effective technique for designing interfaces, UI/UX flows. But it‘s not common to use this approach for envisioning complex, distributed technology architecture and service interactions. But we believe, using personas and narratives is useful not only for consumer-facing products, but for every product. This approach has helped us shape our products for better.

If you like what you’re reading and interested in taking on some of these challenges with our team, do check out our engineering openings at gojek.jobs. If none fits the bill, mail us. We always have room for talented folks. As always, would love to hear what y’all think 🖖