An Introduction to Gojek’s Machine Learning Platform

Common problems related to Data Science and how our Machine Learning Platform aims to solve them.

An Introduction to Gojek’s Machine Learning Platform

By Yu-Xi Lim

Gojek’s Data Science teams work on some of the most interesting problems in transport, logistics, and economics. We leverage machine learning to build data products for ride-hailing, logistics, food delivery, and payments. From selecting the right driver to dispatch, dynamically setting prices, serving food recommendations, forecasting real-world events, detecting fraud and preserving trust, we process hundreds of millions of orders per month, across more than 20 products, in 4 countries. All this, driven by machine learning.

The problem

With our experience in developing and operating machine learning systems in production, we observe the following problems in the way that they are typically developed:

  • The data science development experience can be painful: Data scientists are expected to be full-stack and to be able to take projects end-to-end, but some of the systems and tools they are provided are either painful to use or immature.
  • No standardisation of the ML life cycle: In principle, most data science projects should follow a very similar life cycle. A common problem is a divergence and lack of standardisation at various stages of the ML life cycle. Teams define their own approaches to solving problems — which leads to a lot of duplicated effort.
  • Difficult to get data science systems into production: The project life cycle for ML systems is typically on the order of months. The majority of this time is spent on engineering (infrastructure and integration), and minimally on the data science or machine learning.
  • Hard to maintain data science systems once in production: Historically these systems have been built as proofs-of-concept (POCs) or minimum viable products (MVPs) in order to ascertain impact first. This causes a problem where scaling to large numbers of model variants, environments, or markets becomes challenging. The fact that these systems are relatively brittle (especially data pipelines!) means that improvements are not made at the necessary frequency.
  • Difficult to measure impact: Data science systems are typically optimisations and improvements on existing products. They require measurement systems in order to prove impact, and these typically come in the form of experimentation systems. The status quo is that most teams have built their own experimentation systems and have yet to standardise processes around measuring impact. In addition, these systems are often brittle and require many manual steps in order to run experiments.

The solution

Our vision of the Machine Learning Platform (ML Platform) is to empower data scientists, analysts, and other ML developers to create ML solutions that drive direct business impact. These solutions can range from simple analyses all the way to production ML systems that serve millions of customers. The ML Platform aims to provide these users with a unified set of tools they can use to rapidly develop and confidently deploy their ML solutions.

We achieve this with the following design principles:

  • Easy to compose ML solutions out of parts of the platform: New projects should be able to compose solutions out of existing products on the ML Platform, instead of building from scratch. With the infrastructure complexity abstracted away, the entry barrier of using ML to drive business impact is lowered and would allow a lightweight data science team or even non-data scientists to leverage ML power.
  • Best practices are enforced and unified on each stage in the machine learning lifecycle: Data scientists should have a clear understanding of all the stages of the ML life cycle, the tools that exist at each stage, and how to apply them to their use cases in a self-service manner with minimal support from engineers. This extends the capabilities of data scientists, who can now deploy intelligent systems into production quickly, run experiments with small slices of traffic confidently, and scale their systems to multiple environments, markets, and experiments easily.
  • Integration into the existing Gojek tech stack: The ML Platform is built with the existing Gojek tech stack in mind, and either abstracts away any integration points or makes these integrations easy. Data scientists should not have to be concerned with how their solutions will be consumed. Furthermore, the platform leverages many of the existing products and tools provided by other teams within Gojek.
  • Bottom-up innovation: The platform is built in a modular fashion, in layers from the ground up. Given the diversity of use cases and applications that need to be supported, it is necessary to support not only the “happy path”, but to also provide flexibility when edge cases arise.

Machine Learning Life Cycle

One way to categorise the capabilities of a machine learning platform is through the stages of the machine learning life cycle. The typical ML life cycle can be viewed through the following nine stages:

Starting by sourcing data, a data scientist will explore and analyze it. The raw data is transformed into useful features, typically involving scheduling and automation to do this on a regular basis. The resultant features are stored and managed, available for the various models and other data scientists to use. As part of the exploration, the data scientist will also build, train, and evaluate various models. Promising models are stored and deployed into production. The production models are then served and monitored for a period of time. Typically, there are multiple competing models in production, and choosing between them or evaluating them is done via experimentation. With the learnings of the production models, the data scientist iterates on new features and models.

In the subsequent articles, we will talk more about our solutions for the various stages of the ML life cycle:

  • Merlin: Model management, deployment, and serving
  • Clockwork: Scheduling and automation
  • Feast: Feature management, storage, and serving
  • Turing: Experimentation

Stay tuned for more updates about Gojek’s ML Platform, or sign up for our newsletter to have updates delivered straight to your inbox.

If you are excited by the idea of developing such tools for data scientists, please consider joining Gojek’s Data Science Platform team.

Contributions from:
Willem Pienaar, Luo Shushu.