How Gojek Allocates Personalised Vouchers At Scale
Authors: Praveen Prashant, Kelvin Heng, and Deepesh Naini
How we built a ML Driven voucher allocation engine to serve millions of customers across multiple geographies.
The Idea 💡
How can we use different vouchers to get more business from our customer base while keeping our costs low?
Gojek uses vouchers to achieve multiple business objectives.
For example, the objective can be to maximise food orders in Indonesia for a given week, while in another week it could be to maximise resurrection of churned users in Singapore
An ML Driven Multi-Objective Solution 💻 📊
Identifying the Persuadables
- The Persuadables: customers who make more transactions than if they were not targeted with the voucher ✅
- The Sure Things: customers who make the same number of transactions when targeted/ not targeted (i.e. zero incremental response) 😐
- The Lost Causes: customers who do not make transactions irrespective of whether or not they are targeted (i.e. zero incremental response) 💀
- The Do Not Disturbs: customers who make less transactions because they were targeted 🔻
We use historical data of customers to observe past effects of vouchers on them. This is a typical causal inference problem to measure effect (for example, incremental transactions) of treatment (voucher) for our customers.
However, the complexity increases since there are two response variables, predicted uplift in business objective and cost. So, we use a deep-learning based causal inference algorithm to produce both predictions simultaneously for all customers given a voucher. The “objective” in the problem formulation depends on the business use case (multi-objective).
These predictions are then fed into a knapsack optimiser to recommend treatments (vouchers) for each customer to maximise the business objective while adhering to the budget constraint. We chose the simple knapsack optimiser because of its fast processing, since there are millions of predictions.
Building At Scale 📈
Data Transformation: How we leverage dbt
We use thousands of features for each customer (there are hundreds of millions of them) to predict their transactional behaviour. To extract these features we depend on hundreds of source tables.
To do all this scalably, we use data build tool (dbt) extensively for efficient data transformation. dbt is an open-source command line tool that helps analysts and engineers transform data in their warehouse more effectively.
dbt helps us proactively monitor the upstream source tables for staleness, so it can be caught before we wrongly ingest into downstream tables.
dbt and its packages (special mention, dbt-expectations) provide tonnes of easy inbuilt tests to ensure data reliability in our feature tables (for example checking nulls in the data). dbt also provides easy support for customised tests.
Other dbt packages like dbt-date are packed with awesome features. For example date manipulations, fuzzy text matching etc. dbt 1.3 and above also support python scripts for advanced data transformation!
Additionally, dbt is a jinja based tool, which means code reusability and standardisation; simplifying code development and maintenance considerably (see dbt macros).
Data Observability: How we leverage elementary
We are power users of dbt and elementary. With elementary, it is extremely easy to be on top of any data anomalies or test failures (for example, elementary can notify users of an unexpected increase in % of null rows).
Elementary provides easy reporting of these anomalies and test failures via slack notifications. Elementary also provides a one-stop dashboard for model runtime and detailed errors.
On top of this we use CI/CD to ensure proper linting, test coverage, and dry testing of any new SQL code injection.
Configuration Management: How we leverage Hydra
We, as a team, manage voucher allocations for multitudes of on-demand services across multiple geographies. Hence, it is essential to manage customised hyperparameter configurations for these ML models in a scalable way. We use hydra to do this with ease.
Scalable and Better Code
Other practices that the data science team follows to write scalable code are:
- Enforce code testing (pytest)
- Track test coverage
- Pre-commit hooks for code style and formatting
- Gitlab CI automated testing
- CI/CD using gitlab pipelines for efficient collaboration and smooth releases
In-house Platforms at Gojek 👨🚒
The data science team at Gojek is also fortunate for having access to some world class in-house tools. Some notable mentions are:
- Merlin: This machine learning platform truly makes ML model deployments magical! ✨
- Campaign Portal: This behemoth engineering platform developed by our extremely skilled engineering team allocates vouchers to millions of customers in minutes! ⏰
Acronyms & Definitions:
- ML: Machine Learning
- Causal Inference: The process of determining whether an observed association truly reflects a cause-and-effect relationship
- Knapsack Optimiser: A type of combinatorial optimisation
- CI/CD: Continuous integration and continuous delivery/deployment
- SQL: Structured Query Language
References:
- https://www.getdbt.com/product/what-is-dbt
- https://www.elementary-data.com/
- https://pypi.org/project/dbt-dry-run/
- https://dbt-labs.github.io/dbt-project-evaluator/0.8/
- https://hydra.cc/docs/intro/
- https://docs.pytest.org/en/8.2.x/
- https://pypi.org/project/coverage/
- https://pre-commit.com/
- https://medium.com/r/?url=https%3A%2F%2Fwww.gojek.io%2Fblog%2Fmerlin-making-ml-model-deployments-magical
- https://www.bradyneal.com/causal-inference-course