Olympus: Terraforming repeatable and extensible infrastructure at GO-JEK


By Ravi Suhag

From where we started, GO-JEK has grown to be a community of more than one million drivers with 3 Million+ orders every day in almost no time. To keep supporting this growth, hundreds of microservices run and communicate across multiple data centers to serve the best experience to our customers.

In this post, we’ll talk about our approach of assembling Infrastructure As Code that simplifies the maintenance of an increasingly complex microservices architecture for our company.

Motivation

Building infrastructure is, without a doubt, a complex problem evolving over time. Maintainability, scalability, observability, fault-tolerance, and performance are some of the aspects around it that demand improvements over and over again.

One of the reasons it is so complex is the need for high availability. Most of the components are deployed as a cluster with 100s of microservices and 1000s of machines running; thus, no one knew what the managed infrastructure looked like, how the running machines were configured, what changes were made, how networks were connected. In a nutshell, we lacked observability into our infrastructure. And when there was a failure in the system, it was hard to tell what could’ve brought the system down.

Goals

We have been using Terraform for our IAC in bits and pieces for a while now, but we lacked structure and consistency. Different teams had different repositories. Modules were all over the place or inside the project itself. They were complex, and there were lots and lots of bash scripts.

It was very challenging and error-prone to create infrastructure and maintaining it manually. We needed to switch from updating our infrastructure manually and embracing Infrastructure As Code.

Infrastructure As Code allows you to take advantage of all software development best practices. You can safely and predictably create, change, and improve infrastructure. Every network, every server, every database, every open network port can be written in code, committed to version control, peer-reviewed, and then updated as often as necessary.

Project Olympus is our initiative at GO-JEK infrastructure engineering team to solve these problems and achieve the mentioned goals.

Declarative style infrastructure

You write code that specifies your desired end state, and the IAC tool itself is responsible for figuring out how to achieve that state. With declarative style, the code always represents the latest state of your infrastructure. At a glance, you can tell what’s currently deployed and configured without worrying about the past state.

Structure and consistency

Consistent workflow and code structure for developers to provision resources on any provider, central module registry to publish and discover modules that can be reused to provision the infrastructure of their choice.

Self-serve and On-demand infrastructure

A self-service model allows developers to pick the infrastructure best suited to run their application and provision it on-demand in a predictable and consistent way.

Safety and security

For security, safety, and human-error reasons, limiting the effects of IAC-Infrastructure As Code operations to a certain environment/component of the infrastructure.

Observability

With many moving pieces and large-scale infrastructure comes a vital need of allowing both ops teams and service owners to maintain observability of running applications and key infrastructure.

Architecture

Olympus Architecture

Module Registry

Modules in Terraform are self-contained packages of Terraform configurations used to create reusable components. Currently, we have more than 30 core base modules and few abstracted modules built using composition on top of base modules. E.g. glcoud-kubernetes-foundation The module comprises four base modules that set up monitoring, deployment, and other core Kubernetes services.

module "kubernetes_base" {
  source = "<path>/terraform/gcloud-kubernetes-base?ref=v1.0.1"
}
module "kubernetes_telegraf" {
  source = "<path>/terraform/gcloud-kubernetes-telegraf?ref=v1.1.1"
  cluster_name = "${var.cluster_name}"
  teams        = "${var.teams}"
}
module "kubernetes_prometheus" {
  source = "<path>/terraform/gcloud-kubernetes-prometheus?ref=v1.0"
  project_name     = "${var.project_name}"
  cluster_name     = "${var.cluster_name}"
  nginx_ingress_ip = "${module.kubernetes_base.nginx_ingress_ip}"
}
module "kubernetes_whiterabbit" {
  source = "<path>/terraform/gcloud-kubernetes-whiterabbit?ref=v1.0"
}

Terraform allows you to load private modules directly from git version control. The URLs for Git repositories support the ref query parameters, which can be used to check out any branch, tag, or commit. Using ref, it becomes effortless to lock the version of modules in your IAC project.

module "godata_core_vpc" {
  source        = "<path>/terraform/gcloud-vpc.git?ref=v2.0.0"
  project_name  = "${var.project_name}"
  vpc_name      = "${var.landscape_name}"
  subnet_region = "${var.subnet_region}"
  gana_endpoint = "${var.gana_endpoint}"
}

We host our modules on Gitlab group, which is a central hub for all Terraform modules. The Infrastructure Engineering team maintains all base modules. Having a central module registry also allows us to enforce governance, compliance, and security against the modules made available to teams.

Address allocation

GANA- GO-JEK Assigned Numbers Authority is responsible for coordinating the address allocation to resources like VPC, gateways, clusters, etc. GANA is our internal HTTP rest service with a custom written terraform provider.

GANA provides a Terraform resource gana_allocate_subnet to allow allocating a subnet range for any resource type.

resource "gana_allocate_subnet" "subnet_allocate" {
  project      = "${var.project_name}"
  category     = "vpc"
  network_name = "${var.subnet_name}"
  endpoint     = "${var.gana_endpoint}"
}

GANA also provides a Terraform data object for other projects to access subnet ranges of already allocated resources.

data "gana_subnet" "godata_core" {
  project      = "${var.project_name}"
  category     = "vpc"
  network_name = "godata-core"
  endpoint     = "${var.gana_endpoint}"
}

GANA allows us to

  • Prevent IP ranges from clashing across different data-centers and cloud projects.
  • Provides a central place for accessing allocated resource information for use across different projects.

Code structure

Structuring the code in Terraform is important because it determines which files Terraform has access to when Terraform is executed.

Olympus represents the entire GO-JEK live cloud infrastructure as IAC. Each repository in the Olympus GitLab group represents one cloud project owned by their respective teams.

Olympus
|                 
├── ProjectA            
├── ProjectB              
├── ProjectC
├── ProjectD
├── ProjectE
├── ProjectF

Having each project as a separate repository allows us to

  • Version control each project infrastructure separately. This, in turn, allows us to infra versioning — goviet-staging -1.2.3 maps to systems-1.2.3.
  • This mapping allows us to do infrastructure fallback to older versions, as always have working state mapping of the last stable infrastructure.

The code structure within each project is layered where each layer represents one section of infrastructure and is a combination of similar components.

ProjectA
│── networks
│   │── network_a
│   │── network_b
│   └── network_c
├── clusters
│   │── cluster_a
│   │── cluster_b
│   └── cluster_c
├── connectivity
│   ├── cgnats
│   │   │── cgnat_a
│   │   │── cgnat_b
│   │   └── cgnat_c
│   ├── gateways
│   │   │── gateway_a
│   │   │── gateway_b
│   │   └── gateway_c
│   │── tunnels
│   │   │── tunnel_a
│   │   │── tunnel_b
│   │   └── tunnel_c
├── CHANGELOG.md
├── .gitlab-ci.yml
└── README.md

Example folder structure for each component looks like.

component
├── backend.tf
├── data.tf
├── input.tf
├── main.tf
├── provider.tf
├── output.tf
└── README.md

Blast radius

Terraform can “feel scary” since it’s easy to destroy infrastructure with only one commandterraform destroy, which makes basic state management very important. One key rule is to use remote state/remote backends. Terraform stores the state of the infrastructure in a JSON File. It is recommended to store that file on an external backend like Amazon S3 or Google cloud storage bucket.

We utilize our Terraform code structure to control the blast radius for our IAC. State. To limit the scope of terraform destroy our approach breaks down the state into “smaller” components, such as using different states for different projects, environments, and components.

Manager

The manager provides scaffolding as a foundation to jump-start your development. Users can run commands to scaffold complete projects or useful parts. We use Proctor as our IAC scaffold tool.

Conclusion

Olympus allowed us to cut provisioning new infrastructure time from weeks to minutes. By breaking down into modules, managing our infrastructure resources allowed operations and infra teams to be more efficient and more organized, thus providing more business value. It helped us implementing Infrastructure As Code in GO-JEK less risky and less complex for adoption.

Olympus helped us set up the foundational organizational shift to say, “Here’s a wiki to tell you how to provision it yourself. Don’t file a ticket, don’t wait for the Infra engineering team”.

If you like what you’re reading and are interested in building large-scale infrastructure that excites you, do check out our engineering openings at gojek.jobs. As always, I would love to hear what you guys think.

gojek.jobs