Compliance As Code: How We Automate CIS Compliance For GCP


By Aseem Shrey

Compliance might mean different things for different organisations. It’s usually the process of conforming to a specification, policy, standard, or law. But one thing is common — It’s tedious and involves a lot of operational tasks. Engineers always look for various hacks to do such tasks or hope it just happens.

For example if you’re making your cloud infrastructure compliant to certain standards, you would be following the best practices created by the committee which came up with those standards. If you were adhering to widely acknowledged benchmark like Center for Internet Security (CIS) benchmark, it’s bound to improve your security posture by a few notches. As these benchmarks are articulated by the bigwigs of the community and reviewed and revised continuously, it helps in improving the security posture of the resource in question.

Technical standard is an established norm or requirement for a repeatable technical task.

It is usually a formal document that establishes uniform engineering or technical criteria, methods, processes, and practices.

Some of the well known standards are — ISO/IEC 27001, NIST, PCI DSS

Cyber-Security Standards, Benchmarking & Best Practices Overview — European Union

A benchmark is a standard or point of reference against which things may be compared. For Android phones we have AnTuTu benchmark, for GPU benchmarking we have 3DMark, GFXBench and for mobile cameras we have DXOMARK.

Likewise for cybersecurity, we have CIS Benchmarks. Center for Internet Security (CIS) is a community-driven nonprofit, responsible for the CIS Controls® and CIS Benchmarks™, globally recognised best practices for securing IT systems and data.

There are 7 core categories of CIS Benchmarks:

  1. Operating systems benchmarks cover security configurations of core operating systems, such as Microsoft Windows, Linux, and Apple OSX.
  2. Server software benchmarks cover security configurations of widely used server software, including Microsoft Windows Server, SQL Server, VMware, Docker, and Kubernetes.
  3. Cloud provider benchmarks address security configurations for Amazon Web Services (AWS), Microsoft Azure, Google, IBM, and other popular public clouds.
  4. Mobile device benchmarks address mobile operating systems, including iOS and Android, and focus on areas such as developer options and settings, OS privacy configurations, browser settings, and app permissions.
  5. Network device benchmarks offer general and vendor-specific security configuration guidelines for network devices and applicable hardware from Cisco, Palo Alto Networks, Juniper, and others.
  6. Desktop software benchmarks cover security configurations for some of the most commonly used desktop software applications, including Microsoft Office and Exchange Server, Google Chrome, Mozilla Firefox, and Safari Browser. These benchmarks focus on email privacy and server settings, mobile device management, default browser settings, and third-party software blocking.
  7. Multi-function print device benchmarks outline security best practices for configuring multi-function printers in office settings and cover such topics as firmware updating, TCP/IP configurations, wireless access configuration, user management, and file sharing.

An example check in CIS Benchmarks for GCP Cloud Provider benchmark:

Benchmark For Cloud Storage Bucket Access

There are 57 checks in the CIS 1.1 benchmark, categorised into Level 1 and Level 2 for the GCP platform.

Level 1 benchmark profiles cover base-level configurations that are easier to implement and have minimal impact on business functionality.
Level 2 benchmark profiles are intended for high-security environments and require more coordination and planning to implement with minimal business disruption.

Now that we have an idea of what compliance, benchmarks and CIS compliance is, let’s talk about the problem at hand.

The Gojek Scale


At Gojek, the GCP spans across :

  1. More than 350 active projects excluding the sys- projects¹
  2. Firewall Rules > 4000
  3. Storage Buckets > 1000

All these are constantly changing and that too from multiple teams as the ownership of project lies with the teams.

An Ideal State

Let’s see what an ideal state for GCP with respect to compliance would look like:

  1. No non-compliant resource
  2. Auto remediate any non-compliant resource
  3. Ability to whitelist resources
  4. Accountability of non-compliant resources — i.e. to say we should be able to know the business justification around it and have some process to manage these
  5. Temporary whitelisting of resources
  6. Easy to maintain
But are we able to achieve all of this? Yes.

How are we able to do this?

There are multiple parts to the project:

  1. Checker — CloudFunctions
  2. Remediators — CloudFunctions
  3. Whitelist — whitelist.yaml
  4. Accountability — Gitlab/Any other version control system

Architecture of the system

Compliance-As-Code Architecture
  1. Cloud Scheduler: Used as cron job to schedule messages to be sent to Pub/Sub
  2. Cloud Pub/Sub: Acts as trigger for the cloud functions
  3. Cloud Functions: Execute the code for checker and remediator of the different CIS checks
  4. Slack: Updates are sent to slack

First, the cloud scheduler sends a message to cloud pub/sub at a fixed time, like a cron job.

This message to the pub/sub triggers a cloud function which does all the check and remediation.

After the check is done running, it posts a summary on slack

Walkthrough of the code


Here’s a benchmark that suggests that SSH access is restricted from the internet. It’s a Level 2 benchmark.

CIS Benchmark to ensure restricted ssh access

Let me walk you through the code structure and some sample code of the cloud function check for the above benchmark.

Every check, which includes remediator as well, is a module and has its own folder.

Directory structure for one of these checks looks like this — it’s the same pattern repeated for all the checks.

Folder Structure for the check

main.py is where the whole magic happens.

main.py for excessively_open_firewall_monitoring

It creates a backup of the current config, takes into account the whitelist.yaml and then goes on to make changes.

It’s the checker and remediator packed into one.

config.py — Contains config specific to the check

config.py

The following config values are common in each of these checks :

  1. Check Metadata — Some info about the check itself
  2. Backup bucket name
  3. Backup filename
  4. Google API Scope for the specific cloudfunction to work

Apart from these 3 values, it contains check specific configurations as well.

whitelist.yaml — The whitelist file

whitelist.yaml for get_metadata_ssh_keys cloudfunction
<project_name>:
  <key_name>:
    MR-Link: https://<gitlab_instance_url>/security/cis-benchmark-work/-/merge_requests/1
    business-justification: For the test MR
    data: prod
    username: aseem.shrey
    owner: <user_email>
    type: development
    value: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIL+hIwK2q8/NtDuvzeOZ330JUPMFPYd2iKSzZx1R5zOc aseem.shrey
    expiry: <till_when_the_whitelist_is_valid>
  • project_name : The name of the GCP project for which the whitelist is being added
  • key_name : Key for this resource. Multiple resources that need to be added to the same project will be added with different key_name
  • MR-Link : Link to the current MR. Later to be consumed in a dashboard to find out which MR is responsible for a whitelist
  • business-justification : Business Justification for adding the resource to whitelist
  • data : What kind of data will this resource be able to access, once whitelisted. Like in this case, adding the ssh-key to the current project would give the user access to production data
  • owner : Who’s accountable for this whitelist
  • username & value : These keys are specific to this function, as ssh requires a username and a ssh-key ( which is the value, here )
  • expiry : This is useful for temporary whitelisting.
    Values can be :
    epoch timestamp , never

Accountability

  • Maintained through version control system, here on gitlab. Every rule that is whitelisted is attributed to the person who raised that MR (merge request).
  • This further requires approval from their manager on the MR (merge request).

The following is the MR to whitelist one of the resources (here firewall rule) to allow to open ssh port to the world.

Opening a firewall port to the world

Maintenance

The whole codebase is deployed as cloudfunctions and auto deployed using gitlab-ci.yml which is gitlab’s CI automation, similar to github actions.

Every time there’s a change ( like adding resource to the whitelist.yaml) in one of these checks only that particular cloudfunction is redeployed.

Different cloudfunctions for different checks

Slack Notifications

After every run, functions’ send a status update.
The first line mentions which CIS check the slack alert is for.
Remediated : The projects where auto-remediated resolved the pending issue

A remediated Compliance As Code ( CaC ) Bot Slack Alert

A green visual indicator further shows that the operation was successful.

Failed : The projects where auto-remediated failed to resolve the issue

A red visual indicator showing which projects the bot failed to remediate

Further Improvements

  1. Overall dashboard of the current state of compliance of our cloud — Push data to ELK

Ending notes

  1. Projects with the project_id starting with sys- are created by default by GCP for every appscript run. Full doc here.
    When a new Apps Script project is created, a default GCP project is also created behind the scenes.
    Also check : The Quirks of Apps Script and Google Cloud

References

  1. https://www.cisecurity.org/cis-benchmarks/
  2. https://ec.europa.eu/research/participants/documents/downloadPublic?documentIds=080166e5bab62342&appId=PPGMS
  3. https://www.ibm.com/cloud/learn/cis-benchmarks#toc-how-are-ci-rBnWtt0u

Click here to read more stories about how we do what we do.

And we’re hiring! Check out the link below: