By Sourabh Ghorpade
If you have a write heavy table in your Postgres database and you’re using the default Postgres settings for autovacuum, writes may consume more Transaction IDs than the autovacuum can reclaim. This may lead to a database shutdown. To avoid it, consider tuning them using PG Tune.
Why would the database shutdown ?
Following is a high level overview of the ‘why’:
- Postgres uses a MVCC mechanism to handle concurrency. So the reads and writes do not block each other. To manage row visibility across transactions, it has to assign a unique Transaction ID(XID) to each transaction performing a write. This XID is stored on the row when the transaction commits. A query reading from this table can only see rows which are older than the query’s transaction ID.
- This Transaction ID uses modulo-2³² arithmetic, and hence, can support only 4 Billion transactions - such that each ID must have 2 billion IDs before and after it. This means if the XID wraps around, the older rows will appear to be in the future, and become invisible to all new queries. This will result in immediate data loss.
- To prevent it from happening, Postgres has a background process called an autovacuum. The autovacuum assigns a special XID (called as a FrozenXID) to all older rows. All rows with the FroxenXID are treated as in the past, regardless of wraparound issues. The highest XID in use is said to be the “Age of UnFrozen Rows”. If this reaches 2 Billion, it will result in a wrap around.
- Normally, autovacuum does the freezing of rows automatically in the background. However, in case it’s not able to catchup with the writes, the Transaction IDs can reach perilously close to the 2 Billion mark. Postgres preempts this scenario by shutting down for maintenance when there are fewer than 1 Million Transaction IDs. No writes are possible during this time and it can take an unknown amount of time. This effectively causes a major unplanned downtime of unknown time.
- The sentry blog, heroku blog and Postgres doc cover this in detail and we recommend you go through them as well.
With that background setup, let us see what happened to us.
What happened to us ?
GO-JEK is the market leader in Indonesia across multiple sectors including Ride Hailing and Food. We do more than 100+ million bookings a month. A majority of these bookings are created and managed in a single Order Management System and stored in a table called bookings. These booking are updated multiple times with actions like pickup, complete etc. As a result, millions of transaction IDs are consumed each day. With the default settings, the autovacuum process was able to reclaim only 70% of them and took up to 11 days to complete. This reclamation is committed only at the end of the process. As a result, the count of unfrozen rows reached more than a billion with only a few weeks left till the database shut down.
What did we do?
We had the following beefy configurations on the Database VM :
Memory: 64 GB CPUs : 16 HDD : 3TB, SSD
Despite that, the performance of the autovacuum was poor. This indicates it was under-provisioned. After some digging around we decided to upgrade the following configurations:
autovacuum_work_mem = '2GB' #This is the amount of memory that the autovacuum process can use. Higher values allow it to load more rows in memory and hence process them faster. It is not present by default and hence defaults to maintenance_work_mem, which is 128 MB by default autovacuum_vacuum_cost_limit = '2000' #To prevent excessive load on the DB due to the autovacuum, there is an I/O quota imposed by Postgres. So every read/write causes depletion of this quota and once it is exhausted the autovacuum sleeps for a fixed time. This configuration increases the quota limit, therefore increasing the amount of I/O that the vacuum can do. This has a default of 200. More details in references #7 maintenance_work_mem = '1GB' #This is similar to autovacuum_work_mem where it is the memory that a maintenance process can use. DDL commands and indexing use this setting. The autovacuum defaults to this if the autovacuum_work_mem is -1. Given that we set autovacuum_work_mem it does not matter here, but it is useful to have. The default is 128MB. work_mem = '16MB' #This enables sort queries to load more rows into memory thus increasing their speed. This is unrelated to vacuum.
These changes only need a Postgres reload. But the existing autovacuum process has to be manually restarted to pick up new configurations. There are more configurations one can upgrade, but we limited it to the above. Why? Because we wanted to limit changes, lest we impact the main read/writes.
We did the above changes, and as a result, the autovacuum went from processing only 70% of daily consumption of IDs (in 11 days) to processing…
1 billion IDs in only 2 days.
That was a massive performance improvement and saved our DB. The autovacuum finally realised its full power!
Soon, we plan to upgrade the following configs:
autovacuum_max_workers = 5 #This is the number of workers that autovacuum has. It is currently 3 and requires a DB restart to be updated. To be noted is the fact that each table can have only one worker working on it. So increasing workers only helps in parallel and more frequent vacuuming across tables. Wit autovacuum_vacuum_cost_limit = 3000 #As noted above this is the I/O quota. This is shared among the workers so it will need to be upgraded to account for more workers. autovacuum_vacuum_scale_factor = 0.05 #This is basically the rows the database waits to be available for reclamation before it triggers a vacuum. For extremely large tables the default of 20% can be tens or hundreds of millions. This increases the amount of work per vacuum, increasing the time per vacuum. shared_buffers = 16GB #This is the number of pages that the DB can cache in memory. In terms of the read cost from the quota, a cache reads is ten times cheaper than OS reads(memory/disk). Hence increasing this would enable the autovacuum to load more pages before the quota gets exhausted.
What you can do:
We highly recommend you set up monitoring on your Postgres instance. We had a Telegraf agent run the queries mentioned below and send the stats to an InfluxDb instance. We then had dashboards setup in Grafana. The queries themselves are inexpensive as they run on statistics tables which are separately populated by Postgres so they should not impact business queries. For Unfrozen rows, we run the following query:
SELECT c.oid::regclass as table, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as unfrozen_row_age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid INNER JOIN pg_namespace n ON c.relnamespace = n.oid WHERE c.relkind IN ('r', 'm') AND n.nspname='public' version=10 withdbname=false tagvalue="table"
The good folks at Nilenso have documented handy monitoring queries here.
Review Postgres Configurations
We would recommend reviewing the Postgres default configuration since they are meant to work on a wide set of platforms, but may not be optimal at scale. For that, you can use PG Tune — which is a useful tool that suggests optimal values for your database depending on the type of load and the available resources.
Since the XID limit is per table, distributing this data reduces the risk of a wraparound. Some options are normalisation, partitioning and clustering. We currently have the following approaches deployed across various services.
- Partitioning : Our Customer Locations Service (CLS) uses the customer’s history to predict the pickup and drop locations for future trips. This algorithm looks at the last few months only, so we partitioned it per quarter. Data older than a few quarters is not used. This can be dropped to save both disk space and improve query performance. We used Partman for this purpose. Another service which stores ephemeral data employs Partman with daily partitions. A script automatically drops older partitions. Dropping partitions has the advantages of saving disk space, removing bloat and reducing risk of Transaction wraparound (as the XID is per table). Partitioning has a drawback, in that, queries which lead to cross partition reads or query on non partition columns are slow. (Note: Postgres 10 has native support for partitioning, so if you’re starting a new service, we highly recommend Postgres 10).
- Periodic Data Deletion: Ensuring optimal conditions for Partman is not always possible with a legacy database which has grown in size and has non-trivial queries being run against it. In our case, the bookings OMS was such an example. To reduce the effective rows within these constraints, we run a job to delete data which is not being used. This is possible because the data is already backed up in a cold storage. Also, we had to ensure that the autovacuum runs frequently enough to handle the extra writes.
- Dividing Responsibilities Across Services: Data can be published through an event bus and shared across services. Each of these services can then have their own data storage mechanisms and retention policies based on their use case. For example, the OMS does not get updates on an order after about a week, but Customer Locations Service needs a few months of data. By dividing this data, the write heavy OMS avoids bloat, Transaction wraparound and improves write performance. Meanwhile, the read heavy CLS can still store sufficient data to make predictions.
Aaand Sekian !
Please do leave a comment if you find this interesting or have any recommendations/suggestions.
We’re hiring developers. Working at GO-JEK can be immensely rewarding. Grab this chance to write beautiful code and take on some of our most pressing challenges. Check out gojek.jobs for more.
- PG Tune
- Postgres Monitoring Queries