By Akshat Shah
Anyone who has been on production support can testify that time taken to resolve issues is inversely proportional to effectiveness of application logs. To understand what to log and more importantly what not, read ahead.
Logging is to engineers as X-ray is to doctors. With respect to an application, logging brings visibility under the hood.
Understanding what to log and what not is of paramount importance. An application at GO-JEK handling throughput of 250k rpm was logging every request. Converging on the root cause of an issue from logs would be akin to finding a needle in a haystack.
To Log or Not To Log, is an objective question if looked from a production support standpoint. Log levels help us maintain sanity. Let’s understand the role of levels in logging.
When an application logs, it’s sending a message to its owners. Messages have different meanings and warrant different actions under different levels.
The severity of log levels in increasing order is:
debug < info < warn < error < fatal < panic
Levels can be configured for an application to maintain sanity of logs. Applications will only log messages of higher severity than configured log levels.
We’ll take example of an
Order Management Service aka (OMS) to understand what should be logged. OMS handles orders for a ride-hailing service. The flow of OMS is:
- Customer requests for a ride
- OMS assigns driver to the ride
- Ride reaches completion
- Driver gets payment for the ride.
These messages are lifecycle events. They highlight the progress of an application.
Example: application start/stop; order completion in OMS
These messages help in figuring out weird stuff. Logging at debug level means tracking state changes at every step of an application.
Example: Let’s say a driver accepts order abc, but OMS assigns order def. To find root cause of issues, log all events: creation of order, list of eligible drivers for an order, acceptance of booking by a driver, saving state in OMS.
These messages are potentially harmful situations. These issues need to be fixed, but may not require immediate intervention.
Example: Let’s say OMS talks to a payment service for paying drivers. On breach of SLA with payment service, OMS will timeout and retry the request. Log at warning level: payment service SLA breached. Retrying.
These messages are of considerable importance. Log at error level when normal flow of execution is blocked requires human intervention. You should keep an eye out for error logs in order to fix issues manually in order to maintain consistency in the system.
Example: If OMS has exhausted, driver payment retries to payment service, Log Error paying driver XYZ for order abc. Later, pay the driver manually.
These messages are severe events that might cause the application to terminate. There should be an immediate action when application logs at fatal level.
Example: Running out of disk space, total loss of network connectivity
These messages are exceptional scenarios which should be fixed immediately. Log this when an application panics and terminates, or recovers.
Example: database wasn’t configured, number divided by by 0
Plan of Action
In the production environment, ideally, an application should only log on severity of
warn level and above. An application should log at
info level only in staging, integration and UAT environment for the benefit of engineers and QAs.
warn level issues should be tracked and fixed.
error level issues should be fixed ASAP.
panic level issues warrant immediate attention.
Retaining logs beyond 3 days defeats the purpose of logs. Ideally, logs shouldn’t be retained beyond 24 hours, but considering no work on weekends, adding 48 hours should be more than enough.
I hope this blog was helpful 😀. Please leave your thoughts in the comments section below. If you’re interested in working with GO-JEK, check out gojek.jobs.