Knowing your system goes far beyond writing and reviewing the code. Chances are, you are the one maintaining it, taking feature requests from the product owner and tweak some documentation from time to time.
Also, you take all the blame when something goes bad.
Let’s say you are responsible for a complex system that needs to be connected to an external, real-time service, like Facebook Messenger, for example.
One day your product owner messages you on slack with the words “production doesn’t work, some customers can’t reply to our chatbot. We lose money every minute! Do something about it”.
I know it better than I would like to admit. You know the drill. It is time to check if requests are getting proper responses, if the database is accessible and why the hell your unit test suite is still passing.
Everything looks fine. But the issue is confirmed to be the system’s fault. You need more info, and then you realize.
You yet again forgot to add logs to your software.
It sounds like a nightmare, doesn’t it? Fortunately, this doesn’t have to be your reality.
As stated in the example above - you never know what happens to your system in the wild. Logging is a form of teaching the service that it needs to talk to you and the users. Let’s say that in the example above, the external API changed slightly, and it still works with your syntax but gives you different results. It doesn’t produce errors in your backend, but the app is useless business-wise. You keptIt would be best if you a log of responses from the service, so you know when something works as expected and when it silently breaks. It can be especially life-saving if you deal with asynchronous tasks when the order of executions can make or break your business as well.
Another often underrated use case for logging is an audit. There are many reasons why a long-running service needs to be audited. If that happens, a proper log system is invaluable. The data science team can extract all the statistics about the system. Software engineers (even ones new to the project) can spot some bottlenecks. And in the case of a merger with a different company - the legal team will have an easier time checking if the system is compliant will all the new laws it has to obey.
The developer that you pass the codebase to will also appreciate some higher verbosity.
We teach a lot about clean code, good practices and design patterns, but less about what actually happens once the system hits production. Let’s fill this space.
The best strategy for rigorous logging is incrementally adding logs with each code change. As well as you click request changes on a PR without unit tests, as well you should stop a PR that implements an important business logic function without printing to the console:
Copy 1logger.info("User 1239876130 has been billed $30 for the Premium account renewal.")
As always, this comes down to team cooperation. Some may forget, some may oppose because they don’t see the value. Same may not care, but the benefits will show eventually.
If one wants to implement logging without knowing any options or guidelines, they can be overwhelmed with the number of options. You can save to text files, print to STDOUT, STDERR or directly to external services. However, the complexity here is not required.
The Twelve-Factor App, a set of industry-standard guidelines, puts logging as a first-class citizen of a system - on the same level as dependency management, app configuration and the codebase itself.
The general advice is straightforward:
A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app’s behaviour. — The Twelve-Factor App
so fear not! Your whole logging system can be as simple as using print statements in your structure of choice.
There are plenty of tools working out of the box that can enhance your logging system. In this article, I will try to list out ones that I personally used, heard positively about or consider trying out in the future. If you know the software that should make a list, please let me know at @wkulikowski1!
Each major public cloud provider has its own logging solution, well-integrated with the rest of its products. The list goes as follows:
If your stack relies heavily on one of those platforms, sticking inside the ecosystem may seem like the most simple solution. In AWS, for example, it is trivial to trigger the Lambda function on seeing a certain log in CloudWatch. If you use firebase, Google Logging is a couple of clicks away on your dashboard. As much as vendor lock-in can become a problem in the future, using logging solutions from the same provider can cut your costs and speed up the development.
Logstash is a part of an elastic stack. It focuses on gathering log data from an unlimited amount of sources and then categorizing, sorting and transforming it in the desired way. However, in the theory, Logstash (and whole ElasticStack) is independent and open-sourced; all cloud vendors have some sort of ready out-of-the-box solution for hosting the stack. It is pricy and demands a lot of computing power, but it remains extremely valuable for many companies which run it in production every day.
I am a personal fan of Heroku and the ease of adding new services to your system on it. For example, you can have the whole set of logging tool just by choosing a proper add-on. Among currently available are:
although every solution serves a different need, each of them introduces a valuable enhancement for your log management.
If logging is just a stream of plain text words, monitoring is grouping, categorizing and visualizing insights about your system. Usually, there are some charts, dashboards and frankly, whatever is required by the product manager at the time. Do we need to acquire as many new customers as possible? Probably the most useful and needed metric will be landing page visits and user conversion. Do we need to have a stable payment system? Let’s track the number of errors our app experiences while pinging the /payment endpoint.
It doesn’t take a data scientist to set up and interpret the dashboard, but it should be built & read under some important assumptions.
You are not the only person reading the monitoring dashboard, and most likely, you will not be around forever. Your role is to make the monitoring system accessible for everybody. Show critical charts at the beginning. Draw additional lines showing norms/averages for better context. Learn data visualization.
A healthy baseline could be the line mentioned in the previous paragraph. If I am a new maintainer of the project, I need to know what are our targets as soon as possible. If there is a business requirement for at least a 5% conversion rate, there should be a line on the chart indicating how close is the dangerous area.
Finally, your graphs should cover all system picturing it exactly as it is. Some endpoints will be under a heavier load than others, and some statistics are caused by events different than you expect. If you solve problem X and look only at stat Y, you assume that Y causes it. But what if it is caused by Z? You don’t see Z in your metrics. Not only it will be harder to measure now, but you can also miss it during your problem solving in general.
The list is probably super incomplete, but again, please let me know!
Kibana will visualize the logs that you have stored in Logstash. It is a part of elastcStack - which is really cool, and you should check it out!
Grafana is an “open observability platform” which focuses more on visualizing the database than logs. It is perfect for displaying ratios, business logic and more “static” data. Many companies use Grafana in production every day.
The last part of a truly healthy system is altering. Stuff breaks. Your server will spill out 500 errors sooner or later, and you should be the first person to know about it.
We use Sentry extensively at 10Clouds, and it works wonders for us. Sentry is open-sourced as well; however, the company will let you buy a managed solution in a SAAS package. After integration with your app, Sentry will catch all the errors, group them and notify you by the channel of choice. Nothing happens unnoticed.
Nobody wants to maintain a black box system, whether it works perfectly or fails mysteriously every second Thursday. Logging & monitoring saved me more times than I would like to admit, and I encourage you to introduce the observability culture to your development team as well. If you need further help or just spot a mistake - as always - please message me on Twitter.