Knowing your system goes far beyond writing and reviewing the code. Chances are you are the one maintaining it. You are taking feature requests from the product owner and tweaking some documentation from time to time.
Also, you take all the blame when something goes bad.
Let’s say you are responsible for a complex system that needs to be connected to an external, real-time service (e.g., Facebook Messenger). One day your product owner messages you on Slack with the words:
“Production doesn’t work and customers can’t reply to our chatbot. We are losing money every minute! Do something about it!”.
I know it better than I would like to admit.
You know the drill. It is time to check if requests are getting proper responses, if the database is accessible and why your unit test suite is still passing. Everything looks fine. But the issue is confirmed to be the system’s fault. You need more info, and then you realize you yet again forgot to add logs to your software.
It sounds like a nightmare, doesn’t it?
Fortunately, this doesn’t have to be your reality.
When reading online, especially when searching for valuable information, it is crucial to choose the right sources. Before we start, we would like to take a minute to introduce ourselves.The information we provide you with is derived from real-life experiences in the covered topics. However, we could hardly cover all the valuable information you need to fully optimize a certain processes.If you feel that a comprehensive consultation would make a difference, don't hesitate to contact us.
As stated in the example above, you never know what happens to your system “in the wild”. Logging is a form of teaching the service that it needs to talk to you and the users.
Let’s say that in the example above, the external API changed slightly. It still works with your syntax but gives you different results. It doesn’t produce errors in your backend, but the app is useless business-wise. It would be best to log responses from the service. This way you know when something works as expected and when it silently breaks. This can be life-saving. When dealing with asynchronous tasks, the order of executions can make or break your business.
Another often underrated use case for logging is an audit. There are many reasons why a long-running service needs to be audited. If that happens, a proper log system is invaluable.
The data science team can extract all the statistics about the system. Software engineers (even ones new to the project) can spot some bottlenecks. In the case of a merger with a different company, the legal team will have an easier time checking if the system is compliant with all the new laws it has to obey. The developer that you pass the codebase to will also appreciate some higher verbosity.
We teach a lot about clean code, good practices, and design patterns, but less about what happens once the system hits production.
Let’s fill this space.
The best strategy for rigorous logging is incrementally adding logs with each code change, as well as clicking request changes on a PR without unit tests. However, you should stop a PR that implements an important business logic function without printing to the console:
Copy 1logger.info("User 1239876130 has been billed $30 for the Premium account renewal.")
As always, this comes down to team cooperation. Some may forget, and some may oppose it because they don’t see the value. Others may not care but the benefits will show eventually.
If one wants to implement logging without knowing specific guidelines, they can be overwhelmed with the number of options. You can save to text files, print to STDOUT, STDERR, or directly to external services. However, the complexity here is not required.
The Twelve-Factor puts logging on the same level as dependency management, app configuration, and the codebase itself.
The general advice is straightforward:
A twelve-factor app never concerns itself with routing or storage of its output stream. It should not attempt to write to or manage logfiles. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app’s behavior. — The Twelve-Factor App.
So, fear not! Your whole logging system can be as simple as using print statements in your structure of choice.
There are plenty of tools working out of the box that can enhance your logging system. In this article, I will try to list out ones that I personally used, heard positive stuff about, or consider trying out in the future.
Each major public cloud provider has its logging solution, well-integrated with the rest of its products. The list goes as follows:
If your stack relies heavily on one of those platforms, sticking inside the ecosystem may seem like the most simple solution. In AWS, for example, it is trivial to trigger the Lambda function on seeing a certain log in CloudWatch. If you use firebase, Google Logging is a couple of clicks away on your dashboard. Vendor lock-in can become a problem in the future. Using logging solutions from the same provider can cut your costs and speed up the development.
Logstash is a part of an elastic stack. It focuses on gathering log data from an unlimited amount of sources and then categorizing, sorting, and transforming it in the desired way. However, in the theory, Logstash (and whole ElasticStack) is independent and open-sourced. All cloud vendors have some sort of ready out-of-the-box solution for hosting the stack. It is pricy and demands a lot of computing power, but it remains extremely valuable for many companies which run it in production every day.
I am a personal fan of Heroku and the ease of adding new services to your system on it. For example, you can have the whole set of logging tools just by choosing a proper add-on. Among these, currently available are:
Although every solution serves a different need, each of them introduces a valuable enhancement to your log management.
If logging is just a stream of plain text words, monitoring is grouping, categorizing, and visualizing insights about your system. Usually, there are some charts, dashboards, and frankly, whatever is required by the product manager at the time. Do we need to acquire as many new customers as possible? Probably the most useful and needed metric will be landing page visits and user conversion. Do we need to have a stable payment system? Let’s track the number of errors our app experiences while pinging the payment endpoint.
It doesn’t take a data scientist to set up and interpret the dashboard, but it should be built & read under some important assumptions.
You are not the only person reading the monitoring dashboard, and most likely, you will not be around forever. Your role is to make the monitoring system accessible for everybody. Show critical charts at the beginning. Draw additional lines showing norms/averages for better context. Learn data visualization.
A healthy baseline could be the line mentioned in the previous paragraph. If I am a new maintainer of the project, I need to know what are our targets as soon as possible. If there is a business requirement for at least a 5% conversion rate, there should be a line on the chart indicating how close is the dangerous area.
Finally, your graphs should cover all systems picturing them exactly as they are. Some endpoints will be under a heavier load than others, and some statistics are caused by events, different than what you may expect. If you solve problem X and look only at stat Y, you assume that Y causes it. But what if it is caused by Z? You don’t see Z in your metrics. Not only it will be harder to measure now, but you can also miss it during your problem-solving in general.
I know this very short list may be a little incomplete, but again, please let me know if you can think of some other software solutions!
Kibana will visualize the logs that you have stored in Logstash. It is a part of ElasticStack which is cool and you should check it out!
Grafana is an “open observability platform” which focuses more on visualizing the database than logs. It is perfected for displaying ratios, business logic, and more “static” data. Many companies use Grafana in their production every day.
The last part of a truly healthy system is altering. Stuff breaks. Your server will spill out 500 errors sooner or later, and you should be the first person to know about it.
We use Sentry extensively at 10Clouds, and it works wonders for us. Sentry is open-sourced as well; however, the company will let you buy a managed solution in a SAAS package. After integration with your app, Sentry will catch all the errors, group them and notify you by the channel of choice. Nothing happens unnoticed.
Nobody wants to maintain a black box system, whether it works perfectly or fails mysteriously every second Thursday. Logging & monitoring saved me more times than I would like to admit. I strongly encourage you to introduce the observability culture to your development team as well.
If you need further help or just spotted a mistake, as always, please message me on Twitter or simply click the button below.