2018-11-20 JÉFERSON MACHADO
Creating cloud-native applications is challenging. The cloud environment consists of machines running all over the world. It’s very sensitive and different issues could happen, such as network partitions, instances that goes down or even entire regions that disappear.
One thing that I’ve learned during these last years creating applications for the cloud, is that I would need to change my mindset for how to write software. The mindset I have right now is: everything can and will fail in the cloud.
When I changed my mindset, I started to embrace failure in order for the applications to be ready to deal with any kinds of failure scenarios. In the end, this provides better experience for the user.
There are some best practices on how to create cloud-native applications that I’ve learned during the last years and I would like to share some of them in this post.
Recovering quickly from errors is really important for the application in order to keep a high availability.
To be resilient, there are two important aspects that need to be taken into account: design for failure and graceful degradation.
Design for failure
One service must not mess with others. If an error happens, it should be isolated.
So when developing your applications, think about what could go wrong and build a mechanism to repair the possible failure. For example, when calling remote services, add timeout and retry to the code in order to deal with the possibility of that other service being unresponsive or malfunctioning, so that your application won’t be impacted by it.
I’ve created a code example of retry and timeout that can be found here: https://github.com/jefersonm/cloud-native-utilities
Netflix Hystrix (https://github.com/Netflix/Hystrix) is a good tool to add resiliency to the application, as it helps to isolate failure, stop cascading errors and also provides good metrics about the calls that are being made.
When a failure is inevitable, one way to deal with this is to degrade some specific feature. For example, a video streaming application is having issues with the recommendation service as it’s providing slow response times. The recommendation service should not impact the whole video streaming application. Instead, what could be done is to show a cached list of the latest movies for the user.
The drawback is that the user won’t have the most updated movie recommendations because of the service degradation, but the main functionality of watching movies is not impacted.
Chaos monkey is a resilience tool (https://github.com/Netflix/chaosmonkey) that helps us to simulate random failures scenarios in order to ensure that the application is resilient.
What does happen if a network interface fails? If a server or even an entire region goes down? Chaos Monkey helps to simulate and automate this kind of scenarios.
Simulating these failures will help us to understand how our application would behave in different situations, giving us the power to be able to anticipate issues.
Knowing that running a distributed application in the cloud is challenging, having good information about what is happening is really necessary. Observability helps to predict, analyze, giving you more information to improve the application.
Logging, tracing and monitoring tools can all help to find information on different levels for different purposes.
Good and informative logs are really important to the application, it helps to find the necessary information to understand in more details what is happening.
When developing applications, think about what is important to log and include just the necessary information in a way that helps you find useful information in the future. Adding too much unnecessary information to the logs will make it more difficult to debug and troubleshoot.
Finding all logs for a distributed application isn’t a simple task, because they could be residing on different machines. Having a centralized logging infrastructure will help with this. So instead of going into each different machine, we could see information for all services in a single place, making it easier to debug and see correlation information between the services.
Example of centralized logging tools:
Logs help in finding useful information, but sometimes it isn’t enough. Distributed tracing tools allows you to see more information, such as following a request through the system from start to end, see the time between requests, find services errors etc.
Example of distributed tracing tools:
Having good monitoring in place helps to better visualize the information about your application’s metrics, such as CPU consumption, memory usage, health of API endpoints and application throughput
Having metrics in place, alerts could be set up in order to highlight issues, alerting teams to act, fix or prevent issues to happen.
Example of monitoring tools:
Always keep in mind that failures happen in a cloud environment, so embrace it and create resilient applications that can deal with it.
Embracing observability will provide insights to applications in order to predict and act before incidents happen.
Creating fault-tolerant and highly available services is key for providing the best possible user experience, free from disruptions and unexpected downtime.