How Niyuj architected and deployed highly available, auto-healing infrastructure using Kubernetes, Docker and Terraform to achieve a SLA of Five 9s for software that runs as an OEM on every end point of the world’s largest ATM / PoS manufacturer.

Executive Summary

In the current world, no business can afford any downtime of their infrastructure and the service. The advancements in the technology have enabled the vision of “always-on” businesses, which practically run for 24X7 with hardly any down time. Commonly known as “High Availability”, infrastructure setup adheres to 99.999% availability. This translates to a maximum of 5.26 minutes of downtime per year. Achieving this is no easy task. However, with the help of latest tools and technologies, Niyuj managed to develop highly available architecture to one of our

esteemed customers in the security domain, catering to some of the world’s largest financial institutions.

This case study explains how Niyuj helped the customer building infrastructure that adheres to five 9s of uptime and is highly scalable catering to millions of endpoints. It also covers how Niyuj approached the problem and architected the solution with the help of latest tools and technologies.

Key Business Challenges

Our customer has a very sophisticated security product that creates encrypted channels between agent and the server after authentication by the product platform. This relies heavily on setting up and tearing down encrypted channels at a very fast rate, which is extremely compute intensive. Also, the product is deployed in the core network infrastructure of some of the world’s largest financial institutions and banks, which means, it needs to scale as per the load requirements and should also auto-heal if there is any problem detected.

Architecting highly scalable and auto-healing architecture to achieve 99.999% availability was our biggest challenge. Apart from this we also had to develop the disaster recovery plan and to provide fault-tolerant setup to avoid any outages. This was required to achieve using deployments in various zones/regions of the given cloud provider.

The architecture had to be tested thoroughly to prove that it is resilient and scalable to support 5 million clients (or more!)

Kubernetes giving you the freedom to take advantage of on-premises, hybrid, or public cloud infrastructure, letting you effortlessly move workloads to where it matters to you.

The background

The product platform included multiple microservices running inside docker containers, communicating with each other over encrypted channels. These microservices were designed to be stateless. The base version was deployed on a single VM without any capability for handling bursts of loads or handling faults. Customer wanted Niyuj to understand the product architecture, propose and implement the deployment architecture that will fulfil following requirements:

1. High Availability (99.999% uptime)

2. Auto-scaling / Fault-Tolerant

3. Auto-healing

4. Health monitoring

5. Alerting

Proposed architecture should support 500,000 end points continuously communicating with the product platform over encrypted channels. The sessions may not be long-lived, and would be required to be setup and torn down at a very high rate.

Analysis

Niyuj studied the overall product architecture carefully and noted down the components those would potentially be bottlenecks, hence needing support for autoscaling. We found that all the components would be needing autoscaling, given the massive size of deployment. However, some components were expected to be highly loaded and others would have medium load. This observation highlighted the need for having different rates at which these components needed to be scaled up or down. Since each microservice runs in its own container, it was required to monitor the health of each container and also each VM on which the containers would be deployed. All the metrics collected would be required to be made available in actionable format for the NOC (Network Operations Centre) team to act upon.

The deployment needs to be “repeatable” in different zone / region to support high availability.

Solution

Analysis revealed that the architecture would need orchestration at two levels – one for infrastructure itself and other for containers. For infrastructure, we chose Terraform after evaluating other options such as Ansible, Puppet etc. Primary reasons were the support for multiple cloud providers, philosophy of immutable infrastructure and client-only architecture. For containers, we chose Kubernetes over Docker swarm because of its opensource nature and origin to Google.

Terraform scripts were created to setup the infrastructure viz virtual machines for Kubernetes Masters and Nodes, Software Defined Networks (AWS VPC), Load Balancers, Security groups and database. The setup was spread across multiple availability zones to ensure redundancy and fault tolerance. The launch configurations were created for the VMs to associate them with autoscaling functionality of the cloud provider.

Kubernetes masters were configured on 2 hosts across the availability zone. The microservices were deployed as pods in Kubernetes cluster. Most of the microservices were stateless. However, few of them needed to use permanent storage. As a result, those microservices should be deployed on specific nodes where the permanent storage is accessible. This was achieved by defining groups and creating affinity of those services to the group. All this was achieved through YAML configuration files for service deployments.

Autoscaling was supported at two levels. Kubernetes itself was configured with HPA (Horizontal POD Autoscaler) along with metrics server. Metrics server helped the Master to scale the specific pods if they cross the configured threshold such as CPU or memory utilization. On top of this, we also leveraged the autoscaling functionality of the cloud provider to scale up the VMs if already provisioned VMs were getting utilized beyond threshold. This dual scaling gave us synchronized autoscaling functionality, much needed for the HA setup. Both of them also provide functionality to take out not-responding resources and replace them with new ones – giving Auto-healing functionality.

Health Monitoring and Alerting was supported with the help of Prometheus, Grafana and custom Alert manager, which enabled the NOC team to identify the problems and take manual action if not auto-healed.

Using custom scripts written in Python and Bash, we simulated the realistic load on the systems to test all the functionalities mentioned above. Identified bottleneck were fixed and retested. This rigorous testing enabled us to achieve the high-availability as desired by the customer.

Terraform scripts

Terraform scripts were created to setup the infrastructure viz virtual machines for Kubernetes Masters and Nodes, Software Defined Networks (AWS VPC), Load Balancers, Security groups and database. The setup was spread across multiple availability zones to ensure redundancy and fault tolerance. The launch configurations were created for the VMs to associate them with autoscaling functionality of the cloud provider.

Kubernetes

Kubernetes masters were configured on 2 hosts across the availability zone. The microservices were deployed as pods in Kubernetes cluster. Most of the microservices were stateless. However, few of them needed to use permanent storage. As a result, those microservices should be deployed on specific nodes where the permanent storage is accessible. This was achieved by defining groups and creating affinity of those services to the group. All this was achieved through YAML configuration files for service deployments.

Autoscaling

Autoscaling was supported at two levels. Kubernetes itself was configured with HPA (Horizontal POD Autoscaler) along with metrics server. Metrics server helped the Master to scale the specific pods if they cross the configured threshold such as CPU or memory utilization. On top of this, we also leveraged the autoscaling functionality of the cloud provider to scale up the VMs if already provisioned VMs were getting utilized beyond threshold. This dual scaling gave us synchronized autoscaling functionality, much needed for the HA setup. Both of them also provide functionality to take out not-responding resources and replace them with new ones – giving Auto-healing functionality.

Health Monitoring and Alerting

Health Monitoring and Alerting was supported with the help of Prometheus, Grafana and custom Alert manager, which enabled the NOC team to identify the problems and take manual action if not auto-healed.

Testing

Using custom scripts written in Python and Bash, we simulated the realistic load on the systems to test all the functionalities mentioned above. Identified bottleneck were fixed and retested. This rigorous testing enabled us to achieve the high-availability as desired by the customer.

Terraform makes the infrastructure immutable, thus avoiding the “configuration drift”. It’s clientonly architecture allows enterprises to reduce the management of additional server.

Terraform

Terraform is an open-source infrastructure as code software tool. It enables users to define and provision a datacentre infrastructure using a high-level configuration language or optionally JSON. It has support for multiple cloud providers including AWS, Azure and Google Cloud.

Kubernetes

Kubernetes is an open-source container orchestration system for automating application deployment, scaling, and management. It was originally designed by Google, and is now maintained by the Cloud Native Computing Foundation. With the help of several plug-ins, it can help managing thousands of nodes and containers very easily.

Docker

Docker increases productivity and reduces the time it takes to bring applications to market, you now have the resources needed to invest in key digitization projects that cut across the entire value chain, such as application modernization, cloud migration and server consolidation. One can deploy docker images as Kubernetes services.

Prometheus

Prometheus is an open-source software project written in Go that is used to record real-time metrics in a time series database built using a HTTP pull model, with flexible queries and real-time alerting. It allows to get metrics at POD level as well as node level, allowing system administrators to take preventive or corrective actions.

Grafana

Grafana is an open source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics but many use it in other domains including industrial sensors, home automation, weather, and process control.

Client's Perspective