Governance in the Cloud World

David Das Neves
Cloud Computing
October 1, 2020

More and more customers are migrating large parts of the existing IT infrastructure to the Cloud. But as a result, we are seeing more and more challenges becoming visible.

For example, misconfigurations.

Misconfigurations are the number #1 issue with present Cloud implementations and are usually followed by significant Data Breaches, which you can read frequently about in the News.

No alt text provided for this image

But is this really necessary? 

Before we directly jump into the recommended approach for Cloud Governance, we will dive into the different areas of Governance to get a common understanding of what Governance is: 

No alt text provided for this image

Take some time now to think about the areas and try to find answers to the following questions:

  • Did you already consider all of the mentioned areas?
  • Can you find good reasons why you would like to implement each of them?
  • Can you name some good examples for each area?
  • Do you already know how you want to address these examples?

But - Automation as a Governance area?

I have added Automation on purpose as a Governance area. When you are having a look at Cloud-Native implementations and increasing Cloud Maturity, you will always be hearing about topics like DevOps, Infrastructure as Code, Release Pipelines, Immutable Infrastructure, and Automation. The area "Automation" includes all of these and is, therefore, a Governance area that you should try to reach and to enforce. There is almost no added value to treat the Cloud as a Datacenter extension and to focus only on IaaS. You will need to increase the maturity over time to be able to benefit from the Cloud values, and that is why Automation should be part of your Governance strategy, as you try to establish it.

In the next table, you can find the typical motivations, as well as the primary focus scope for each area.

No alt text provided for this image

Typical mistakes I see is that some of those areas are entirely or partially missed out. The reason for that is very often a lack of role diversity in the leading / Architecture teams.

A good recommendation is to do cross-functional ideation sessions to identify all tasks and requirements for all compliance areas. It is essential to have a good understanding and a broad overview of each of those areas, but also to understand the opportunities that could be leveraged by e.g., increasing the level of Automation. After that, you will quickly recognize that the complexity of the whole field is relatively high, and there are way too many topics that need to be addressed.

But do you need to have the whole Governance implemented from the very beginning?

I will say:

"No, but you should define a strategy to set up the right controls at the right time."

This means that you need to identify which Governance implementations are essential and which supplemental and if they might have some dependencies to each other. As a result, you should come up with a roadmap. e.g. Starting first with urgent Technical Security controls, Standardization, and if needed also Regulatory Compliance, and then subsequentially addressing all other areas based on demand.

But what is Governance exactly?

The establishment of Governance is nothing less than the definition of a various number of quality gates. These quality gates are either strict or informal and can be categorized in the following four types:

No alt text provided for this image

The main idea is to define a Governance lifecycle and capture all stages:

  1. Proactive Governance will catch all uncompliant configurations before implementation. It is a quality gate that forces your users and teams to obey to defined standards and blocks uncompliant settings technically or procedural. e.g., input validations, deny policies, privileged rights request process, guidelines
  2. Implemented Governance will prevent misconfigurations by using Automation or templates, which already integrate the centrally managed Governance requirements. e.g., Cloud policies with modify actions, automated deployments, solution templates, configuration baseline, DR guidance
  3. Continuous Governance - as per the name - checks continuously on various settings and definitions and prevents wrong configurations after the initial setup. e.g., policies, quality gates in the release pipelines as unit tests and best practice analysers, DSC, Monitoring and Alerting
  4. Reactive Governance tries to identify uncompliant states that were not caught by the previous quality gates and either highlights or fixes the issues automatically. e.g., frequent pentests or scans, regular reports

The higher your Governance maturity is from the very beginning, the less friction you will have with newly and manually created solutions/resources afterward. But, - you should always cross-check if you are not blocking product teams with some of those rules and delaying ramp-up.

But what is actually Governance Maturity?

Governance Maturity

You will start your Governance path with many blind spots, not much Automation, and a lack of technical quality gates. Transitioned to our previous model, I added one additional type, which I name "Missed". The reactive approach means that you will search and find issues on a frequent base. But the "missed" ones mean that you did not even consider those as requirements and would not always identify those with the currently established reactive approaches. You will always have some missed topics that somehow come up reactively, but the amount and the frequency should significantly lower over time.

Let us have a look at an exemplary Governance Maturity evolvement:

Low Maturity

No alt text provided for this image

You start with some well-defined rules like required tags and naming conventions, build up some shared services for Networking, but the majority is being missed or comes up on a reactive base. Your security team complains about insecure workloads and privileged rights issues that seem a bit out of control.

Medium Maturity:

No alt text provided for this image

Over time you start to build a Policy-Framework, increase the level of Automation, and create documentation, guidelines, and templates that are being used by more and more teams. There are still way too many topics that are being highlighted on a reactive base. e.g., Your security team complains now regularly about minor but sometimes still significant issues. The significant issues come up less frequently, but more often than you like. Topics like Cost Management come up repeatedly now, as the costs are increasing exponentially. Also, more and more teams are evaluating to bring PII or other sensitive data to the Cloud and ask for more robust regulatory Compliance controls.

High Maturity:

No alt text provided for this image

The majority of all Governance areas are captured in the whole lifecycle. The essential ones are blocked directly and are continuously revalidated. The level of Automation is high, with most of the other requirements directly implemented into it. You have a good resource base with standard patterns, a knowledge base with guidelines and templates approved by necessary teams and ready for usage for your Product teams. Besides, you have Monitoring and Alerting established to identify any deviations almost instantly. But even if your rules did not catch all deviations, you have a strategy to frequently run reactive scans and reports to identify Governance violations very quickly.

So,- this was one example of a maturity evolvement. Let us now break this down into the key tasks to reach a mature Governance strategy:

  • Increase proactive and continuous implementations to mitigate deviations from the very beginning.

Make use of policies, desired state implementations, Monitoring, and Alerting - prefer continuously checking policies over single quality gates. A good recommendation is to define a technical Policy-Framework. Your documentation and guidelines should be centralized and well-communicated. Usually, proactive rules are hard requirements that should not be circumvented by anyone. Don´t set up too many strict rules and allow your developing teams the room for flexibility.

  • Increase the implemented Governance by using templates and an increasing level of Automation. IT should become a partner for your business and enable it for a fast and secure ramp-up instead of being a service provider with strict rules.

Processes and patterns (e.g. for DR) should be standardized and easily reusable. You have centralized repositories available with these templates, including documentation, ready to use for your product teams. By doing so, you can ramp up the teams quickly and securely, and decrease the overhead by redesigning similar requirements from scratch. Repetitive standard requests are fully automated and implemented in the Cloud Architecture.

  • Run reactive checks frequently and try to transition these findings into the other areas to directly prevent them from happening in the future.

You have a detailed schedule on when to run which reactive tasks and validate the current implementation. Dedicated people own reactive tasks, and transparency is recognized as something positive in your environment. After identifying issues, you validate to include the findings in not reactive approaches to avoid these deviations from happening.

And that´s it. Now it is on you to implement it. Get an overview of all necessary tasks and controls, classify them between proactive, continuous, implemented, and reactive, and plan when to implement which of them. And - don´t miss too many! ;-)

I hope it helped and I am happy to hear about your feedback!

All the best,

David das Neves

Is it feasible to run HPC in the cloud? How different is it from running a local HPC cluster? What are some of the common alternatives for running HPC in the cloud?

Introduction

Before beginning our discussion about HPC (High Performance Computing) in the cloud, let us talk about what exactly HPC really means?

"High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business." (https://www.usgs.gov/core-science-systems/sas/arc/about/what-high-performance-computing)

 

In more technical terms – it refers to a cluster of machines composed of multiple cores (either physical or virtual cores), a lot of memory, fast parallel storage (for read/write) and fast network connectivity between cluster nodes.

 

HPC is useful when you need a lot of compute resources, from image or video rendering (in batch mode) to weather forecasting (which requires fast connectivity between the cluster nodes).

 

The world of HPC is divided into two categories:

·        Loosely coupled – In this scenario you might need a lot of compute resources, however, each task can run in parallel and is not dependent on other tasks being completed.

Common examples of loosely coupled scenarios: Image processing, genomic analysis, etc.

·        Tightly coupled – In this scenario you need fast connectivity between cluster resources (such as memory and CPU), and each cluster node depends on other nodes for the completion of the task. Common examples of tightly coupled scenarios: Computational fluid dynamics, weather prediction, etc.

 

Pricing considerations

Deploying an HPC cluster on premise requires significant resources. This includes a large investment in hardware (multiple machines connected in the cluster, with many CPUs or GPUs, with parallel storage and sometimes even RDMA connectivity between the cluster nodes), manpower with the knowledge to support the platform, a lot of electric power, and more.

 

Deploying an HPC cluster in the cloud is also costly. The price of a virtual machine with multiple CPUs, GPUs or large amounts of RAM can be very high, as compared to purchasing the same hardware on premise and using it 24x7 for 3-5 years.

The cost of parallel storage, as compared to other types of storage, is another consideration.

 

The magic formula is to run HPC clusters in the cloud and still have the benefits of (virtually) unlimited compute/memory/storage resources to build dynamic clusters.

We do this by building the cluster for a specific job, according to the customer’s requirements (in terms of number of CPUs, amount of RAM, storage capacity size, network connectivity between the cluster nodes, required software, etc.). Once the job is completed, we copy the job output data and take down the entire HPC cluster in-order to save unnecessary hardware cost.

 

Alternatives for running HPC in the cloud

Summary

As you can see, running HPC in the public cloud is a viable option. But you need to carefully plan the specific solution, after gathering the customer’s exact requirements in terms of required compute resources, required software and of course budget estimation.

 

Product documentation

·        Azure Batch

https://azure.microsoft.com/en-us/services/batch/

·        Azure CycleCloud

https://azure.microsoft.com/en-us/features/azure-cyclecloud/

·        AWS ParallelCluster

https://aws.amazon.com/hpc/parallelcluster/

·        Slurm on Google Cloud Platform

https://github.com/SchedMD/slurm-gcp

·        HPC on Oracle Cloud Infrastructure

https://www.oracle.com/cloud/solutions/hpc.html

Related Posts

Newsletter FranceClouds.com

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form