Published on
March 28, 2023
Imagine this: an organisation’s ERP system hosted on a managed Kubernetes cluster (GKE) on GCP, using an (AD) from Microsoft Azure and ingesting data into Snowflake using StreamSets. And the data is then prepared for training, testing, and validating an ML model for revenue predictions. Cool, right? It’s amazing to think of using different cloud services available on all the various cloud platforms and simply being able to choose the best service for its use case. So, it’s not surprising that the expectation is that more and more CTOs and other IT decision-makers will seek multi-cloud strategies in the near future to reap the benefits of a variety of services across various cloud platforms.
But of course decision-makers have apprehensions. Not only that costs will sprial, but about maintaining solution visibility by coordinating the multiple services provided by many cloud providers. This Forbes piece on multi-cloud visibility, security, and governance challenges does a great job of explaining the issue succinctly.
Various cost management and visibility services are available from the major cloud providers, but unfortunately they are currently only adapted for their exclusive services. For instance, Amazon Web Services offers AWS Cloud Financial Management, which includes a suite of resources for planning, monitoring, and reporting o -n cloud expenditures. And similarly with Google Cloud and Microsoft Azure.
Cloud cost management and visibility may add a great deal of complexity for a business seeking to adopt multi-cloud for its operations, where many services and tools interact seamlessly to provide value to the business. The company will need to establish separate budgets, monitoring, and alert mechanisms.
Due to the added complexity, most firms are hesitant to embrace cloud computing, much alone deploy multi-cloud solutions.
SO HOW CAN WE LOOK AT LOWERING COSTS?
Unexpected costs in the cloud can often be attributed to unnecessary or underutilised resources such as idle compute resources, underused virtual private clouds (VPCs), and uncleaned resources. Choosing the right set of resources for the task at hand can help to avoid these costs. In the cloud, there are many options for completing the same tasks. For example, Apache Airflow can manage to deploy, use, and monitor a machine learning workflow in the cloud. This workflow can utilise the cloud platform compute resources for preprocessing data (such as with EMR, Dataproc, or Databricks), training and validating a model, deploying it, and monitoring and visualising its performance. Other cloud-specific software, such as AWS Sagemaker, Azure ML, and GCP ML, can also be utilised for these tasks. Sagemaker, for instance, provides access to a range of fully-managed tools and libraries for preprocessing data, training ML models, deploying them, and monitoring their performance. The ML process can also be orchestrated with the help of Lambda functions and Step functions. GCP offers similar capabilities through Dataproc, Cloud Run, Cloud Functions, and MLflow, while Azure offers similar resources through its Data and ML pipelines.
Regardless of the approach taken, keeping tabs on resource creation, updates across all resources, and the deletion and replacement of some resources are crucial.
A FEW COST-SAVING HACKS
THE ANSWER? A SKILLED ENGINEER
A multi-cloud approach is becoming increasingly necessary as we see more and more applications designed to take advantage of the variety of resources out there. Businesses simply need to have skilled engineers on hand to help them navigate the complexities of cloud engineering. A smart engineer will be able to suggest cost-saving tools and put strategies in place to help make the most of the incredible technical benefits of the cloud. The more clouds the merrier, I say.
SOME THINGS TO HELP YOU ON YOUR WAY!
Using the HashiCorp Configuration Language (HCL) and Terraform, you can easily manage a Dataproc cluster on GCP using a few commands. With Terraform, you can create, update, or delete resources on the cluster without any manual actions on the GCP console. This can streamline resource management and reduce the need for manual intervention, improving efficiency, reducing the risk of errors, and eliminating Engineering Toil.
1mkdir -p ~/code/github/<github username>/dataproc-terraform-example
1cd ~/code/github/<github username>/dataproc-terraform-example
1/* 2Setting up the provider 3/ 4terraform { 5 required_providers { 6 google = { 7 source = "hashicorp/google" 8 version = "4.33.0" 9 } 10 } 11} 12provider "google" { 13 credentials = file(var.credentials) 14 project = var.project 15 region = var.region 16 zone = var.zone 17}
1/* 2Create the dataproc cluster 3*/ 4resource "google_dataproc_cluster" "dataproc-terraform" { 5 name = var.cluster_name 6 region = var.region 7 project = var.project 8 cluster_config { 9 # For adding metadata, initialization actions, configs that apply to all instances in the cluster */ 10 gce_cluster_config { 11 zone = var.zone 12} 13 # Allowing http port access to components inside the cluster 14 endpoint_config { 15 enable_http_port_access = "true" 16 } 17 # Configuring the master nodes 18 master_config { 19 num_instances = var.master_num_instances 20 machine_type = var.master_machine_type 21 disk_config { 22 boot_disk_size_gb = var.master_disk_size 23 } 24 } 25 # Configuring the worker nodes 26 worker_config { 27 num_instances = var.node_num_instances 28 machine_type = var.node_machine_type 29 disk_config { 30 boot_disk_size_gb = var.node_disk_size 31 num_local_ssds = var.node_num_local_ssds 32 } 33 } 34 software_config { 35 override_properties = { 36 # Add the spark-bigquery connector to save data in the BigQuery Data Warehouse 37 "spark:spark.jars.packages" = "com.google.cloud.spark:spark-3.1-bigquery:0.26.0-preview" 38 "dataproc:dataproc.allow.zero.workers" = "true" 39 } 40 # Add components like Zeppelin, Jupyter, Druid, Hbase, etc. 41 optional_components = var.additional_components 42 } 43 } 44}
1variable "project" { 2 type = string 3 description = "The project indicates the default GCP project all of your resources will be created in." 4} 5variable "region" { 6 type = string 7 description = "The region will be used to choose the default location for regional resources. Regional resources are spread across several zones." 8} 9variable "zone" { 10 type = string 11 description = "The zone will be used to choose the default location for zonal resources. Zonal resources exist in a single zone. All zones are a part of a region." 12} 13variable "cluster_name" { 14 type = string 15 description = "cluster name" 16} 17variable "master_machine_type" { 18 type = string 19 description = "The compute type(CPU+Memory+etc.) to assign to the master)" 20} 21variable "node_machine_type" { 22 type = string 23 description = "The compute type(CPU+Memory+etc.) to assign to the master)" 24} 25variable "credentials" { 26 type = string 27 description = "The path to the credentials" 28 sensitive = true 29} 30variable "additional_components" { 31 type = list(string) 32 description = "Additional Components like Zeppelin, Hive etc." 33} 34variable "node_num_instances" { 35 type = number 36 description = "The number of worker instances in the cluster" 37} 38variable "master_num_instances" { 39 type = number 40 description = "The number of master instances in the cluster" 41} 42variable "master_disk_size" { 43 type = number 44 description = "The boot disk size of the master node" 45} 46variable "node_disk_size" { 47 type = number 48 description = "The boot disk size of the worker nodes" 49} 50variable "node_num_local_ssds" { 51 type = number 52 description = "This can help in temporary storing any data to disk locally" 53} 54
1output "jupyter_url" { 2 value = google_dataproc_cluster.dataproc-terraform.cluster_config[0].endpoint_config[0].http_ports["Jupyter"] 3}
1project = "<locate this on your GCP console>" 2region = "<choose a region>" eg. "us-central1" 3zone = "<choose a zone>" "us-central1-a" 4cluster_name = "<name of the cluster>" eg. test-dataproc" 5master_machine_type = "n1-standard-2" 6node_machine_type = "n1-standard-2" 7credentials = "<the path to your credentials file>" 8additional_components = ["JUPYTER"] 9node_num_instances = 2 10master_num_instances = 1 11master_disk_size = 30 12node_disk_size = 50 13node_num_local_ssds = 0
1. 2└── dataproc-terraform-example 3 ├── main.tf 4 ├── outputs.tf 5 ├── provider.tf 6 ├── terraform.tfvars 7 └── variables.tf
1terraform init
1terraform plan -out latest-plan.tfplan
1terraform apply "latest-plan.tfplan"
1std-logic@dev-01:~/code/github/aastom/dataproc-terraform-example$ terraform apply "latest-plan.tfplan" 2google_dataproc_cluster.dataproc-terraform: Creating... 3google_dataproc_cluster.dataproc-terraform: Still creating... [10s elapsed] 4google_dataproc_cluster.dataproc-terraform: Still creating... [20s elapsed] 5google_dataproc_cluster.dataproc-terraform: Still creating... [30s elapsed] 6google_dataproc_cluster.dataproc-terraform: Still creating... [40s elapsed] 7google_dataproc_cluster.dataproc-terraform: Still creating... [50s elapsed] 8google_dataproc_cluster.dataproc-terraform: Still creating... [1m0s elapsed] 9google_dataproc_cluster.dataproc-terraform: Still creating... [1m10s elapsed] 10google_dataproc_cluster.dataproc-terraform: Still creating... [1m20s elapsed] 11google_dataproc_cluster.dataproc-terraform: Still creating... [1m30s elapsed] 12google_dataproc_cluster.dataproc-terraform: Still creating... [1m40s elapsed] 13google_dataproc_cluster.dataproc-terraform: Still creating... [1m50s elapsed] 14google_dataproc_cluster.dataproc-terraform: Still creating... [2m0s elapsed] 15google_dataproc_cluster.dataproc-terraform: Still creating... [2m10s elapsed] 16google_dataproc_cluster.dataproc-terraform: Still creating... [2m20s elapsed] 17google_dataproc_cluster.dataproc-terraform: Still creating... [2m30s elapsed] 18google_dataproc_cluster.dataproc-terraform: Still creating... [2m40s elapsed] 19google_dataproc_cluster.dataproc-terraform: Still creating... [2m50s elapsed] 20google_dataproc_cluster.dataproc-terraform: Still creating... [3m0s elapsed] 21google_dataproc_cluster.dataproc-terraform: Still creating... [3m10s elapsed] 22google_dataproc_cluster.dataproc-terraform: Still creating... [3m20s elapsed] 23google_dataproc_cluster.dataproc-terraform: Still creating... [3m30s elapsed] 24google_dataproc_cluster.dataproc-terraform: Still creating... [3m40s elapsed] 25google_dataproc_cluster.dataproc-terraform: Still creating... [3m50s elapsed] 26google_dataproc_cluster.dataproc-terraform: Creation complete after 3m52s [id=projects/spark-372815/regions/us-central1/clusters/test-dataproc] 27 28Apply complete! Resources: 1 added, 0 changed, 0 destroyed. 29 30Outputs: 31 32jupyter_url = "https://studup2gzreqlikar6m3jeuyoa-dot-us-central1.dataproc.googleusercontent.com/gateway/default/jupyter/" 33std-logic@dev-01:~/code/github/aastom/dataproc-terraform-example$
1terraform destroy
To find out how bigspark can help your organisation successfully manage a multi-cloud approach get in touch today.