Data engineering in the cloud

With the rise of cloud services, who are the major providers and what are their tools?

Posted by Allan Situma on June 10, 2020 · 6 mins read
Cloud computing is the on-demand availability of computer system resources, especially data storage and computing power, without direct active management by the user.- Wikipedia

The public cloud lets users gain new capabilities without investing much in hardware or software. Cloud computing has rise over the past years with more and more companies switching to cloud instead of on-prem in order ti reduce initial capital costs.This article will address the rise of cloud computing in relation to data engineering tasks

Agenda

  • Basics of cloud computing
  • Top 3 cloud providers
  • Google Cloud Platform
  • Amazon Web Services
  • Microsoft Azure

Basics of Cloud Computing

Cloud computing is the delivery of computing services over the internet ("the cloud"), including storage, processing power, databases, networking, software, and analytics. Instead of owning and maintaining physical data centers and servers, businesses and individuals can rent computing resources on-demand from a cloud provider. This model offers several benefits, such as cost savings, scalability, flexibility, and access to a wide range of services and tools.

Top 3 Cloud Providers

There are several cloud service providers in the market, but the top three are Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure. Each of these providers offers a comprehensive suite of cloud services, including computing, storage, databases, machine learning, and more.

Google Cloud Platform (GCP)

Google Cloud Platform is a suite of cloud computing services provided by Google. It offers a range of services for compute, storage, machine learning, and networking, along with tools for data analytics, AI, and the Internet of Things (IoT). Key features include:

  • Compute Engine: Virtual machines running in Google's data centers.
  • Cloud Storage: Scalable and secure object storage.
  • BigQuery: A serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility.
  • App Engine: A fully managed platform for building and deploying applications.
  • Kubernetes Engine: Managed Kubernetes service for running containerized applications.

Data Infrastructure Tools

  • BigQuery: A powerful data warehousing solution for running fast SQL queries on large datasets.
  • Cloud Dataflow: A unified stream and batch data processing service.
  • Cloud Dataproc: A fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters.
  • Cloud Composer: A managed workflow orchestration service built on Apache Airflow.

Amazon Web Services (AWS)

Amazon Web Services is the most comprehensive and widely adopted cloud platform, offering over 200 fully featured services from data centers globally. AWS provides a wide range of services for computing, storage, databases, analytics, networking, mobile, developer tools, management tools, IoT, security, and enterprise applications. Key features include:

  • EC2 (Elastic Compute Cloud): Scalable virtual servers for compute capacity.
  • S3 (Simple Storage Service): Object storage built to store and retrieve any amount of data from anywhere.
  • RDS (Relational Database Service): Managed relational database services for MySQL, PostgreSQL, MariaDB, Oracle, and SQL Server.
  • Lambda: Run code without provisioning or managing servers.
  • Redshift: A fully managed data warehouse service.

Data Infrastructure Tools

  • Redshift: A fast, scalable data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and existing business intelligence tools.
  • Glue: A fully managed ETL service that makes it easy to prepare and load data for analytics.
  • Data Pipeline: A web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.
  • EMR (Elastic MapReduce): A cloud big data platform for processing vast amounts of data using open-source tools such as Apache Hadoop, Spark, and HBase.

Microsoft Azure

Microsoft Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers. Azure offers a wide range of services, including those for computing, analytics, storage, and networking. Key features include:

  • Azure Virtual Machines: On-demand scalable computing resources.
  • Azure Blob Storage: Massively scalable object storage for any type of unstructured data.
  • Azure SQL Database: Fully managed relational database with built-in intelligence.
  • Azure Functions: Event-driven serverless compute service.
  • Azure Kubernetes Service (AKS): Managed Kubernetes service for deploying and managing containerized applications.

Data Infrastructure Tools

  • Azure Synapse Analytics: A limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics.
  • Azure Data Factory: A cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.
  • Azure Databricks: A fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure.
  • Azure HDInsight: A fully managed, full-spectrum, open-source analytics service in the cloud for enterprises.

These cloud providers offer a robust and flexible environment to support the needs of modern businesses. They help organizations reduce costs, increase agility, and innovate faster by leveraging the latest cloud technologies.

Demo Image To go places and do things that have never been done before – that’s what living is all about.