Data Engineering on Google Cloud Platform

Data Engineering on Google Cloud PlatformDEGCPGOGoogleGO-DEGCP3.0<ul> <li>Design and build data processing systems on Google Cloud.</li><li>Process batch and streaming data by implementing autoscaling data pipelines on Dataflow.</li><li>Derive business insights from extremely large datasets using BigQuery.</li><li>Leverage unstructured data using Spark and ML APIs on Dataproc.</li><li>Enable instant insights from streaming data.</li></ul><ul> <li>Prior Google Cloud experience using Cloud Shell and accessing products from the Google Cloud console.</li><li>Basic proficiency with a common query language such as SQL.</li><li>Experience with data modeling and ETL (extract, transform, load) activities.</li><li>Experience developing applications using a common programming language such as Python</li></ul><ul> <li>Data engineers</li><li>Database administrators</li><li>System administrators</li></ul><h4>Module 01 - Data engineering tasks and components</h4> Topics: <ul> <li>The role of a data engineer</li><li>Data sources versus data syncs</li><li>Data formats</li><li>Storage solution options on Google Cloud</li><li>Metadata management options on Google Cloud</li><li>Share datasets using Analytics Hub</li></ul>Objectives: <ul> <li>Explain the role of a data engineer.</li><li>Understand the differences between a data source and a data sink.</li><li>Explain the different types of data formats.</li><li>Explain the storage solution options on Google Cloud.</li><li>Learn about the metadata management options on Google Cloud.</li><li>Understand how to share datasets with ease using Analytics Hub.</li><li>Understand how to load data into BigQuery using the Google Cloud console and/or the gcloud CLI.</li></ul>Activities: <ul> <li>Lab: Loading Data into BigQuery</li></ul><h4>Module 02 - Data replication and migration</h4> Topics: <ul> <li>Replication and migration architecture</li><li>The gcloud command line tool</li><li>Moving datasets</li><li>Datastream</li></ul>Objectives: <ul> <li>Explain the baseline Google Cloud data replication and migration architecture.</li><li>Understand the options and use cases for the gcloud command line tool.</li><li>Explain the functionality and use cases for the Storage Transfer Service.</li><li>Explain the functionality and use cases for the Transfer Appliance.</li><li>Understand the features and deployment of Datastream.</li></ul>Activities: <ul> <li>Lab: Datastream: PostgreSQL Replication to BigQuery</li></ul><h4>Module 03 - The extract and load data pipeline pattern</h4> Topics: <ul> <li>Extract and load architecture</li><li>The bq command line tool</li><li>BigQuery Data Transfer Service</li><li>BigLake</li></ul>Objectives: <ul> <li>Explain the baseline extract and load architecture diagram.</li><li>Understand the options of the bq command line tool.</li><li>Explain the functionality and use cases for the BigQuery Data Transfer Service.</li><li>Explain the functionality and use cases for BigLake as a non-extract-load pattern.</li></ul>Activities: <ul> <li>Lab: BigLake: Qwik Start</li></ul><h4>Module 04 - The extract, load, and transform data pipeline pattern</h4> Topics: <ul> <li>Extract, load, and transform (ELT) architecture</li><li>SQL scripting and scheduling with BigQuery</li><li>Dataform</li></ul>Objectives: <ul> <li>Explain the baseline extract, load, and transform architecture diagram.</li><li>Understand a common ELT pipeline on Google Cloud.</li><li>Learn about BigQuery’s SQL scripting and scheduling capabilities.</li><li>Explain the functionality and use cases for Dataform.</li></ul>Activities: <ul> <li>Lab: Create and Execute a SQL Workflow in Dataform</li></ul><h4>Module 05 - The extract, transform, and load data pipeline pattern</h4> Topics: <ul> <li>Extract, transform, and load (ETL) architecture</li><li>Google Cloud GUI tools for ETL data pipelines</li><li>Batch data processing using Dataproc</li><li>Streaming data processing options</li><li>Bigtable and data pipelines</li></ul>Objectives: <ul> <li>Explain the baseline extract, transform, and load architecture diagram.</li><li>Learn about the GUI tools on Google Cloud used for ETL data pipelines.</li><li>Explain batch data processing using Dataproc.</li><li>Learn to use Dataproc Serverless for Spark for ETL.</li><li>Explain streaming data processing options.</li><li>Explain the role Bigtable plays in data pipelines.</li></ul>Activities: <ul> <li>Lab: Use Dataproc Serverless for Spark to Load BigQuery</li><li>Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow</li></ul><h4>Module 06 - Automation techniques</h4> Topics: <ul> <li>Automation patterns and options for pipelines</li><li>Cloud Scheduler and Workflows</li><li>Cloud Composer</li><li>Cloud Run functions</li><li>Eventarc</li></ul>Objectives: <ul> <li>Explain the automation patterns and options available for pipelines.</li><li>Learn about Cloud Scheduler and workflows.</li><li>Learn about Cloud Composer.</li><li>Learn about Cloud Run functions.</li><li>Explain the functionality and automation use cases for Eventarc.</li></ul>Activities: <ul> <li>Lab: Use Cloud Run Functions to Load BigQuery</li></ul><h4>Module 07 - Introduction to data engineering</h4> Topics: <ul> <li>Data engineer’s role</li><li>Data engineering challenges</li><li>Introduction to BigQuery</li><li>Data lakes and data warehouses</li><li>Transactional databases versus data warehouses</li><li>Effective partnership with other data teams</li><li>Management of data access and governance</li><li>Building of production-ready pipelines</li><li>Google Cloud customer case study</li></ul>Objectives: <ul> <li>Discuss the challenges of data engineering, and how building data pipelines in the cloud helps to address these.</li><li>Review and understand the purpose of a data lake versus a data warehouse, and when to use which.</li></ul>Activities: <ul> <li>Lab: Using BigQuery to Do Analysis</li></ul><h4>Module 08 - Build a Data Lake</h4> Topics: <ul> <li>Introduction to data lakes</li><li>Data storage and ETL options on Google Cloud</li><li>Building of a data lake using Cloud Storage</li><li>Secure Cloud Storage</li><li>Store all sorts of data types</li><li>Cloud SQL as your OLTP system</li></ul>Objectives: <ul> <li>Discuss why Cloud Storage is a great option for building a data lake on Google Cloud.</li><li>Explain how to use Cloud SQL for a relational data lake.</li></ul>Activities: <ul> <li>Lab: Loading Taxi Data into Cloud SQL</li></ul><h4>Module 09 - Build a data warehouse</h4> Topics: <ul> <li>The modern data warehouse</li><li>Introduction to BigQuery</li><li>Get started with BigQuery</li><li>Loading of data into BigQuery</li><li>Exploration of schemas</li><li>Schema design</li><li>Nested and repeated fields</li><li>Optimization with partitioning and clustering</li></ul>Objectives: <ul> <li>Discuss requirements of a modern warehouse.</li><li>Explain why BigQuery is the scalable data warehousing solution on Google Cloud.</li><li>Discuss the core concepts of BigQuery and review options of loading data into BigQuery.</li></ul>Activities: <ul> <li>Lab: Working with JSON and Array Data in BigQuery</li><li>Lab: Partitioned Tables in BigQuery</li></ul><h4>Module 10 - Introduction to building batch data pipelines</h4> Topics: <ul> <li>EL, ELT, ETL</li><li>Quality considerations</li><li>Ways of executing operations in BigQuery</li><li>Shortcomings</li><li>ETL to solve data quality issues</li></ul>Objectives: <ul> <li>Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL.</li></ul><h4>Module 11 - Execute Spark on Dataproc</h4> Topics: <ul> <li>The Hadoop ecosystem</li><li>Run Hadoop on Dataproc</li><li>Cloud Storage instead of HDFS</li><li>Optimize Dataproc</li></ul>Objectives: <ul> <li>Review the Hadoop ecosystem.</li><li>Discuss how to lift and shift your existing Hadoop workloads to the cloud using Dataproc.</li><li>Explain when you would use Cloud Storage instead of HDFS storage.</li><li>Explain how to optimize Dataproc jobs.</li></ul>Activities: <ul> <li>Lab: Running Apache Spark Jobs on Dataproc</li></ul><h4>Module 12 - Serverless data processing with Dataflow</h4> Topics: <ul> <li>Introduction to Dataflow</li><li>Reasons why customers value Dataflow</li><li>Dataflow pipelines</li><li>Aggregating with GroupByKey and Combine</li><li>Side inputs and windows</li><li>Dataflow templates</li></ul>Objectives: <ul> <li>Identify features customers value in Dataflow.</li><li>Discuss core concepts in Dataflow.</li><li>Review the use of Dataflow templates and SQL.</li><li>Write a simple Dataflow pipeline and run it both locally and on the cloud.</li><li>Identify Map and Reduce operations, execute the pipeline, and use command line parameters.</li><li>Read data from BigQuery into Dataflow and use the output of a pipeline as a side-input to another pipeline.</li></ul>Activities: <ul> <li>Lab: A Simple Dataflow Pipeline (Python/Java)</li><li>Lab: MapReduce in Beam (Python/Java)</li><li>Lab: Side Inputs (Python/Java)</li></ul><h4>Module 13 - Manage data pipelines with Cloud Data Fusion and Cloud Composer</h4> Topics: <ul> <li>Build batch data pipelines visually with Cloud Data Fusion<ul> <li>Components</li><li>UI overview</li><li>Building a pipeline</li><li>Exploring data using Wrangler</li></ul></li><li>Orchestrate work between Google Cloud services with Cloud Composer<ul> <li>Apache Airflow environment</li><li>DAGs and operators</li><li>Workflow scheduling</li><li>Monitoring and logging</li></ul></li></ul>Objectives: <ul> <li>Discuss how to manage your data pipelines with Cloud Data Fusion and Cloud Composer.</li><li>Summarize how Cloud Data Fusion allows data analysts and ETL developers to wrangle data and build pipelines in a visual way.</li><li>Describe how Cloud Composer can help to orchestrate the work across multiple Google Cloud services.</li></ul>Activities: <ul> <li>Lab: Building and Executing a Pipeline Graph in Data Fusion</li><li>Lab: An Introduction to Cloud Composer</li></ul><h4>Module 14 - Introduction to processing streaming data</h4> Topics: <ul> <li>Process streaming data</li></ul>Objectives: <ul> <li>Explain streaming data processing.</li><li>Identify the Google Cloud products and tools that can help address streaming data challenges.</li></ul><h4>Module 15 - Serverless messaging with Pub/Sub</h4> Topics: <ul> <li>Introduction to Pub/Sub</li><li>Pub/Sub push versus pull</li><li>Publishing with Pub/Sub code</li></ul>Objectives: <ul> <li>Describe the Pub/Sub service.</li><li>Explain how Pub/Sub works.</li><li>Simulate real-time streaming sensor data using Pub/Sub.</li></ul>Activities: <ul> <li>Lab: Publish Streaming Data into Pub/Sub</li></ul><h4>Module 16 - Dataflow streaming features</h4> Topics: <ul> <li>Steaming data challenges</li><li>Dataflow windowing</li></ul>Objectives: <ul> <li>Describe the Dataflow service.</li><li>Build a stream processing pipeline for live traffic data.</li><li>Demonstrate how to handle late data using watermarks, triggers, and accumulation.</li></ul>Activities: <ul> <li>Lab: Streaming Data Pipelines</li></ul><h4>Module 17 - High-throughput BigQuery and Bigtable streaming features</h4> Topics: <ul> <li>Streaming into BigQuery and visualizing results</li><li>High-throughput streaming with Bigtable</li><li>Optimizing Bigtable performance</li></ul>Objectives: <ul> <li>Describe how to perform ad-hoc analysis on streaming data using BigQuery and dashboards.</li><li>Discuss Bigtable as a low-latency solution.</li><li>Describe how to architect for Bigtable and how to ingest data into Bigtable.</li><li>Highlight performance considerations for the relevant services.</li></ul>Activities: <ul> <li>Lab: Streaming Analytics and Dashboards</li><li>Lab: Generate Personalized Email Content with BigQuery Continuous Queries and Gemini</li><li>Lab: Streaming Data Pipelines into Bigtable</li></ul><h4>Module 18 - Advanced BigQuery functionality and performance</h4> Topics: <ul> <li>Analytic window functions</li><li>GIS functions</li><li>Performance considerations</li></ul>Objectives: <ul> <li>Review some of BigQuery’s advanced analysis capabilities.</li><li>Discuss ways to improve query performance.</li></ul>Activities: <ul> <li>Lab: Optimizing Your BigQuery Queries for Performance</li></ul>- Design and build data processing systems on Google Cloud. - Process batch and streaming data by implementing autoscaling data pipelines on Dataflow. - Derive business insights from extremely large datasets using BigQuery. - Leverage unstructured data using Spark and ML APIs on Dataproc. - Enable instant insights from streaming data.- Prior Google Cloud experience using Cloud Shell and accessing products from the Google Cloud console. - Basic proficiency with a common query language such as SQL. - Experience with data modeling and ETL (extract, transform, load) activities. - Experience developing applications using a common programming language such as Python- Data engineers - Database administrators - System administratorsModule 01 - Data engineering tasks and components Topics: - The role of a data engineer - Data sources versus data syncs - Data formats - Storage solution options on Google Cloud - Metadata management options on Google Cloud - Share datasets using Analytics Hub Objectives: - Explain the role of a data engineer. - Understand the differences between a data source and a data sink. - Explain the different types of data formats. - Explain the storage solution options on Google Cloud. - Learn about the metadata management options on Google Cloud. - Understand how to share datasets with ease using Analytics Hub. - Understand how to load data into BigQuery using the Google Cloud console and/or the gcloud CLI. Activities: - Lab: Loading Data into BigQuery Module 02 - Data replication and migration Topics: - Replication and migration architecture - The gcloud command line tool - Moving datasets - Datastream Objectives: - Explain the baseline Google Cloud data replication and migration architecture. - Understand the options and use cases for the gcloud command line tool. - Explain the functionality and use cases for the Storage Transfer Service. - Explain the functionality and use cases for the Transfer Appliance. - Understand the features and deployment of Datastream. Activities: - Lab: Datastream: PostgreSQL Replication to BigQuery Module 03 - The extract and load data pipeline pattern Topics: - Extract and load architecture - The bq command line tool - BigQuery Data Transfer Service - BigLake Objectives: - Explain the baseline extract and load architecture diagram. - Understand the options of the bq command line tool. - Explain the functionality and use cases for the BigQuery Data Transfer Service. - Explain the functionality and use cases for BigLake as a non-extract-load pattern. Activities: - Lab: BigLake: Qwik Start Module 04 - The extract, load, and transform data pipeline pattern Topics: - Extract, load, and transform (ELT) architecture - SQL scripting and scheduling with BigQuery - Dataform Objectives: - Explain the baseline extract, load, and transform architecture diagram. - Understand a common ELT pipeline on Google Cloud. - Learn about BigQuery’s SQL scripting and scheduling capabilities. - Explain the functionality and use cases for Dataform. Activities: - Lab: Create and Execute a SQL Workflow in Dataform Module 05 - The extract, transform, and load data pipeline pattern Topics: - Extract, transform, and load (ETL) architecture - Google Cloud GUI tools for ETL data pipelines - Batch data processing using Dataproc - Streaming data processing options - Bigtable and data pipelines Objectives: - Explain the baseline extract, transform, and load architecture diagram. - Learn about the GUI tools on Google Cloud used for ETL data pipelines. - Explain batch data processing using Dataproc. - Learn to use Dataproc Serverless for Spark for ETL. - Explain streaming data processing options. - Explain the role Bigtable plays in data pipelines. Activities: - Lab: Use Dataproc Serverless for Spark to Load BigQuery - Lab: Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow Module 06 - Automation techniques Topics: - Automation patterns and options for pipelines - Cloud Scheduler and Workflows - Cloud Composer - Cloud Run functions - Eventarc Objectives: - Explain the automation patterns and options available for pipelines. - Learn about Cloud Scheduler and workflows. - Learn about Cloud Composer. - Learn about Cloud Run functions. - Explain the functionality and automation use cases for Eventarc. Activities: - Lab: Use Cloud Run Functions to Load BigQuery Module 07 - Introduction to data engineering Topics: - Data engineer’s role - Data engineering challenges - Introduction to BigQuery - Data lakes and data warehouses - Transactional databases versus data warehouses - Effective partnership with other data teams - Management of data access and governance - Building of production-ready pipelines - Google Cloud customer case study Objectives: - Discuss the challenges of data engineering, and how building data pipelines in the cloud helps to address these. - Review and understand the purpose of a data lake versus a data warehouse, and when to use which. Activities: - Lab: Using BigQuery to Do Analysis Module 08 - Build a Data Lake Topics: - Introduction to data lakes - Data storage and ETL options on Google Cloud - Building of a data lake using Cloud Storage - Secure Cloud Storage - Store all sorts of data types - Cloud SQL as your OLTP system Objectives: - Discuss why Cloud Storage is a great option for building a data lake on Google Cloud. - Explain how to use Cloud SQL for a relational data lake. Activities: - Lab: Loading Taxi Data into Cloud SQL Module 09 - Build a data warehouse Topics: - The modern data warehouse - Introduction to BigQuery - Get started with BigQuery - Loading of data into BigQuery - Exploration of schemas - Schema design - Nested and repeated fields - Optimization with partitioning and clustering Objectives: - Discuss requirements of a modern warehouse. - Explain why BigQuery is the scalable data warehousing solution on Google Cloud. - Discuss the core concepts of BigQuery and review options of loading data into BigQuery. Activities: - Lab: Working with JSON and Array Data in BigQuery - Lab: Partitioned Tables in BigQuery Module 10 - Introduction to building batch data pipelines Topics: - EL, ELT, ETL - Quality considerations - Ways of executing operations in BigQuery - Shortcomings - ETL to solve data quality issues Objectives: - Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL. Module 11 - Execute Spark on Dataproc Topics: - The Hadoop ecosystem - Run Hadoop on Dataproc - Cloud Storage instead of HDFS - Optimize Dataproc Objectives: - Review the Hadoop ecosystem. - Discuss how to lift and shift your existing Hadoop workloads to the cloud using Dataproc. - Explain when you would use Cloud Storage instead of HDFS storage. - Explain how to optimize Dataproc jobs. Activities: - Lab: Running Apache Spark Jobs on Dataproc Module 12 - Serverless data processing with Dataflow Topics: - Introduction to Dataflow - Reasons why customers value Dataflow - Dataflow pipelines - Aggregating with GroupByKey and Combine - Side inputs and windows - Dataflow templates Objectives: - Identify features customers value in Dataflow. - Discuss core concepts in Dataflow. - Review the use of Dataflow templates and SQL. - Write a simple Dataflow pipeline and run it both locally and on the cloud. - Identify Map and Reduce operations, execute the pipeline, and use command line parameters. - Read data from BigQuery into Dataflow and use the output of a pipeline as a side-input to another pipeline. Activities: - Lab: A Simple Dataflow Pipeline (Python/Java) - Lab: MapReduce in Beam (Python/Java) - Lab: Side Inputs (Python/Java) Module 13 - Manage data pipelines with Cloud Data Fusion and Cloud Composer Topics: - Build batch data pipelines visually with Cloud Data Fusion - Components - UI overview - Building a pipeline - Exploring data using Wrangler - Orchestrate work between Google Cloud services with Cloud Composer - Apache Airflow environment - DAGs and operators - Workflow scheduling - Monitoring and logging Objectives: - Discuss how to manage your data pipelines with Cloud Data Fusion and Cloud Composer. - Summarize how Cloud Data Fusion allows data analysts and ETL developers to wrangle data and build pipelines in a visual way. - Describe how Cloud Composer can help to orchestrate the work across multiple Google Cloud services. Activities: - Lab: Building and Executing a Pipeline Graph in Data Fusion - Lab: An Introduction to Cloud Composer Module 14 - Introduction to processing streaming data Topics: - Process streaming data Objectives: - Explain streaming data processing. - Identify the Google Cloud products and tools that can help address streaming data challenges. Module 15 - Serverless messaging with Pub/Sub Topics: - Introduction to Pub/Sub - Pub/Sub push versus pull - Publishing with Pub/Sub code Objectives: - Describe the Pub/Sub service. - Explain how Pub/Sub works. - Simulate real-time streaming sensor data using Pub/Sub. Activities: - Lab: Publish Streaming Data into Pub/Sub Module 16 - Dataflow streaming features Topics: - Steaming data challenges - Dataflow windowing Objectives: - Describe the Dataflow service. - Build a stream processing pipeline for live traffic data. - Demonstrate how to handle late data using watermarks, triggers, and accumulation. Activities: - Lab: Streaming Data Pipelines Module 17 - High-throughput BigQuery and Bigtable streaming features Topics: - Streaming into BigQuery and visualizing results - High-throughput streaming with Bigtable - Optimizing Bigtable performance Objectives: - Describe how to perform ad-hoc analysis on streaming data using BigQuery and dashboards. - Discuss Bigtable as a low-latency solution. - Describe how to architect for Bigtable and how to ingest data into Bigtable. - Highlight performance considerations for the relevant services. Activities: - Lab: Streaming Analytics and Dashboards - Lab: Generate Personalized Email Content with BigQuery Continuous Queries and Gemini - Lab: Streaming Data Pipelines into Bigtable Module 18 - Advanced BigQuery functionality and performance Topics: - Analytic window functions - GIS functions - Performance considerations Objectives: - Review some of BigQuery’s advanced analysis capabilities. - Discuss ways to improve query performance. Activities: - Lab: Optimizing Your BigQuery Queries for Performance4 days2600.002600.002695.002695.002600.002495.001950.003450.002600.002600.003380.001500.00221000.009020.001950.001950.001950.002600.002640.003445.002990.00