Modern Data Warehousing / Data Engineering with Databricks on AWS Hackathon (MDWDE-HACK-AWS)

 

Course Overview

This hackathon challenges attendees to design, implement, and optimize robust data processing and transformation pipelines on the Databricks platform within the Amazon Web Services (AWS) ecosystem. Participants will tackle real-world scenarios involving diverse data sources, focusing on efficient ingestion, sophisticated transformations, and preparing data for analytical consumption. The emphasis will be on leveraging Databricks' capabilities for scalable data manipulation and ensuring data quality.

By participating, attendees will gain hands-on expertise in building high-performance, maintainable data pipelines crucial for deriving actionable insights. This practical experience directly translates into increased proficiency in modern data engineering practices, enabling them to unlock the full potential of their data assets and drive data-driven decision-making within their organizations.

Who should attend

  • Data Engineers: This is the core target audience. The challenges directly align with their daily tasks, such as building and optimizing ETL/ELT pipelines, managing data lakes, and ensuring data quality.
  • Data Analysts: Those who are looking to move beyond simple analysis and gain a deeper understanding of the data pipeline ""plumbing."" This hackathon will help them understand how data is prepared, leading to more effective and robust analysis.
  • Data Scientists: While their primary focus is on modeling, many data scientists are involved in data preparation. This event will help them build more efficient pipelines for feature engineering and data preprocessing, which is a significant part of their workflow.

Prerequisites

To be successful and get the most out of the event, participants should have:

  • Relational database knowledge: Understanding of concepts like tables, joins, and SQL.
  • Programming experience: Proficiency in a language like Scala or Python
  • ETL pipeline concepts: Familiarity with the concept of data transformation with data pipelines is recommended.
  • Cloud fundamentals: Familiarity with AWS is recommended.

Course Objectives

This hackathon embodies the modern approach to data processing, leveraging the Databricks platform, powered by innovation in the Apache Spark product family, and running on any of the top cloud hyperscalers. By the end of this hackathon, participants will gain practical skills and a deeper understanding of modern data engineering on Databricks within AWS, specifically to:

  • Master data ingestion from diverse sources and establish robust multi-layered Delta Lake architectures (Bronze, Silver, Gold).
  • Proficiently perform complex cleaning, standardization, and feature engineering using Spark to prepare high-quality data for analytical consumption.
  • Design and implement automated, incremental data loading and transformation jobs to ensure efficiency and data freshness.
  • Integrate robust error handling and basic telemetry to build reliable pipelines and effectively monitor their health and performance.

Outline: Modern Data Warehousing / Data Engineering with Databricks on AWS Hackathon (MDWDE-HACK-AWS)

This hackathon is structured as a progressive journey, designed to immerse participants in the practical application of cutting-edge data processing and transformation techniques on Databricks within the AWS ecosystem. The challenges participants are about to embark upon are interconnected and build directly upon one another. They will step into the shoes of a newly established digital department within a rapidly growing retail corporation, tasked with transforming raw data into valuable assets.

Participants will gradually gain access to the necessary Databricks features and AWS resources to address each challenge, fostering a hands-on learning environment. They will prepare to handle real-world data scenarios, from initial ingestion to complex transformations, culminating in reliable, analytical-ready data. Each step will reinforce their understanding and practical skills, equipping them to fully utilize Databricks for effective data solutions.

  • Challenge 1: Data Ingestion and Bronze Layer Creation
    • In this initial challenge, participants will focus on establishing the foundation of their data lake. They will ingest raw data from various AWS sources (e.g., S3, Kinesis) into Databricks. The primary goal is to land the data in its original, untransformed format, creating a ""Bronze"" layer in Delta Lake. This challenge emphasizes robust data loading and schema inference.
  • Challenge 2: Data Cleaning and Standardization (Silver Layer)
    • Building upon the Bronze layer, this challenge focuses on cleaning and standardizing the ingested data. Participants will identify and address common data quality issues such as missing values, inconsistencies, and incorrect data types. The cleaned and standardized data will form the ""Silver"" layer in Delta Lake, ready for more complex transformations.
  • Challenge 3: Advanced Data Transformation and Harmonization (Gold Layer Refinement)
    • Building on the previous challenges, participants will now focus on transforming the standardized data (from the Silver layer) into a ""Gold"" layer that adheres to a common data warehouse schema. This layer will be specifically designed to support the generation of key business reports and analytical insights. The emphasis will be on complex transformations, aggregations, and ensuring data is perfectly shaped for consumption, while orchestrating this dataflow in an automated manner using Databricks notebooks and jobs.
  • Challenge 4: Incremental Processing and Pipeline Robustness
    • As data volumes grow and new data arrive continuously, it becomes critical to process only the changes rather than reloading entire datasets. In this challenge, participants will enhance their existing pipelines to support differential (incremental) data loads into the Bronze, Silver, and Gold layers. They will also focus on making their data pipelines more robust by implementing comprehensive error handling and basic telemetry to monitor the health and performance of the data flow within Databricks.

Prices & Delivery methods

Online Training

Duration
3 days

Price
  • on request
Classroom Training

Duration
3 days

Price
  • on request

Schedule

Currently there are no training dates scheduled for this course.