Spark for Developers (SPARK-DEV)

Course Description Schedule Course Outline

About this Course

Spark for Developers is a hands-on, three-day course on developing in Apache Spark. You will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis. The course covers Spark shell for interactive data analysis, Spark internals, Spark APIs, Spark SQL, Spark streaming, and machine learning and graphX.

Why you should take this course:

  • Taught by real Big Data experts (not professional slide readers)
  • Vendor neutral – we teach Apache Spark
  • Hands on with small class sizes
  • Fresh labs with real world data, not rehashed lab examples from AMPLabs)
  • High quality materials

Why are IT professionals using Apache Spark?

  • It's a unified platform that provides a variety of compute models (batch, ad-hoc, streaming, machine learning and graph processing) within a well integrated eco-system
  • It's faster than Hadoop
  • It's very expressive & flexible
  • It's relatively easier than alternative solutions

Who should attend

  • Developers
  • Data analysts

Class Prerequisites

  • Familiarity with either Java, Scala, or Python programming languages
  • Basic understanding of Linux development environment like command line navigation and editing files using VI or nano

Outline: Spark for Developers (SPARK-DEV)

Module 1: Scala Introduction

Module 2: Spark Basics

  • Background and history
  • Spark and Hadoop
  • Spark concepts and architecture
  • Spark eco system (core, spark sql, mlib, streaming)

Module 3: RDDs

  • Running Spark in local mode
  • Spark web UI
  • Spark shell
  • Analyzing dataset – part 1
  • Inspecting RDDs

Module 4: RDDs In Depth

  • Partitions
  • RDD Operations / transformations
  • RDD types
  • Key-Value pair RDDs
  • MapReduce on RDD
  • Caching and persistence

Module 5: Spark and Hadoop

  • Hadoop Intro (HDFS / YARN)
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark

Module 6: Spark API programming

  • Introduction to Spark API / RDD API
  • Submitting the first program to Spark
  • Debugging / logging
  • Configuration properties

Module 7: Spark SQL

  • SQL context
  • Defining tables and importing datasets
  • Querying

Module 8: Spark Streaming

  • Streaming overview
  • Streaming operations
  • Sliding window operations
  • Writing spark streaming applications

Module 9: Spark Mlib

  • mlib intro
  • mlib algorithms
  • Writing mlib applications

Module 10: Spark GraphX

  • GraphX library overview
  • GraphX APIs
  • Processing graph data using Spark

Module 11: Spark Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory management

Bonus Lab : Running Spark in cluster mode

  • Inspecting master / workers in UIs
  • Configurations
  • Distributed processing of large data sets
Classroom Training

Duration 3 days

  • United States: US$ 2,500
Enroll now
Online Training

Duration 3 days

  • United States: US$ 2,500
Enroll now