Hadoop for Administrators (HA)

Course Description Schedule Course Outline
 

Course Content

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this three day course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan cluster deployment and growth, as well as how to install, maintain, monitor, troubleshoot and optimize Hadoop. They will also practice cluster bulk data load, get familiar with various Hadoop distributions, and practice installing and managing Hadoop ecosystem tools. The course finishes off with discussion of securing cluster with Kerberos.

Who should attend

Hadoop Administrators

Prerequisites

Learners will need to come to class meeting the following prerequisites:

  • Comfortable with basic Linux system administration
  • Basic scripting skills

Knowledge of Hadoop and Distributed Computing is not required, and will be introduced and explained in the course.

Course Objectives

By the end of this course, you should be able to:

  • Understand the business benefits and use cases for Hadoop and its ecosystem
  • Planning for cluster deployment and growth
  • Install, maintain, monitor, troubleshoot and optimize Hadoop
  • Gain familiarity with various Hadoop distributions
  • Installing and managing Hadoop ecosystem tools
  • Securing cluster with Kerberos

Detailed Course Outline

Introduction
  • Hadoop history, concepts
  • Ecosystem
  • Distributions
  • High level architecture
  • Hadoop myths
  • Hadoop challenges (hardware / software)
Planning and installation
  • Selecting software, Hadoop distributions
  • Sizing the cluster, planning for growth
  • Selecting hardware and network
  • Rack topology
  • Installation
  • Multi-tenancy
  • Directory structure, logs
  • Benchmarking
HDFS operations
  • Concepts (horizontal scaling, replication, data locality, rack awareness)
  • Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
  • Health monitoring
  • Command-line and browser-based administration
  • Adding storage, replacing defective drives
Data ingestion
  • Flume for logs and other data ingestion into HDFS
  • Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
  • Hadoop data warehousing with Hive
  • Copying data between clusters (distcp)
  • Using S3 as complementary to HDFS
  • Data ingestion best practices and architectures
MapReduce operations and administration
  • Parallel computing before mapreduce: compare HPC vs Hadoop administration
  • MapReduce cluster loads
  • Nodes and Daemons (JobTracker, TaskTracker)
  • MapReduce UI walk through
  • Mapreduce configuration
  • Job config
  • Optimizing MapReduce
  • Fool-proofing MR: what to tell your programmers
YARN: new architecture and new capabilities
  • YARN design goals and implementation architecture
  • New actors: ResourceManager, NodeManager, Application Master
  • Installing YARN
  • Job scheduling under YARN
Advanced topics
  • Hardware monitoring
  • Cluster monitoring
  • Adding and removing servers, upgrading Hadoop
  • Backup, recovery and business continuity planning
  • Oozie job workflows
  • Hadoop high availability (HA)
  • Hadoop Federation
  • Securing your cluster with Kerberos
Optional tracks
  • Cloudera Manager for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
  • Ambari for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)
Classroom Training

Duration 3 days

Price
  • United States: US$ 2,500
Enroll now
Online Training

Duration 3 days

Price
  • United States: US$ 2,500
Enroll now