Hadoop Course Content


I. Introduction to Big Data and Hadoop

  •  What is Big Data?
  •  What are the challenges for processing big data?
  •  What technologies support big data?
  •  3V’s of BigData and Growing.
  •  What is Hadoop?
  •  Why Hadoop and its Use cases
  •  History of Hadoop
  •  Different Ecosystems of Hadoop.
  •  Advantages and Disadvantages of Hadoop
  •  Real Life Use Cases

II. HDFS (Hadoop Distributed File System)
  •  HDFS architecture
  •  Features of HDFS
  •  Where does it fit and Where doesn't fit?
  •  HDFS daemons and its functionalities
  •  Name Node and its functionality
  •  Data Node and its functionality
  •  Secondary Name Node and its functionality
  •  Data Storage in HDFS
  •  Introduction about Blocks
  • Data replication
  • Accessing HDFS
  • CLI(Command Line Interface) and admin commands
  • Java Based Approach
  • Hadoop Administration
  • Hadoop Configuration Files
  • Configuring Hadoop Domains
  • Precedence of Hadoop Configuration
  • Diving into Hadoop Configuration
  • Scheduler
  • RackAwareness
  • Cluster Administration Utilities
  • Rebalancing HDFS DATA
  • Copy Large amount of data from HDFS
  • FSImage and Edit.log file theoretically and practically.

III. MAPREDUCE

Map Reduce architecture
  •  JobTracker , TaskTracker and its functionality
  •  Job execution flow
  •  Configuring development environment using Eclipse
  •  Map Reduce Programming Model
  •  How to write a basic Map Reduce jobs
  •  Running the Map Reduce jobs in local mode and distributed  mode
  •  Different Data types in Map Reduce
  •  How to use Input Formatters and Output Formatters in Map  Reduce Jobs
  •  Input formatters and its associated Record Readers with  examples
  •  Text Input Formatter
  •  Key Value Text Input Formatter
  •  Sequence File Input Formatter
  •  How to write custom Input Formatters and its Record  Readers
  •  Output formatters and its associated Record Writers with  examples
  •  Text Output Formatter
  •  Sequence File Output Formatter
  •  How to write custom Output Formatters and its Record  Writers
  •  How to write Combiners, Partitioners and use of these
  •  Importance of Distributed Cache
  •  Importance Counters and how to use Counters
Advance MapReduce Programming

Joins - Map Side and Reduce Side
  • Use of Secondary Sorting
  • Importance of Writable and Writable Comparable Api's
  • How to write Map Reduce Keys and Values
  • Use of Compression techniques
  • Snappy, LZO and Zip
  • How to debug Map Reduce Jobs in Local and Pseudo Mode.
  •  Introduction to Map Reduce Streaming and Pipes with examples
  • Job Submission
  • Job Initialization
  • Task Assignment
  • Task Execution
  • Progress and status bar
  • Job Completion
  • Failures
  • Task Failure
  • Tasktracker failure
  • JobTracker failure
  • Job Scheduling
  • Shuffle & Sort in depth
  • Diving into Shuffle and Sort
  • Dive into Input Splits
  • Dive into Buffer Concepts
  • Dive into Configuration Tuning
  • Dive into Task Execution
  • The Task assignment Environment
  • Speculative Execution
  • Output Committers
  • Task JVM Reuse
  • Multiple Inputs & Multiple Outputs
  • Build In Counters
  • Dive into Counters – Job Counters & User Defined Counters
  • Sql operations using Java MapReduce
  • Introduction to YARN (Next Generation Map Reduce)

IV. Apache HIVE
  • Hive Introduction
  • Hive architecture
  • Driver
  • Compiler
  • Semantic Analyzer
  • Hive Integration with Hadoop
  • Hive Query Language(Hive QL)
  • SQL VS Hive QL
  • Hive Installation and Configuration
  • Hive, Map-Reduce and Local-Mode
  • Hive DLL and DML Operations
  • Hive Services
  • CLI
  • Schema Design
  • Views
  • Indexes
  • Hiveserver

Metastore
  • embedded metastore configuration
  • external metastore configuration
  • Transformations in Hive
  • UDFs in Hive
  • How to write a simple hive queries
  • Usage
  • Tuning
  • Hive with HBASE Integration
  • Need to add some more R&D done by myself


V. Apache PIG

Introduction to Apache Pig

Map Reduce Vs Apache Pig
  • SQL Vs Apache Pig
  • Different data types in Pig
  • Modes Of Execution in Pig
  • Local Mode
  • Map Reduce Mode
  • Execution Mechanism
  • Grunt Shell
  • Script
  • Embedded
  • Transformations in Pig
  • How to write a simple pig script
  • UDFs in Pig
  • Pig with HBASE Integration
  •  Need to add some more R&D done by myself

VI. Apache SQOOP
  • Introduction to Sqoop
  • MySQL client and Server Installation
  • How to connect to Relational Database using Sqoop
  • Sqoop Commands and Examples on Import and Export commands.
  • Transferring an Entire Table
  • Specifying a Target Directory
  • Importing only a Subset of data
  • Protecting your password
  • Using a file format other than CSV
  • Compressing Imported Data
  • Speeding up Transfers
  • Overriding Type Mapping
  • Controlling Parallelism
  • Encoding Null Values
  • Importing all your tables
  • Incremental Import
  • Importing only new data
  • Incrementing Importing Mutable data
  • Preserving the last imported value
  • Storing Password in the Metastore
  • Overriding arguments to a saved job
  • Sharing the MetaStore between sqoop client
  • Importing data from two tables
  • Using Custom Boundary Queries
  • Renaming Sqoop Job instances
  • Importing Queries with duplicate columns
  • Transferring data from Hadoop
  • Inserting Data in Batches
  • Exporting with All or Nothing Semantics
  • Updating an Existing Data Set
  • Updating or Inserting at the same time
  • Using Stored Procedures
  • Exporting into a subset of columns
  • Encoding the Null Value
  • Encoding the Null Value Differently
  • Exporting Corrupted Data

VII. Apache FLUME
  • Introduction to flume
  • Flume agent usage

VIII Apache Hbase
  • Hbase introduction
  • Hbase basics
  • Column families
  • Scans
  • Hbase installation
  • Hbase Architecture
  • Storage
  • WriteAhead Log
  • Log Structured MergeTrees
  • Mapreduce integration
  • Mapreduce over Hbase
  • Hbase Usage
  • Key design
  • Bloom Filters
  • Versioning
  • Filters
  • Hbase Clients
  • REST
  • Thrift
  • Hive
  • Web Based UI
  • Hbase Admin
  • Schema definition
  • Basic CRUD operations
  • Apache OOZIE
  • Introduction to Oozie
  •  Executing workflow jobs

X. Hadoop Installation on Linux, All other ecosystems installations on Linux.

XI. Cluster setup (200 Nodes cluster) knowledge sharing with setup document.

XII. Cloudera & Hortonworks