Hadoop Developer

Duration

22 hours

Course Price

$ 299.00

4.5 (23)

Overview

Hadoop Developer will cover how to work with large datasets stored in a distributed file system and execute the scripts on a Hadoop cluster

Participants will learn how to use Spark SQL to query structured data and Spark Streaming to perform real-time processing on streaming data from a variety of sources. Developers will also practice writing applications that use core Spark to perform ETL processing and iterative algorithms.

This course is designed for developers and engineers who do have programming experience, however

prior knowledge of Hadoop or Spark is not required.

The ability to program in one of those languages is required
Basic knowledge of SQL is required
Hands-on exercises in Scala and Python will be provided
Linux command line wil be used

Course Content

1. Introduction to Apache Hadoop and the Hadoop Ecosystem

Apache Hadoop Overview
Data Ingestion and Storage
Data Processing
Data Analysis and Exploration
Other Ecosystem Tools
Introduction to the Hands-On Exercises

2. Apache Hadoop File Storage

Apache Hadoop Cluster Components
HDFS Architecture
Using HDFS

3. Distributed Processing on an Apache Hadoop Cluster

YARN Architecture
Working With YARN

4. Apache Spark Basics

What is Apache Spark?
Starting the Spark Shell
Using the Spark Shell
Getting Started with Datasets and DataFrames
DataFrame Operations

5. Working with DataFrames and Schemas

Creating DataFrames from Data Sources
Saving DataFrames to Data Sources
DataFrame Schemas
Eager and Lazy Execution

6. Analyzing Data with DataFrame Queries

Querying DataFrames Using Column Expressions
Grouping and Aggregation Queries
Joining DataFrames

7. RDD Overview

RDD Overview
RDD Data Sources
Creating and Saving RDDs
RDD Operations

8. Transforming Data with RDDs

Writing and Passing Transformation Functions
Transformation Execution
Converting Between RDDs and DataFrames

9. Aggregating Data with Pair RDDs

Key-Value Pair RDDs
Map-Reduce
Other Pair RDD Operations

10. Querying Tables and Views with Apache Spark SQL

Querying Tables in Spark Using SQL
Querying Files and Views
The Catalog API
Comparing Spark SQL, Apache Impala,and Apache Hive-on-Spark

11. Working with Datasets in Scala

Datasets and DataFrames
Creating Datasets
Loading and Saving Datasets
Dataset Operations

12. Writing, Configuring, and Running Apache Spark Applications

Writing a Spark Application
Building and Running an Application
Application Deployment Mode
The Spark Application Web UI
Configuring Application Properties

13. Distributed Processing

Review: Apache Spark on a Cluster
RDD Partitions
Example: Partitioning in Queries
Stages and Tasks
Job Execution Planning
Example: Catalyst Execution Plan
Example: RDD Execution Plan

14. Distributed Data Persistence

DataFrame and Dataset Persistence
Persistence Storage Levels
Viewing Persisted RDDs

15. Common Patterns in Apache Spark Data Processing

Common Apache Spark Use Cases
Iterative Algorithms in Apache Spark
Machine Learning
Example: k-means

16. Apache Spark Streaming: Introduction to DStreams

Apache Spark Streaming Overview
Example: Streaming Request Count
DStreams
Developing Streaming Applications

17. Apache Spark Streaming: Processing Multiple Batches

Multi-Batch Operations
Time Slicing
State Operations
Sliding Window Operations
Preview: Structured Streaming

18. Apache Spark Streaming: Data Sources

Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources
Example: Using a Kafka Direct Data Source

Trainer Profile

Pankaj Kumar Pathak

Having 15+ years of Experience in Big Data Hadoop
Successfullyimplemented and migrated data on demand from existing Traditional(RDBMS,Oracle/Sql server) to Nosql (Cassandra, Mongo db etc) on Hadoop Cluster and provided ample training in India’s and oversees. The topmost Big Corporate houses where I have delivered such things from last 4 and half years are: -
Times internet: - Implemented Apcahe Spark with Cassandra on @ Hadoop RAC server for collectingMultiple log files.
Amar Ujala: -For Hadoop cluster planning and sizing with data migration from Sql server to Cassandra.
TCS: - 3 corporate batches for Hadoop admin and Data warehousing Cassandra Mongodb (Cloudera, Hortonworks).
HCL info System: - Hadoop Cluster implementing and migration from DB2.
HCL Technologies: - Hadoop, Spark-Scala, FlumeCassandra Nosql.
IBM: - 2 Corporate batches for Hadoop clustering, Cloudera Manager and others.
Dish TV: - Implemented Ware housing on Hadoop cluster.
UHG: - Implemented Hadoop cluster20 node cluster for Warehousing using Hive/Impala,Mapreduces.
Genpact:- Hadoop, Spark-Scala, Flume- and R.
Nucleus software: -For Hadoop cluster planning and sizing for data warehouses through Cassandra.
Tech Mahindra:- Implemented Spark with Cassandra on @ RAC server for collectingMultiple log files. Migrated db2 data.
BARC Mumbai:- Hadoop clustering with Spark and Cassandra.
Providing consultancy to UK base 2 clients for Data Science implementation

Interview Questions & Answer

1) What do you know about the term “Big Data”?

Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business in a better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. It also allows the companies to take better business decisions backed by data.

2) What are the five V’s of Big Data?

The five V’s of Big data are as follows:

Volume – Volume represents amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue

3) Tell us how big data and Hadoop are related to each other.

Big data and Hadoop are almost synonyms terms. Hadoop is a solution to big data. So, With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.

4) Explain the steps to be followed to deploy a Big Data solution.

The three steps that are followed to deploy a Big Data Solution are –

i. Data Ingestion

The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. This extracted data is then stored in HDFS.

Steps of Deploying Big Data Solution

ii. Data Storage

After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.

iii. Data Processing

The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.

5) What Are the Two Main Parts Of The Hadoop Framework?

The two main parts of Hadoop Framework are:

Hadoop distributed file system, a distributed file system with high throughput,
Hadoop MapReduce, a software framework for processing large data sets.

6) What Is HDFS?

HDFS is a file system which is designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

7) What is YARN?

Yet Another Resource Negotiator or we can say YARN is a Next generation MapReduce or MapReduce 2 or MRv2. It is applied in Hadoop 0.23 release to overcome the scalability issue in classic MapReduce framework by dividing the functionality of Job tracker in MapReduce framework into Resource Manager.

8) Define respective components of HDFS and YARN?

The two main components of HDFS are-

Name Node – This is the master node for processing metadata information for data blocks within the HDFS
Data Node/Slave node – This is the node which acts as slave node to store the data, for processing and use by the Name Node

In addition to serving the client requests, the Name Node executes any of the two following roles –

Checkpoint Node – It runs on a different host from the NameNode
Backup Node- It is a read-only Name Node which contains file system metadata information excluding the block locations
The two main components of YARN are–
Resource Manager– This component receives processing requests and accordingly allocates to respective Node Managers depending on processing needs.
Node Manager– It executes tasks on each single Data Node

9) What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A Datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or may be task tracker is unable to perform the assigned task.

10) What is Apache Hive?

Apache Hive is a data warehouse software. it is used to facilitate managing and querying large data sets stored in distributed storage. Hive also permits traditional MapReduce programs to customize mappers and reducers when it is inefficient to run the logic in HiveQL.

11) What are the key components of Job flow in YARN architecture?

MapReduce job flow in YARN architecture have below components:

A Client node, which submits the MapReduce job.
YARN Node Managers, which launch and monitor the tasks of jobs.
MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
YARN Resource Manager, which allocates the cluster resources to jobs.
HDFS file system is used for sharing job files between the above entities.

12) What is the importance of Application Master in YARN architecture?

Application Master helps in negotiating resources from the resource manager and working with the Node Manager(s) to run and monitor the tasks. Application Master makes request to containers for all map and reduce tasks. As Containers are assigned to tasks, it starts containers by reporting its Node Manager. It collects progress information from all the tasks and values are propagated to user or client node.

13) What do you mean by MapReduce in Hadoop?

MapReduce is a framework for processing huge raw data sets utilizing a large number of computers. It helps to processes the raw data in two phases i.e. Map and Reduce phase. MapReduce programming model can be easily processed on large scale data. It is integrated with HDFS for processing distributed across data nodes of clusters.

14) What are the value/key Pairs in MapReduce framework?

MapReduce framework implements a data model in which data is shown as value/key pairs. Both output and input data to MapReduce framework should be in value/key pairs only.

Blog

Register For Online Demo

Enter Name

Enter Email-Id

Enter Contact No

Course

Enter Your Skype Id

Enter Your Message

Captcha Code

Enter Captcha Code

Can't read the image? click here to refresh