Hadoop Data Analyst

Duration

20 hours

Course Price

INR 11,999

4.5 (23)

Overview

This course is designed for data analysts, business intelligence specialists, developers, system

architects, and database administrators. This course would give us how to:

Acquire, store, and analyze data using features in Pig, Hive, and Impala
Perform fundamental ETL (extract, transform, and load) tasks with Hadoop tools
Use Pig, Hive, and Impala to improve productivity for typical analysis tasks
Join diverse datasets to gain valuable business insight
Perform interactive, complex queries on datasets

Course Content

1. Introduction to Apache Hadoop Fundamentals

The Motivation for Hadoop
Hadoop Overview
Data Storage: HDFS
Distributed Data Processing:
YARN, MapReduce, and Spark
Data Processing and Analysis:
Pig, Hive, and Impala
Database Integration: Sqoop
Other Hadoop Data Tools
Exercise Scenarios

2. Introduction to Apache Pig

What is Pig?
Pig’s Features
Pig Use Cases
Interacting with Pig

3. Basic Data Analysis with Apache Pig

Pig Latin Syntax
Loading Data
Simple Data Types
Field Definitions
Data Output
Viewing the Schema
Filtering and Sorting Data
Commonly Used Functions

4. Processing Complex Data with Apache Pig

Storage Formats
Complex/Nested Data Types
Grouping
Built-In Functions for Complex Data
Iterating Grouped Data

5. Multi-Dataset Operations with Apache Pig

Techniques for Combining Datasets
Joining Datasets in Pig
Set Operations
Splitting Datasets

6. Apache Pig Troubleshooting and Optimization

Troubleshooting Pig
Logging
Using Hadoop’s Web UI
Data Sampling and Debugging
Performance Overview
Understanding the Execution Plan
Tips for Improving the Performance of Pig Jobs

7. Introduction to Apache Hiv and Impala

What is Hive?
What is Impala?
Why Use Hive and Impala?
Schema and Data Storage
Comparing Hive and Impala to Traditional Databases
Use Cases

8. Querying with Apache Hive and Impala

Databases and Tables
Basic Hive and Impala Query Language Syntax
Data Types
Using Hue to Execute Queries
Using Beeline (Hive’s Shell)
Using the Impala Shell

9. Apache Hive and Impala Data Management

Data Storage
Creating Databases and Tables
Loading Data
Altering Databases and Tables
Simplifying Queries with Views
Storing Query Results

10. Data Storage and Performance

Partitioning Tables
Loading Data into Partitioned Tables
When to Use Partitioning
Choosing a File Format
Using Avro and Parquet File Formats

11. Relational Data Analysis with Apache Hive and Impala

Joining Datasets
Common Built-In Functions
Aggregation and Windowing

12. Complex Data with Apache Hive and Impala

Complex Data with Hive
Complex Data with Impala

13. Analyzing Text with Apache Hive and Impala

Using Regular Expressions with Hive and Impala
Processing Text Data with SerDes in Hive
Sentiment Analysis and n-grams in Hive

14. Apache Hive Optimization

Understanding Query Performance
Bucketing
Indexing Data
Hive on Spark

15. Apache Impala Optimization

How Impala Executes Queries
Improving Impala Performance

16. Extending Apache Hive and Impala

Custom SerDes and File Formats in Hive
Data Transformation with
Custom Scripts in Hive
User-Defined Functions
Parameterized Queries

17. Choosing the Best Tool for the Job

Comparing Pig, Hive, Impala and Relational Databases
Which to Choose?

Trainer Profile

Pankaj Kumar Pathak

Having 15+ years of Experience in Big Data Hadoop
Successfullyimplemented and migrated data on demand from existing Traditional(RDBMS,Oracle/Sql server) to Nosql (Cassandra, Mongo db etc) on Hadoop Cluster and provided ample training in India’s and oversees. The topmost Big Corporate houses where I have delivered such things from last 4 and half years are: -
Times internet: - Implemented Apcahe Spark with Cassandra on @ Hadoop RAC server for collectingMultiple log files.
Amar Ujala: -For Hadoop cluster planning and sizing with data migration from Sql server to Cassandra.
TCS: - 3 corporate batches for Hadoop admin and Data warehousing Cassandra Mongodb (Cloudera, Hortonworks).
HCL info System: - Hadoop Cluster implementing and migration from DB2.
HCL Technologies: - Hadoop, Spark-Scala, FlumeCassandra Nosql.
IBM: - 2 Corporate batches for Hadoop clustering, Cloudera Manager and others.
Dish TV: - Implemented Ware housing on Hadoop cluster.
UHG: - Implemented Hadoop cluster20 node cluster for Warehousing using Hive/Impala,Mapreduces.
Genpact:- Hadoop, Spark-Scala, Flume- and R.
Nucleus software: -For Hadoop cluster planning and sizing for data warehouses through Cassandra.
Tech Mahindra:- Implemented Spark with Cassandra on @ RAC server for collectingMultiple log files. Migrated db2 data.
BARC Mumbai:- Hadoop clustering with Spark and Cassandra.
Providing consultancy to UK base 2 clients for Data Science implementation

Interview Questions & Answer

1) What is Hadoop and list its components?

Hadoop is an open-source framework. It is used for storing large data sets and runs applications across clusters of commodity hardware.

It offers extensive storage for any type of data and it can handle endless parallel tasks.

Core components of Hadoop:

Storage unit– HDFS (DataNode, NameNode)
Processing framework– YARN (NodeManager, ResourceManager)

2) What is YARN and explain its components?

Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop. It is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.

YARN components:

Resource Manager - It runs on a master daemon and controls the resource allocation in the cluster.
Node Manager - It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.
Application Master - It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.
Container - It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node

3) Explain HDFS and its components?

HDFS (Hadoop Distributed File System) is the primary data storage unit of Hadoop.
It stores various types of data as blocks in a distributed environment and it follows master and slave topology.

4) What is MapReduce and list its features?

MapReduce is a programming model. It is used for processing and generating large datasets on the clusters with parallel and distributed algorithms.

The syntax for running the MapReduce program is

1	`hadoop_jar_file.jar /input_path /output_path.`

5) What is Apache Pig?

Apache Pig is a high-level scripting language used for creating programs to run on Apache Hadoop. It is a tool used to deal with huge amount of structured and semi structed data. . It is a platform using which huge datasets are analyzed.

The language used in this platform is called Pig Latin.
It executes Hadoop jobs in Apache Spark, MapReduce, etc.

6) What is Pig Latin?

Pig Latin is a script language which is used in Apache Pig to create Data flow in order to analyze data.

7) List down the benefits of Apache Pig over MapReduce?

Pig Latin is a high-level scripting language while MapReduce is a low-level data processing paradigm.
Without much complex Java implementations in MapReduce, programmers can perform the same implementations very easily using Pig Latin.
Apache Pig decreases the length of the code by approx 20 times (according to Yahoo). Hence, this reduces development time by almost 16 times.
Pig offers various built-in operators for data operations like filters, joins, sorting, ordering, etc., while to perform these same functions in MapReduce is an enormous task.

8) List the various relational operators used in “Pig Latin”?

SPLIT
LIMIT
CROSS
COGROUP
GROUP
STORE
DISTINCT
ORDER BY
JOIN
FILTER
FOREACH
LOAD

9) What are the different data types in Pig Latin?

Pig Latin can handle both atomic data types like int, float, long, double etc. and complex data types like tuple, bag and map.

Atomic data types: Atomic or scalar data types are the basic data types which are used in all the languages like string, int, float, long, double, char[], byte[].

Complex Data Types: Complex data types are Tuple, Map and Bag.

10) How to load data in pig?

A= load ‘/home/training/simple.txt’ using PigStorage ‘|’ as (sname : chararray, sid: int, address:chararray);

11) What is Apache Hive?

Apache Hive offers database query interface to Apache Hadoop. It reads, writes, and manages large datasets that are residing in distributed storage and queries through SQL syntax. In other words, Hive is a data ware software which runs on the top of Hadoop. Hive is tool used for querying and processing a data. Hives store mostly a structured data.

12) What is the Use of Hive?

Hive works as a storage layer which is used to store structured data. This is very useful and convenient tool for SQL user as Hive use HQL.

13) How to managed create a table in hive?

hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’;
//hands on
hive>describe student;

14) What is Sqoop and what is the use of Sqoop?

Sqoop is a short form of SQL to Hadoop. This is basically a command line tool to transfer data between Hadoop and SQL and vice-versa. Sqoop is a CLI tool which is used to migrate data between RDBMS to Hadoop and vice-versa.

15) List some features of sqoop?

Full Load : Sqoop can load the single table or all the tables in a database using sqoop command.
Incremental Load : Sqoop can do incremental load, which means it will retrieve only rows newer than some previously-imported set of rows.
Parallel import/export : Sqoop is using the YARN framework to import and export the data. The YARN framework provides parallelism as it is read and writes multiple nodes parallelly and fault tolerance is very much possible because by default replication is happening.
Import results of SQL query : It is having the facility to import the result of the query in HDFS.
Compression : Sqoop having the facility to do the compression of the data, what it imports from a database. Sqoop having various options to compress the data. if you specify -compress while importing data, Sqoop compress the output file with grip format by default and it will create an extension as .gz, If you provide -compression-codec instead of compress then Sqoop compress the output with bgip2 format.
Connectors for all major RDBMS Databases : Sqoop having almost all the connectors to connect the relational databases.
Kerberos Security Integration : Sqoop supports Kerberos Authentication, Kerberos Authentication is a protocol which works on the basis of Ticket or key tab which will help you to authenticate user as well as services prior to connect the services like HDFS/HIVE, etc.

14) What is Apache Spark?

Apache Spark is a framework for real-time data analytics in a distributed computing environment. It executes in-memory computations to increase the speed of data processing.

It is 100x faster than MapReduce for large-scale data processing by exploiting in-memory computations and other optimizations.

Blog

Register For Online Demo

Enter Name

Enter Email-Id

Enter Contact No

Course

Enter Your Skype Id

Enter Your Message

Captcha Code

Enter Captcha Code

Can't read the image? click here to refresh