Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

Hands on Sqoop, Hive and Impala for Data Analysts

Overview of Big Data ecosystem such as Hadoop HDFS, YARN, Map Reduce, Sqoop, Hive, etc

Overview of HDFS Commands such as put or copyFromLocal, get or copyToLocal, cat, etc along with concepts such as block size, replication factor, etc

Managing Tables in Hive Metastore using DDL Commands

Load or Insert data into Hive Metastore Tables using commands such as LOAD and INSERT

Overview of Functions in Hive to manipulate strings, dates, etc

Writing Basic Hive QL Queries using WHERE, JOIN, GROUP BY, etc

Analytical or Windowing Functions in Hive

Overview of Impala and understanding similarities and differences between Hive and Impala

Getting Started with Sqoop by reviewing official documentation and also exploring commands such as Sqoop eval

Importing Data from RDBMS tables into HDFS using Sqoop Import

Importing Data from RDBMS tables into Hive tables using Sqoop Import

Exporting Data from Hive or HDFS to RDBMS tables using Sqoop Export

Incremental Imports using Sqoop Import into HDFS or Hive Tables

As part of Sqoop, Hive, and Impala for Data Analysts (Formerly CCA 159), you will learn key skills such as Sqoop, Hive, and Impala.

This comprehensive course covers all aspects of the certification with real-world examples and data sets.

Overview of Big Data ecosystem

Overview Of Distributions and Management Tools
Properties and Properties Files – General Guidelines
Hadoop Distributed File System
YARN and Map Reduce2
Submitting Map ReduceJob
Determining Number of Mappers and Reducers
Understanding YARN and Map Reduce Configuration Properties
Review and Override Job Properties
Reviewing Map Reduce Job Logs
Map Reduce Job Counters
Overview of Hive
Databases and Query Engines
Overview of Data Ingestion in Big Data
Data Processing using Spark

HDFS Commands to manage files

Introduction to HDFS for Certification Exams
Overview of HDFS and PropertiesFiles
Overview of Hadoop CLI
Listing Files in HDFS
User Spaces or Home Directories in HDFS
Creating Directories in HDFS
Copying Files and Directories into HDFS
File and Directory Permissions Overview
Getting Files and Directories from HDFS
Previewing Text Files in HDFS
Copying or Moving Files and Directories within HDFS
Understanding Size of File System and Files
Overview of Block Size and ReplicationFactor
Getting File Metadata using hdfs fsck
Resources and Exercises

Getting Started with Hive

Overview of Hive Language Manual
Launching and using Hive CLI
Overview of Hive Properties
Hive CLI History and hiverc
Running HDFS Commands in Hive CLI
Understanding Warehouse Directory
Creating and Using Hive Databases
Creating and Describing Hive Tables
Retrieve Matadata of Tables using DESCRIBE
Role of Hive Metastore Database
Overview of beeline
Running Hive Commands and Queries using beeline

Creating Tables in Hive using Hive QL

Creating Tables in Hive – orders
Overview of Basic Data Types in Hive
Adding Comments to Columns and Tables
Loading Data into Hive Tables from Local File System
Loading Data into Hive Tables from HDFS
Loading Data – Overwrite vs Append
Creating External tables in Hive
Specifying Location for Hive Tables
Difference between Managed Table and External Table
Default Delimiters in Hive Tables using Text File
Overview of File Formats in Hive
Differences between Hive and RDBMS
Truncate and Drop tables in Hive
Resources and Exercises

Loading/Inserting data into Hive tables using Hive QL

Introduction to Partitioning and Bucketing
Creating Tables using Orc Format – order_items
Inserting Data into Tables using Stage Tables
Load vs. Insert in Hive
Creating Partitioned Tables in Hive
Adding Partitions to Tables in Hive
Loading into Partitions in Hive Tables
Inserting Data Into Partitions in Hive Tables
Insert Using Dynamic Partition Mode
Creating Bucketed Tables in Hive
Inserting Data into Bucketed Tables
Bucketing with Sorting
Overview of ACID Transactions
Create Tables for Transactions
Inserting Individual Records into Hive Tables
Update and Delete Data in Hive Tables

Overview of functions in Hive

Overview of Functions
Validating Functions
String Manipulation – Case Conversion and Length
String Manipulation – substr and split
String Manipulation – Trimming and Padding Functions
String Manipulation – Reverse and Concatenating Multiple Strings
Date Manipulation – Current Date and Timestamp
Date Manipulation – Date Arithmetic
Date Manipulation – trunc
Date Manipulation – Using date format
Date Manipulation – Extract Functions
Date Manipulation – Dealing with Unix Timestamp
Overview of Numeric Functions
Data Type Conversion Using Cast
Handling Null Values
Query Example – Get Word Count

Writing Basic Queries in Hive

Overview of SQL or Hive QL
Execution Life Cycle of Hive Query
Reviewing Logs of Hive Queries
Projecting Data using Select and Overview of From
Derive Conditional Values using CASE and WHEN
Projecting Distinct Values
Filtering Data using Where Clause
Boolean Operations in Where Clause
Boolean OR vs IN Operator
Filtering Data using LIKE Operator
Performing Basic Aggregations using Aggregate Functions
Performing Aggregations using GROUP BY
Filtering Aggregated Data Using HAVING
Global Sorting using ORDER BY
Overview of DISTRIBUTE BY
Sorting Data within Groups using SORT BY
Using CLUSTERED BY

Joining Data Sets and Set Operations in Hive

Overview of Nested Sub Queries
Nested Sub Queries – Using IN Operator
Nested Sub Queries – Using EXISTS Operator
Overview of Joins in Hive
Performing Inner Joins using Hive
Performing Outer Joins using Hive
Performing Full Outer Joins using Hive
Map Side Join and Reduce Side Join in Hive
Joining in Hive using Legacy Syntax
Cross Joins in Hive
Overview of Set Operations in Hive
Perform Set Union between two Hive Query Results
Set Operations – Intersect and Minus Not Supported

Windowing or Analytics Functions in Hive

Prepare HR Database in Hive with Employees Table
Overview of Analytics or Windowing Functions in Hive
Performing Aggregations using Hive Queries
Create Tables to Get Daily Revenue using CTAS in Hive
Getting Lead and Lag using Windowing Functions in Hive
Getting First and Last Values using Windowing Functions in Hive
Applying Rank using Windowing Functions in Hive
Applying Dense Rank using Windowing Functions in Hive
Applying Row Number using Windowing Functions in Hive
Difference Between rank, dense_rank, and row_number in Hive
Understanding the order of execution of Hive Queries
Overview of Nested Sub Queries in Hive
Filtering Data on Top of Window Functions in Hive
Getting Top 5 Products by Revenue for Each Day using Windowing Functions in Hive – Recap

Running Queries using Impala

Introduction to Impala
Role of Impala Daemons
Impala State Store and Catalog Server
Overview of Impala Shell
Relationship between Hive and Impala
Overview of Creating Databases and Tables using Impala
Loading and Inserting Data into Tables using Impala
Running Queries using Impala Shell
Reviewing Logs of Impala Queries
Synching Hive and Impala – Using Invalidate Metadata
Running Scripts using Impala Shell
Assignment – Using NYSE Data
Assignment – Solution

Getting Started with Sqoop

Introduction to Sqoop
Validate Source Database – MySQL
Review JDBC Jar to Connect to MySQL
Getting Help using Sqoop CLI
Overview of Sqoop User Guide
Validate Sqoop and MySQL Integration using Sqoop List Databases
Listing Tables in Database using Sqoop
Run Queries in MySQL using Sqoop Eval
Understanding Logs in Sqoop
Redirecting Sqoop Job Logs into Log Files

Importing data from MySQL to HDFS using Sqoop Import

Overview of Sqoop Import Command
Import Orders using target-dir
Import Order Items using warehouse-dir
Managing HDFS Directories
Sqoop Import Execution Flow
Reviewing Logs of Sqoop Import
Sqoop Import Specifying Number of Mappers
Review the Output Files generated by Sqoop Import
Sqoop Import Supported File Formats
Validating avro files using Avro Tools
Sqoop Import Using Compression

Apache Sqoop – Importing Data into HDFS – Customizing

Introduction to customizing Sqoop Import
Sqoop Import by Specifying Columns
Sqoop import Using Boundary Query
Sqoop import while filtering Unnecessary Data
Sqoop Import Using Split By to distribute import using non default column
Getting Query Results using Sqoop eval
Dealing with tables with Composite Keys while using Sqoop Import
Dealing with tables with Non Numeric Key Fields while using Sqoop Import
Dealing with tables with No Key Fields while using Sqoop Import
Using autoreset-to-one-mapper to use only one mapper while importing data using Sqoop from tables with no key fields
Default Delimiters used by Sqoop Import for Text File Format
Specifying Delimiters for Sqoop Import using Text File Format
Dealing with Null Values using Sqoop Import
Import Mulitple Tables from source database using Sqoop Import

Importing data from MySQL to Hive Tables using Sqoop Import

Quick Overview of Hive
Create Hive Database for Sqoop Import
Create Empty Hive Table for Sqoop Import
Import Data into Hive Table from source database table using Sqoop Import
Managing Hive Tables while importing data using Sqoop Import using Overwrite
Managing Hive Tables while importing data using Sqoop Import – Errors Out If Table Already Exists
Understanding Execution Flow of Sqoop Import into Hive tables
Review Files generated by Sqoop Import in Hive Tables
Sqoop Delimiters vs Hive Delimiters
Different File Formats supported by Sqoop Import while importing into Hive Tables
Sqoop Import all Tables into Hive from source database

Exporting Data from HDFS/Hive to MySQL using Sqoop Export

Introduction to Sqoop Export
Prepare Data for Sqoop Export
Create Table in MySQL for Sqoop Export
Perform Simple Sqoop Export from HDFS to MySQL table
Understanding Execution Flow of Sqoop Export
Specifying Number of Mappers for Sqoop Export
Troubleshooting the Issues related to Sqoop Export
Merging or Upserting Data using Sqoop Export – Overview
Quick Overview of MySQL – Upsert using Sqoop Export
Update Data using Update Key using Sqoop Export
Merging Data using allowInsert in Sqoop Export
Specifying Columns using Sqoop Export
Specifying Delimiters using Sqoop Export
Using Stage Table for Sqoop Export

Submitting Sqoop Jobs and Incremental Sqoop Imports

Introduction to Sqoop Jobs
Adding Password File for Sqoop Jobs
Creating Sqoop Job
Run Sqoop Job
Overview of Incremental Loads using Sqoop
Incremental Sqoop Import – Using Where
Incremental Sqoop Import – Using Append Mode
Incremental Sqoop Import – Create Table
Incremental Sqoop Import – Create Sqoop Job
Incremental Sqoop Import – Execute Job
Incremental Sqoop Import – Add Additional Data
Incremental Sqoop Import – Rerun Job
Incremental Sqoop Import – Using Last Modified

Here are the objectives for this course.

Provide Structure to the Data

Use Data Definition Language (DDL) statements to create or alter structures in the metastore for use by Hive and Impala.

Create tables using a variety of data types, delimiters, and file formats
Create new tables using existing tables to define the schema
Improve query performance by creating partitioned tables in the metastore
Alter tables to modify the existing schema
Create views in order to simplify queries

Data Analysis

Use Query Language (QL) statements in Hive and Impala to analyze data on the cluster.

Prepare reports using SELECT commands including unions and subqueries
Calculate aggregate statistics, such as sums and averages, during a query
Create queries against multiple data sources by using join commands
Transform the output format of queries by using built-in functions
Perform queries across a group of rows using windowing functions

Exercises will be provided to have enough practice to get better at Sqoop as well as writing queries using Hive and Impala.

All the demos are given on our state-of-the-art Big Data cluster. If you do not have multi-node cluster, you can sign up for our labs and practice on our multi-node cluster. You will be able to practice Sqoop and Hive on the cluster.

Sqoop, Hive and Impala for Data Analysts (Formerly CCA 159)

Setup Big Data Development Environment for Spark and Hadoop

Linux Fundamentals for IT Professionals using Ubuntu 20x

Learn To Create Advance AI Assistant (JARVIS 2.0)With Python

Zero to Hero with GPT-3 & Python : Building Cutting-Edge AI

Arduino UNO Bootcamp For Beginners – 2024 Complete Course

ChatGPT : The Complete Bootcamp for Beginner With Practical