2017/05/23

Apache Sqoop Import - Import data from Relational database to HDFS

Apache Sqoop is used to import data from Relational Databases ( MySql, Oracle, SQL Server etc., ) into Hadoop ecosystem.

This post details different variants of using the Sqoop tool for importing data.

#1 Transfer Entire table content


[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \

The above command will import entire contents of "categories" table into HDFS as a CSV delimited file.

The parameters to sqoop command are -
import  - specifies that the data need to be imported into HDFS from relational db
--connect - contains the JDBC URL for the source relational db
--username & --password - credentials to connect to the relational db
--table -  name of the table to  be transferred to HDFS

When the above command is executed, the contents of the table are imported into HDFS and a file will be created under the directory - /user/$user -  as below:



#2 Specifying a Target Directory


Sqoop offers to import contents of a table into a specific HDFS directory.  These are :
  • --target-dir -  allows to specify the final directory where the data will be imported into.
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \
--target-dir /user/root/output/categories

With this variation, the categories folder will be created under /user/root/output  and data will be imported.  The parameter --target-dir need to modified everytime a new table need to be imported.

  • --warehouse-dir - allows to specify the root directory under which a new folder will be created while importing
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--warehouse-dir /user/root/output

With this variation a parent directory is specified under HDFS.  Sqoop import when executed, rather than writing the contents into the parent directory will create a new folder matching the name of the table and imports data in there.

In both instances, sqoop will reject data if the output directory already exists.

#3 Importing only subset of data


Sqoop allows to import only a portion of the table contents into HDFS with a where clause.

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--where "substr(product_name,1,2) = 'Sn'" \
--target-dir /user/root/output/products-Sn

The above command will only import the product names starting with "Sn" from "Products" table into HDFS and stores under the specified directory.

#4 File Formats


Sqoop can import data in 3 formats:
  • CSV  - text - default
  • Sequencefile - binary
  • Avro - binary
To invoke the sqoop import to store in binary format use below syntax

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-sequencefile \
--warehouse-dir /user/root/output

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-avrodatafile \
--warehouse-dir /user/root/output


#5 Importing all tables


Sqoop supports importing all tables in a database, rather than importing tables one by one.  To import all tables use the tool - import-all-tables.

[root@ODIGettingStarted ~]# sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--warehouse-dir /user/root/output

When executed, this command will create one folder for each of the table in the database and transfers data into it.  The folder structure is as below at the end of this command execution:

mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


mysql> 

After import, the HDFS directory structure is:
[root@ODIGettingStarted conf.dist]# hdfs dfs -ls /user/root/output
Found 6 items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:36 /user/root/output/categories
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/customers
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/departments
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/order_items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/orders
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/products
[root@ODIGettingStarted conf.dist]# 


2017/05/06

Hive Getting Started - Tutorial

In this post, I am going to demonstrate a very simple tutorial to get started with Hive.

The following assumptions are made:

  • you have set up Hadoop Cluster 
  • Hive is installed & configured on the cluster
  • All environment variables are set accordingly
  • Hive Datawareouse default location is - /user/hive/warehouse

In this tutorial, we will do the following

  • Copy the data set from local system to HDFS
  • create hive database
  • create hive table 
  • load data into hive table
  • verify the data in hive table
  • run queries to answer few business questions

Dataset

The dataset that I will be using is movies dataset.  You can download the data from movies.  This is a simple CSV file and contains the following information , in the order specified, related to movies:
  1. Movie ID - Integer
  2. Movie Name - String
  3. Year of Release - Integer ( only year )
  4. Movie  Rating - Float ( decimal )
  5. Duration in Seconds - Integer
Contents of the file looks as below:

1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126

Ok, now you got the file and you know the contents of it.  Let's get started with the tutorial.


File Transfer

When you downloaded the file it is available on your local drive.  Again I am assuming that you have downloaded from your Linux VM.  If you have downloaded on your Windows host, FTP the file to Linux VM.

I have set up my directories on Linux VM as below:

/root/samples => local system where the downloaded file exists


/user/root/samples => HDFS Directory where the data file will be made available to Hive

Hadoop allows to copy a file from local file system to HDFS using the command - copyFromLocal.  
Syntax:  hdfs dfs -copyFromLocal
Verify the file presence and its contents on HDFS.  Again most of unix commands like, ls, cat, chmod, rm etc., can be used with hadoop.  

To list the contents of a HDFS directory using the command as below:

To view the contents of the file we have copied use the command as below.  Certain unix commands are not recognized within hadoop, for example: head or tail, commands.  But we can pipe outputs of hadoop command results onto unix shell commands as demonstrated below:








So now, we have copied the file from Local File System to HDFS and verified the contents of the file as well. Now onto Hive operations.


Hive - DDL & DML 

If you are a DWH professional like me, who has very limited exposure to Java World, Hive and Impala offers great escape out of writing MapReduce jobs for processing data in Hadoop.

Hive supports most of ANSI SQLs, which makes it very convenient for people who are comfortable with ANSI SQL to venture into Hadoop technologies.  Also, Hive translates the DMLs into a series of MapReduce jobs, no need to MapReduce jobs in Java.

So first step, initiating Hive on Cluster.  It is very easy, just type "hive" at the command prompt.  Assuming that you have set up all environment variables you should see the Hive Shell as below:


Depending on your configuration, hive will be started either in Local Mode , pseudodistributed or Distributed Mode.  If you want to know more about Hive, I would start with this Programming Hive book which is a great source.

Database & Table Creation

First we will create a database where we will be creating tables.  Database and Table creation commands are similar to Oracle.  

Within Hive, when you create a new database it actually creates a directory, under the default Hive Warehouse directory as per your configuration ,on HDFS. As I mentioned before my default Hive Warehouse location is - /user/hive/warehouse, You need to explicitly create the relevant directories for this as part of your Hive Installation & Configuration.

On HDFS, before creating the database available directories are:

We will be creating a database named - practice. So to create the database, from hive shell use the command - create database.  Below listing shows available databases before and after creating the "practice" database.



You can verify that after successful creation of the database, a directory with the db name will be created on HDFS as below:


Next, we will create a table named "movies" that matches the structure of the "movies" dataset as described in the section Dataset. 

Within Hive, we need to point to the database where we want our tables to be created, otherwise Hive will use the "default" database.  
As you see here, the Hive shell has changed to specified database name.

The table creation is as shown below:


The Create table statement is more or less similar to Oracle's command.  The difference is that for Hive you need to specify how your underlying data file is organized, which is done by 3 lines starting from ROW FORMAT ... LINES Terminated by.  These lines are self explanatory.

The LOCATION parameter is optional, if specified it tells Hive where to place the file on HDFS directory path.  Please note that this is the target directory path not the source file path.

After successful creation of the file, 

Loading Data into Hive Table

Hive allows to load data into table from source files using the command LOAD DATA.  The syntax for this command is as below:
LOAD DATA [LOCAL] INPATH '' OVERWRITE INTO TABLE

When LOCAL is omitted, Hive looks for the file on HDFS, otherwise Hive looks for the source file on local file system.  The data loading in Hive is as below:

Verifying Hive Table Contents

To describe the table metadata and to get the row count of the table:


As you can see, the "select count(*) from movies" has initiated MapReduce job by Hive.

To view a limited row set , 5 lines, use the LIMIT clause as below:


In this tutorial, we have shown how to load data into HDFS from Local System, Hive Database and Table creation, Loading data into Hive Table and running some simple HiveQL queries to manipulate the data on the table.







2017/04/22

Apache Pig 0.16.0 Error 2998 Unhandled internal Error

After installing Pig 0.16.0 and when tried to invoke Pig on Hadoop I started getting few errors -

Cannot locate pig-withouthadoop.jar

After setting PIG_HOME pointing to 0.16.0 version, when invoked from command line pig was throwing below error  :


This could be easily resolved by invoking explicitly invoking Pig by specifying the full path where Pig is installed on the machine as below :

Error 2998 Unhandled internal Error. Jline.Terminal

When Pig is initiated, either to run a pig latin script OR invoke grunt, started throwing another error-jline.Terminal.getTerminal()Ljline/Terminal -  as below:

This error is resolved by setting the environment variable - HADOOP_USER_CLASSPATH_FIRST = true, as below.  After setting up the environment variable the jline error has gone away.



2017/04/18

HDFS Corrupted File System

If the HDFS File System is corrupted, follow these steps to identify the files that are corrupted and to manually delete the corrupted files.

First check is the status of the file system.

The output of the "fsck" command will list lots of information and is difficult to read it on the screen, so redirect the output to a file.  Going through the fsck output  reveals that my file system is CORRUPT.

Add caption
The output of "fsck" has several messages related to certain file blocks being corrupted as shown below:


In order to fix this issue, I manually deleted the corrupted files using the below command:
After all the corrupted files are deleted, by running the "fsck" reveals that the HDFS is healthy.






Apache Hadoop - HDFS NameNode in Safemode

I noticed today that I am unable to perform any operations on the cluster, not even simple operations like creating a directory on HDFS.  Checking the status of NameNode reveals that it is in Safe mode.

To leave the NameNode from Safe mode use the command as below:



2017/03/11

Cloudera CDH5 Tutorial - Sqoop import - retail_db into HDFS

Starting to import "retail_db" mysql databse tables into HDFS.  Hit another error!!😕😕

retail_dba user creation 

I initially created the "retail_dba" user using the below syntax:

The  user is created successfully, as listed below ( entries from user_privileges) table.
But when attempted to login to mysql got the error "Access denied for user' as below:


 Dropped the user and recreated by specifying explicitely '@localhost' as below


After recreating the user I am able to login as "retail_dba" and list the available tables.

Sqoop Import Error

Ok, now the issue with "retail_dba" user connectivity been sorted moving to the original task of importing "retail_db" mysql db tables into HDFS using sqoop.

So first issued Sqoop import command from the CLI:
Importing all available tables  [-- import-all-tables] option
from mysql db [ --connect jdbc:mysql://localhost:3306/retail_db]
using the credentials created earlier [ --username=retail_dba and password]

Nothing seem to work first time, hit another error.
This time complaining about missing JSon jar files.



Checked the Sqoop lib directory - /usr/lib/sqoop/lib - and found no JSon jar file.

Download the file from Java Downloads web site - http://www.java2s.com/Code/Jar/j/Downloadjavajsonjar.htm as below
Download the zip file and unpack it.
Copied the file "java-json.jar" to the Sqoop lib path as below

Fingers Crossed.  Time to test the Sqoop import again.

This time attempting to import data from mysql db to HDFS has completed successfully.  Though, I faced some data issues while importing "customers" table.  I had to import one table at a time into HDFS using the below syntax:

When all the tables are imported, we should see one directory for each table under the Hive warehouse path - /user/hive/warehouse - as below:

2017/03/09

Cloudera Hadoop Tutorial - Retail DB Installation

The Cloudera getting started tutorial requires "retail_db" mysql database for learning the concepts.  
If you are using the "Cloudera QuickStartVM" then you are fine.  But if you have setup the Hadoop Cluster manually, like me, you will find that the "retail_db" mysql database missing.

This post is for detailing how to install the "retail_db" mysql database.

1.  Download the attached "retail_db" script.
Retail_db MySQL database

Or this can also be downloaded from command prompt using the command
2.  Login to MySQL as below:

3.  Create a database "retail_db" and a database user as below:
4. Grant all permissions to the user "retail_dba" and populate the database using the script downloaded in step 1.
5.  After successful completion of loading script, verity the tables as below:

This will give a sample database in MySQL to complete the Cloudera tutorial.

Apache Sqoop Import - Import data from Relational database to HDFS

Apache Sqoop is used to import data from Relational Databases ( MySql, Oracle, SQL Server etc., ) into Hadoop ecosystem. This post details...