2017/09/11

AWS CLI - Error - AWS was not able to validate credentials

I have set up AWS EC2 instance through AWS Management Console and able to launch it.  While using AWS CLI to connect to instances I encountered the error "AWS was not able to validate credentials" error.

I have carried out below steps:


  • set up AWS instance through Console
  • Create Access Keys and downloaded the rootkeys.csv file
  •  On Linux machine, installed Python 3.6
  • Installed AWS CLI tool
  • Configured AWS CLI on linux machine using  the command - aws configure
After all these steps, when I run the command - aws ec2 describe-instances,  I was encountering the error "AWS was unable to validate credentials".

I verified the AWS credentials file, all the details like AWS Access ID and AWS Secret Key are all present.

To resolve this error, I had to explicitly export the variables AWS_ACCESS_KEY and AWS_SECRET_KEY.




After this I am able to connect to the running instances from AWS CLI.




2017/08/28

Data Visualization with R ggplot2 - Part 2

In my previous post Data Visualization with R ggpplot2 - Part 1, I detailed the pre-requisites for getting started with using ggplot2 with R.

In this post, I will focus more on the usage of R package - ggplot2 and various visualizations that can be generated using this package.

Within R, ggplot2 is initiated by calling the package "ggplot".  The basic syntax of ggplot is  -

ggplot(data = <data set>)) +
<geom_function>(mapping = aes(<MAPPINGS>))

Provide a data set to ggplot through "data" attribute;
Specify what graph you need through "geom_function" attribute;  eg:  geom_bar; geom_point etc,
And specify how your graph should look like through combination of "mapping" and "aes (short for aesthetics)"

Data Set

For illustrating the usage of "ggplot2" I am using "mpg" data set available through "tidyverse" package.  The "mpg" data set provides fuel efficiency of vehicles for the years 1998 - 2008 and it has below variables:
Manufacturer, model, displ(displace of engine in litres), Year of Manufacture, Number of Cylinders, type of transmission, Type of Drive, City Efficiency, Highway Efficiency, Type of Fuel, Vehicle Class

Usage

Blank Graph - No plotting

gglpot(data = mpg) + (mapping = aes(x=displ, y=hwy))

If you omit geom_function in the call to "ggplot" R will produce a blank graph with no plotting of variables. 
As you can see here, R produced a graph with "displ" along X-axis and "hwy" along Y-axis.  But there are no plotting of variables as we have not specified what type of graph we need.

geom_bar: Basic Bar Chart

To produce Bar Charts use "geom_bar" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_bar(mapping = aes(x=displ))

This will generate a vertical bar chart showing number of vehicles by engine size as below
To add some colors to the graph 
gglpot(data = mpg) + geom_bar(mapping = aes(x=displ), color = "orange" )

This will generate a graph with chosen color as below:
To show number of vehicles by their Class and to distinguish the Vehicle Class by color :

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class ))

Horizontal Bar Chart

By default, the "geom_bar" generates a vertical bar chart.  To display horizontal bar chart add "coord_flip() function to the command as below:

gglpot(data = mpg) + geom_bar(mapping = aes(x=class, fill = class )) + coord_flip()

geom_point: Scatter Chart

To generate Scatter Charts use "geom_point" function.  This function by default accepts one variable for x position and produces count of observations for the x position.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy))

This will generate a vertical bar chart showing number of vehicles by engine size as below
We can further enhance this chart by adding color or changing the shape by "vehicle class" attribute as below:

Scatter Chart - Color attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, color = class))

In this chart each vehicle class is color coded.

Scatter Chart - Shape attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape = class))

In this chart each vehicle class is given a different shape.


Scatter Chart - Size attribute

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy, size = class))

In this chart each vehicle class has different size based on the mileage.

geom_point: Facets / Subplots

If you notice, in the above scatter charts we have displayed all "vehicle classes" in a single chart but distinguished each by a shape or size or color.

What if, we want to produce one chart for each vehicle class but still want to display them together?  R has an option for this.  Using R's Facets/Subplots we can achieve this.  This is equivalent to "trellis" views.

Facets/Subplots - single Variables

We will reproduce the Scatter chart, but will split by vehicle class one for each class.  We will use "facet_wrap" if the plot is split on one variable, in this case by "class".   You can control how many rows in the sub plot by "nrow".

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy)) + facet_wrap( ~ class, nrow = 2)


As you can see, the scatter plot is split by "Vehicle Class" one for each class.

We can change the color of the chart by mapping a required to color to "color" attribute.

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_wrap( ~ class, nrow = 2)


Facets/Subplots - two Variables

To facet the plot by 2 variables use "facet_grid" function as below:

gglpot(data = mpg) + geom_point(mapping = aes(x=displ, y=hwy), color = "blue") + facet_grid( drv ~ class)

We are asking to produce a Scatter plot but produce one sub plot for each combination of "drv" and "class" variables. 


As you can notice, each subplot has 2 variables.

geom_smooth: Line Charts

To generate a line chart in R, use the function geom_smooth as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy))

The above line chart, basically shows the relationship between the fuel efficiency on highways against engine size in litres.  

Line Types

We can plot the relationship by Vehicle Drive type with one line for each drive type using Line type attribute as below:

gglpot(data = mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv))


Multiple Charts in same plot

R supports to generate multiple chart types (ie scatter & line charts ) in same plot.  The below code example shows the Scatter & Line Charts in the same plot:

gglpot(data = mpg) + 
geom_smooth(mapping = aes(x=displ, y=hwy, linetype = drv, color = drv)) +
geom_point(mapping = aes(x=displ, y=hwy, color = drv)

geom_boxplots

A box plot can be generated in R using below syntax:

gglpot(data = mpg, aes(class, hwy) + 
geom_boxplot()


There are several other options available to plot various charts using R.  See the cheatsheet for all the  available options:  Data Visualization with R

2017/07/15

Data Visualization with R - ggplot2 - Part1

R has several systems for visualizing data, ggplot2 is one of them. ggplot2 is pretty easy to use and offers various options to generate impressive graphs to wow your users. This post is to showcase usage of ggplot and its impressive features.
What you need: 
  • R System installed on your PC.
  • R - Studio Or if you prefer R Console
  • tidyverse package
What is tidyverse ?  tidyverse is a collection of packages that supports most commonly used packages for data manipulation and visualization. It contains below packages:
  • data visualization - ggplot2
  • data manipulation - dplyr
  • data tidying - tidyr
  • data import - readr
  • functional programming -purr
Usage - To start using tidyverse, one need to load the package into R first. Loading the package is very easy, just fire up R-Studio and issue below command:



The "ggplot2" package has several data sets which we can use for plotting graphs. To view details about the data sets ( obeservations => rows, variables => columns) , type the data set name at command prompt , as below. This will lists a sample data and the variable names.

To know what each of these variables (columns) in the data set means just use the help function :
?datasetname

The help function can be used with any of the R commands / packages to know more details about them.

How to use "ggplot2" in next part... soon.

2017/05/23

Apache Sqoop Import - Import data from Relational database to HDFS

Apache Sqoop is used to import data from Relational Databases ( MySql, Oracle, SQL Server etc., ) into Hadoop ecosystem.

This post details different variants of using the Sqoop tool for importing data.

#1 Transfer Entire table content


[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \

The above command will import entire contents of "categories" table into HDFS as a CSV delimited file.

The parameters to sqoop command are -
import  - specifies that the data need to be imported into HDFS from relational db
--connect - contains the JDBC URL for the source relational db
--username & --password - credentials to connect to the relational db
--table -  name of the table to  be transferred to HDFS

When the above command is executed, the contents of the table are imported into HDFS and a file will be created under the directory - /user/$user -  as below:



#2 Specifying a Target Directory


Sqoop offers to import contents of a table into a specific HDFS directory.  These are :
  • --target-dir -  allows to specify the final directory where the data will be imported into.
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table categories \
--target-dir /user/root/output/categories

With this variation, the categories folder will be created under /user/root/output  and data will be imported.  The parameter --target-dir need to modified everytime a new table need to be imported.

  • --warehouse-dir - allows to specify the root directory under which a new folder will be created while importing
[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--warehouse-dir /user/root/output

With this variation a parent directory is specified under HDFS.  Sqoop import when executed, rather than writing the contents into the parent directory will create a new folder matching the name of the table and imports data in there.

In both instances, sqoop will reject data if the output directory already exists.

#3 Importing only subset of data


Sqoop allows to import only a portion of the table contents into HDFS with a where clause.

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--where "substr(product_name,1,2) = 'Sn'" \
--target-dir /user/root/output/products-Sn

The above command will only import the product names starting with "Sn" from "Products" table into HDFS and stores under the specified directory.

#4 File Formats


Sqoop can import data in 3 formats:
  • CSV  - text - default
  • Sequencefile - binary
  • Avro - binary
To invoke the sqoop import to store in binary format use below syntax

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-sequencefile \
--warehouse-dir /user/root/output

[root@ODIGettingStarted ~]# sqoop import \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--table products \
--as-avrodatafile \
--warehouse-dir /user/root/output


#5 Importing all tables


Sqoop supports importing all tables in a database, rather than importing tables one by one.  To import all tables use the tool - import-all-tables.

[root@ODIGettingStarted ~]# sqoop import-all-tables \
--connect jdbc:mysql://localhost:3306/retail_db \
--username retail_dba \
--password oracle \
--warehouse-dir /user/root/output

When executed, this command will create one folder for each of the table in the database and transfers data into it.  The folder structure is as below at the end of this command execution:

mysql> show tables;
+---------------------+
| Tables_in_retail_db |
+---------------------+
| categories          |
| customers           |
| departments         |
| order_items         |
| orders              |
| products            |
+---------------------+
6 rows in set (0.00 sec)


mysql> 

After import, the HDFS directory structure is:
[root@ODIGettingStarted conf.dist]# hdfs dfs -ls /user/root/output
Found 6 items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:36 /user/root/output/categories
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/customers
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/departments
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/order_items
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/orders
drwxr-xr-x   - root supergroup          0 2017-05-23 22:37 /user/root/output/products
[root@ODIGettingStarted conf.dist]# 


2017/05/06

Hive Getting Started - Tutorial

In this post, I am going to demonstrate a very simple tutorial to get started with Hive.

The following assumptions are made:

  • you have set up Hadoop Cluster 
  • Hive is installed & configured on the cluster
  • All environment variables are set accordingly
  • Hive Datawareouse default location is - /user/hive/warehouse

In this tutorial, we will do the following

  • Copy the data set from local system to HDFS
  • create hive database
  • create hive table 
  • load data into hive table
  • verify the data in hive table
  • run queries to answer few business questions

Dataset

The dataset that I will be using is movies dataset.  You can download the data from movies.  This is a simple CSV file and contains the following information , in the order specified, related to movies:
  1. Movie ID - Integer
  2. Movie Name - String
  3. Year of Release - Integer ( only year )
  4. Movie  Rating - Float ( decimal )
  5. Duration in Seconds - Integer
Contents of the file looks as below:

1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126

Ok, now you got the file and you know the contents of it.  Let's get started with the tutorial.


File Transfer

When you downloaded the file it is available on your local drive.  Again I am assuming that you have downloaded from your Linux VM.  If you have downloaded on your Windows host, FTP the file to Linux VM.

I have set up my directories on Linux VM as below:

/root/samples => local system where the downloaded file exists


/user/root/samples => HDFS Directory where the data file will be made available to Hive

Hadoop allows to copy a file from local file system to HDFS using the command - copyFromLocal.  
Syntax:  hdfs dfs -copyFromLocal
Verify the file presence and its contents on HDFS.  Again most of unix commands like, ls, cat, chmod, rm etc., can be used with hadoop.  

To list the contents of a HDFS directory using the command as below:

To view the contents of the file we have copied use the command as below.  Certain unix commands are not recognized within hadoop, for example: head or tail, commands.  But we can pipe outputs of hadoop command results onto unix shell commands as demonstrated below:








So now, we have copied the file from Local File System to HDFS and verified the contents of the file as well. Now onto Hive operations.


Hive - DDL & DML 

If you are a DWH professional like me, who has very limited exposure to Java World, Hive and Impala offers great escape out of writing MapReduce jobs for processing data in Hadoop.

Hive supports most of ANSI SQLs, which makes it very convenient for people who are comfortable with ANSI SQL to venture into Hadoop technologies.  Also, Hive translates the DMLs into a series of MapReduce jobs, no need to MapReduce jobs in Java.

So first step, initiating Hive on Cluster.  It is very easy, just type "hive" at the command prompt.  Assuming that you have set up all environment variables you should see the Hive Shell as below:


Depending on your configuration, hive will be started either in Local Mode , pseudodistributed or Distributed Mode.  If you want to know more about Hive, I would start with this Programming Hive book which is a great source.

Database & Table Creation

First we will create a database where we will be creating tables.  Database and Table creation commands are similar to Oracle.  

Within Hive, when you create a new database it actually creates a directory, under the default Hive Warehouse directory as per your configuration ,on HDFS. As I mentioned before my default Hive Warehouse location is - /user/hive/warehouse, You need to explicitly create the relevant directories for this as part of your Hive Installation & Configuration.

On HDFS, before creating the database available directories are:

We will be creating a database named - practice. So to create the database, from hive shell use the command - create database.  Below listing shows available databases before and after creating the "practice" database.



You can verify that after successful creation of the database, a directory with the db name will be created on HDFS as below:


Next, we will create a table named "movies" that matches the structure of the "movies" dataset as described in the section Dataset. 

Within Hive, we need to point to the database where we want our tables to be created, otherwise Hive will use the "default" database.  
As you see here, the Hive shell has changed to specified database name.

The table creation is as shown below:


The Create table statement is more or less similar to Oracle's command.  The difference is that for Hive you need to specify how your underlying data file is organized, which is done by 3 lines starting from ROW FORMAT ... LINES Terminated by.  These lines are self explanatory.

The LOCATION parameter is optional, if specified it tells Hive where to place the file on HDFS directory path.  Please note that this is the target directory path not the source file path.

After successful creation of the file, 

Loading Data into Hive Table

Hive allows to load data into table from source files using the command LOAD DATA.  The syntax for this command is as below:
LOAD DATA [LOCAL] INPATH '' OVERWRITE INTO TABLE

When LOCAL is omitted, Hive looks for the file on HDFS, otherwise Hive looks for the source file on local file system.  The data loading in Hive is as below:

Verifying Hive Table Contents

To describe the table metadata and to get the row count of the table:


As you can see, the "select count(*) from movies" has initiated MapReduce job by Hive.

To view a limited row set , 5 lines, use the LIMIT clause as below:


In this tutorial, we have shown how to load data into HDFS from Local System, Hive Database and Table creation, Loading data into Hive Table and running some simple HiveQL queries to manipulate the data on the table.







2017/04/22

Apache Pig 0.16.0 Error 2998 Unhandled internal Error

After installing Pig 0.16.0 and when tried to invoke Pig on Hadoop I started getting few errors -

Cannot locate pig-withouthadoop.jar

After setting PIG_HOME pointing to 0.16.0 version, when invoked from command line pig was throwing below error  :


This could be easily resolved by invoking explicitly invoking Pig by specifying the full path where Pig is installed on the machine as below :

Error 2998 Unhandled internal Error. Jline.Terminal

When Pig is initiated, either to run a pig latin script OR invoke grunt, started throwing another error-jline.Terminal.getTerminal()Ljline/Terminal -  as below:

This error is resolved by setting the environment variable - HADOOP_USER_CLASSPATH_FIRST = true, as below.  After setting up the environment variable the jline error has gone away.



AWS CLI - Error - AWS was not able to validate credentials

I have set up AWS EC2 instance through AWS Management Console and able to launch it.  While using AWS CLI to connect to instances I encounte...