Tuesday 28 November 2017

Using Sqoop - I

We take a brief look at Sqoop in this post. For all the work in this post, we will use the Oracle Big Data Lite 4.7.0 Virtual Machine.

Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases. More details about this project are here.

The migration of data is key to some processes as output from a Map Reduce process in Hadoop may be ported to a relational database for further analysis and vice versa. In that sense, Sqoop is an important tool.

After installing Sqoop, we have to place the JDBC driver for the database that we propose to establish a connection to. Sqoop is already installed and the JDBC driver to connect to Oracle is located at /var/lib/sqoop/ as shown below:

















We can see two links there: MySQL JDBC driver and Oracle JDBC driver. ojdbc6.jar is a link to /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc6.jar. If one does not have it, then, one can download it from here and copy it to /var/lib/sqoop/ directory.

We can then check the version of Sqoop using below command on the terminal window:

sqoop version










To see the different available commands, we can use below command:

sqoop help


















Once we are done with checking that the JDBC driver is in place and the version of Sqoop, let us start the database so that we can try out some data transfers between Oracle and HDFS. Click on Start/Stop Services as shown below. This will bring up the window that can be used to start the Oracle database.















Once Oracle Database 12c is selected, press the spacebar to select it and hit OK button to start it. You are lead to below window indicating that the database is starting:


















The first command will be list all the databases in the database using below command:

sqoop list-databases --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser 

The results returned are shown below:


















The command itself is quite simple. The command starts with Sqoop followed by list_databases. username and password are the credentials for the user to connect to the database. The complete results can be seen by scrolling.

The next command lists the tables:

sqoop list-tables --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser

The results returned are shown below:












There is only one table called TICKER.

Open SQL Developer to check out the table called TICKER under pmuser as shown below:

















Let us see the command for reading TICKER is shown below:

sqoop eval --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --query "select * from  ticker where rownum <= 5"

The results returned are shown below:


















We can compare the above results with the results in SQL Developer. They compare well. eval will evaluate the SQL query that follows and displays results. connect will specify the jdbc driver. query specifies the SQL query.

We can modify the query to reflect only the date in TSTAMP as shown below:

sqoop eval --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --query "select symbol,to_char(tstamp,'mm-dd-yyyy'),price from ticker where rownum <= 5"

The results returned are shown below:








The rest of the Sqoop details are in the following post.