Tuesday, 28 November 2017

Using Sqoop - II

In the earlier part of  Sqoop post, we saw how it is invoked and how it can be used to check the data on Oracle database. In this post, we will attempt data migration. For all the work in this post, we will use the Oracle Big Data Lite 4.7.0 Virtual Machine.

The first command we will see is that of importing data from Oracle. The command is:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/

The results returned are shown below:



















target_dir specifies the target directory. An error is returned stating that no primary key could be found on TICKER table. We will take the option of adding -m 1. the number after -m specifies the number of mappers to be used in parallel for the import process.

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1

The results returned are shown below:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 09:07:53 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 09:07:53 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 09:07:53 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 09:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 09:07:53 INFO tool.CodeGenTool: Beginning code generation
17/11/28 09:07:54 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "TICKER" t WHERE 1=0
17/11/28 09:07:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 09:07:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO mapreduce.ImportJobBase: Beginning import of TICKER
17/11/28 09:07:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 09:07:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 09:08:01 INFO db.DBInputFormat: Using read commited transaction isolation
17/11/28 09:08:01 INFO mapreduce.JobSubmitter: number of splits:1
17/11/28 09:08:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511848368976_0003
17/11/28 09:08:02 INFO impl.YarnClientImpl: Submitted application application_1511848368976_0003
17/11/28 09:08:02 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511848368976_0003/
17/11/28 09:08:02 INFO mapreduce.Job: Running job: job_1511848368976_0003
17/11/28 09:08:09 INFO mapreduce.Job: Job job_1511848368976_0003 running in uber mode : false
17/11/28 09:08:09 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job: Job job_1511848368976_0003 completed successfully
17/11/28 09:08:16 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=143601
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=1859
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3850
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3850
        Total vcore-seconds taken by all map tasks=3850
        Total megabyte-seconds taken by all map tasks=3942400
    Map-Reduce Framework
        Map input records=60
        Map output records=60
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=78
        CPU time spent (ms)=1690
        Physical memory (bytes) snapshot=184446976
        Virtual memory (bytes) snapshot=2101907456
        Total committed heap usage (bytes)=115867648
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=1859
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Transferred 1.8154 KB in 18.6901 seconds (99.4642 bytes/sec)
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Retrieved 60 records.
[oracle@bigdatalite ~]$

Following two commands will validate the records that were imported into files in HDFS. Note that the values are separated by commas.


hadoop fs -ls /user/oracle/import-dir

hadoop fs -cat /user/oracle/import-dir/part-m-00000 | head













We can specify a where condition too as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir_1/ -m 1 --where "symbol like 'GLOBEX'"
 

The results returned are shown below:










Instead of a table,we can specify the results of a query as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --query 'select * from customer_segment where $CONDITIONS' --target-dir /user/oracle/import-dir_2/ --split-by segment_id 
  
The results returned can be checked with below commands:

hadoop fs -ls /user/oracle/import-dir_2

hadoop fs -cat /user/oracle/import-dir_2/part-m-0000* | head



















In the last step, we will see a data export from HDFS to Oracle. Let us create a table in Oracle using below command:

create table customer_segment_1 as select * from customer_segment where 1=2;

The results returned are shown below:













This table will be created with no data in it as shown below:

















We can then export the same data that we imported earlier using below command:

sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*

The output is shown below:

 [oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 13:22:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 13:22:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 13:22:51 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 13:22:51 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 13:22:51 INFO tool.CodeGenTool: Beginning code generation
17/11/28 13:22:52 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "CUSTOMER_SEGMENT_1" t WHERE 1=0
17/11/28 13:22:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 13:22:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.jar
17/11/28 13:22:54 INFO mapreduce.ExportJobBase: Beginning export of CUSTOMER_SEGMENT_1
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
17/11/28 13:22:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 13:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: number of splits:3
17/11/28 13:23:00 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511886050393_0008
17/11/28 13:23:00 INFO impl.YarnClientImpl: Submitted application application_1511886050393_0008
17/11/28 13:23:00 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511886050393_0008/
17/11/28 13:23:00 INFO mapreduce.Job: Running job: job_1511886050393_0008
17/11/28 13:23:06 INFO mapreduce.Job: Job job_1511886050393_0008 running in uber mode : false
17/11/28 13:23:06 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 13:23:16 INFO mapreduce.Job:  map 33% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job: Job job_1511886050393_0008 completed successfully
17/11/28 13:23:18 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=430611
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1430
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=21
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Launched map tasks=3
        Data-local map tasks=3
        Total time spent by all maps in occupied slots (ms)=22352
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=22352
        Total vcore-seconds taken by all map tasks=22352
        Total megabyte-seconds taken by all map tasks=22888448
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Input split bytes=732
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=425
        CPU time spent (ms)=3920
        Physical memory (bytes) snapshot=548945920
        Virtual memory (bytes) snapshot=6293712896
        Total committed heap usage (bytes)=354418688
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Transferred 1.3965 KB in 22.3071 seconds (64.1051 bytes/sec)
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Exported 10 records.
[oracle@bigdatalite ~]$ 


We can then check the data in the table in the database as has been shown below:












This concludes tbe post on Sqoop.