Showing posts with label oracle. Show all posts
Showing posts with label oracle. Show all posts

Tuesday, 28 November 2017

Using Sqoop - II

In the earlier part of  Sqoop post, we saw how it is invoked and how it can be used to check the data on Oracle database. In this post, we will attempt data migration. For all the work in this post, we will use the Oracle Big Data Lite 4.7.0 Virtual Machine.

The first command we will see is that of importing data from Oracle. The command is:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/

The results returned are shown below:



















target_dir specifies the target directory. An error is returned stating that no primary key could be found on TICKER table. We will take the option of adding -m 1. the number after -m specifies the number of mappers to be used in parallel for the import process.

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1

The results returned are shown below:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 09:07:53 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 09:07:53 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 09:07:53 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 09:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 09:07:53 INFO tool.CodeGenTool: Beginning code generation
17/11/28 09:07:54 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "TICKER" t WHERE 1=0
17/11/28 09:07:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 09:07:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO mapreduce.ImportJobBase: Beginning import of TICKER
17/11/28 09:07:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 09:07:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 09:08:01 INFO db.DBInputFormat: Using read commited transaction isolation
17/11/28 09:08:01 INFO mapreduce.JobSubmitter: number of splits:1
17/11/28 09:08:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511848368976_0003
17/11/28 09:08:02 INFO impl.YarnClientImpl: Submitted application application_1511848368976_0003
17/11/28 09:08:02 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511848368976_0003/
17/11/28 09:08:02 INFO mapreduce.Job: Running job: job_1511848368976_0003
17/11/28 09:08:09 INFO mapreduce.Job: Job job_1511848368976_0003 running in uber mode : false
17/11/28 09:08:09 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job: Job job_1511848368976_0003 completed successfully
17/11/28 09:08:16 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=143601
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=1859
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3850
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3850
        Total vcore-seconds taken by all map tasks=3850
        Total megabyte-seconds taken by all map tasks=3942400
    Map-Reduce Framework
        Map input records=60
        Map output records=60
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=78
        CPU time spent (ms)=1690
        Physical memory (bytes) snapshot=184446976
        Virtual memory (bytes) snapshot=2101907456
        Total committed heap usage (bytes)=115867648
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=1859
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Transferred 1.8154 KB in 18.6901 seconds (99.4642 bytes/sec)
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Retrieved 60 records.
[oracle@bigdatalite ~]$

Following two commands will validate the records that were imported into files in HDFS. Note that the values are separated by commas.


hadoop fs -ls /user/oracle/import-dir

hadoop fs -cat /user/oracle/import-dir/part-m-00000 | head













We can specify a where condition too as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir_1/ -m 1 --where "symbol like 'GLOBEX'"
 

The results returned are shown below:










Instead of a table,we can specify the results of a query as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --query 'select * from customer_segment where $CONDITIONS' --target-dir /user/oracle/import-dir_2/ --split-by segment_id 
  
The results returned can be checked with below commands:

hadoop fs -ls /user/oracle/import-dir_2

hadoop fs -cat /user/oracle/import-dir_2/part-m-0000* | head



















In the last step, we will see a data export from HDFS to Oracle. Let us create a table in Oracle using below command:

create table customer_segment_1 as select * from customer_segment where 1=2;

The results returned are shown below:













This table will be created with no data in it as shown below:

















We can then export the same data that we imported earlier using below command:

sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*

The output is shown below:

 [oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 13:22:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 13:22:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 13:22:51 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 13:22:51 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 13:22:51 INFO tool.CodeGenTool: Beginning code generation
17/11/28 13:22:52 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "CUSTOMER_SEGMENT_1" t WHERE 1=0
17/11/28 13:22:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 13:22:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.jar
17/11/28 13:22:54 INFO mapreduce.ExportJobBase: Beginning export of CUSTOMER_SEGMENT_1
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
17/11/28 13:22:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 13:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: number of splits:3
17/11/28 13:23:00 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511886050393_0008
17/11/28 13:23:00 INFO impl.YarnClientImpl: Submitted application application_1511886050393_0008
17/11/28 13:23:00 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511886050393_0008/
17/11/28 13:23:00 INFO mapreduce.Job: Running job: job_1511886050393_0008
17/11/28 13:23:06 INFO mapreduce.Job: Job job_1511886050393_0008 running in uber mode : false
17/11/28 13:23:06 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 13:23:16 INFO mapreduce.Job:  map 33% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job: Job job_1511886050393_0008 completed successfully
17/11/28 13:23:18 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=430611
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1430
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=21
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Launched map tasks=3
        Data-local map tasks=3
        Total time spent by all maps in occupied slots (ms)=22352
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=22352
        Total vcore-seconds taken by all map tasks=22352
        Total megabyte-seconds taken by all map tasks=22888448
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Input split bytes=732
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=425
        CPU time spent (ms)=3920
        Physical memory (bytes) snapshot=548945920
        Virtual memory (bytes) snapshot=6293712896
        Total committed heap usage (bytes)=354418688
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Transferred 1.3965 KB in 22.3071 seconds (64.1051 bytes/sec)
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Exported 10 records.
[oracle@bigdatalite ~]$ 


We can then check the data in the table in the database as has been shown below:












This concludes tbe post on Sqoop.

Friday, 27 October 2017

SQL in R - II

As a continuation of SQL in R, we wade into into the final and conclusive part after we saw glimpses of this topic in Part 1. For starters, let us begin with the group by clause, two examples of which are shown below:

> sqldf("select am,sum(wt) from mtcars group by am")
  am sum(wt)
1  0  71.609
2  1  31.343
> sqldf("select carb,avg(wt) from mtcars group by carb")
  carb avg(wt)
1    1  2.4900
2    2  2.8628
3    3  3.8600
4    4  3.8974
5    6  2.7700
6    8  3.5700

Group by clause usage is shown in the next query:

> sqldf("select carb,avg(wt) from mtcars group by carb having avg(wt) >= 3.0")
  carb avg(wt)

1    3  3.8600
2    4  3.8974
3    8  3.5700

The next query shows a subquery:


> sqldf("select * from mtcars where disp = (select max(disp) from mtcars)")
   mpg cyl disp  hp drat   wt  qsec vs am gear carb
1 10.4   8  472 205 2.93 5.25 17.98  0  0    3    4

Set operations like union can be performed and an example is shown below:

> sqldf("select * from mtcars where disp = (select max(disp) from mtcars) union select * from mtcars where disp = (select min(disp) from mtcars)", row.names = TRUE)
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Toyota Corolla     33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

The last query gives the details of the vehicle having the maximum and minimum displacement in cubic inches.

The next query shows a correlated subquery:


> sqldf("select * from employee_data as e1 where salary > (select avg(salary) from employee_data where department_id = e1.department_id)")
   EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER  HIRE_DATE     JOB_ID SALARY
1          100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES  24000
2          103  Alexander    Hunold  AHUNOLD       590.423.4567 2006-01-03    IT_PROG   9000
3          104      Bruce     Ernst   BERNST       590.423.4568 2007-05-21    IT_PROG   6000
4          108      Nancy Greenberg NGREENBE       515.124.4569 2002-08-17     FI_MGR  12008
5          109     Daniel    Faviet  DFAVIET       515.124.4169 2002-08-16 FI_ACCOUNT   9000
6          114        Den  Raphaely DRAPHEAL       515.127.4561 2002-12-07     PU_MAN  11000
7          120    Matthew     Weiss   MWEISS       650.123.1234 2004-07-18     ST_MAN   8000
8          121       Adam     Fripp   AFRIPP       650.123.2234 2005-04-10     ST_MAN   8200
9          122      Payam  Kaufling PKAUFLIN       650.123.3234 2003-05-01     ST_MAN   7900
10         123     Shanta   Vollman SVOLLMAN       650.123.4234 2005-10-10     ST_MAN   6500
11         124      Kevin   Mourgos KMOURGOS       650.123.5234 2007-11-16     ST_MAN   5800
12         137     Renske    Ladwig  RLADWIG       650.121.1234 2003-07-14   ST_CLERK   3600
13         141     Trenna      Rajs    TRAJS       650.121.8009 2003-10-17   ST_CLERK   3500
14         145       John   Russell  JRUSSEL 011.44.1344.429268 2004-10-01     SA_MAN  14000
15         146      Karen  Partners KPARTNER 011.44.1344.467268 2005-01-05     SA_MAN  13500
16         147    Alberto Errazuriz AERRAZUR 011.44.1344.429278 2005-03-10     SA_MAN  12000
17         148     Gerald Cambrault GCAMBRAU 011.44.1344.619268 2007-10-15     SA_MAN  11000
18         149      Eleni   Zlotkey EZLOTKEY 011.44.1344.429018 2008-01-29     SA_MAN  10500
19         150      Peter    Tucker  PTUCKER 011.44.1344.129268 2005-01-30     SA_REP  10000
20         151      David Bernstein DBERNSTE 011.44.1344.345268 2005-03-24     SA_REP   9500
21         152      Peter      Hall    PHALL 011.44.1344.478968 2005-08-20     SA_REP   9000
22         156    Janette      King    JKING 011.44.1345.429268 2004-01-30     SA_REP  10000
23         157    Patrick     Sully   PSULLY 011.44.1345.929268 2004-03-04     SA_REP   9500
24         158      Allan    McEwen  AMCEWEN 011.44.1345.829268 2004-08-01     SA_REP   9000
25         162      Clara   Vishney CVISHNEY 011.44.1346.129268 2005-11-11     SA_REP  10500
26         163   Danielle    Greene  DGREENE 011.44.1346.229268 2007-03-19     SA_REP   9500
27         168       Lisa      Ozer    LOZER 011.44.1343.929268 2005-03-11     SA_REP  11500
28         169   Harrison     Bloom   HBLOOM 011.44.1343.829268 2006-03-23     SA_REP  10000
29         170     Tayler       Fox     TFOX 011.44.1343.729268 2006-01-24     SA_REP   9600
30         174      Ellen      Abel    EABEL 011.44.1644.429267 2004-05-11     SA_REP  11000
31         184    Nandita  Sarchand NSARCHAN       650.509.1876 2004-01-27   SH_CLERK   4200
32         185     Alexis      Bull    ABULL       650.509.2876 2005-02-20   SH_CLERK   4100
33         188      Kelly     Chung   KCHUNG       650.505.1876 2005-06-14   SH_CLERK   3800
34         189   Jennifer     Dilly   JDILLY       650.505.2876 2005-08-13   SH_CLERK   3600
35         192      Sarah      Bell    SBELL       650.501.1876 2004-02-04   SH_CLERK   4000
36         193    Britney   Everett BEVERETT       650.501.2876 2005-03-03   SH_CLERK   3900
37         201    Michael Hartstein MHARTSTE       515.123.5555 2004-02-17     MK_MAN  13000
38         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR  12008
   COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
1              NA         NA            90
2              NA        102            60
3              NA        103            60
4              NA        101           100
5              NA        108           100
6              NA        100            30
7              NA        100            50
8              NA        100            50
9              NA        100            50
10             NA        100            50
11             NA        100            50
12             NA        123            50
13             NA        124            50
14           0.40        100            80
15           0.30        100            80
16           0.30        100            80
17           0.30        100            80
18           0.20        100            80
19           0.30        145            80
20           0.25        145            80
21           0.25        145            80
22           0.35        146            80
23           0.35        146            80
24           0.35        146            80
25           0.25        147            80
26           0.15        147            80
27           0.25        148            80
28           0.20        148            80
29           0.20        148            80
30           0.30        149            80
31             NA        121            50
32             NA        121            50
33             NA        122            50
34             NA        122            50
35             NA        123            50
36             NA        123            50
37             NA        100            20
38             NA        101           110

The above query returns those employees that have salaries above the average salaries in their respective departments. In the next query we see how we can create a temporary table and use that temporary table for querying.

> sqldf(c("create temp table temp_1 as select * from mtcars where qsec > 17.0", "select gear,avg(disp) from temp_1 group by gear"))
  gear avg(disp)
1    3  322.2083
2    4  119.6545

In the above query, a temporary table called temp_1 that only exists for querying in this command . The next query uses temp_1 to calculate the average of displacement by gear.

To illustrate update feature, we a dataframe called update_sample and update the missing values in second column from the values in first column.

> update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample

  a  b
1 1  3
2 2 NA
3 3  5
4 4 NA
5 5  7
> sqldf(c("update update_sample set b = a where b is null", "select * from update_sample"))
  a b
1 1 3
2 2 2
3 3 5
4 4 4
5 5 7

It is to be remembered that update_sample is unchanged and the updated record set is only to be used in the second query as shown above. This behavior is the same for the insert and delete statements. The insert as shown below inserts a new row with values 15 and NA:

> sqldf(c("insert into update_sample (a,b) values (15,NULL)", "select * from update_sample"))
   a  b
1  1  3
2  2 NA
3  3  5
4  4 NA
5  5  7
6 15 NA

The delete keyword deletes a row that has a equals 4 as shown below:

 update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample
  a  b
1 1  3
2 2 NA
3 3  5
4 4 NA
5 5  7
> sqldf(c("delete from update_sample where a = 4", 
+ "select * from update_sample"))
  a  b
1 1  3
2 2 NA
3 3  5
4 5  7
 
To delete all the records in the table, we can use the command shown below:

> sqldf(c("delete from update_sample", "select * from update_sample"))
[1] a b
<0 rows> (or 0-length row.names)

In the last section, we explore joins. Let us import some data from Oracle into R using RODBC. The package RODBC needs to be installed and loaded as shown below:

> library(RODBC)
Error in library(RODBC) : there is no package called ‘RODBC’
> install.packages("RODBC")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RODBC_1.3-15.zip'
Content type 'application/zip' length 831635 bytes (812 KB)
downloaded 812 KB

package ‘RODBC’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Administrator\AppData\Local\Temp\RtmpU3bkd9\downloaded_packages
> library("RODBC", lib.loc="C:/Program Files/R/R-3.4.1/library")

Once above steps are carried out, we can establish a connection to Oracle as shown below:

> connection <- odbcConnect("Oracle", uid="hr", pwd="hr", rows_at_time = 1000)

The command is self-explanatory. Oracle is a DSN to facilitate this connection. Once the connection is established, we can start issuing queries as shown below:

> sqlQuery(connection,"select count(*) from employees")
  COUNT(*)
1      107

We can fetch the data from Oracle as shown below where the data from employee table in Oracle is imported into employee_data.

> employee_data <- sqlQuery(connection,"select * from employees")
>
> class(employee_data)
[1] "data.frame"
>
> colnames(employee_data)
 [1] "EMPLOYEE_ID"    "FIRST_NAME"     "LAST_NAME"      "EMAIL"          "PHONE_NUMBER" 
 [6] "HIRE_DATE"      "JOB_ID"         "SALARY"         "COMMISSION_PCT" "MANAGER_ID"   
[11] "DEPARTMENT_ID"
>
> str(employee_data)
'data.frame':    107 obs. of  11 variables:
 $ EMPLOYEE_ID   : int  100 101 102 103 104 105 106 107 108 109 ...
 $ FIRST_NAME    : Factor w/ 91 levels "Adam","Alana",..: 80 64 51 4 11 18 88 20 61 16 ...
 $ LAST_NAME     : Factor w/ 102 levels "Abel","Ande",..: 50 51 20 44 24 4 74 57 38 27 ...
 $ EMAIL         : Factor w/ 107 levels "ABANDA","ABULL",..: 90 74 61 6 11 17 104 23 73 19 ...
 $ PHONE_NUMBER  : Factor w/ 107 levels "011.44.1343.329268",..: 37 38 39 58 59 60 57 61 50 45 ...
 $ HIRE_DATE     : POSIXct, format: "2003-06-17" "2005-09-21" "2001-01-13" ...
 $ JOB_ID        : Factor w/ 19 levels "AC_ACCOUNT","AC_MGR",..: 4 5 5 9 9 9 9 9 7 6 ...
 $ SALARY        : int  24000 17000 17000 9000 6000 4800 4800 4200 12008 9000 ...
 $ COMMISSION_PCT: num  NA NA NA NA NA NA NA NA NA NA ...
 $ MANAGER_ID    : int  NA 100 100 102 103 103 103 103 101 108 ...
 $ DEPARTMENT_ID : int  90 90 90 60 60 60 60 60 100 100 ...
>

In the same manner, we import the department data as shown below:

> department_data <- sqlQuery(connection,"select * from departments")
> department_data
   DEPARTMENT_ID      DEPARTMENT_NAME MANAGER_ID LOCATION_ID
1             10       Administration        200        1700
2             20            Marketing        201        1800
3             30           Purchasing        114        1700
4             40      Human Resources        203        2400
5             50             Shipping        121        1500
6             60                   IT        103        1400
7             70     Public Relations        204        2700
8             80                Sales        145        2500
9             90            Executive        100        1700
10           100              Finance        108        1700
11           110           Accounting        205        1700
12           120             Treasury         NA        1700
13           130        Corporate Tax         NA        1700
14           140   Control And Credit         NA        1700
15           150 Shareholder Services         NA        1700
16           160             Benefits         NA        1700
17           170        Manufacturing         NA        1700
18           180         Construction         NA        1700
19           190          Contracting         NA        1700
20           200           Operations         NA        1700
21           210           IT Support         NA        1700
22           220                  NOC         NA        1700
23           230          IT Helpdesk         NA        1700
24           240     Government Sales         NA        1700
25           250         Retail Sales         NA        1700
26           260           Recruiting         NA        1700
27           270              Payroll         NA        1700

Now that we have the employee data and department data, we can perform a inner join operation in ANSI syntax as shown below (only the first five records are shown):

> sqldf("select employee_data.*,department_name from employee_data inner join department_data on employee_data.department_id = department_data.department_id limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER  HIRE_DATE  JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES  24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP  17000
3         102        Lex   De Haan  LDEHAAN 515.123.4569 2001-01-13   AD_VP  17000
4         103  Alexander    Hunold  AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

The same inner join in non ANSI syntax is shown below:


> sqldf("select e1.*,d1.department_name from employee_data e1, department_data d1 where e1.department_id = d1.department_id limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER  HIRE_DATE  JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES  24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP  17000
3         102        Lex   De Haan  LDEHAAN 515.123.4569 2001-01-13   AD_VP  17000
4         103  Alexander    Hunold  AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

Left outer join is shown below:

> sqldf("select employee_data.*,department_name from employee_data left join department_data on employee_data.department_id = department_data.department_id order by department_name asc limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER  HIRE_DATE     JOB_ID SALARY
1         178  Kimberely     Grant   KGRANT 011.44.1644.429263 2007-05-24     SA_REP   7000
2         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR  12008
3         206    William     Gietz   WGIETZ       515.123.8181 2002-06-07 AC_ACCOUNT   8300
4         200   Jennifer    Whalen  JWHALEN       515.123.4444 2003-09-17    AD_ASST   4400
5         100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES  24000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1           0.15        149            NA            <NA>
2             NA        101           110      Accounting
3             NA        205           110      Accounting
4             NA        101            10  Administration
5             NA         NA            90       Executive

Once all the queries against Oracle are complete, we can close the connection as shown below:

> close(connection)

Thus, we have seen how SQL can be used in R environment. More details about the available functions can be seen at:

https://www.sqlite.org/lang_corefunc.html

https://www.sqlite.org/lang_aggfunc.html

https://www.sqlite.org/lang_datefunc.html

Wednesday, 19 April 2017

Analytic Functions in Oracle - IX

In this post, we will take a look at MODEL keyword. The MODEL or SPREADSHEET keyword expands the functionality of SQL in that it gives the user the ability to treat the result set of a query as a multidimensional array and also apply calculations or formulas on the multidimensional array. These calculations or formulas that can be applied are quite flexible and can be applied, for example, on a column to be based on values in other columns.

We use Oracle 12c as the basis for all our queries. For this post, we will create a new view called sales_model that we will use for our queries related to this post. Sales_model view is based on tables in sample SH schema. The statement for creation of sales_model view is given below:

create or replace view sales_model 
as
select distinct extract(year from c.time_id) sale_year, a.prod_name product, b.channel_desc channel, sum(c.amount_sold) over (partition by extract (year from c.time_id), a.prod_name, b.channel_desc) amount from sh.products a, sh.channels b, sh.sales c
where a.prod_id = c.prod_id and b.channel_id = c.channel_id and prod_name like '%6%MB Memory Card' and  extract(year from c.time_id) in (1999,2000) order by a.prod_name;


returns

view SALES_MODEL created.

Below query will give an idea of  the data in the result that we will play with in the various queries using MODEL:
 
select * from sales_model;
















There are twelve rows having sale amount about two products over two years via different channels. 

Let us take up the first query on SALES_MODEL. The query is shown below:

select sale_year, product,channel, amount, new_amount
from sales_model
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 100)
order by sale_year, product;


The results returned are shown below: 














To understand the results of the above query, we need to understand the different parts of the query:

a) SPREADSHEET keyword is used instead of MODEL.

b)  The query returns five columns: four from the original view and the fifth is calculated as an alias of amount.

c) The result set is partitioned by sale_year and product. The rules that will apply later will be only on the partitioned result set defined by the partition.

d) The columns mentioned in the dimension clause, channel and amount, are the ones used by the rules.

e) The measures clause contains the columns that will be used for the calculation. Here, it is column amount aliased to new_amount.

f) Rules use the columns in the dimension clause to act on the columns in measures clause for calculations defined in the rule. The rule here means that the new_amount column is an alias of amount column with the rule that if the channel column has value 'Internet' and for any value in amount column, the aliased amount column should have a default value of 100. So, rows 5, 8 and 12 contain 100 in new_amount column.

The second example has two rules as shown below:

select sale_year, product,channel, amount, new_amount
from sales_model
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 100,
new_amount['Partners',any]  = 50
)
order by sale_year, product;


The results returned are shown below:














The above query just extends the rule to include 'Partners' also in channel column but, should have a value of 50 and the result is evident in the result set.

The next example shows how we can use calculations on existing column value by using cv() function. "cv" stands for current value.

select sale_year, product,channel, amount, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year, product)
dimension by (channel,amount)
measures (amount new_amount) ignore nav
rules
(new_amount['Internet',any]  = 2*new_amount['Internet',cv()]
)
order by sale_year, product;


The results returned are shown below:








The records returned are fewer as there is a filter clause. The rules are defined such that in the new_amount column, the current values get doubled only when the value in channel column is 'Internet'.

One can also use values in another row for replacement as shown in below query:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Partners']  = new_amount['64MB Memory Card','Internet'])
order by sale_year, product;


The results returned are shown below:


  






 In the results above, when the product is '64MB Memory Card' and channel is 'Partners', the new_amount is set to the value when the product is '64 MP Memory Card' and channel is 'Internet'. This is as per the rule that is set in the query.

Like the result above, we can also use calculations to arrive at a value for new_amount as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Partners']  = new_amount['64MB Memory Card','Internet'] + new_amount['256MB Memory Card','Internet'])
order by sale_year, product;


The results returned are shown below:









As per the rule, when the product is '64MB Memory Card' and channel is 'Partners' , new_amount value is set to sum of new_amount values when products are '64MB Memory Card' and '256MB Memory Card' and the channel is 'Internet'. 4348.69 is a sum of 2814.4 and 1534.29.

The next query creates new records based on the rules as shown below:

select sale_year, product,channel, new_amount
from sales_model
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['1024MB Memory Card','Partners']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:
















In the rules, we have created a new product called '1024MB Memory Card' with a channel of  'Partners'. Two records are created corresponding to the two sales_year values and  new_amount value is set to sum of new_amount values of the same channel but corresponding to existing two products.

To return only the new rows, use return updated rows clause as shown below:

select sale_year, product,channel, new_amount
from sales_model
model return updated rows
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['1024MB Memory Card','Partners']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:

 


The return updated rows clause after the model keyword does the trick.

In the below query, we create a record based on a different channel:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Road Show']  = new_amount['64MB Memory Card','Partners'] + new_amount['256MB Memory Card','Partners'])
order by sale_year, product;


The results returned are shown below:










The newly created record is seen in row 6 with new_amount value as a sum of the two products but having channel value, 'Partners'.

We can also use aggregate functions across rows to set value on records as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
model
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card','Road Show']  = 2*max(new_amount)['256MB Memory Card',channel in ('Internet','Direct Sales')])
order by sale_year, product;


The results returned are shown below:









 
The row 6 contains 208919.82 in new_amount that is twice the maximum value,104459.91, between the new_amount values of product, '256MB Memory Card' in two channels, 'Internet' and 'Direct Sales'. 

The last example shows use of the for clause as shown below:

select sale_year, product,channel, new_amount
from sales_model where sale_year = 2000
spreadsheet
partition by (sale_year)
dimension by (product,channel)
measures (amount new_amount) ignore nav
rules
(new_amount['64MB Memory Card',for channel in ('Internet','Partners','Direct Sales')]  = 3*new_amount['64MB Memory Card',cv()])
order by sale_year, product;


The results returned are shown below:









For clause has been used to set values in new_amount with three times the existing value only for product, '64MB Memory Card'.

This concludes the post on MODEL keyword.