Friday 27 October 2017

SQL in R - II

As a continuation of SQL in R, we wade into into the final and conclusive part after we saw glimpses of this topic in Part 1. For starters, let us begin with the group by clause, two examples of which are shown below:

> sqldf("select am,sum(wt) from mtcars group by am")
  am sum(wt)
1  0  71.609
2  1  31.343
> sqldf("select carb,avg(wt) from mtcars group by carb")
  carb avg(wt)
1    1  2.4900
2    2  2.8628
3    3  3.8600
4    4  3.8974
5    6  2.7700
6    8  3.5700

Group by clause usage is shown in the next query:

> sqldf("select carb,avg(wt) from mtcars group by carb having avg(wt) >= 3.0")
  carb avg(wt)

1    3  3.8600
2    4  3.8974
3    8  3.5700

The next query shows a subquery:


> sqldf("select * from mtcars where disp = (select max(disp) from mtcars)")
   mpg cyl disp  hp drat   wt  qsec vs am gear carb
1 10.4   8  472 205 2.93 5.25 17.98  0  0    3    4

Set operations like union can be performed and an example is shown below:

> sqldf("select * from mtcars where disp = (select max(disp) from mtcars) union select * from mtcars where disp = (select min(disp) from mtcars)", row.names = TRUE)
                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Toyota Corolla     33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

The last query gives the details of the vehicle having the maximum and minimum displacement in cubic inches.

The next query shows a correlated subquery:


> sqldf("select * from employee_data as e1 where salary > (select avg(salary) from employee_data where department_id = e1.department_id)")
   EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER  HIRE_DATE     JOB_ID SALARY
1          100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES  24000
2          103  Alexander    Hunold  AHUNOLD       590.423.4567 2006-01-03    IT_PROG   9000
3          104      Bruce     Ernst   BERNST       590.423.4568 2007-05-21    IT_PROG   6000
4          108      Nancy Greenberg NGREENBE       515.124.4569 2002-08-17     FI_MGR  12008
5          109     Daniel    Faviet  DFAVIET       515.124.4169 2002-08-16 FI_ACCOUNT   9000
6          114        Den  Raphaely DRAPHEAL       515.127.4561 2002-12-07     PU_MAN  11000
7          120    Matthew     Weiss   MWEISS       650.123.1234 2004-07-18     ST_MAN   8000
8          121       Adam     Fripp   AFRIPP       650.123.2234 2005-04-10     ST_MAN   8200
9          122      Payam  Kaufling PKAUFLIN       650.123.3234 2003-05-01     ST_MAN   7900
10         123     Shanta   Vollman SVOLLMAN       650.123.4234 2005-10-10     ST_MAN   6500
11         124      Kevin   Mourgos KMOURGOS       650.123.5234 2007-11-16     ST_MAN   5800
12         137     Renske    Ladwig  RLADWIG       650.121.1234 2003-07-14   ST_CLERK   3600
13         141     Trenna      Rajs    TRAJS       650.121.8009 2003-10-17   ST_CLERK   3500
14         145       John   Russell  JRUSSEL 011.44.1344.429268 2004-10-01     SA_MAN  14000
15         146      Karen  Partners KPARTNER 011.44.1344.467268 2005-01-05     SA_MAN  13500
16         147    Alberto Errazuriz AERRAZUR 011.44.1344.429278 2005-03-10     SA_MAN  12000
17         148     Gerald Cambrault GCAMBRAU 011.44.1344.619268 2007-10-15     SA_MAN  11000
18         149      Eleni   Zlotkey EZLOTKEY 011.44.1344.429018 2008-01-29     SA_MAN  10500
19         150      Peter    Tucker  PTUCKER 011.44.1344.129268 2005-01-30     SA_REP  10000
20         151      David Bernstein DBERNSTE 011.44.1344.345268 2005-03-24     SA_REP   9500
21         152      Peter      Hall    PHALL 011.44.1344.478968 2005-08-20     SA_REP   9000
22         156    Janette      King    JKING 011.44.1345.429268 2004-01-30     SA_REP  10000
23         157    Patrick     Sully   PSULLY 011.44.1345.929268 2004-03-04     SA_REP   9500
24         158      Allan    McEwen  AMCEWEN 011.44.1345.829268 2004-08-01     SA_REP   9000
25         162      Clara   Vishney CVISHNEY 011.44.1346.129268 2005-11-11     SA_REP  10500
26         163   Danielle    Greene  DGREENE 011.44.1346.229268 2007-03-19     SA_REP   9500
27         168       Lisa      Ozer    LOZER 011.44.1343.929268 2005-03-11     SA_REP  11500
28         169   Harrison     Bloom   HBLOOM 011.44.1343.829268 2006-03-23     SA_REP  10000
29         170     Tayler       Fox     TFOX 011.44.1343.729268 2006-01-24     SA_REP   9600
30         174      Ellen      Abel    EABEL 011.44.1644.429267 2004-05-11     SA_REP  11000
31         184    Nandita  Sarchand NSARCHAN       650.509.1876 2004-01-27   SH_CLERK   4200
32         185     Alexis      Bull    ABULL       650.509.2876 2005-02-20   SH_CLERK   4100
33         188      Kelly     Chung   KCHUNG       650.505.1876 2005-06-14   SH_CLERK   3800
34         189   Jennifer     Dilly   JDILLY       650.505.2876 2005-08-13   SH_CLERK   3600
35         192      Sarah      Bell    SBELL       650.501.1876 2004-02-04   SH_CLERK   4000
36         193    Britney   Everett BEVERETT       650.501.2876 2005-03-03   SH_CLERK   3900
37         201    Michael Hartstein MHARTSTE       515.123.5555 2004-02-17     MK_MAN  13000
38         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR  12008
   COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
1              NA         NA            90
2              NA        102            60
3              NA        103            60
4              NA        101           100
5              NA        108           100
6              NA        100            30
7              NA        100            50
8              NA        100            50
9              NA        100            50
10             NA        100            50
11             NA        100            50
12             NA        123            50
13             NA        124            50
14           0.40        100            80
15           0.30        100            80
16           0.30        100            80
17           0.30        100            80
18           0.20        100            80
19           0.30        145            80
20           0.25        145            80
21           0.25        145            80
22           0.35        146            80
23           0.35        146            80
24           0.35        146            80
25           0.25        147            80
26           0.15        147            80
27           0.25        148            80
28           0.20        148            80
29           0.20        148            80
30           0.30        149            80
31             NA        121            50
32             NA        121            50
33             NA        122            50
34             NA        122            50
35             NA        123            50
36             NA        123            50
37             NA        100            20
38             NA        101           110

The above query returns those employees that have salaries above the average salaries in their respective departments. In the next query we see how we can create a temporary table and use that temporary table for querying.

> sqldf(c("create temp table temp_1 as select * from mtcars where qsec > 17.0", "select gear,avg(disp) from temp_1 group by gear"))
  gear avg(disp)
1    3  322.2083
2    4  119.6545

In the above query, a temporary table called temp_1 that only exists for querying in this command . The next query uses temp_1 to calculate the average of displacement by gear.

To illustrate update feature, we a dataframe called update_sample and update the missing values in second column from the values in first column.

> update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample

  a  b
1 1  3
2 2 NA
3 3  5
4 4 NA
5 5  7
> sqldf(c("update update_sample set b = a where b is null", "select * from update_sample"))
  a b
1 1 3
2 2 2
3 3 5
4 4 4
5 5 7

It is to be remembered that update_sample is unchanged and the updated record set is only to be used in the second query as shown above. This behavior is the same for the insert and delete statements. The insert as shown below inserts a new row with values 15 and NA:

> sqldf(c("insert into update_sample (a,b) values (15,NULL)", "select * from update_sample"))
   a  b
1  1  3
2  2 NA
3  3  5
4  4 NA
5  5  7
6 15 NA

The delete keyword deletes a row that has a equals 4 as shown below:

 update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample
  a  b
1 1  3
2 2 NA
3 3  5
4 4 NA
5 5  7
> sqldf(c("delete from update_sample where a = 4", 
+ "select * from update_sample"))
  a  b
1 1  3
2 2 NA
3 3  5
4 5  7
 
To delete all the records in the table, we can use the command shown below:

> sqldf(c("delete from update_sample", "select * from update_sample"))
[1] a b
<0 rows> (or 0-length row.names)

In the last section, we explore joins. Let us import some data from Oracle into R using RODBC. The package RODBC needs to be installed and loaded as shown below:

> library(RODBC)
Error in library(RODBC) : there is no package called ‘RODBC’
> install.packages("RODBC")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RODBC_1.3-15.zip'
Content type 'application/zip' length 831635 bytes (812 KB)
downloaded 812 KB

package ‘RODBC’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Administrator\AppData\Local\Temp\RtmpU3bkd9\downloaded_packages
> library("RODBC", lib.loc="C:/Program Files/R/R-3.4.1/library")

Once above steps are carried out, we can establish a connection to Oracle as shown below:

> connection <- odbcConnect("Oracle", uid="hr", pwd="hr", rows_at_time = 1000)

The command is self-explanatory. Oracle is a DSN to facilitate this connection. Once the connection is established, we can start issuing queries as shown below:

> sqlQuery(connection,"select count(*) from employees")
  COUNT(*)
1      107

We can fetch the data from Oracle as shown below where the data from employee table in Oracle is imported into employee_data.

> employee_data <- sqlQuery(connection,"select * from employees")
>
> class(employee_data)
[1] "data.frame"
>
> colnames(employee_data)
 [1] "EMPLOYEE_ID"    "FIRST_NAME"     "LAST_NAME"      "EMAIL"          "PHONE_NUMBER" 
 [6] "HIRE_DATE"      "JOB_ID"         "SALARY"         "COMMISSION_PCT" "MANAGER_ID"   
[11] "DEPARTMENT_ID"
>
> str(employee_data)
'data.frame':    107 obs. of  11 variables:
 $ EMPLOYEE_ID   : int  100 101 102 103 104 105 106 107 108 109 ...
 $ FIRST_NAME    : Factor w/ 91 levels "Adam","Alana",..: 80 64 51 4 11 18 88 20 61 16 ...
 $ LAST_NAME     : Factor w/ 102 levels "Abel","Ande",..: 50 51 20 44 24 4 74 57 38 27 ...
 $ EMAIL         : Factor w/ 107 levels "ABANDA","ABULL",..: 90 74 61 6 11 17 104 23 73 19 ...
 $ PHONE_NUMBER  : Factor w/ 107 levels "011.44.1343.329268",..: 37 38 39 58 59 60 57 61 50 45 ...
 $ HIRE_DATE     : POSIXct, format: "2003-06-17" "2005-09-21" "2001-01-13" ...
 $ JOB_ID        : Factor w/ 19 levels "AC_ACCOUNT","AC_MGR",..: 4 5 5 9 9 9 9 9 7 6 ...
 $ SALARY        : int  24000 17000 17000 9000 6000 4800 4800 4200 12008 9000 ...
 $ COMMISSION_PCT: num  NA NA NA NA NA NA NA NA NA NA ...
 $ MANAGER_ID    : int  NA 100 100 102 103 103 103 103 101 108 ...
 $ DEPARTMENT_ID : int  90 90 90 60 60 60 60 60 100 100 ...
>

In the same manner, we import the department data as shown below:

> department_data <- sqlQuery(connection,"select * from departments")
> department_data
   DEPARTMENT_ID      DEPARTMENT_NAME MANAGER_ID LOCATION_ID
1             10       Administration        200        1700
2             20            Marketing        201        1800
3             30           Purchasing        114        1700
4             40      Human Resources        203        2400
5             50             Shipping        121        1500
6             60                   IT        103        1400
7             70     Public Relations        204        2700
8             80                Sales        145        2500
9             90            Executive        100        1700
10           100              Finance        108        1700
11           110           Accounting        205        1700
12           120             Treasury         NA        1700
13           130        Corporate Tax         NA        1700
14           140   Control And Credit         NA        1700
15           150 Shareholder Services         NA        1700
16           160             Benefits         NA        1700
17           170        Manufacturing         NA        1700
18           180         Construction         NA        1700
19           190          Contracting         NA        1700
20           200           Operations         NA        1700
21           210           IT Support         NA        1700
22           220                  NOC         NA        1700
23           230          IT Helpdesk         NA        1700
24           240     Government Sales         NA        1700
25           250         Retail Sales         NA        1700
26           260           Recruiting         NA        1700
27           270              Payroll         NA        1700

Now that we have the employee data and department data, we can perform a inner join operation in ANSI syntax as shown below (only the first five records are shown):

> sqldf("select employee_data.*,department_name from employee_data inner join department_data on employee_data.department_id = department_data.department_id limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER  HIRE_DATE  JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES  24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP  17000
3         102        Lex   De Haan  LDEHAAN 515.123.4569 2001-01-13   AD_VP  17000
4         103  Alexander    Hunold  AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

The same inner join in non ANSI syntax is shown below:


> sqldf("select e1.*,d1.department_name from employee_data e1, department_data d1 where e1.department_id = d1.department_id limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER  HIRE_DATE  JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES  24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP  17000
3         102        Lex   De Haan  LDEHAAN 515.123.4569 2001-01-13   AD_VP  17000
4         103  Alexander    Hunold  AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

Left outer join is shown below:

> sqldf("select employee_data.*,department_name from employee_data left join department_data on employee_data.department_id = department_data.department_id order by department_name asc limit 5")
  EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER  HIRE_DATE     JOB_ID SALARY
1         178  Kimberely     Grant   KGRANT 011.44.1644.429263 2007-05-24     SA_REP   7000
2         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR  12008
3         206    William     Gietz   WGIETZ       515.123.8181 2002-06-07 AC_ACCOUNT   8300
4         200   Jennifer    Whalen  JWHALEN       515.123.4444 2003-09-17    AD_ASST   4400
5         100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES  24000
  COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1           0.15        149            NA            <NA>
2             NA        101           110      Accounting
3             NA        205           110      Accounting
4             NA        101            10  Administration
5             NA         NA            90       Executive

Once all the queries against Oracle are complete, we can close the connection as shown below:

> close(connection)

Thus, we have seen how SQL can be used in R environment. More details about the available functions can be seen at:

https://www.sqlite.org/lang_corefunc.html

https://www.sqlite.org/lang_aggfunc.html

https://www.sqlite.org/lang_datefunc.html