Let's Refresh: October 2017

Friday 27 October 2017

SQL in R - II

As a continuation of SQL in R, we wade into into the final and conclusive part after we saw glimpses of this topic in Part 1. For starters, let us begin with the group by clause, two examples of which are shown below:

> sqldf("select am,sum(wt) from mtcars group by am")
am sum(wt)
1 0 71.609
2 1 31.343
> sqldf("select carb,avg(wt) from mtcars group by carb")
carb avg(wt)
1    1 2.4900
2    2 2.8628
3    3 3.8600
4    4 3.8974
5    6 2.7700
6    8 3.5700

Group by clause usage is shown in the next query:

> sqldf("select carb,avg(wt) from mtcars group by carb having avg(wt) >= 3.0")
carb avg(wt)
1    3 3.8600
2    4 3.8974
3    8 3.5700

The next query shows a subquery:

> sqldf("select * from mtcars where disp = (select max(disp) from mtcars)")
   mpg cyl disp hp drat   wt qsec vs am gear carb
1 10.4   8 472 205 2.93 5.25 17.98 0 0    3    4

Set operations like union can be performed and an example is shown below:

> sqldf("select * from mtcars where disp = (select max(disp) from mtcars) union select * from mtcars where disp = (select min(disp) from mtcars)", row.names = TRUE)
                    mpg cyl disp hp drat    wt qsec vs am gear carb
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98 0 0    3    4
Toyota Corolla     33.9   4 71.1 65 4.22 1.835 19.90 1 1    4    1

The last query gives the details of the vehicle having the maximum and minimum displacement in cubic inches.

The next query shows a correlated subquery:

> sqldf("select * from employee_data as e1 where salary > (select avg(salary) from employee_data where department_id = e1.department_id)")
   EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER HIRE_DATE     JOB_ID SALARY
1          100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES 24000
2          103 Alexander    Hunold AHUNOLD       590.423.4567 2006-01-03    IT_PROG   9000
3          104      Bruce     Ernst   BERNST       590.423.4568 2007-05-21    IT_PROG   6000
4          108      Nancy Greenberg NGREENBE       515.124.4569 2002-08-17     FI_MGR 12008
5          109     Daniel    Faviet DFAVIET       515.124.4169 2002-08-16 FI_ACCOUNT   9000
6          114        Den Raphaely DRAPHEAL       515.127.4561 2002-12-07     PU_MAN 11000
7          120    Matthew     Weiss   MWEISS       650.123.1234 2004-07-18     ST_MAN   8000
8          121       Adam     Fripp   AFRIPP       650.123.2234 2005-04-10     ST_MAN   8200
9          122      Payam Kaufling PKAUFLIN       650.123.3234 2003-05-01     ST_MAN   7900
10         123     Shanta   Vollman SVOLLMAN       650.123.4234 2005-10-10     ST_MAN   6500
11         124      Kevin   Mourgos KMOURGOS       650.123.5234 2007-11-16     ST_MAN   5800
12         137     Renske    Ladwig RLADWIG       650.121.1234 2003-07-14   ST_CLERK   3600
13         141     Trenna      Rajs    TRAJS       650.121.8009 2003-10-17   ST_CLERK   3500
14         145       John   Russell JRUSSEL 011.44.1344.429268 2004-10-01     SA_MAN 14000
15         146      Karen Partners KPARTNER 011.44.1344.467268 2005-01-05     SA_MAN 13500
16         147    Alberto Errazuriz AERRAZUR 011.44.1344.429278 2005-03-10     SA_MAN 12000
17         148     Gerald Cambrault GCAMBRAU 011.44.1344.619268 2007-10-15     SA_MAN 11000
18         149      Eleni   Zlotkey EZLOTKEY 011.44.1344.429018 2008-01-29     SA_MAN 10500
19         150      Peter    Tucker PTUCKER 011.44.1344.129268 2005-01-30     SA_REP 10000
20         151      David Bernstein DBERNSTE 011.44.1344.345268 2005-03-24     SA_REP   9500
21         152      Peter      Hall    PHALL 011.44.1344.478968 2005-08-20     SA_REP   9000
22         156    Janette      King    JKING 011.44.1345.429268 2004-01-30     SA_REP 10000
23         157    Patrick     Sully   PSULLY 011.44.1345.929268 2004-03-04     SA_REP   9500
24         158      Allan    McEwen AMCEWEN 011.44.1345.829268 2004-08-01     SA_REP   9000
25         162      Clara   Vishney CVISHNEY 011.44.1346.129268 2005-11-11     SA_REP 10500
26         163   Danielle    Greene DGREENE 011.44.1346.229268 2007-03-19     SA_REP   9500
27         168       Lisa      Ozer    LOZER 011.44.1343.929268 2005-03-11     SA_REP 11500
28         169   Harrison     Bloom   HBLOOM 011.44.1343.829268 2006-03-23     SA_REP 10000
29         170     Tayler       Fox     TFOX 011.44.1343.729268 2006-01-24     SA_REP   9600
30         174      Ellen      Abel    EABEL 011.44.1644.429267 2004-05-11     SA_REP 11000
31         184    Nandita Sarchand NSARCHAN       650.509.1876 2004-01-27   SH_CLERK   4200
32         185     Alexis      Bull    ABULL       650.509.2876 2005-02-20   SH_CLERK   4100
33         188      Kelly     Chung   KCHUNG       650.505.1876 2005-06-14   SH_CLERK   3800
34         189   Jennifer     Dilly   JDILLY       650.505.2876 2005-08-13   SH_CLERK   3600
35         192      Sarah      Bell    SBELL       650.501.1876 2004-02-04   SH_CLERK   4000
36         193    Britney   Everett BEVERETT       650.501.2876 2005-03-03   SH_CLERK   3900
37         201    Michael Hartstein MHARTSTE       515.123.5555 2004-02-17     MK_MAN 13000
38         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR 12008
   COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
1              NA         NA            90
2              NA        102            60
3              NA        103            60
4              NA        101           100
5              NA        108           100
6              NA        100            30
7              NA        100            50
8              NA        100            50
9              NA        100            50
10             NA        100            50
11             NA        100            50
12             NA        123            50
13             NA        124            50
14           0.40        100            80
15           0.30        100            80
16           0.30        100            80
17           0.30        100            80
18           0.20        100            80
19           0.30        145            80
20           0.25        145            80
21           0.25        145            80
22           0.35        146            80
23           0.35        146            80
24           0.35        146            80
25           0.25        147            80
26           0.15        147            80
27           0.25        148            80
28           0.20        148            80
29           0.20        148            80
30           0.30        149            80
31             NA        121            50
32             NA        121            50
33             NA        122            50
34             NA        122            50
35             NA        123            50
36             NA        123            50
37             NA        100            20
38             NA        101           110

The above query returns those employees that have salaries above the average salaries in their respective departments. In the next query we see how we can create a temporary table and use that temporary table for querying.

> sqldf(c("create temp table temp_1 as select * from mtcars where qsec > 17.0", "select gear,avg(disp) from temp_1 group by gear"))
gear avg(disp)
1    3 322.2083
2    4 119.6545

In the above query, a temporary table called temp_1 that only exists for querying in this command . The next query uses temp_1 to calculate the average of displacement by gear.

To illustrate update feature, we a dataframe called update_sample and update the missing values in second column from the values in first column.

> update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample
a b
1 1 3
2 2 NA
3 3 5
4 4 NA
5 5 7
> sqldf(c("update update_sample set b = a where b is null", "select * from update_sample"))
a b
1 1 3
2 2 2
3 3 5
4 4 4
5 5 7

It is to be remembered that update_sample is unchanged and the updated record set is only to be used in the second query as shown above. This behavior is the same for the insert and delete statements. The insert as shown below inserts a new row with values 15 and NA:

> sqldf(c("insert into update_sample (a,b) values (15,NULL)", "select * from update_sample"))
   a b
1 1 3
2 2 NA
3 3 5
4 4 NA
5 5 7
6 15 NA

The delete keyword deletes a row that has a equals 4 as shown below:

 update_sample <- data.frame(a = 1:5, b = c(3, NA, 5, NA, 7))
> update_sample
  a  b
1 1  3
2 2 NA
3 3  5
4 4 NA
5 5  7
> sqldf(c("delete from update_sample where a = 4",

+ "select * from update_sample"))
  a  b
1 1  3
2 2 NA
3 3  5
4 5  7

To delete all the records in the table, we can use the command shown below:

> sqldf(c("delete from update_sample", "select * from update_sample"))
[1] a b
<0 rows> (or 0-length row.names)

In the last section, we explore joins. Let us import some data from Oracle into R using RODBC. The package RODBC needs to be installed and loaded as shown below:

> library(RODBC)
Error in library(RODBC) : there is no package called ‘RODBC’
> install.packages("RODBC")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.4/RODBC_1.3-15.zip'
Content type 'application/zip' length 831635 bytes (812 KB)
downloaded 812 KB

package ‘RODBC’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Administrator\AppData\Local\Temp\RtmpU3bkd9\downloaded_packages
> library("RODBC", lib.loc="C:/Program Files/R/R-3.4.1/library")

Once above steps are carried out, we can establish a connection to Oracle as shown below:

> connection <- odbcConnect("Oracle", uid="hr", pwd="hr", rows_at_time = 1000)

The command is self-explanatory. Oracle is a DSN to facilitate this connection. Once the connection is established, we can start issuing queries as shown below:

> sqlQuery(connection,"select count(*) from employees")
COUNT(*)
1      107

We can fetch the data from Oracle as shown below where the data from employee table in Oracle is imported into employee_data.

> employee_data <- sqlQuery(connection,"select * from employees")
>
> class(employee_data)
[1] "data.frame"
>
> colnames(employee_data)
[1] "EMPLOYEE_ID"    "FIRST_NAME"     "LAST_NAME"      "EMAIL"          "PHONE_NUMBER"
[6] "HIRE_DATE"      "JOB_ID"         "SALARY"         "COMMISSION_PCT" "MANAGER_ID"
[11] "DEPARTMENT_ID"
>
> str(employee_data)
'data.frame':    107 obs. of 11 variables:
$ EMPLOYEE_ID   : int 100 101 102 103 104 105 106 107 108 109 ...
$ FIRST_NAME    : Factor w/ 91 levels "Adam","Alana",..: 80 64 51 4 11 18 88 20 61 16 ...
$ LAST_NAME     : Factor w/ 102 levels "Abel","Ande",..: 50 51 20 44 24 4 74 57 38 27 ...
$ EMAIL         : Factor w/ 107 levels "ABANDA","ABULL",..: 90 74 61 6 11 17 104 23 73 19 ...
$ PHONE_NUMBER : Factor w/ 107 levels "011.44.1343.329268",..: 37 38 39 58 59 60 57 61 50 45 ...
$ HIRE_DATE     : POSIXct, format: "2003-06-17" "2005-09-21" "2001-01-13" ...
$ JOB_ID        : Factor w/ 19 levels "AC_ACCOUNT","AC_MGR",..: 4 5 5 9 9 9 9 9 7 6 ...
$ SALARY        : int 24000 17000 17000 9000 6000 4800 4800 4200 12008 9000 ...
$ COMMISSION_PCT: num NA NA NA NA NA NA NA NA NA NA ...
$ MANAGER_ID    : int NA 100 100 102 103 103 103 103 101 108 ...
$ DEPARTMENT_ID : int 90 90 90 60 60 60 60 60 100 100 ...
>

In the same manner, we import the department data as shown below:

> department_data <- sqlQuery(connection,"select * from departments")
> department_data
   DEPARTMENT_ID      DEPARTMENT_NAME MANAGER_ID LOCATION_ID
1             10       Administration        200        1700
2             20            Marketing        201        1800
3             30           Purchasing        114        1700
4             40      Human Resources        203        2400
5             50             Shipping        121        1500
6             60                   IT        103        1400
7             70     Public Relations        204        2700
8             80                Sales        145        2500
9             90            Executive        100        1700
10           100              Finance        108        1700
11           110           Accounting        205        1700
12           120             Treasury         NA        1700
13           130        Corporate Tax         NA        1700
14           140   Control And Credit         NA        1700
15           150 Shareholder Services         NA        1700
16           160             Benefits         NA        1700
17           170        Manufacturing         NA        1700
18           180         Construction         NA        1700
19           190          Contracting         NA        1700
20           200           Operations         NA        1700
21           210           IT Support         NA        1700
22           220                  NOC         NA        1700
23           230          IT Helpdesk         NA        1700
24           240     Government Sales         NA        1700
25           250         Retail Sales         NA        1700
26           260           Recruiting         NA        1700
27           270              Payroll         NA        1700

Now that we have the employee data and department data, we can perform a inner join operation in ANSI syntax as shown below (only the first five records are shown):

> sqldf("select employee_data.*,department_name from employee_data inner join department_data on employee_data.department_id = department_data.department_id limit 5")
EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER HIRE_DATE JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES 24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP 17000
3         102        Lex   De Haan LDEHAAN 515.123.4569 2001-01-13   AD_VP 17000
4         103 Alexander    Hunold AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

The same inner join in non ANSI syntax is shown below:

> sqldf("select e1.*,d1.department_name from employee_data e1, department_data d1 where e1.department_id = d1.department_id limit 5")
EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL PHONE_NUMBER HIRE_DATE JOB_ID SALARY
1         100     Steven      King    SKING 515.123.4567 2003-06-17 AD_PRES 24000
2         101      Neena   Kochhar NKOCHHAR 515.123.4568 2005-09-21   AD_VP 17000
3         102        Lex   De Haan LDEHAAN 515.123.4569 2001-01-13   AD_VP 17000
4         103 Alexander    Hunold AHUNOLD 590.423.4567 2006-01-03 IT_PROG   9000
5         104      Bruce     Ernst   BERNST 590.423.4568 2007-05-21 IT_PROG   6000
COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1             NA         NA            90       Executive
2             NA        100            90       Executive
3             NA        100            90       Executive
4             NA        102            60              IT
5             NA        103            60              IT

Left outer join is shown below:

> sqldf("select employee_data.*,department_name from employee_data left join department_data on employee_data.department_id = department_data.department_id order by department_name asc limit 5")
EMPLOYEE_ID FIRST_NAME LAST_NAME    EMAIL       PHONE_NUMBER HIRE_DATE     JOB_ID SALARY
1         178 Kimberely     Grant   KGRANT 011.44.1644.429263 2007-05-24     SA_REP   7000
2         205    Shelley   Higgins SHIGGINS       515.123.8080 2002-06-07     AC_MGR 12008
3         206    William     Gietz   WGIETZ       515.123.8181 2002-06-07 AC_ACCOUNT   8300
4         200   Jennifer    Whalen JWHALEN       515.123.4444 2003-09-17    AD_ASST   4400
5         100     Steven      King    SKING       515.123.4567 2003-06-17    AD_PRES 24000
COMMISSION_PCT MANAGER_ID DEPARTMENT_ID DEPARTMENT_NAME
1           0.15        149            NA            <NA>
2             NA        101           110      Accounting
3             NA        205           110      Accounting
4             NA        101            10 Administration
5             NA         NA            90       Executive

Once all the queries against Oracle are complete, we can close the connection as shown below:

> close(connection)

Thus, we have seen how SQL can be used in R environment. More details about the available functions can be seen at:

https://www.sqlite.org/lang_corefunc.html

https://www.sqlite.org/lang_aggfunc.html

https://www.sqlite.org/lang_datefunc.html

Wednesday 25 October 2017

SQL in R - I

SQL was invented in the 1970s. Over the years, it has become quite popular among developers interacting with databases for its power while at the same time offering ease of use abilities. Though initially SQL was confined to the realm of databases, it has slowly and steadily pervaded various technologies. Examples of SQL usage are seen in below cases to name a few:

a) HiveQL

b) Spark SQL

c) Google BigQuery

d) Apache Drill

It is, therefore, not surprising that R does not lag behind and also supports SQL via sqldf package. So, let's explore how SQL can be used in R.

Details of sqldf package gives a fair idea about the manner in which SQL can be used to query dataframes in R. The first step is, of course, to install the sqldf package itself in R Studio as shown below:

Click on Install text under Packages tab in R Studio. This will bring up the Install Packages dialog in which we enter sqldf as shown below:

Once we select sqldf from the prompted values, click on Install button. Once the installation of the package is complete, we can see that the following commands run on the console.

Then, we load the sqldf package using library command as shown below to use this package.

Running data() command shows all the default datasets available in R environment.

We will use these default datasets to play with SQL in R. First, let us prepare a dataframe based on precip dataset as shown below:

> precip_data <- data.frame(attr(precip,"names"),precip)
> colnames(precip_data) <- c("Cities","Precipitation")
> head(precip_data)
       Cities Precipitation
1      Mobile          67.0
2      Juneau          54.7
3     Phoenix           7.0
4 Little Rock          48.5
5 Los Angeles          14.0
6 Sacramento          17.2

A quick look at the structure of this dataframe shows that the dataframe has two columns: one consisting of factors and the other numeric.

> str(precip_data)
'data.frame':   70 obs. of 2 variables:
$ Cities       : Factor w/ 69 levels "Albany","Albuquerque",..: 42 33 51 35 36 58 60 21 27 69 ...
$ Precipitation: num 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
>

We can a simple select on the dataframe as shown below to see the complete data:

> sqldf("select * from precip_data")
                Cities Precipitation
1               Mobile          67.0
2               Juneau          54.7
3              Phoenix           7.0
4          Little Rock          48.5
5          Los Angeles          14.0
6           Sacramento          17.2
7        San Francisco          20.7
8               Denver          13.0
9             Hartford          43.4
10          Wilmington          40.2
11          Washington          38.9
12        Jacksonville          54.5
13               Miami          59.8
14             Atlanta          48.3
15            Honolulu          22.9
16               Boise          11.5
17             Chicago          34.4
18              Peoria          35.1
19        Indianapolis          38.7
20          Des Moines          30.8
21             Wichita          30.6
22          Louisville          43.1
23         New Orleans          56.8
24            Portland          40.8
25           Baltimore          41.8
26              Boston          42.5
27             Detroit          31.0
28    Sault Ste. Marie          31.7
29              Duluth          30.2
30 Minneapolis/St Paul          25.9
31             Jackson          49.2
32         Kansas City          37.0
33            St Louis          35.9
34         Great Falls          15.0
35               Omaha          30.2
36                Reno           7.2
37             Concord          36.2
38       Atlantic City          45.5
39         Albuquerque           7.8
40              Albany          33.4
41             Buffalo          36.1
42            New York          40.2
43           Charlotte          42.7
44             Raleigh          42.5
45             Bismark          16.2
46          Cincinnati          39.0
47           Cleveland          35.0
48            Columbus          37.0
49       Oklahoma City          31.4
50            Portland          37.6
51        Philadelphia          39.9
52           Pittsburg          36.2
53          Providence          42.8
54            Columbia          46.4
55         Sioux Falls          24.7
56             Memphis          49.1
57           Nashville          46.0
58              Dallas          35.9
59             El Paso           7.8
60             Houston          48.2
61      Salt Lake City          15.2
62          Burlington          32.5
63             Norfolk          44.7
64            Richmond          42.6
65      Seattle Tacoma          38.8
66             Spokane          17.4
67          Charleston          40.8
68           Milwaukee          29.1
69            Cheyenne          14.6
70            San Juan          59.2

In order to select just one column, we can rewrite the SQL to as shown below:

> sqldf("select cities from precip_data")
                Cities
1               Mobile
2               Juneau
3              Phoenix
4          Little Rock
5          Los Angeles
6           Sacramento
7        San Francisco
8               Denver
9             Hartford
10          Wilmington
11          Washington
12        Jacksonville
13               Miami
14             Atlanta
15            Honolulu
16               Boise
17             Chicago
18              Peoria
19        Indianapolis
20          Des Moines
21             Wichita
22          Louisville
23         New Orleans
24            Portland
25           Baltimore
26              Boston
27             Detroit
28    Sault Ste. Marie
29              Duluth
30 Minneapolis/St Paul
31             Jackson
32         Kansas City
33            St Louis
34         Great Falls
35               Omaha
36                Reno
37             Concord
38       Atlantic City
39         Albuquerque
40              Albany
41             Buffalo
42            New York
43           Charlotte
44             Raleigh
45             Bismark
46          Cincinnati
47           Cleveland
48            Columbus
49       Oklahoma City
50            Portland
51        Philadelphia
52           Pittsburg
53          Providence
54            Columbia
55         Sioux Falls
56             Memphis
57           Nashville
58              Dallas
59             El Paso
60             Houston
61      Salt Lake City
62          Burlington
63             Norfolk
64            Richmond
65      Seattle Tacoma
66             Spokane
67          Charleston
68           Milwaukee
69            Cheyenne
70            San Juan

To limit the number of rows in the returned dataset, limit can be used as shown below:

> sqldf("select * from precip_data limit 5")
       Cities Precipitation
1      Mobile          67.0
2      Juneau          54.7
3     Phoenix           7.0
4 Little Rock          48.5
5 Los Angeles          14.0

Two forms of column alias are show below.

> sqldf("select cities as USCities from precip_data limit 5")
     USCities
1      Mobile
2      Juneau
3     Phoenix
4 Little Rock
5 Los Angeles

> sqldf("select cities 'US Cities' from precip_data limit 5")
    US Cities
1      Mobile
2      Juneau
3     Phoenix
4 Little Rock
5 Los Angeles
>

Calculations are allowed on columns containing numeric values. So, a query like the one below is entirely plausible:

> sqldf("select cities,precipitation*2 from precip_data limit 5")
       Cities precipitation*2
1      Mobile           134.0
2      Juneau           109.4
3     Phoenix            14.0
4 Little Rock            97.0
5 Los Angeles            28.0

To return limited rows based on a search criterion, we can use the where clause as shown below:

> sqldf("select cities,precipitation from precip_data where cities like 'New%'")
       Cities Precipitation
1 New Orleans          56.8
2    New York          40.2

The next query uses a string function, instr to return the same results as shown below:

> sqldf("select cities,precipitation from precip_data where instr(cities,'New')")
       Cities Precipitation
1 New Orleans          56.8
2    New York          40.2

We can put a where clause on the numeric column as shown below:

> sqldf("select cities,precipitation from precip_data where precipitation >= 50")
        Cities Precipitation
1       Mobile          67.0
2       Juneau          54.7
3 Jacksonville          54.5
4        Miami          59.8
5 New Orleans          56.8
6     San Juan          59.2

Usage of between is shown in the below query:

> sqldf("select cities,precipitation from precip_data where precipitation between 55 and 60")
       Cities Precipitation
1       Miami          59.8
2 New Orleans          56.8
3    San Juan          59.2

For discontinuous values, we can use in as shown below:

> sqldf("select cities,precipitation from precip_data where precipitation in (30.2,35.9)")
    Cities Precipitation
1   Duluth          30.2
2 St Louis          35.9
3    Omaha          30.2
4   Dallas          35.9

Now, let us use mtcars dataframe to see the other features of SQL in R.

> str(mtcars)
'data.frame':    32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

To find out unique values, we can use distinct as shown below:

> sqldf("select distinct am from mtcars")
am
1 1
2 0

Support for aggregate functions and nesting of functions are shown below:

> sqldf("select count(*) from mtcars")
count(*)
1       32

> sqldf("select count(am) from mtcars")
count(am)
1        32

> sqldf("select count(distinct am) from mtcars")
count(distinct am)
1                  2

count(*) returns the number of rows.

max() gives the maximum, min() gives the minimum, round() rounds the numeric value, and sum() calculates the sum of the values. These are shown below:

> sqldf("select min(disp), max(disp) from mtcars")
min(disp) max(disp)
1      71.1       472
>
> sqldf("select round(min(disp)), round(max(disp)) from mtcars")
round(min(disp)) round(max(disp))
1               71              472
>
> sqldf("select sum(wt) from mtcars")
sum(wt)
1 102.952

The rest of the SQL in R discussion is carried out in Part II ...

Pages

Friday 27 October 2017

SQL in R - II

Wednesday 25 October 2017

SQL in R - I