Wednesday, 25 October 2017

SQL in R - I

SQL was invented in the 1970s. Over the years, it has become quite popular among developers interacting with databases for its power while at the same time offering ease of use abilities. Though initially SQL was confined to the realm of databases, it has slowly and steadily pervaded various technologies. Examples of  SQL usage are seen in below cases to name a few:

a) HiveQL

b) Spark SQL

c) Google BigQuery

d) Apache Drill

It is, therefore, not surprising that R does not lag behind  and also supports SQL via sqldf package. So, let's explore how SQL can be used in R.

Details of sqldf package gives a fair idea about the manner in which SQL can be used to query dataframes in R. The first step is, of course, to install the sqldf package itself in R Studio as shown below:


Click on Install text under Packages tab in R Studio. This will bring up the Install Packages dialog in which we enter sqldf as shown below:


















Once we select sqldf from the prompted values, click on Install button. Once the installation of the package is complete, we can see that the following commands run on the console.







Then, we load the sqldf package using library command as shown below to use this package.









Running data() command shows all the default datasets available in R environment.

























We will use these default datasets to play with SQL in R. First, let us prepare a dataframe based on precip dataset as shown below:

> precip_data <- data.frame(attr(precip,"names"),precip)
> colnames(precip_data) <- c("Cities","Precipitation")
> head(precip_data)

       Cities Precipitation
1      Mobile          67.0
2      Juneau          54.7
3     Phoenix           7.0
4 Little Rock          48.5
5 Los Angeles          14.0
6  Sacramento          17.2

A quick look at the structure of this dataframe shows that the dataframe has two columns: one consisting of factors and the other numeric.

> str(precip_data)
'data.frame':    70 obs. of  2 variables:
 $ Cities       : Factor w/ 69 levels "Albany","Albuquerque",..: 42 33 51 35 36 58 60 21 27 69 ...
 $ Precipitation: num  67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...


We can a simple select on the dataframe as shown below to see the complete data:

> sqldf("select * from precip_data")
                Cities Precipitation
1               Mobile          67.0
2               Juneau          54.7
3              Phoenix           7.0
4          Little Rock          48.5
5          Los Angeles          14.0
6           Sacramento          17.2
7        San Francisco          20.7
8               Denver          13.0
9             Hartford          43.4
10          Wilmington          40.2
11          Washington          38.9
12        Jacksonville          54.5
13               Miami          59.8
14             Atlanta          48.3
15            Honolulu          22.9
16               Boise          11.5
17             Chicago          34.4
18              Peoria          35.1
19        Indianapolis          38.7
20          Des Moines          30.8
21             Wichita          30.6
22          Louisville          43.1
23         New Orleans          56.8
24            Portland          40.8
25           Baltimore          41.8
26              Boston          42.5
27             Detroit          31.0
28    Sault Ste. Marie          31.7
29              Duluth          30.2
30 Minneapolis/St Paul          25.9
31             Jackson          49.2
32         Kansas City          37.0
33            St Louis          35.9
34         Great Falls          15.0
35               Omaha          30.2
36                Reno           7.2
37             Concord          36.2
38       Atlantic City          45.5
39         Albuquerque           7.8
40              Albany          33.4
41             Buffalo          36.1
42            New York          40.2
43           Charlotte          42.7
44             Raleigh          42.5
45             Bismark          16.2
46          Cincinnati          39.0
47           Cleveland          35.0
48            Columbus          37.0
49       Oklahoma City          31.4
50            Portland          37.6
51        Philadelphia          39.9
52           Pittsburg          36.2
53          Providence          42.8
54            Columbia          46.4
55         Sioux Falls          24.7
56             Memphis          49.1
57           Nashville          46.0
58              Dallas          35.9
59             El Paso           7.8
60             Houston          48.2
61      Salt Lake City          15.2
62          Burlington          32.5
63             Norfolk          44.7
64            Richmond          42.6
65      Seattle Tacoma          38.8
66             Spokane          17.4
67          Charleston          40.8
68           Milwaukee          29.1
69            Cheyenne          14.6
70            San Juan          59.2

In order to select just one column, we can rewrite the SQL to as shown below:

> sqldf("select cities from precip_data")
                Cities
1               Mobile
2               Juneau
3              Phoenix
4          Little Rock
5          Los Angeles
6           Sacramento
7        San Francisco
8               Denver
9             Hartford
10          Wilmington
11          Washington
12        Jacksonville
13               Miami
14             Atlanta
15            Honolulu
16               Boise
17             Chicago
18              Peoria
19        Indianapolis
20          Des Moines
21             Wichita
22          Louisville
23         New Orleans
24            Portland
25           Baltimore
26              Boston
27             Detroit
28    Sault Ste. Marie
29              Duluth
30 Minneapolis/St Paul
31             Jackson
32         Kansas City
33            St Louis
34         Great Falls
35               Omaha
36                Reno
37             Concord
38       Atlantic City
39         Albuquerque
40              Albany
41             Buffalo
42            New York
43           Charlotte
44             Raleigh
45             Bismark
46          Cincinnati
47           Cleveland
48            Columbus
49       Oklahoma City
50            Portland
51        Philadelphia
52           Pittsburg
53          Providence
54            Columbia
55         Sioux Falls
56             Memphis
57           Nashville
58              Dallas
59             El Paso
60             Houston
61      Salt Lake City
62          Burlington
63             Norfolk
64            Richmond
65      Seattle Tacoma
66             Spokane
67          Charleston
68           Milwaukee
69            Cheyenne
70            San Juan

To limit the number of rows in the returned dataset, limit can be used as shown below:

> sqldf("select * from precip_data limit 5")
       Cities Precipitation
1      Mobile          67.0
2      Juneau          54.7
3     Phoenix           7.0
4 Little Rock          48.5
5 Los Angeles          14.0

Two forms of column alias are show below.

> sqldf("select cities as USCities from precip_data limit 5")
     USCities
1      Mobile
2      Juneau
3     Phoenix
4 Little Rock
5 Los Angeles

> sqldf("select cities 'US Cities' from precip_data limit 5")
    US Cities
1      Mobile
2      Juneau
3     Phoenix
4 Little Rock
5 Los Angeles
>

Calculations are allowed on columns containing numeric values. So, a query like the one below is entirely plausible:

> sqldf("select cities,precipitation*2 from precip_data limit 5")
       Cities precipitation*2
1      Mobile           134.0
2      Juneau           109.4
3     Phoenix            14.0
4 Little Rock            97.0
5 Los Angeles            28.0

To return limited rows based on a search criterion, we can use the where clause as shown below:

> sqldf("select cities,precipitation from precip_data where cities like 'New%'")
       Cities Precipitation
1 New Orleans          56.8
2    New York          40.2

The next query uses a string function, instr to return the same results as shown below:

> sqldf("select cities,precipitation from precip_data where instr(cities,'New')")
       Cities Precipitation
1 New Orleans          56.8
2    New York          40.2

We can put a where clause on the numeric column as shown below:

 > sqldf("select cities,precipitation from precip_data where precipitation >= 50")
        Cities Precipitation
1       Mobile          67.0
2       Juneau          54.7
3 Jacksonville          54.5
4        Miami          59.8
5  New Orleans          56.8
6     San Juan          59.2

Usage of between is shown in the below query:

> sqldf("select cities,precipitation from precip_data where precipitation between 55 and 60")
       Cities Precipitation
1       Miami          59.8
2 New Orleans          56.8
3    San Juan          59.2

For discontinuous values, we can use in as shown below:

> sqldf("select cities,precipitation from precip_data where precipitation in (30.2,35.9)")
    Cities Precipitation
1   Duluth          30.2
2 St Louis          35.9
3    Omaha          30.2
4   Dallas          35.9

Now, let us use mtcars dataframe to see the other features of SQL in R.

> str(mtcars)
'data.frame':    32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

To find out unique values, we can use distinct as shown below:

> sqldf("select distinct am from mtcars")
  am
1  1
2  0

Support for aggregate functions and nesting of functions are shown below:

> sqldf("select count(*) from mtcars")
  count(*)
1       32

> sqldf("select count(am) from mtcars")
  count(am)
1        32

> sqldf("select count(distinct am) from mtcars")
  count(distinct am)
1                  2

count(*) returns the number of rows.

max() gives the maximum, min() gives the minimum, round() rounds the numeric value, and sum() calculates the sum of the values. These are shown below:

> sqldf("select min(disp), max(disp) from mtcars")
  min(disp) max(disp)
1      71.1       472
>
> sqldf("select round(min(disp)), round(max(disp)) from mtcars")
  round(min(disp)) round(max(disp))
1               71              472
>
> sqldf("select sum(wt) from mtcars")
  sum(wt)
1 102.952

The rest of the SQL in R discussion is carried out in Part II ...