SQL was invented in the 1970s. Over the years, it has become quite popular among developers interacting with databases for its power while at the same time offering ease of use abilities. Though initially SQL was confined to the realm of databases, it has slowly and steadily pervaded various technologies. Examples of SQL usage are seen in below cases to name a few:
a) HiveQL
b) Spark SQL
c) Google BigQuery
d) Apache Drill
It is, therefore, not surprising that R does not lag behind and also supports SQL via sqldf package. So, let's explore how SQL can be used in R.
Details of sqldf package gives a fair idea about the manner in which SQL can be used to query dataframes in R. The first step is, of course, to install the sqldf package itself in R Studio as shown below:
Click on Install text under Packages tab in R Studio. This will bring up the Install Packages dialog in which we enter sqldf as shown below:
Once we select sqldf from the prompted values, click on Install button. Once the installation of the package is complete, we can see that the following commands run on the console.
Then, we load the sqldf package using library command as shown below to use this package.
Running data() command shows all the default datasets available in R environment.
We will use these default datasets to play with SQL in R. First, let us prepare a dataframe based on precip dataset as shown below:
> precip_data <- data.frame(attr(precip,"names"),precip)
> colnames(precip_data) <- c("Cities","Precipitation")
> head(precip_data)
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
6 Sacramento 17.2
A quick look at the structure of this dataframe shows that the dataframe has two columns: one consisting of factors and the other numeric.
> str(precip_data)
'data.frame': 70 obs. of 2 variables:
$ Cities : Factor w/ 69 levels "Albany","Albuquerque",..: 42 33 51 35 36 58 60 21 27 69 ...
$ Precipitation: num 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
>
We can a simple select on the dataframe as shown below to see the complete data:
> sqldf("select * from precip_data")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
6 Sacramento 17.2
7 San Francisco 20.7
8 Denver 13.0
9 Hartford 43.4
10 Wilmington 40.2
11 Washington 38.9
12 Jacksonville 54.5
13 Miami 59.8
14 Atlanta 48.3
15 Honolulu 22.9
16 Boise 11.5
17 Chicago 34.4
18 Peoria 35.1
19 Indianapolis 38.7
20 Des Moines 30.8
21 Wichita 30.6
22 Louisville 43.1
23 New Orleans 56.8
24 Portland 40.8
25 Baltimore 41.8
26 Boston 42.5
27 Detroit 31.0
28 Sault Ste. Marie 31.7
29 Duluth 30.2
30 Minneapolis/St Paul 25.9
31 Jackson 49.2
32 Kansas City 37.0
33 St Louis 35.9
34 Great Falls 15.0
35 Omaha 30.2
36 Reno 7.2
37 Concord 36.2
38 Atlantic City 45.5
39 Albuquerque 7.8
40 Albany 33.4
41 Buffalo 36.1
42 New York 40.2
43 Charlotte 42.7
44 Raleigh 42.5
45 Bismark 16.2
46 Cincinnati 39.0
47 Cleveland 35.0
48 Columbus 37.0
49 Oklahoma City 31.4
50 Portland 37.6
51 Philadelphia 39.9
52 Pittsburg 36.2
53 Providence 42.8
54 Columbia 46.4
55 Sioux Falls 24.7
56 Memphis 49.1
57 Nashville 46.0
58 Dallas 35.9
59 El Paso 7.8
60 Houston 48.2
61 Salt Lake City 15.2
62 Burlington 32.5
63 Norfolk 44.7
64 Richmond 42.6
65 Seattle Tacoma 38.8
66 Spokane 17.4
67 Charleston 40.8
68 Milwaukee 29.1
69 Cheyenne 14.6
70 San Juan 59.2
In order to select just one column, we can rewrite the SQL to as shown below:
> sqldf("select cities from precip_data")
Cities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
6 Sacramento
7 San Francisco
8 Denver
9 Hartford
10 Wilmington
11 Washington
12 Jacksonville
13 Miami
14 Atlanta
15 Honolulu
16 Boise
17 Chicago
18 Peoria
19 Indianapolis
20 Des Moines
21 Wichita
22 Louisville
23 New Orleans
24 Portland
25 Baltimore
26 Boston
27 Detroit
28 Sault Ste. Marie
29 Duluth
30 Minneapolis/St Paul
31 Jackson
32 Kansas City
33 St Louis
34 Great Falls
35 Omaha
36 Reno
37 Concord
38 Atlantic City
39 Albuquerque
40 Albany
41 Buffalo
42 New York
43 Charlotte
44 Raleigh
45 Bismark
46 Cincinnati
47 Cleveland
48 Columbus
49 Oklahoma City
50 Portland
51 Philadelphia
52 Pittsburg
53 Providence
54 Columbia
55 Sioux Falls
56 Memphis
57 Nashville
58 Dallas
59 El Paso
60 Houston
61 Salt Lake City
62 Burlington
63 Norfolk
64 Richmond
65 Seattle Tacoma
66 Spokane
67 Charleston
68 Milwaukee
69 Cheyenne
70 San Juan
To limit the number of rows in the returned dataset, limit can be used as shown below:
> sqldf("select * from precip_data limit 5")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
Two forms of column alias are show below.
> sqldf("select cities as USCities from precip_data limit 5")
USCities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
> sqldf("select cities 'US Cities' from precip_data limit 5")
US Cities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
>
Calculations are allowed on columns containing numeric values. So, a query like the one below is entirely plausible:
> sqldf("select cities,precipitation*2 from precip_data limit 5")
Cities precipitation*2
1 Mobile 134.0
2 Juneau 109.4
3 Phoenix 14.0
4 Little Rock 97.0
5 Los Angeles 28.0
To return limited rows based on a search criterion, we can use the where clause as shown below:
> sqldf("select cities,precipitation from precip_data where cities like 'New%'")
Cities Precipitation
1 New Orleans 56.8
2 New York 40.2
The next query uses a string function, instr to return the same results as shown below:
> sqldf("select cities,precipitation from precip_data where instr(cities,'New')")
Cities Precipitation
1 New Orleans 56.8
2 New York 40.2
We can put a where clause on the numeric column as shown below:
> sqldf("select cities,precipitation from precip_data where precipitation >= 50")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Jacksonville 54.5
4 Miami 59.8
5 New Orleans 56.8
6 San Juan 59.2
Usage of between is shown in the below query:
> sqldf("select cities,precipitation from precip_data where precipitation between 55 and 60")
Cities Precipitation
1 Miami 59.8
2 New Orleans 56.8
3 San Juan 59.2
For discontinuous values, we can use in as shown below:
> sqldf("select cities,precipitation from precip_data where precipitation in (30.2,35.9)")
Cities Precipitation
1 Duluth 30.2
2 St Louis 35.9
3 Omaha 30.2
4 Dallas 35.9
Now, let us use mtcars dataframe to see the other features of SQL in R.
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
To find out unique values, we can use distinct as shown below:
> sqldf("select distinct am from mtcars")
am
1 1
2 0
Support for aggregate functions and nesting of functions are shown below:
> sqldf("select count(*) from mtcars")
count(*)
1 32
> sqldf("select count(am) from mtcars")
count(am)
1 32
> sqldf("select count(distinct am) from mtcars")
count(distinct am)
1 2
count(*) returns the number of rows.
max() gives the maximum, min() gives the minimum, round() rounds the numeric value, and sum() calculates the sum of the values. These are shown below:
> sqldf("select min(disp), max(disp) from mtcars")
min(disp) max(disp)
1 71.1 472
>
> sqldf("select round(min(disp)), round(max(disp)) from mtcars")
round(min(disp)) round(max(disp))
1 71 472
>
> sqldf("select sum(wt) from mtcars")
sum(wt)
1 102.952
The rest of the SQL in R discussion is carried out in Part II ...
a) HiveQL
b) Spark SQL
c) Google BigQuery
d) Apache Drill
It is, therefore, not surprising that R does not lag behind and also supports SQL via sqldf package. So, let's explore how SQL can be used in R.
Details of sqldf package gives a fair idea about the manner in which SQL can be used to query dataframes in R. The first step is, of course, to install the sqldf package itself in R Studio as shown below:
Once we select sqldf from the prompted values, click on Install button. Once the installation of the package is complete, we can see that the following commands run on the console.
Then, we load the sqldf package using library command as shown below to use this package.
Running data() command shows all the default datasets available in R environment.
We will use these default datasets to play with SQL in R. First, let us prepare a dataframe based on precip dataset as shown below:
> precip_data <- data.frame(attr(precip,"names"),precip)
> colnames(precip_data) <- c("Cities","Precipitation")
> head(precip_data)
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
6 Sacramento 17.2
A quick look at the structure of this dataframe shows that the dataframe has two columns: one consisting of factors and the other numeric.
> str(precip_data)
'data.frame': 70 obs. of 2 variables:
$ Cities : Factor w/ 69 levels "Albany","Albuquerque",..: 42 33 51 35 36 58 60 21 27 69 ...
$ Precipitation: num 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
>
We can a simple select on the dataframe as shown below to see the complete data:
> sqldf("select * from precip_data")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
6 Sacramento 17.2
7 San Francisco 20.7
8 Denver 13.0
9 Hartford 43.4
10 Wilmington 40.2
11 Washington 38.9
12 Jacksonville 54.5
13 Miami 59.8
14 Atlanta 48.3
15 Honolulu 22.9
16 Boise 11.5
17 Chicago 34.4
18 Peoria 35.1
19 Indianapolis 38.7
20 Des Moines 30.8
21 Wichita 30.6
22 Louisville 43.1
23 New Orleans 56.8
24 Portland 40.8
25 Baltimore 41.8
26 Boston 42.5
27 Detroit 31.0
28 Sault Ste. Marie 31.7
29 Duluth 30.2
30 Minneapolis/St Paul 25.9
31 Jackson 49.2
32 Kansas City 37.0
33 St Louis 35.9
34 Great Falls 15.0
35 Omaha 30.2
36 Reno 7.2
37 Concord 36.2
38 Atlantic City 45.5
39 Albuquerque 7.8
40 Albany 33.4
41 Buffalo 36.1
42 New York 40.2
43 Charlotte 42.7
44 Raleigh 42.5
45 Bismark 16.2
46 Cincinnati 39.0
47 Cleveland 35.0
48 Columbus 37.0
49 Oklahoma City 31.4
50 Portland 37.6
51 Philadelphia 39.9
52 Pittsburg 36.2
53 Providence 42.8
54 Columbia 46.4
55 Sioux Falls 24.7
56 Memphis 49.1
57 Nashville 46.0
58 Dallas 35.9
59 El Paso 7.8
60 Houston 48.2
61 Salt Lake City 15.2
62 Burlington 32.5
63 Norfolk 44.7
64 Richmond 42.6
65 Seattle Tacoma 38.8
66 Spokane 17.4
67 Charleston 40.8
68 Milwaukee 29.1
69 Cheyenne 14.6
70 San Juan 59.2
In order to select just one column, we can rewrite the SQL to as shown below:
> sqldf("select cities from precip_data")
Cities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
6 Sacramento
7 San Francisco
8 Denver
9 Hartford
10 Wilmington
11 Washington
12 Jacksonville
13 Miami
14 Atlanta
15 Honolulu
16 Boise
17 Chicago
18 Peoria
19 Indianapolis
20 Des Moines
21 Wichita
22 Louisville
23 New Orleans
24 Portland
25 Baltimore
26 Boston
27 Detroit
28 Sault Ste. Marie
29 Duluth
30 Minneapolis/St Paul
31 Jackson
32 Kansas City
33 St Louis
34 Great Falls
35 Omaha
36 Reno
37 Concord
38 Atlantic City
39 Albuquerque
40 Albany
41 Buffalo
42 New York
43 Charlotte
44 Raleigh
45 Bismark
46 Cincinnati
47 Cleveland
48 Columbus
49 Oklahoma City
50 Portland
51 Philadelphia
52 Pittsburg
53 Providence
54 Columbia
55 Sioux Falls
56 Memphis
57 Nashville
58 Dallas
59 El Paso
60 Houston
61 Salt Lake City
62 Burlington
63 Norfolk
64 Richmond
65 Seattle Tacoma
66 Spokane
67 Charleston
68 Milwaukee
69 Cheyenne
70 San Juan
To limit the number of rows in the returned dataset, limit can be used as shown below:
> sqldf("select * from precip_data limit 5")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Phoenix 7.0
4 Little Rock 48.5
5 Los Angeles 14.0
Two forms of column alias are show below.
> sqldf("select cities as USCities from precip_data limit 5")
USCities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
> sqldf("select cities 'US Cities' from precip_data limit 5")
US Cities
1 Mobile
2 Juneau
3 Phoenix
4 Little Rock
5 Los Angeles
>
Calculations are allowed on columns containing numeric values. So, a query like the one below is entirely plausible:
> sqldf("select cities,precipitation*2 from precip_data limit 5")
Cities precipitation*2
1 Mobile 134.0
2 Juneau 109.4
3 Phoenix 14.0
4 Little Rock 97.0
5 Los Angeles 28.0
To return limited rows based on a search criterion, we can use the where clause as shown below:
> sqldf("select cities,precipitation from precip_data where cities like 'New%'")
Cities Precipitation
1 New Orleans 56.8
2 New York 40.2
The next query uses a string function, instr to return the same results as shown below:
> sqldf("select cities,precipitation from precip_data where instr(cities,'New')")
Cities Precipitation
1 New Orleans 56.8
2 New York 40.2
We can put a where clause on the numeric column as shown below:
> sqldf("select cities,precipitation from precip_data where precipitation >= 50")
Cities Precipitation
1 Mobile 67.0
2 Juneau 54.7
3 Jacksonville 54.5
4 Miami 59.8
5 New Orleans 56.8
6 San Juan 59.2
Usage of between is shown in the below query:
> sqldf("select cities,precipitation from precip_data where precipitation between 55 and 60")
Cities Precipitation
1 Miami 59.8
2 New Orleans 56.8
3 San Juan 59.2
For discontinuous values, we can use in as shown below:
> sqldf("select cities,precipitation from precip_data where precipitation in (30.2,35.9)")
Cities Precipitation
1 Duluth 30.2
2 St Louis 35.9
3 Omaha 30.2
4 Dallas 35.9
Now, let us use mtcars dataframe to see the other features of SQL in R.
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
To find out unique values, we can use distinct as shown below:
> sqldf("select distinct am from mtcars")
am
1 1
2 0
Support for aggregate functions and nesting of functions are shown below:
> sqldf("select count(*) from mtcars")
count(*)
1 32
> sqldf("select count(am) from mtcars")
count(am)
1 32
> sqldf("select count(distinct am) from mtcars")
count(distinct am)
1 2
count(*) returns the number of rows.
max() gives the maximum, min() gives the minimum, round() rounds the numeric value, and sum() calculates the sum of the values. These are shown below:
> sqldf("select min(disp), max(disp) from mtcars")
min(disp) max(disp)
1 71.1 472
>
> sqldf("select round(min(disp)), round(max(disp)) from mtcars")
round(min(disp)) round(max(disp))
1 71 472
>
> sqldf("select sum(wt) from mtcars")
sum(wt)
1 102.952
The rest of the SQL in R discussion is carried out in Part II ...