Thursday 28 September 2017

Set Operations in R

After a fairly long break, we are back to writing interesting insights on data and related areas. The focus of this short post is on the set operations in R. Set operations are useful when we have to find out, say, common elements between two sets of data or need the elements of one set of data not present in another set of data. We will consider a few functions in R that can be of immense help in data analysis.

To illustrate the functions, we use two sets of vectors called p and q defined as follows:









Unique() function gives the unique elements in a vector. Let us use this function on p and q to fetch the unique elements as shown below:







As can be seen, unique function on p and q results in distinct elements and duplicates are discarded. To get the union of elements in p and q, we use the union function as below:






Note that the union of p and q results in distinct elements from p and q. Next, we look at intersect function that returns in common elements between the two sets under consideration.







setdiff function returns the elements of the first set not present in the second set. setdiff is like minus operation in a relational database. Unlike union or intersect where the order of the dataset does not matter, in the case of setdiff, it matters.







Note again that the setdiff returns only distinct elements.

We can create a custom setdiff function that returns duplicates as shown below:






To see if two vectors are identical we can use setequal as shown below:







setequal returns false in the first case as p and q are not equal and true in the second case as p is same as p.

is.element returns TRUE or FALSE based on whether the element in the first dataset is present in the second dataset or not. It is same as %in%  operator. The is.element is shown below:






The last function we discuss is choose. choose(n,r) is the same as nCr defined in Combinatorics. choose returns the binomial coefficient. In other words, choose(n,r) returns the number of possible subsets of size r chosen from n elements. Examples are shown below: