Wednesday 6 December 2017

Analytic Functions in Impala - V

In the following post, we will take a look at Regular Expressions. We will follow the post that we wrote before. For all our work, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

A Regular Expression is a special text used as a search pattern. In other words, a Regular Expression describes a pattern that can be used for searching text in strings. They are specified as a combination of Metacharacters and Literals. These are described below:

1) Metacharacters: These are operators like *,?,[], etc, that are used to create the algorithm to be used in the search process

2) Literals: These are the actual characters or content that are searched for

The syntax followed by Hive is described here.

Hive has the following functions to support Regular Expressions: REGEXP_LIKE, REGEXP_EXTRACT, and REGEXP_REPLACE.

In the first query, we will use REGEXP_LIKE as seen in the example below:

Use of '.':

select first_name, last_name from employees where regexp_like(last_name, 'A....n')

The results returned are shown below: 












The '.' metacharacter matches any character. So, since there are 4 of those, the query matches any word that starts with A and ends with n and has 4 characters in between.

Use of '+':


select first_name, last_name from employees where
regexp_like(last_name, '(rr)+')

The results returned are shown below: 

 













'+' matches one or more of the preceding subexpression. 

Use of '?':

select first_name, last_name from employees where
regexp_like(last_name, '(at?a)')

The results returned are shown below: 














'*' matches zero or more of the preceding subexpression.

Use of '{}':

select first_name, last_name from employees where regexp_like(last_name, 'l{2}')


The results returned are shown below:
























This query returns last_name containing a character two times.

select first_name, last_name from employees where regexp_like(last_name, 'a{2,}')

The results returned are shown below:












This query returns all employees having last_names having 'a' two times or more.

select first_name, last_name from employees where regexp_like(last_name, 'h{1,2}')

The results returned are shown below:

























The last query returns all employees having last_names having at least one h but not more than two h.

Use of '[]':

select first_name, last_name from employees where regexp_like(last_name, '[zx]')

The results returned are shown below:
















The last query returns first_name, last_name from all employees having last_name containing x, or z.

Use of '[ ]':

select first_name, last_name from employees where regexp_like(last_name, '[ ]')

The results returned are shown below:












The last query returns first_name, last_name from all employees having last_name containing a space.

The same result can be obtained by below query:

select first_name, last_name from employees where regexp_like(last_name, '\\s')

Use of '\1':

'\1' is called back reference as it will call the first search criteria that has been called before. In this case, it is a vowel. So, a repeat is sought.

select last_name from employees where regexp_like(last_name, '([aeiou])\\1')

The results returned are shown below:






The back reference does not look to be supported in Impala.

Next, we look at REGEXP_EXTRACT that helps with extracting parts from a field that meet certain criteria as shown in below example:

select phone_number, regexp_extract(phone_number, '590.423.(.*)',0) from employees order by regexp_extract(phone_number, '590.423.(.*)',0) desc limit 7

The results returned are shown below:


















The first parameter is the field itself, phone_number. The second argument is the pattern that provides the search expression. The third argument has 0 and so results that match entire search criteria are returned. If the search criteria is not met, nulls are returned. If the third argument is set to 1, then, only results that pertain to the group, (.*) in our case is returned back. This is shown in following query:

select phone_number, regexp_extract(phone_number, '590.423.(.*)',1) from employees order by regexp_extract(phone_number, '590.423.(.*)',1) desc limit 7

The results returned are shown below:


















The last function, REGEXP_REPLACE, can be used to replace characters that satisfy certain pattern matches as shown below:

select last_name, phone_number, regexp_replace(phone_number, '([/.])', ' ') as modified_phone_number from employees where regexp_like(last_name, 'K')
 
The results returned are shown below:














REGEXP_REPLACE function replaces '.' character in phone_number with a space character.

With the query on REGEXP_REPLACE, we conclude the topic of Regular Expressions in Impala