Tuesday 28 November 2017

Using Sqoop - II

In the earlier part of  Sqoop post, we saw how it is invoked and how it can be used to check the data on Oracle database. In this post, we will attempt data migration. For all the work in this post, we will use the Oracle Big Data Lite 4.7.0 Virtual Machine.

The first command we will see is that of importing data from Oracle. The command is:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/

The results returned are shown below:



















target_dir specifies the target directory. An error is returned stating that no primary key could be found on TICKER table. We will take the option of adding -m 1. the number after -m specifies the number of mappers to be used in parallel for the import process.

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1

The results returned are shown below:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir/ -m 1
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 09:07:53 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 09:07:53 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 09:07:53 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 09:07:53 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 09:07:53 INFO tool.CodeGenTool: Beginning code generation
17/11/28 09:07:54 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:54 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "TICKER" t WHERE 1=0
17/11/28 09:07:54 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 09:07:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/a17fd6bd0c3377aa29a6891087a863e2/TICKER.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:56 INFO mapreduce.ImportJobBase: Beginning import of TICKER
17/11/28 09:07:56 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 09:07:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 09:07:57 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 09:07:57 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 09:08:01 INFO db.DBInputFormat: Using read commited transaction isolation
17/11/28 09:08:01 INFO mapreduce.JobSubmitter: number of splits:1
17/11/28 09:08:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511848368976_0003
17/11/28 09:08:02 INFO impl.YarnClientImpl: Submitted application application_1511848368976_0003
17/11/28 09:08:02 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511848368976_0003/
17/11/28 09:08:02 INFO mapreduce.Job: Running job: job_1511848368976_0003
17/11/28 09:08:09 INFO mapreduce.Job: Job job_1511848368976_0003 running in uber mode : false
17/11/28 09:08:09 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 09:08:15 INFO mapreduce.Job: Job job_1511848368976_0003 completed successfully
17/11/28 09:08:16 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=143601
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=87
        HDFS: Number of bytes written=1859
        HDFS: Number of read operations=4
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=1
        Other local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3850
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=3850
        Total vcore-seconds taken by all map tasks=3850
        Total megabyte-seconds taken by all map tasks=3942400
    Map-Reduce Framework
        Map input records=60
        Map output records=60
        Input split bytes=87
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=78
        CPU time spent (ms)=1690
        Physical memory (bytes) snapshot=184446976
        Virtual memory (bytes) snapshot=2101907456
        Total committed heap usage (bytes)=115867648
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=1859
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Transferred 1.8154 KB in 18.6901 seconds (99.4642 bytes/sec)
17/11/28 09:08:16 INFO mapreduce.ImportJobBase: Retrieved 60 records.
[oracle@bigdatalite ~]$

Following two commands will validate the records that were imported into files in HDFS. Note that the values are separated by commas.


hadoop fs -ls /user/oracle/import-dir

hadoop fs -cat /user/oracle/import-dir/part-m-00000 | head













We can specify a where condition too as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --table TICKER --target-dir /user/oracle/import-dir_1/ -m 1 --where "symbol like 'GLOBEX'"
 

The results returned are shown below:










Instead of a table,we can specify the results of a query as shown below:

sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --query 'select * from customer_segment where $CONDITIONS' --target-dir /user/oracle/import-dir_2/ --split-by segment_id 
  
The results returned can be checked with below commands:

hadoop fs -ls /user/oracle/import-dir_2

hadoop fs -cat /user/oracle/import-dir_2/part-m-0000* | head



















In the last step, we will see a data export from HDFS to Oracle. Let us create a table in Oracle using below command:

create table customer_segment_1 as select * from customer_segment where 1=2;

The results returned are shown below:













This table will be created with no data in it as shown below:

















We can then export the same data that we imported earlier using below command:

sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*

The output is shown below:

 [oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CUSTOMER_SEGMENT_1 --export-dir /user/oracle/import-dir_2/part-m-0000*
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
17/11/28 13:22:51 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.9.0
17/11/28 13:22:51 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
17/11/28 13:22:51 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
17/11/28 13:22:51 INFO manager.SqlManager: Using default fetchSize of 1000
17/11/28 13:22:51 INFO tool.CodeGenTool: Beginning code generation
17/11/28 13:22:52 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "CUSTOMER_SEGMENT_1" t WHERE 1=0
17/11/28 13:22:52 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
Note: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
17/11/28 13:22:54 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/1110053d6710a6e6fd21b138ca20cc29/CUSTOMER_SEGMENT_1.jar
17/11/28 13:22:54 INFO mapreduce.ExportJobBase: Beginning export of CUSTOMER_SEGMENT_1
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
17/11/28 13:22:55 INFO Configuration.deprecation: mapred.map.max.attempts is deprecated. Instead, use mapreduce.map.maxattempts
17/11/28 13:22:56 INFO manager.OracleManager: Time zone has been set to GMT
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:22:56 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
17/11/28 13:22:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO input.FileInputFormat: Total input paths to process : 4
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: number of splits:3
17/11/28 13:23:00 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
17/11/28 13:23:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1511886050393_0008
17/11/28 13:23:00 INFO impl.YarnClientImpl: Submitted application application_1511886050393_0008
17/11/28 13:23:00 INFO mapreduce.Job: The url to track the job: http://bigdatalite.localdomain:8088/proxy/application_1511886050393_0008/
17/11/28 13:23:00 INFO mapreduce.Job: Running job: job_1511886050393_0008
17/11/28 13:23:06 INFO mapreduce.Job: Job job_1511886050393_0008 running in uber mode : false
17/11/28 13:23:06 INFO mapreduce.Job:  map 0% reduce 0%
17/11/28 13:23:16 INFO mapreduce.Job:  map 33% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job:  map 100% reduce 0%
17/11/28 13:23:18 INFO mapreduce.Job: Job job_1511886050393_0008 completed successfully
17/11/28 13:23:18 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=430611
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1430
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=21
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Launched map tasks=3
        Data-local map tasks=3
        Total time spent by all maps in occupied slots (ms)=22352
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=22352
        Total vcore-seconds taken by all map tasks=22352
        Total megabyte-seconds taken by all map tasks=22888448
    Map-Reduce Framework
        Map input records=10
        Map output records=10
        Input split bytes=732
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=425
        CPU time spent (ms)=3920
        Physical memory (bytes) snapshot=548945920
        Virtual memory (bytes) snapshot=6293712896
        Total committed heap usage (bytes)=354418688
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Transferred 1.3965 KB in 22.3071 seconds (64.1051 bytes/sec)
17/11/28 13:23:18 INFO mapreduce.ExportJobBase: Exported 10 records.
[oracle@bigdatalite ~]$ 


We can then check the data in the table in the database as has been shown below:












This concludes tbe post on Sqoop.

Using Sqoop - I

We take a brief look at Sqoop in this post. For all the work in this post, we will use the Oracle Big Data Lite 4.7.0 Virtual Machine.

Sqoop is a tool for transferring bulk data between Apache Hadoop and structured datastores such as relational databases. More details about this project are here.

The migration of data is key to some processes as output from a Map Reduce process in Hadoop may be ported to a relational database for further analysis and vice versa. In that sense, Sqoop is an important tool.

After installing Sqoop, we have to place the JDBC driver for the database that we propose to establish a connection to. Sqoop is already installed and the JDBC driver to connect to Oracle is located at /var/lib/sqoop/ as shown below:

















We can see two links there: MySQL JDBC driver and Oracle JDBC driver. ojdbc6.jar is a link to /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc6.jar. If one does not have it, then, one can download it from here and copy it to /var/lib/sqoop/ directory.

We can then check the version of Sqoop using below command on the terminal window:

sqoop version










To see the different available commands, we can use below command:

sqoop help


















Once we are done with checking that the JDBC driver is in place and the version of Sqoop, let us start the database so that we can try out some data transfers between Oracle and HDFS. Click on Start/Stop Services as shown below. This will bring up the window that can be used to start the Oracle database.















Once Oracle Database 12c is selected, press the spacebar to select it and hit OK button to start it. You are lead to below window indicating that the database is starting:


















The first command will be list all the databases in the database using below command:

sqoop list-databases --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser 

The results returned are shown below:


















The command itself is quite simple. The command starts with Sqoop followed by list_databases. username and password are the credentials for the user to connect to the database. The complete results can be seen by scrolling.

The next command lists the tables:

sqoop list-tables --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser

The results returned are shown below:












There is only one table called TICKER.

Open SQL Developer to check out the table called TICKER under pmuser as shown below:

















Let us see the command for reading TICKER is shown below:

sqoop eval --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --query "select * from  ticker where rownum <= 5"

The results returned are shown below:


















We can compare the above results with the results in SQL Developer. They compare well. eval will evaluate the SQL query that follows and displays results. connect will specify the jdbc driver. query specifies the SQL query.

We can modify the query to reflect only the date in TSTAMP as shown below:

sqoop eval --connect jdbc:oracle:thin:@localhost:1521/orcl --username PMUSER --password pmuser --query "select symbol,to_char(tstamp,'mm-dd-yyyy'),price from ticker where rownum <= 5"

The results returned are shown below:








The rest of the Sqoop details are in the following post.

Sunday 26 November 2017

Word Count in Pig

After the word count in Hive, we will proceed to word count in Pig. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

We will use the same file that we used in the earlier word count exercise. The path to the file is /user/hive/warehouse/text/blue_carbuncle.txt

Navigate to the Pig Editor as shown below:






















The first step is the below command:

data = LOAD '/user/hive/warehouse/text/blue_carbuncle.txt' as (text:CHARARRAY);

The above command loads the data in the file into a field called text that is of type CHARARRAY. Once the data is loaded, we can run below command to describe the schema.

DESCRIBE data;

The results returned are shown below on line 17:







The next line is shown below:

words = FOREACH data GENERATE (TOKENIZE(text)) AS word;

FOREACH applies that the operation that follows on every element of input, data in our case.

GENERATE generates the records or the output from the input on the left.

TOKENIZE splits a string input into a bag of words based on word separators, space in our case.

The output at the end of this operation is shown below from line 93:











We can see that based on the space the words are separated. But, we need to put the words on different rows so that that can counted like we did in case of HIVE. So, we use the word FLATTEN as shown below:

words = FOREACH data GENERATE FLATTEN(TOKENIZE(text)) AS word;

The output at the end of this operation is shown below from line 93:











The next task is to group the words using below command:

grpd  = GROUP words BY word;

The output at the end of this operation is shown below from line 96:










Lastly, we need to add the count as shown below:

cntd  = FOREACH grpd GENERATE group, COUNT(words);

The output at the end of this operation is shown below from line 96:














The overall code is:

data = LOAD '/user/hive/warehouse/text/blue_carbuncle.txt' as (text:CHARARRAY);

words = FOREACH data GENERATE FLATTEN(TOKENIZE(text)) AS word;

grpd  = GROUP words BY word;

cntd  = FOREACH grpd GENERATE group, COUNT(words);

DUMP cntd;


The DUMP command outputs the result.

Saturday 25 November 2017

Word Count in Hive

In this post, we will take a look at word count in Hive. This post is the culmination of the posts that we have written earlier on Analytic Functions in Hive. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

We begin by creating an external table that will have the text data on which we wish to do a word count. The table will have just one column of data type, STRING. The command to create the table is shown below:

create external table text (line string);

The screenshot is shown below:












Once the table is created, we can import data into this table. We will import a .txt file that has the first few paragraphs of the short story, The Adventure of the Blue Carbuncle, by Sir Arthur Conan Doyle.

To port data into this external table, right click on table name on left as shown below:
















In the next screen, click on Import Data as shown below:












In the dialog box, click Upload a file as shown below:
























Select the file once uploaded and click on Submit button:















Click on Sample tab to see the data loaded as shown below:




















Navigate back to Hive Editor as shown below:

























Run the below query to see the position of the word on each line and the word itself:

select posexplode(split(line, '\\s')) as (position, word) from text; 

The results returned are shown below:
























Running below query returns the words in rows:

select explode(split(line,'\\s')) as word from text; 

The results returned are shown below:

























Adding a count as shown below returns the word count:

select word, count(*) from(select explode(split(line,'\\s')) as word from text) z group by word;

The results returned are shown below:



















Counting only those words that begin with a capital letter:

select word, count(*) from(select explode(split(line,'\\s')) as word from text) z where regexp(word, '^[A-Z]') group by word; 

The results returned are shown below:
















The last query returns the top three words that begin with a capital letter occurring in the text:

select word, count(*) as word_count from(select explode(split(line,'\\s')) as word from text) z where regexp(word, '^[A-Z]') group by word order by word_count desc limit 3;

The results returned are shown below:



Tuesday 21 November 2017

Analytic Functions in Hive - VI

In the following post, we will take a look at a case study similar to word count in Hive. For all our work, we will use the Mapr sandbox, MapR-Sandbox-For-Hadoop-5.2.0.

Before we proceed to  the case study, let us take a look at EXPLODE and POSEXPLODE.

We will run below query to understand EXPLODE:

select explode(array('Hello','Big','Data'));

The results returned are shown below:














As the name suggests, EXPLODE decomposes the elements in an array into elements in multiple rows. POSEXPLODE is the same as EXPLODE but returns another column with positions as well starting from 0.

select posexplode(array('Hello','Big','Data')); 

The results returned are shown below:











For the case study, the data is described by the query below:

select last_name, phone_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













The goal is to count the number of elements in the phone numbers after they are split into different elements using '.' as a delimiter. As an example, '515' should have a count of 3 while '615' should have a count of 1 only.

As a first step, we remove the double quotes in the numbers with below query:

select regexp_replace(phone_number, '([/"])', '') as part_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













We then split the result into an array having different phone number parts based on delimiter "." as shown below:

select split(regexp_replace(phone_number, '([/"])', ''),'\\.') as part_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













We then use EXPLODE to move the different parts of the array to multiple rows as shown below: 

select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K');

The results returned are shown below:


























Now, the part numbers are into different rows. We just need to do a group by to complete the part number count as shown below:

select part_number, count(*) as count_number from (select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K')) w group by part_number;

The results returned are shown below: 















This shows the count of the different part numbers. To get the total number of part numbers, use below query:

select sum(count_number) from (select part_number, count(*) as count_number from (select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K')) w group by part_number) y;

The results returned are shown below:











This concludes the Analytic Functions series on Hive.

Monday 20 November 2017

Analytic Functions in Hive - V

In the following post, we will take a look at Regular Expressions as is dealt by Hive. We will follow the post that we wrote before. For all our work, we will use the Mapr sandbox, MapR-Sandbox-For-Hadoop-5.2.0.


A Regular Expression is a special text used as a search pattern. In other words, a Regular Expression describes a pattern that can be used for searching text in strings. They are specified as a combination of Metacharacters and Literals. These are described below:

1) Metacharacters: These are operators like *,?,[], etc, that are used to create the algorithm to be used in the search process

2) Literals: These are the actual characters or content that are searched for

The syntax followed by Hive is described here.

Hive has the following functions to support Regular Expressions: RLIKE, REGEXP, REGEXP_EXTRACT, and REGEXP_REPLACE.

In the first query, we will use RLIKE as seen in the example below:

Use of '.':

select first_name, last_name from employees where rlike (last_name, 'A....n');

The results returned are shown below:











The '.' metacharacter matches any character. So, since there are 4 of those, the query matches any word that starts with A and ends with n and has 4 characters in between.

Use of '+':


The second query makes use of REGEXP. REGEXP is synonymous with RLIKE. 

select first_name, last_name from employees where regexp(last_name, '(rr)+');

The results returned are shown below: 











'+' matches one or more of the preceding subexpression. 

Use of '?':

select first_name, last_name from employees where rlike(last_name, '(at?a)');

The results returned are shown below:  

 










'?' matches zero or one of the preceding subexpression. De Haan is selected because there is no t in between in aa. Pataballa is selected because there is one occurrence of t between aa.

Use of '*':

select first_name, last_name from employees where rlike (last_name, 'al*a');


The results returned are shown below: 













'*' matches zero or more of the preceding subexpression.

Use of '{}':

select first_name, last_name from employees where regexp_like (last_name, 'l{2}');

The results returned are shown below:



















This query returns last_name containing l character two times.

select first_name, last_name from employees where regexp(last_name, 'a{2,}');

The results returned are shown below:









This query returns all employees having last_names having 'a' two times or more.

select first_name, last_name from employees where rlike(last_name, 'h{1,2}');

The results returned are shown below:

















The last query returns all employees having last_names having at least one h but not more than two h.

Use of '[]':

select first_name, last_name from employees where rlike(last_name, '[zx]');

The results returned are shown below:












The last query returns first_name, last_name from all employees having last_name containing x, or z.


Use of '[ ]':

select first_name, last_name from employees where rlike(last_name, '[ ]');

The results returned are shown below:










The last query returns first_name, last_name from all employees having last_name containing a space.

The same result can be obtained by below query:

select first_name, last_name from employees where rlike(last_name, '\\s');

Use of '\1':

'\1' is called back reference as it will call the first search criteria that has been called before. In this case, it is a vowel. So, a repeat is sought.

select last_name from employees where rlike(last_name, '([aeiou])\\1');

The results returned are shown below:
















The last query returns last_name containing two consecutive characters mentioned in [].

Next, we look at REGEXP_EXTRACT that helps with extracting parts from a field that meet certain criteria as shown in below example:

select phone_number, regexp_extract(phone_number, '590.423.(.*)',0) from employees order by regexp_extract(phone_number, '590.423.(.*)',0) desc limit 7;

The results returned are shown below:











The first parameter is the field itself, phone_number. The second argument is the pattern that provides the search expression. The third argument has 0 and so results that match entire search criteria are returned. If the search criteria is not met, nulls are returned. If the third argument is set to 1, then, only results that pertain to the group, (.*) in our case is returned back. This is shown in following query:

select phone_number, regexp_extract(phone_number, '590.423.(.*)',1) from employees order by regexp_extract(phone_number, '590.423.(.*)',1) desc limit 7;

The results returned are shown below:











The last function, REGEXP_REPLACE, can be used to replace characters that satisfy certain pattern matches as shown below:

select last_name, phone_number, regexp_replace(phone_number, '([/.])', ' ') as modified_phone_number from employees where rlike(last_name, 'K');
 
The results returned are shown below:










REGEXP_REPLACE function replaces '.' character in phone_number with a space character.

With the query on REGEXP_REPLACE, we conclude the topic of Regular Expressions in Hive.