Let's Refresh: More Regular Expressions

Tuesday 10 April 2018

More Regular Expressions

In an earlier post we have taken a look at Regular Expressions. We extend the conversation on Regular Expressions in this post by covering a few more topics as inline modifiers, capturing groups, non-capturing groups and look arounds. Some of these topics may seem a bit involved at first glance. For demonstrating examples related to the different topics, we will be using the test harness mentioned here. The java program is saved to a folder called RegularExpressions. Let us use it to check out a few simple examples in Regular Expressions before we take up the topics mentioned earlier. The commands to run the test harness are below:

F:\>cd RegularExpressions

F:\RegularExpressions>javac RegexTestHarness.java

F:\RegularExpressions>java -classpath . RegexTestHarness

The results are shown below:

We can enter a regular expression that we intend to search for. Once we enter the pattern to search for, we get a prompt where we can enter the text that will be searched for the pattern entered earlier. Then, the program will return the results of the search. This process is then repeated till we exit the program using CTRL+C keys

Let us look for vowels in foobar. The results are shown below:

Enter your regex: [aeiou]

Enter input string to search: foobar

I found the text "o" starting at index 1 and ending at index 2.

I found the text "o" starting at index 2 and ending at index 3.

I found the text "a" starting at index 4 and ending at index 5.

Note that three results are returned and in each case the vowels are picked. We can add a restriction on the search to look for two two vowels appearing together as shown below:

Enter your regex: [aeiou]{2}

Enter input string to search: foobar

I found the text "oo" starting at index 1 and ending at index 3.

Inline modifiers have the syntax, (?z) where z is an alphabet like i or s. i means case insensitive. An example of its usage and the result is shown below:

Enter your regex: (?i)ms\.

Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.

I found the text "Ms." starting at index 0 and ending at index 3.

I found the text "MS." starting at index 11 and ending at index 14.

I found the text "ms." starting at index 26 and ending at index 29.

If we wish to match case insensitive feature to be applicable to M only, then,

Enter your regex: (?i:M)s\.

Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.

I found the text "Ms." starting at index 0 and ending at index 3.

I found the text "ms." starting at index 26 and ending at index 29.

(?s) enables the metacharacter . to match every character on a single line including newline characters. Two examples is shown below:

Enter your regex: (?s).*

Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.

I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.

I found the text "" starting at index 54 and ending at index 54.

Enter your regex: (?s)^.*

Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.

I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.

Enter your regex: (?s).*$

Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.

I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.

I found the text "" starting at index 54 and ending at index 54.

Capturing groups are quite interesting because they help us in picking up selectively those expressions that match the patterns that we are searching for and can be used for processing later. Any search pattern in parenthesis qualifies for a capturing group as shown below:

Enter your regex: (\w{3})

Enter input string to search: a bc def gh ijk

I found the text "def" starting at index 5 and ending at index 8.

I found the text "ijk" starting at index 12 and ending at index 15.

Any expression that matches three alphanumeric elements is returned. In the next example, we use back references in conjunction with two capturing groups:

Enter your regex: (\w{1})(\w{1})\2\1\2

Enter input string to search: abcde fghij abbab kllkl lmnop

I found the text "abbab" starting at index 12 and ending at index 17.

I found the text "kllkl" starting at index 18 and ending at index 23.

Non-capturing groups are very similar to capturing groups but the match is not picked. The syntax is too is very similar to that of capturing groups but we use ?: within the parenthesis as shown below:

Enter your regex: (?:\w{1})(\w{1})(\w{1})\1\1\2

Enter input string to search: abcbbc defghi klmllm

I found the text "abcbbc" starting at index 0 and ending at index 6.

I found the text "klmllm" starting at index 14 and ending at index 20.

Note that there are three groups: the first one is non-capturing and the next two are capturing groups. Since we have only two capturing groups, we can have only two back references

Next we take a look at look arounds. They take only a look for a search pattern either in the forward direction or in the backward direction. But, the search pattern is itself skipped. So, they are called look arounds. If they take a look in the forward direction, they are called Lookaheads, and if they take a look in the backward direction, then, they are called Lookbehinds. There are two types of Lookaheads: Positive Lookahead and Negative Lookahead. The syntax for Positive Lookahead is (?=). An example of Positive Lookahead is shown below:

Enter your regex: foo(?=bar)

Enter input string to search: One can often see foobar used in software code.

I found the text "foo" starting at index 18 and ending at index 21.

The check is made for foo followed by bar but only foo is captured but not the bar. This is evident in the next example:

Enter your regex: foo(?=bar)

Enter input string to search: foo in foobar is used separately also.

I found the text "foo" starting at index 7 and ending at index 10.

There are two foo in input string. But, only the foo that is followed by bar is picked. Negative Lookahead has syntax as (?!) and is same as Positive Lookahead but it will only pick when the search is not matched as shown in below example:

Enter your regex: foo(?!bar)

Enter input string to search: foo in foobar is used separately also.

I found the text "foo" starting at index 0 and ending at index 3.

There are two foo in input string. But, only the foo that is not followed by bar is picked by Negative Lookahead. The next look around is Lookbehind. The syntax for Lookbehind is (?<=). This is similar to Positive Lookahead except that the search pattern is before the picked unit. An example is shown below:

Enter your regex: (?<=foo)bar

Enter input string to search: foo, bar and foobar are often used in software examples

I found the text "bar" starting at index 16 and ending at index 19.

In the above example, between the standalone bar and the bar that is part of foobar, the bar that is preceded by foo is picked because only that satisfies the Lookbehind search pattern. Lastly, we take a look at Negative Lookbehind. Negative Lookbehind have the syntax (?<!). Negative Lookbehind is similar to Lookbehind but the search pattern should not find a match as is seen in below example:

Enter your regex: (?<!foo)bar

Enter input string to search: foo, bar and foobar are often used in software examples

I found the text "bar" starting at index 5 and ending at index 8.

We have used the same input text as in the case of Lookbehind example. Note that the bar that is not preceded by foo is picked and the bar that is preceded by foo is missed.

This concludes the discussion on inline modifiers, capturing groups, non-capturing groups and look arounds in Regular Expressions