Monday, 5 December 2016

Chapter 6.1: Rules of writing regular expressions-Common matching symbols, Metacharacters, Quantifier, Grouping and Backreference, Grouping and Backreference, Backslashes in Java


The following description is an overview of available meta characters which can be used in regular expressions. This chapter is supposed to be a references for the different regex elements.

3.1. Common matching symbols

Table 1. 
Regular Expression
Description
.
Matches any character
^regex
Finds regex that must match at the beginning of the line.
regex$
Finds regex that must match at the end of the line.
[abc]
Set definition, can match the letter a or b or c.
[abc][vz]
Set definition, can match a or b or c followed by either v or z.
[^abc]
When a caret appears as the first character inside square brackets, it negates the pattern. This can match any character except a or b or c.
[a-d1-7]
Ranges: matches a letter between a and d and figures from 1 to 7, but not d1.
X|Z
Finds X or Z.
XZ
Finds X directly followed by Z.
$
Checks if a line end follows.

3.2. Metacharacters

The following metacharacters have a pre-defined meaning and make certain common patterns easier to use, e.g., \dinstead of [0..9].
Table 2. 
Regular Expression
Description
\d
Any digit, short for [0-9]
\D
A non-digit, short for [^0-9]
\s
A whitespace character, short for [ \t\n\x0b\r\f]
\S
A non-whitespace character, short for [^\s]
\w
A word character, short for [a-zA-Z_0-9]
\W
A non-word character [^\w]
\S+
Several non-whitespace characters
\b
Matches a word boundary where a word character is [a-zA-Z0-9_].

3.3. Quantifier

A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions
Table 3. 
Regular Expression
Description
Examples
*
Occurs zero or more times, is short for {0,}
X* finds no or several letter X,
.* finds any character sequence
+
Occurs one or more times, is short for {1,}
X+ - Finds one or several letter X
?
Occurs no or one times, ? is short for {0,1}.
X? finds no or exactly one letter X
{X}
Occurs X number of times, {} describes the order of the preceding liberal
\d{3} searches for three digits, .{10} for any character sequence of length 10.
{X,Y}
Occurs between X and Y times,
\d{1,4} means \d must occur at least once and at a maximum of four.
*?
? after a quantifier makes it a reluctant quantifier. It tries to find the smallest match.


3.4. Grouping and Backreference

You can group parts of your regular expression. In your pattern you group elements with round brackets, e.g., (). This allows you to assign a repetition operator to a complete group.
In addition these groups also create a backreference to the part of the regular expression. This captures the group. A backreference stores the part of the String which matched the group. This allows you to use this part in the replacement.
Via the $ you can refer to a group. $1 is the first group, $2 the second, etc.
Let's, for example, assume you want to replace all whitespace between a letter followed by a point or a comma. This would involve that the point or the comma is part of the pattern. Still it should be included in the result.
// Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
System.out.println(EXAMPLE_TEST.replaceAll(pattern, "$1$3")); 
This example extracts the text between a title tag.
// Extract the text between the two title elements
pattern = "(?i)(<title.*?>)(.+?)(</title>)";
String updated = EXAMPLE_TEST.replaceAll(pattern, "$2"); 

3.5. Negative Lookahead

Negative Lookahead provides the possibility to exclude a pattern. With this you can say that a string should not be followed by another string.
Negative Lookaheads are defined via (?!pattern). For example, the following will match "a" if "a" is not followed by "b".
a(?!b) 

3.6. Backslashes in Java


The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w, then you must be using \\w in your regex. If you want to use backslash as a literal, you have to type \\\\ as \ is also an escape character in regular expressions.

No comments:

Post a Comment