The
following description is an overview of available meta characters which can be
used in regular expressions. This chapter is supposed to be a references for
the different regex elements.
3.1. Common
matching symbols
Regular Expression
|
Description
|
. |
Matches any character
|
^regex |
Finds regex that must match at the beginning of the
line.
|
regex$ |
Finds regex that must match at the end of the line.
|
[abc] |
Set definition, can match the letter a or b or c.
|
[abc][vz] |
Set definition, can match a or b or c followed by
either v or z.
|
[^abc] |
When a caret appears as the first character inside
square brackets, it negates the pattern. This can match any character except
a or b or c.
|
[a-d1-7] |
Ranges: matches a letter between a and d and figures
from 1 to 7, but not d1.
|
X|Z |
Finds X or Z.
|
XZ |
Finds X directly followed by Z.
|
$ |
Checks if a line end follows.
|
3.2. Metacharacters
The
following metacharacters have a pre-defined meaning and make certain common
patterns easier to use, e.g.,
\d
instead
of [0..9]
.
Regular Expression
|
Description
|
\d |
Any digit, short for
[0-9] |
\D |
A non-digit, short for
[^0-9] |
\s |
A whitespace character, short for
[ \t\n\x0b\r\f] |
\S |
A non-whitespace character, short for
[^\s] |
\w |
A word character, short for
[a-zA-Z_0-9] |
\W |
A non-word character
[^\w] |
\S+ |
Several non-whitespace characters
|
\b |
Matches a word boundary where a word character is
[a-zA-Z0-9_] . |
3.3. Quantifier
A
quantifier defines how often an element can occur. The symbols ?, *, + and {}
define the quantity of the regular expressions
Regular Expression
|
Description
|
Examples
|
* |
Occurs zero or more times, is short for
{0,} |
X* finds no or several letter X,.* finds any character sequence |
+ |
Occurs one or more times, is short for
{1,} |
X+ - Finds one or several letter X |
? |
Occurs no or one times,
? is short for {0,1} . |
X? finds no or exactly one letter X |
{X} |
Occurs X number of times,
{} describes the order of the preceding liberal |
\d{3} searches for three digits, .{10} for any character sequence of length 10. |
{X,Y} |
Occurs between X and Y times,
|
\d{1,4} means \d must occur at least once and at a maximum of four. |
*? |
? after a quantifier makes it a reluctant quantifier.
It tries to find the smallest match. |
|
3.4. Grouping
and Backreference
You can
group parts of your regular expression. In your pattern you group elements with
round brackets, e.g.,
()
.
This allows you to assign a repetition operator to a complete group.
In
addition these groups also create a backreference to the part of the regular
expression. This captures the group. A backreference stores the part of the
String
which matched the group. This allows you
to use this part in the replacement.
Via the
$
you can refer to a group. $1
is the first group, $2
the second, etc.
Let's, for
example, assume you want to replace all whitespace between a letter followed by
a point or a comma. This would involve that the point or the comma is part of
the pattern. Still it should be included in the result.
// Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
System.out.println(EXAMPLE_TEST.replaceAll(pattern, "$1$3"));
This
example extracts the text between a title tag.
// Extract the text between the two title elements
pattern = "(?i)(<title.*?>)(.+?)(</title>)";
String updated = EXAMPLE_TEST.replaceAll(pattern, "$2");
3.5. Negative
Lookahead
Negative
Lookahead provides the possibility to exclude a pattern. With this you can say
that a string should not be followed by another string.
Negative
Lookaheads are defined via
(?!pattern)
. For example, the following will match "a" if
"a" is not followed by "b".a(?!b)
3.6. Backslashes
in Java
The
backslash
\
is an escape character in Java Strings. That means backslash has a
predefined meaning in Java. You have to use double backslash \\
to define a single backslash. If you want
to define \w
, then
you must be using \\w
in your regex. If you want to use backslash as a literal, you have
to type \\\\
as \
is also an escape character in regular expressions.
No comments:
Post a Comment