The
following description is an overview of available meta characters which can be
used in regular expressions. This chapter is supposed to be a references for
the different regex elements.
3.1. Common
matching symbols
|
Regular Expression
|
Description
|
. |
Matches any character
|
^regex |
Finds regex that must match at the beginning of the
line.
|
regex$ |
Finds regex that must match at the end of the line.
|
[abc] |
Set definition, can match the letter a or b or c.
|
[abc][vz] |
Set definition, can match a or b or c followed by
either v or z.
|
[^abc] |
When a caret appears as the first character inside
square brackets, it negates the pattern. This can match any character except
a or b or c.
|
[a-d1-7] |
Ranges: matches a letter between a and d and figures
from 1 to 7, but not d1.
|
X|Z |
Finds X or Z.
|
XZ |
Finds X directly followed by Z.
|
$ |
Checks if a line end follows.
|
3.2. Metacharacters
The
following metacharacters have a pre-defined meaning and make certain common
patterns easier to use, e.g.,
\dinstead
of [0..9].|
Regular Expression
|
Description
|
\d |
Any digit, short for
[0-9] |
\D |
A non-digit, short for
[^0-9] |
\s |
A whitespace character, short for
[ \t\n\x0b\r\f] |
\S |
A non-whitespace character, short for
[^\s] |
\w |
A word character, short for
[a-zA-Z_0-9] |
\W |
A non-word character
[^\w] |
\S+ |
Several non-whitespace characters
|
\b |
Matches a word boundary where a word character is
[a-zA-Z0-9_]. |
3.3. Quantifier
A
quantifier defines how often an element can occur. The symbols ?, *, + and {}
define the quantity of the regular expressions
|
Regular Expression
|
Description
|
Examples
|
* |
Occurs zero or more times, is short for
{0,} |
X* finds no or several letter X,.* finds any character sequence |
+ |
Occurs one or more times, is short for
{1,} |
X+ - Finds one or several letter X |
? |
Occurs no or one times,
? is short for {0,1}. |
X? finds no or exactly one letter X |
{X} |
Occurs X number of times,
{} describes the order of the preceding liberal |
\d{3} searches for three digits, .{10} for any character sequence of length 10. |
{X,Y} |
Occurs between X and Y times,
|
\d{1,4} means \d must occur at least once and at a maximum of four. |
*? |
? after a quantifier makes it a reluctant quantifier.
It tries to find the smallest match. |
|
3.4. Grouping
and Backreference
You can
group parts of your regular expression. In your pattern you group elements with
round brackets, e.g.,
().
This allows you to assign a repetition operator to a complete group.
In
addition these groups also create a backreference to the part of the regular
expression. This captures the group. A backreference stores the part of the
String which matched the group. This allows you
to use this part in the replacement.
Via the
$ you can refer to a group. $1 is the first group, $2 the second, etc.
Let's, for
example, assume you want to replace all whitespace between a letter followed by
a point or a comma. This would involve that the point or the comma is part of
the pattern. Still it should be included in the result.
// Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
System.out.println(EXAMPLE_TEST.replaceAll(pattern, "$1$3"));
This
example extracts the text between a title tag.
// Extract the text between the two title elements
pattern = "(?i)(<title.*?>)(.+?)(</title>)";
String updated = EXAMPLE_TEST.replaceAll(pattern, "$2");
3.5. Negative
Lookahead
Negative
Lookahead provides the possibility to exclude a pattern. With this you can say
that a string should not be followed by another string.
Negative
Lookaheads are defined via
(?!pattern). For example, the following will match "a" if
"a" is not followed by "b".a(?!b)
3.6. Backslashes
in Java
The
backslash
\ is an escape character in Java Strings. That means backslash has a
predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want
to define \w, then
you must be using \\w in your regex. If you want to use backslash as a literal, you have
to type \\\\ as \ is also an escape character in regular expressions.
No comments:
Post a Comment