The following
lists typical examples for the usage of regular expressions. I hope you find
similarities to your real-world problems.
6.1. Or
Task:
Write a regular expression which matches a text line if this text line contains
either the word "Joe" or the word "Jim" or both.
Create a
project
de.vogella.regex.eitheror
and
the following class.package de.vogella.regex.eitheror;
import org.junit.Test;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
public class EitherOrCheck {
@Test
public void testSimpleTrue() {
String s = "humbapumpa jim";
assertTrue(s.matches(".*(jim|joe).*"));
s = "humbapumpa jom";
assertFalse(s.matches(".*(jim|joe).*"));
s = "humbaPumpa joe";
assertTrue(s.matches(".*(jim|joe).*"));
s = "humbapumpa joe jim";
assertTrue(s.matches(".*(jim|joe).*"));
}
}
6.2. Phone
number
Task:
Write a regular expression which matches any phone number.
A phone
number in this example consists either out of 7 numbers in a row or out of 3
number, a (white)space or a dash and then 4 numbers.
package de.vogella.regex.phonenumber;
import org.junit.Test;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
public class CheckPhone {
@Test
public void testSimpleTrue() {
String pattern = "\\d\\d\\d([,\\s])?\\d\\d\\d\\d";
String s= "1233323322";
assertFalse(s.matches(pattern));
s = "1233323";
assertTrue(s.matches(pattern));
s = "123 3323";
assertTrue(s.matches(pattern));
}
}
6.3. Check
for a certain number range
The
following example will check if a text contains a number with 3 digits.
Create the
Java project
de.vogella.regex.numbermatch
and
the following class.package de.vogella.regex.numbermatch;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.junit.Test;
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
public class CheckNumber {
@Test
public void testSimpleTrue() {
String s= "1233";
assertTrue(test(s));
s= "0";
assertFalse(test(s));
s = "29 Kasdkf 2300 Kdsdf";
assertTrue(test(s));
s = "99900234";
assertTrue(test(s));
}
public static boolean test (String s){
Pattern pattern = Pattern.compile("\\d{3}");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
return true;
}
return false;
}
}
6.4. Building
a link checker
The
following example allows you to extract all valid links from a webpage. It does
not consider links which start with "javascript:" or
"mailto:".
Create a
Java project called de.vogella.regex.weblinks and the following class:
package de.vogella.regex.weblinks;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LinkGetter {
private Pattern htmltag;
private Pattern link;
public LinkGetter() {
htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>");
link = Pattern.compile("href=\"[^>]*\">");
}
public List<String> getLinks(String url) {
List<String> links = new ArrayList<String>();
try {
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
String s;
StringBuilder builder = new StringBuilder();
while ((s = bufferedReader.readLine()) != null) {
builder.append(s);
}
Matcher tagmatch = htmltag.matcher(builder.toString());
while (tagmatch.find()) {
Matcher matcher = link.matcher(tagmatch.group());
matcher.find();
String link = matcher.group().replaceFirst("href=\"", "")
.replaceFirst("\">", "")
.replaceFirst("\"[\\s]?target=\"[a-zA-Z_0-9]*", "");
if (valid(link)) {
links.add(makeAbsolute(url, link));
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return links;
}
private boolean valid(String s) {
if (s.matches("javascript:.*|mailto:.*")) {
return false;
}
return true;
}
private String makeAbsolute(String url, String link) {
if (link.matches("http://.*")) {
return link;
}
if (link.matches("/.*") && url.matches(".*$[^/]")) {
return url + "/" + link;
}
if (link.matches("[^/].*") && url.matches(".*[^/]")) {
return url + "/" + link;
}
if (link.matches("/.*") && url.matches(".*[/]")) {
return url + link;
}
if (link.matches("/.*") && url.matches(".*[^/]")) {
return url + link;
}
throw new RuntimeException("Cannot make the link absolute. Url: " + url
+ " Link " + link);
}
}
6.5. Finding duplicated words
The
following regular expression matches duplicated words.
\b(\w+)\s+\1\b
\b
is a word boundary and \1
references to the captured match of the
first group, i.e., the first word.
The
(?!-in)\b(\w+) \1\b
finds duplicate words if they do not start
with "-in".
Tip
Add
(?s)
to search across multiple lines.
6.6. Finding an
elements which start in a new line
The following
regular expression allows you to find the "title" word, in case it
starts in a new line, potentially with leading spaces.
(\n\s*)title
No comments:
Post a Comment