Monday, 5 December 2016

Chapter 6.4: Java Regex Examples - Phone number, Check for a certain number range, Building a link checker, Finding an elements which start in a new line,


The following lists typical examples for the usage of regular expressions. I hope you find similarities to your real-world problems.

6.1. Or

Task: Write a regular expression which matches a text line if this text line contains either the word "Joe" or the word "Jim" or both.
Create a project de.vogella.regex.eitheror and the following class.
package de.vogella.regex.eitheror;
 
import org.junit.Test;
 
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
 
public class EitherOrCheck {
  @Test
  public void testSimpleTrue() {
    String s = "humbapumpa jim";
    assertTrue(s.matches(".*(jim|joe).*"));
    s = "humbapumpa jom";
    assertFalse(s.matches(".*(jim|joe).*"));
    s = "humbaPumpa joe";
    assertTrue(s.matches(".*(jim|joe).*"));
    s = "humbapumpa joe jim";
    assertTrue(s.matches(".*(jim|joe).*"));
  }
} 

6.2. Phone number

Task: Write a regular expression which matches any phone number.
A phone number in this example consists either out of 7 numbers in a row or out of 3 number, a (white)space or a dash and then 4 numbers.
package de.vogella.regex.phonenumber;
 
import org.junit.Test;
 
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
 
 
public class CheckPhone {
  
  @Test
  public void testSimpleTrue() {
    String pattern = "\\d\\d\\d([,\\s])?\\d\\d\\d\\d";
    String s= "1233323322";
    assertFalse(s.matches(pattern));
    s = "1233323";
    assertTrue(s.matches(pattern));
    s = "123 3323";
    assertTrue(s.matches(pattern));
  }
} 

6.3. Check for a certain number range

The following example will check if a text contains a number with 3 digits.
Create the Java project de.vogella.regex.numbermatch and the following class.
package de.vogella.regex.numbermatch;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
import org.junit.Test;
 
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
 
public class CheckNumber {
 
  
  @Test
  public void testSimpleTrue() {
    String s= "1233";
    assertTrue(test(s));
    s= "0";
    assertFalse(test(s));
    s = "29 Kasdkf 2300 Kdsdf";
    assertTrue(test(s));
    s = "99900234";
    assertTrue(test(s));
  }
  
 
  
  
  public static boolean test (String s){
    Pattern pattern = Pattern.compile("\\d{3}");
    Matcher matcher = pattern.matcher(s);
    if (matcher.find()){
      return true; 
    } 
    return false; 
  }
 
} 

6.4. Building a link checker

The following example allows you to extract all valid links from a webpage. It does not consider links which start with "javascript:" or "mailto:".
Create a Java project called de.vogella.regex.weblinks and the following class:
package de.vogella.regex.weblinks;
 
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class LinkGetter {
  private Pattern htmltag;
  private Pattern link;
 
  public LinkGetter() {
    htmltag = Pattern.compile("<a\\b[^>]*href=\"[^>]*>(.*?)</a>");
    link = Pattern.compile("href=\"[^>]*\">");
  }
 
  public List<String> getLinks(String url) {
    List<String> links = new ArrayList<String>();
    try {
      BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
      String s;
      StringBuilder builder = new StringBuilder();
      while ((s = bufferedReader.readLine()) != null) {
        builder.append(s);
      }
 
      Matcher tagmatch = htmltag.matcher(builder.toString());
      while (tagmatch.find()) {
        Matcher matcher = link.matcher(tagmatch.group());
        matcher.find();
        String link = matcher.group().replaceFirst("href=\"", "")
            .replaceFirst("\">", "")
            .replaceFirst("\"[\\s]?target=\"[a-zA-Z_0-9]*", "");
        if (valid(link)) {
          links.add(makeAbsolute(url, link));
        }
      }
    } catch (MalformedURLException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }
    return links;
  }
 
  private boolean valid(String s) {
    if (s.matches("javascript:.*|mailto:.*")) {
      return false;
    }
    return true;
  }
 
  private String makeAbsolute(String url, String link) {
    if (link.matches("http://.*")) {
      return link;
    }
    if (link.matches("/.*") && url.matches(".*$[^/]")) {
      return url + "/" + link;
    }
    if (link.matches("[^/].*") && url.matches(".*[^/]")) {
      return url + "/" + link;
    }
    if (link.matches("/.*") && url.matches(".*[/]")) {
      return url + link;
    }
    if (link.matches("/.*") && url.matches(".*[^/]")) {
      return url + link;
    }
    throw new RuntimeException("Cannot make the link absolute. Url: " + url
        + " Link " + link);
  }
} 

6.5. Finding duplicated words

The following regular expression matches duplicated words.
\b(\w+)\s+\1\b 
\b is a word boundary and \1 references to the captured match of the first group, i.e., the first word.
The (?!-in)\b(\w+) \1\b finds duplicate words if they do not start with "-in".

Tip

Add (?s) to search across multiple lines.

6.6. Finding an elements which start in a new line

The following regular expression allows you to find the "title" word, in case it starts in a new line, potentially with leading spaces.
(\n\s*)title 
                                                            

 

No comments:

Post a Comment