Friday, October 9, 2015

Use Java regular expression for multiline string

There are 2  points worth to point out when using Java regular expression to parse multi-line string.

  • The period "." doesn NOT match line break, unless you set extra flag to your pattern instance.
  • The "^" and "$" mean different in single(default) and multi line mode.

1. period "." does not match line break

Usually the period "." is used to match any thing, but by default it doesn’t match line break, such as "\r","\n". A Patten flag Pattern.DOTALL need to be set explicitly to make it match line break.  This is important to know when you use matches(). All String/Pattern/Matcher Class have matches() method which try to match the whole input string with the regex pattern. Since "\n" will break patterns like ".*", then the whole string match breaks, you may get unexpected result.

    String input = "11abc22\n33abc44";  //multi-line input
String reg = ".*abc.*";

Pattern p0 = Pattern.compile(reg);
Matcher m0 = p0.matcher(input);
System.out.println(m0.matches()); // print false

Pattern p1 = Pattern.compile(reg,Pattern.DOTALL); // set DOTALL flag
Matcher m1 = p1.matcher(input);
System.out.println(m1.matches()); // print true

The regex input is a multi-line string. Since the "\n" doesn't fit in pattern ".*" default, the first m0.matches() call returns false.

To make the period "." also matches  line break, add Pattern.DOTALL flag to the pattern. 

2. Meaning of "^" and "$" in single and multi line mode


The default mode for Java regular expression is single line mode. Use Pattern.MULTILINE to turn on multi-line mode.

The meaning of "^" and "$" in single and multi line mode.

 In default modeIn multi-line mode
^The beginning of the whole input StringThe beginning of every line.
$The end of the whold input StringThe end of every line.

Let's check the following demo code RegexMultiline.java for better understanding.

package com.shengwang.demo;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMultiline {
// ##x=6##
// ##x=8##
static String input = "##x=6##\n##x=8##";

public static void main(String[] args) {

String reg = "^.*x=(\\d+).*$";

// case 1
System.out.println("--> single line mode"); // default single-line
Pattern p1 = Pattern.compile(reg);
searchPatternInMultiLineText(p1);


// case 2
System.out.println("--> single line mode with DOTALL"); // single-line + dotall
Pattern p2 = Pattern.compile(reg,Pattern.DOTALL);
searchPatternInMultiLineText(p2);


// case 3
System.out.println("--> multi line mode");
Pattern p3 = Pattern.compile(reg, Pattern.MULTILINE); // multi-line mode
searchPatternInMultiLineText(p3);

}

public static void searchPatternInMultiLineText(Pattern p) {
boolean isFound = false;

Matcher m = p.matcher(input);
while (m.find()) {
System.out.println("x="+m.group(1));
isFound = true;
}

if (!isFound) {
System.out.println("No pattern found");
}
}
}

Run this demo and the result is:

--> single line mode
No pattern found
--> single line mode with DOTALL
x=8
--> multi line mode
x=6
x=8

Let go through the demo code. There are 3 cases try to match the same input text with the same pattern "^.*x=(\\d+).*$". The only differences among them are the flags for the pattern instances. 


Case 1,  pattern instances in  single mode, (default mode with no extra flag), The ^ and $ match the beginning and ending of the whole input text. Since "\n" can not fit in pattern ".*",  in fact nothing in pattern matches "\n", so the pattern match will fail.


Case 2, single mode with DOTALL flag, ".*" now can cover "\n". The ^ and $ match the beginning and ending of the whole input text. Since default is greedy matching, so the first ".*" will try to consume as much as possible, the whole input will have only one match.


Case3, in multi-line mode. The ^ and $ match the beginning and ending of every line. Since there are 2 lines in the input String, there are 2 matches, one for each line. 


Be careful when using Java regular expression for multi-line input text match, same input and same pattern can get different results when working in different modes.

0 comments:

Post a Comment

Powered by Blogger.

About The Author

My Photo

Has been a senior software developer, project manager for 10+ years. Dedicate himself to Alcatel-Lucent and China Telecom for delivering software solutions.

Pages

Unordered List