String string = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<Request>\n" +
" <Item>\n" +
" <Type>C0401</Type>\n" +
" <InvDate>20150301</InvDate>\n" +
" <No>PK1000000</No>\n" +
" </Item>\n" +
" <Item>\n" +
" <Type>C0401</Type>\n" +
" <InvDate>20150301</InvDate>\n" +
" <No>PK1000002</No>\n" +
" </Item>\n" +
"</Request>";
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)");
Matcher matcher = pattern.matcher(string);
List<String> listMatches = new ArrayList<String>();
while(matcher.find())
{
listMatches.add(matcher.group(2));
}
If I replace Item with Type or InvDate or No, I can get the content.
Looking for answer. Thanks
You have to use the option Pattern.DOTALL for multiline maches:
Pattern pattern = Pattern.compile("(<Item>)(.*?)(</Item>)",Pattern.DOTALL);
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
You have to use the option Pattern.DOTALL for multiline maches:
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)",Pattern.DOTALL);
But it is better to use a HTML-parser.
You need to use DOTALL flag to make DOT match any character including newlines also:
Pattern pattern = Pattern.compile("(?s)(<Item>)(.*)(</Item>)");
Or else:
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)", Pattern.DOTALL);
Related
I need to match sequence='999' inside a <noteinfo> tag in a xml document using Java RegEx (xml parser is not an option).
Snippet of the xml:
<xmltag sequence='11'>
<noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='3'/>
</xmltag>
I am using this: (?<=<noteinfo.*)sequence='[0-9999]'(?=/>)
I am expecting a match on this: sequence='3'
Getting error: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length
I understand the issue is with the .* in the look-behind part. Any alternatives to avoid the error?
never use a lookbehind if not absolutely necessary
You can reduce the length of a lookbehind with the curly braces eg. {1,255}.
Your problem is solvable without the use of a lookbehind:
static final Pattern seqpat = Pattern.compile( "<noteinfo[^>]+(?<seq>sequence\\s*=\\s*'[\\d]*')", Pattern.MULTILINE );
read through the file with:
Matcher m = seqpat.matcher( s );
while( m.find() )
System.err.println( m.group( "seq" ) );
Pattern.MULTILINE is necessary in the case a noteinfo-line is wrapped
seqpat finds (not matches!) any line starting with <noteinfo and ending with >
the requested sequence is captured in group( "seq" )
perhaps You have to deal with spaces or newlines between sequence, = and the sequence-id '3' — therefore: \\s*=\\s*
the above Pattern finds each sequence-id (even an empy one)
to find only the '999' sequence-id, take this Pattern:
Pattern.compile( "<noteinfo[^>]+(?<seq>sequence\\s*=\\s*'999')", Pattern.MULTILINE );
My guess is that you might want to design an expression similar to:
(?=<noteinfo).*(sequence='[0-9]'|sequence='[1-9][0-9]{0,3}')
DEMO
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "(?=<noteinfo).*(sequence='[0-9]'|sequence='[1-9][0-9]{0,3}')";
final String string = "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='3'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='9999'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='10000'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='-1'/>\n"
+ "</xmltag>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
Trying to parse out names with given samples
++++++++++++++++++SELIZABETH+COLLAZO+++++++++++++++++++
+++++++++++++++++++PALOMA+CORREA+++++++++++++++++++++++
+++++++++++++++++++NOAH+BLAKEMORE++++++++++++++++++++++
I've tried
//++(.*?)+(.*?)//++
but that's way off.
Would like to parse out the first and last name to two strings.
You can use this regex (\w+)\+(\w+) or \+{2,}(.*?)\+(.*?)\+{2,} with Pattern like this :
String str = "++++++++++++++++++SELIZABETH+COLLAZO+++++++++++++++++++\n"
+ "+++++++++++++++++++PALOMA+CORREA+++++++++++++++++++++++\n"
+ "+++++++++++++++++++NOAH+BLAKEMORE++++++++++++++++++++++";
Pattern pattern = Pattern.compile("(\\w+)\\+(\\w+)");// or instead "\\+{2,}(.*?)\\+"(.*?)\\+{2,}
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1) + " " + matcher.group(2));
}
Outputs
SELIZABETH COLLAZO
PALOMA CORREA
NOAH BLAKEMORE
I need to extract the values after :70: in the following text file using RegEx. Value may contain line breaks as well.
My current solution is to extract the string between :70: and : but this always returns only one match, the whole text between the first :70: and last :.
:32B:xxx,
:59:yyy
something
:70:ACK1
ACK2
:21:something
:71A:something
:23E:something
value
:70:ACK2
ACK3
:71A:something
How can I achive this using Java? Ideally I want to iterate through all values, i.e.
ACK1\nACK2,
ACK2\nACK3
Thanks :)
Edit: What I'm doing right now,
Pattern pattern = Pattern.compile("(?<=:70:)(.*)(?=\n)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
System.out.println(matcher.group())
}
Try this.
String data = ""
+ ":32B:xxx,\n"
+ ":59:yyy\n"
+ "something\n"
+ ":70:ACK1\n"
+ "ACK2\n"
+ ":21:something\n"
+ ":71A:something\n"
+ ":23E:something\n"
+ "value\n"
+ ":70:ACK2\n"
+ "ACK3\n"
+ ":71A:something\n";
Pattern pattern = Pattern.compile(":70:(.*?)\\s*:", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find())
System.out.println("found="+ matcher.group(1));
result:
found=ACK1
ACK2
found=ACK2
ACK3
You need a loop to do this.
Pattern p = Pattern.compile(regexPattern);
List<String> list = new ArrayList<String>();
Matcher m = p.matches(input);
while (m.find()) {
list.add(m.group());
}
As seen here Create array of regex matches
For some reason this piece of Java code is giving me overlapping matches:
Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);
any way/option so it avoids detecting overlaps? e.g.
leftContext rightContext rightContext
should be be 1 match instead of 2
Here's the complete code:
public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){
Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);
Matcher matcher = pat.matcher(input);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, "");
buffer.append(matcher.group(1) + newString + matcher.group(2));
}
matcher.appendTail(buffer);
return buffer.toString();
}
So here's the final answer using a negative lookahead, my bad for not realizing * was greedy:
Pattern pat = Pattern.compile("(" +
leftContext + ")" + "(?:(?!" +
rightContext + ").)*" + "(" +
rightContext + ")", Pattern.DOTALL);
Your use of the word "overlapping" is confusing. Apparently, what you meant was that the regex is too greedy, matching everything from the first leftContext to the last rightContext. It seems you figured that out already--and came up with a better approach as well--but there's still at least one potential problem.
You said leftContext and rightContext are "plain Strings", by which I assume you meant they aren't supposed to be interpreted as regexes, but they will be. You need to escape them, or any regex metacharacters they contain will cause incorrect results or run-time exceptions. The same goes for your replacement string, although only $ and the backslash have special meanings there. Here's an example (notice the non-greedy .*?, too):
public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){
String lcRegex = Pattern.quote(leftContext);
String rcRegex = Pattern.quote(rightContext);
String replace = Matcher.quoteReplacment(newString);
Pattern pat = Pattern.compile("(" + lcRegex + ").*?(" + rcRegex + ")", Pattern.DOTALL);
One other thing: if you aren't doing any post-match processing on the matched text, you can use replaceAll instead of rolling your own with appendReplacement and appendTail:
return input.replaceAll("(?s)(" + lcRegex + ")" +
"(?:(?!" + rcRegex + ").)*" +
"(" + rcRegex + ")",
"$1" + replace + "$2");
There are few possibilities, depending on what you really need.
You can append $ at the end of your regex, like this:
"(" + leftContext + ")" + ".*" + "(" + rightContext + ")$"
so if rightContext isn't the last thing, your regex won't match.
Next, you can capture everything after rightContext:
"(" + leftContext + ")" + ".*" + "(" + rightContext + ")(.*)"
and after that discard everything in your third matching group.
But, since we don't know what leftContext and rightContext really are, maybe your problem lies within them.
I am trying to match pattern like '#(a-zA-Z0-9)+ " but not like 'abc#test'.
So this is what I tried:
Pattern MY_PATTERN
= Pattern.compile("\\s#(\\w)+\\s?");
String data = "abc#gere.com #gogasig #jytaz #tibuage";
Matcher m = MY_PATTERN.matcher(data);
StringBuffer sb = new StringBuffer();
boolean result = m.find();
while(result) {
System.out.println (" group " + m.group());
result = m.find();
}
But I can only see '#jytaz', but not #tibuage.
How can I fix my problem? Thank you.
This pattern should work: \B(#\w+)
The \B scans for non-word boundary in the front. The \w+ already excludes the trailing space. Further I've also shifted the parentheses so that the # and + comes in the correct group. You should preferably use m.group(1) to get it.
Here's the rewrite:
Pattern pattern = Pattern.compile("\\B(#\\w+)");
String data = "abc#gere.com #gogasig #jytaz #tibuage";
Matcher m = pattern.matcher(data);
while (m.find()) {
System.out.println(" group " + m.group(1));
}