Pattern:
"(([^",\n ]*[,\n ])*([^",\n ]*"{2})*)*[^",\n ]*"[ ]*,[ ]*|[^",\n]*[ ]*,[ ]*|"(([^",\n ]*[,\n ])*([^",\n ]*"{2})*)*[^",\n ]*"[ ]*|[^",\n]*[ ]*
This Regex is for parsing CSV file. But when it goes into Pattern.matcher, I encounter a hung thread exception. Appreciate it if someone can help fine tune this pattern.
[7/1/13 16:45:26:745 GMT+08:00] 00000029 ThreadMonitor W WSVR0605W: Thread "MessageListenerThreadPool : 0" (00000035) has been active for 691836 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
at java.util.regex.Pattern$Curly.match(Pattern.java:4233)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4606)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4752)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4689)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4606)
at java.util.regex.Pattern$Loop.match(Pattern.java:4733)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4665)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4754)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4689)
at java.util.regex.Pattern$Loop.match(Pattern.java:4742)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4665)
at java.util.regex.Pattern$BitClass.match(Pattern.java:2912)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4278)
at java.util.regex.Pattern$Curly.match(Pattern.java:4233)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4606)
at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4752)
at java.util.regex.Pattern$Prolog.match(Pattern.java:4689)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4606)
Description
The problem appears to be the shear amount of back tracking being done to accomplish the match.
If your CSV is well formed you could use a more simple regex to parse each line. Note this will only separate the quote-comma and comma delimited values from a string, so you'd need to pass each line through the .matcher with this regex and iterate over each of the matches.
regex: (?:^|,)"?((?<=")[^"]*|[^,"]*)"?(?=,|$)
Java Code Example:
Live example: http://ideone.com/NBmzrk
Sample Text
"root",test1,1111,"22,22",,fdsa
Code
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
public static void main(String[] asd){
String sourcestring = "source string to match with pattern";
Pattern re = Pattern.compile("(?:^|,)\"?((?<=\")[^\"]*|[^,\"]*)\"?(?=,|$)",Pattern.CASE_INSENSITIVE);
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
}
}
Capture Group 1
[0] => root
[1] => test1
[2] => 1111
[3] => 22,22
[4] =>
[5] => fdsa
Related
grammar Hello;
#parser::header {
import java.util.*;
}
#parser::members {
Map<Integer, String> map = new HashMap<Integer, String>();
int A = 0;
int B = 0;
int max = 0;
}
prog
#after {
List<Integer> msgs = new ArrayList<>(map.keySet());
Collections.sort(msgs);
for (int i=0; i< msgs.size(); i++){
System.out.println(map.get(msgs.get(i)));
}
System.out.println("Alice: "+ A +", Bob: "+B);
System.out.println(max);
}
: stat+;
stat: message NL | message;
message: h=T_NUM ':' m=T_NUM ':' s=T_NUM 'A' ':' T_MSG {
String msg = $h.getText() + ":" + $m.getText() + ":" + $s.getText() + " A: " + $T_MSG.getText();
int len = $T_MSG.getText().length();
if (len > max) max = len;
A++;
int id = Integer.parseInt($h.getText()) * 3600 + Integer.parseInt($m.getText()) * 60 + Integer.parseInt($s.getText());
map.put(id, msg);
} | h=T_NUM ':' m=T_NUM ':' s=T_NUM 'B' ':' T_MSG {
String msg = $h.getText() + ":" + $m.getText() + ":" + $s.getText() + " A: " + $T_MSG.getText();
int len = $T_MSG.getText().length();
if (len > max) max = len;
B++;
int id = Integer.parseInt($h.getText()) * 3600 + Integer.parseInt($m.getText()) * 60 + Integer.parseInt($s.getText());
map.put(id, msg);
};
T_NUM: [0-9][0-9];
T_MSG: [A-Za-z0-9.,!? ]+;
NL: [\n]+;
WS : [ \t\r]+ -> skip ; // skip spaces, tabs, newlines
Hello! So I have a task to write grammar and parser in ANTLR4 which recognizes this kind of input:
00:10:11 A: Message 1
23:12:12 B: Message 5
11:12:13 A: Message 2
12:21:12 B: Message 4
11:12:15 A: Message 3
and as an output, it has to sort out messages by time. Now my problem is with spaces. I want to be able to recognize spaces in messages but I get an error:
line 1:6 no viable alternative at input '00:10:11 A'
Alice: 0, Bob: 0
0
When I remove space from the T_MSG token and obviously input it works. But I don't know how to make it work for it to be able to recognize spaces in messages.
Always dump your token stream to see what the Lexer produces for the Parser to consume.
For the first line of your test input, (using grun Hello prog -tokens < Hello.txt), I get:
[#0,0:1='00',<T_NUM>,1:0]
[#1,2:2=':',<':'>,1:2]
[#2,3:4='10',<T_NUM>,1:3]
[#3,5:5=':',<':'>,1:5]
[#4,6:9='11 A',<T_MSG>,1:6]
[#5,10:10=':',<':'>,1:10]
[#6,11:21=' Message 1 ',<T_MSG>,1:11]
[#7,22:22='\n',<NL>,1:22]
[#8,23:22='<EOF>',<EOF>,2:0]
line 1:6 no viable alternative at input '00:10:11 A'
Alice: 0, Bob: 0
0
in particular notice the line
[#4,6:9='11 A',<T_MSG>,1:6]
Your parser isn't seeing the stream of tokens that your parser rules assume it will see.
This is because "11 A" matches the T_MSG Lexer rule. Note: even though the T_NUM rule matches the "11" input, ANTLR's Lexer will use the Lexer rule that consumes the most input, so ANTLR will produce a T_MSG token.
That's why you're getting the observed error.
There are ways using Lexer modes, not skipping WS (which means accounting for all places where WS can occur in your parser rules), or maybe a couple of other techniques.
That said, you're really applying the wrong tool to the job. Reading this input in line by line and applying a Regex with capture groups will be MUCH simpler. There's nothing about your input that requires a full-fledged parser.
IF your push forward with ANTLR, you're probably also much better off just working out the grammar to get the correct parse tree and then using a listener to handle the results. All of the #parser* , prog {...} and parser rule actions, are distractions at best if you're not yet building the correct parse tree.
It may be very simple, but I am extremely new to regex and have a requirement where I need to do some regex matches in a string and extract the number in it. Below is my code with sample i/p and required o/p. I tried to construct the Pattern by referring to https://www.freeformatter.com/java-regex-tester.html, but my regex match itself is returning false.
Pattern pattern = Pattern.compile(".*/(a-b|c-d|e-f)/([0-9])+(#[0-9]?)");
String str = "foo/bar/Samsung-Galaxy/a-b/1"; // need to extract 1.
String str1 = "foo/bar/Samsung-Galaxy/c-d/1#P2";// need to extract 2.
String str2 = "foo.com/Samsung-Galaxy/9090/c-d/69"; // need to extract 69
System.out.println("result " + pattern.matcher(str).matches());
System.out.println("result " + pattern.matcher(str1).matches());
System.out.println("result " + pattern.matcher(str1).matches());
All of above SOPs are returning false. I am using java 8, is there is any way by which in a single statement I can match the pattern and then extract the digit from the string.
I would be great if somebody can point me on how to debug/develop the regex.Please feel free to let me know if something is not clear in my question.
You may use
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
See the regex demo
When used with matches(), the pattern above does not require explicit anchors, ^ and $.
Details
.* - any 0+ chars other than line break chars, as many as possible
/ - the rightmost / that is followed with the subsequent subpatterns
(?:a-b|c-d|e-f) - a non-capturing group matching any of the alternatives inside: a-b, c-d or e-f
/ - a / char
[^/]*? - any chars other than /, as few as possible
([0-9]+) - Group 1: one or more digits.
Java demo:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
Pattern pattern = Pattern.compile(".*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)");
for (String s : strs) {
Matcher m = pattern.matcher(s);
if (m.matches()) {
System.out.println(s + ": \"" + m.group(1) + "\"");
}
}
A replacing approach using the same regex with anchors added:
List<String> strs = Arrays.asList("foo/bar/Samsung-Galaxy/a-b/1","foo/bar/Samsung-Galaxy/c-d/1#P2","foo.com/Samsung-Galaxy/9090/c-d/69");
String pattern = "^.*/(?:a-b|c-d|e-f)/[^/]*?([0-9]+)$";
for (String s : strs) {
System.out.println(s + ": \"" + s.replaceFirst(pattern, "$1") + "\"");
}
See another Java demo.
Output:
foo/bar/Samsung-Galaxy/a-b/1: "1"
foo/bar/Samsung-Galaxy/c-d/1#P2: "2"
foo.com/Samsung-Galaxy/9090/c-d/69: "69"
Because you match always the last number in your regex, I would Like to just use replaceAll with this regex .*?(\d+)$ :
String regex = ".*?(\\d+)$";
String strResult1 = str.replaceAll(regex, "$1");
System.out.println(!strResult1.isEmpty() ? "result " + strResult1 : "no result");
String strResult2 = str1.replaceAll(regex, "$1");
System.out.println(!strResult2.isEmpty() ? "result " + strResult2 : "no result");
String strResult3 = str2.replaceAll(regex, "$1");
System.out.println(!strResult3.isEmpty() ? "result " + strResult3 : "no result");
If the result is empty then you don't have any number.
Outputs
result 1
result 2
result 69
Here is a one-liner using String#replaceAll:
public String getDigits(String input) {
String number = input.replaceAll(".*/(?:a-b|c-d|e-f)/[^/]*?(\\d+)$", "$1");
return number.matches("\\d+") ? number : "no match";
}
System.out.println(getDigits("foo.com/Samsung-Galaxy/9090/c-d/69"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/a-b/some other text/1"));
System.out.println(getDigits("foo/bar/Samsung-Galaxy/9090/a-b/69ace"));
69
no match
no match
This works on the sample inputs you provided. Note that I added logic which will display no match for the case where ending digits could not be matched fitting your pattern. In the case of a non-match, we would typically be left with the original input string, which would not be all digits.
Is there a way to match start and end of sentence in Java? The easiest case is ending with simple (.) dot. In some other cases it could end with colum (:) or a shortcut ended with colum (.:).
For example some random news text:
Cliffs have collapsed in New Zealand during an earthquake in the city
of Christchurch on the South Island. No serious damage or fatalities
were reported in the Valentine's Day quake that struck at 13:13 local
time. Based on the med. report everybody were ok.
My goal is to get the shortcut of a word + the context of it, but if possible only the sentence in which the shortcut belonds.
So the successfull output for me will be if I would be able to get something like this:
selected word -> collapsed
context -> Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island.
selected word -> med.
context -> Based on the med. report everybody were ok.
Thanks
You spot the sentence easily. It starts with a capital letter and ends with one of .:!? chars followed by space and another capital letter or reached the end of the whole string.
Compare the difference time. Based and med. report.
So the regex capturing the whole sentence should look like this:
([A-Z][a-z].*?[.:!?](?=$| [A-Z]))
Take a look! Regex101
what you are looking for is a natural language processing toolkit. for java you can use: CoreNLP
and they already have some example cases on their tutorials page.
you can certainly make a regex expression that looks for all chars inbetween the set of chars (.:? etc...), and it would look something like this:
\.*?(?=[\.\:])\
then you would have to loop through the matched results and find the relevant sentences which have your words in them. but i recommend you use a NLP to achieve this.
The code:
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main( String[] args ) {
final Map<String, String> dict = new HashMap<>();
dict.put( "med", "medical" );
final String text =
"Cliffs have collapsed in New Zealand during an earthquake in the "
+ "city of Christchurch on the South Island. No serious damage or "
+ "fatalities were reported in the Valentine's Day quake that struck "
+ "at 13:13 local time. Based on the med. report everybody were ok.";
final Pattern p = Pattern.compile( "[^\\.]+\\W+(\\w+)\\." );
final Matcher m = p.matcher( text );
int pos = 0;
while(( pos < text.length()) && m.find( pos )) {
pos = m.end() + 1;
final String word = m.group( 1 );
if( dict.containsKey( word )) {
final String repl = dict.get( word );
final String beginOfSentence = text.substring( m.start(), m.end());
final String endOfSentence;
if( m.find( pos )) {
endOfSentence = text.substring( m.start() - 1, m.end());
}
else {
endOfSentence = text.substring( m.start() - 1);
}
System.err.printf( "Replace '%s.' in '%s%s' with '%s'\n",
word, beginOfSentence, endOfSentence, repl );
final String sentence =
( beginOfSentence + endOfSentence ).replaceAll( word+'.', repl );
System.err.println( sentence );
}
}
}
}
The execution:
Replace 'med.' in 'Based on the med. report everybody were ok.' with 'medical'
Based on the medical report everybody were ok.
This is the regex for finding the session ID: "(?<=( ))([0-9]*)(?=(.*ABC.DEEP. [1-9] s))" and the output is:
ID TYPE USER IDLE
63494 ABC DEEP 3 s
-> 70403 ABC DEEAP 0 s
82446 ABC DEEOP 52 min 27 s
In myregexp.com/signedJar.html, this regex works fine. But when I try to find using Java, it is not able to get the output. Please find the snippet:
FrameworkControls.regularExpressionPattern = Pattern.compile("(?<=( ))([0-9]*)(?=(.*ABC.*DEEP.*[1-9] s))");
String deepak = "\n" +
"\n" +
" ID TYPE USER IDLE\n" +
"\n" +
" 63494 ABC DEEP 3 s\n" +
" -> 70403 ABC DEEAP 0 s\n" +
" 82446 ABC DEEOP 52 min 27 s\n";
FrameworkControls.regularExpressionMatcher = FrameworkControls.regularExpressionPattern.matcher(deepak);
if (FrameworkControls.regularExpressionMatcher.find()) {
String h = FrameworkControls.regularExpressionMatcher.group().trim();
System.err.println(h);
}
"FrameworkControls.regularExpressionMatcher.find()" returns true. But h variable is always empty. Can anyone let me know, where I might be doing wrong.
Expected Output: 63494
I think you're trying to print ID of the USER DEEPAK. If yes, then your code would be,
Pattern p = Pattern.compile("(?<= )[0-9]+(?=\\s*ABC\\s*DEEP\\s*[0-9]\\s*s)");
Matcher m = p.matcher(deepak);
while (m.find()) {
System.out.println(m.group());
}
IDEONE
I would use the following expression:
"^\\s+(\\d+)\\s+(\\w+)\\s+(\\w+).+\$"
then
group(1) is ID
group(2) is TYPE
group(3) is USER
The expressions are non greedy, so you can remove last two groups if you don't need them.
I'm working on strings like "[ro.multiboot]: [1]". How do I just select 1(it can also be 0) out of this string?
I am looking for a regex in Java.
Usually, you would do something like (assuming 0 and 1 were the only options):
^.*\[([01])\].*$
If you only wanted the value for ro.multiboot, you could change it to something like:
^.*\[ro.multiboot\].*\[([01])\].*$
(depending on how complex any of the non-bracketed stuff is allowed to be).
These would both basically only extract the value between square brackets if it were zero or one, and capture it into a capture variable so you could use it.
Of course, regex is not a world-wide standard, nor are the environments in which you use it. That means it depends a lot on your actual environment how you will actually code this up.
For Java, the following sample program may help:
import java.util.regex.*;
class Test {
public static void main(String args[]) {
Pattern p = Pattern.compile("^.*\\[ro.multiboot\\].*\\[([01])\\].*$");
String str;
Matcher m;
str = "[ro.multiboot]: [0]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str0 has " + m.group(1));
}
str = "[ro.multiboot]: [1]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str1 has " + m.group(1));
}
str = "[ro.multiboot]: [2]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str2 has " + m.group(1));
}
}
}
This results in (as expected):
str0 has 0
str1 has 1
#paxdiablo's regexps are correct, but complete answer for "How do I just select 1(it can also be 0) out of this string?" is:
1. very simple solution
String input = "[ro.multiboot]: [1]";
String matched = input.replaceFirst( "^.*\\[ro.multiboot\\].*\\[([01])\\].*$", "$1" );
2. same functionality, more complicated but with better performance
String input = "[ro.multiboot]: [1]";
Pattern p = Pattern.compile( "^.*\\[ro.multiboot\\].*\\[([01])\\].*$" );
Matcher m = p.matcher( input );
String matched = null;
if ( m.matches() ) matched = m.group( 1 );
Performance is better because the pattern is compiled just once (for example when you are matching array os such Strings);
Notes:
in both examples the group is part of regexps between ( and ) (if not escaped)
in Java you have to use \\[, because \[ returns error - it is not correct escape sequence for String