I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1
This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween, but that fails on strings like '""', "\\\""
So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.
Edit: After testing the answer
Code:
public static void main(String[] args) {
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
"local c = { [\"\\\\/\"] = \"/\" }";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
Output:
Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/
It's not handling the escaped quotes correctly.
I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like
'[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"
As a Java String:
String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";
The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.
See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)
For the overflow state, you would probably want to allocate whatever resources that'd be required. You'd likely want to design small benchmark tests and find out about the practical resources that might be necessary to finalize your task.
Another option would be to find some other strategies or maybe languages to solve your problem. For instance, if you could classify your strings into two categories of ' or " wrapped to find some other optimal solutions.
Otherwise, you might want to try designing simple expressions and avoid back-referencing, such as with:
'([^']*)'|"(.*)"
which would probably fail for some other inputs that you might have and we don't know of.
Or maybe present your question slightly more technical such that some experienced users might be able to provide better answers, such as this answer.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "'\"\"'\n"
+ "\"\\\\\\\"\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: '""'
Group 1: ""
Group 2: null
Full match: "\\\""
Group 1: null
Group 2: \\\"
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
Related
Basically my desired outcome is to split a string based on known keywords regardless on if whitespace seperates the keyword. Below is an example of my current implementation, expect param String line = "sum:=5;":
private static String[] nextLineAsToken(String line) {
return line.split("\\s+(?=(:=|<|>|=))");
}
Expected:
String[] {"sum", ":=", "5;"};
Actual:
String[] {"sum:=5;"};
I have a feeling this isn't possible, but it would be great to hear from you guys.
Thanks.
Here is an example code that you can use to split your input into groups. White space characters like regular space are ignored. It is later printed to the output in for loop:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
final String regex = "(\\w*)\\s*(:=)\\s*(\\d*;)";
final String string = "sum:=5;";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
And this is the output:
Full match: sum:=5;
Group 1: sum
Group 2: :=
Group 3: 5;
Your main problem is you coded \s+ instead of \s*, which required there to be spaces to split, instead of spaces being optional. The other problem is your regex only splits before operators.
Use this regex:
\s*(?=(:=|<|>|(?<!:)=))|(?<=(=|<|>))\s*
See live demo.
Or as Java:
return line.split("\\s*(?=(:=|<|>|(?<!:)=))|(?<=(=|<|>))\\s*");
Which uses a look ahead to split before operators and a look behind to split after operators.
\s* has been added to consume any spaces between terms.
Note also the negative look behind (?<!:) within the look ahead to prevent splitting between : and =.
I'm building a calculator that can solve formula's as a project of mine in which i encountered the problem that a string such as 2x+7 will get tokenized as "2x","+" ,"7".
I need to properly split it into constants and variables which means 2x should be "2" , "x" . How do i do this without it affecting even complex formulas which include Sin and Cos functions etc?
For example i want 16x + cos(y) to be tokenized as "16" , "x" , "+" , "cos" , "(" , "y" , ")"
This problem would be pretty complicated, and this answer is just an example.
Maybe, we would want to figure out what types of equations we might have, then we would start designing some expressions. For instance, we can have a look at this:
([a-z]+)|([-]?\d+)|[-+*\/]
Demo 1
Or:
([a-z]+)|([-]?\d+)|([-+*\/])|(\(|\))
Demo 2
Example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "([a-z]+)|([-]?\\d+)|([-+*\\/])";
final String string = "2x+7\n"
+ "2sin(2x + 2y) = 2sin(x)*cos(2y) + 2cos 2x * 2sin 2y\n"
+ "2sin(2x - 2y) = -2tan 2x / cot -2y + -2cos -2x / 2sin 2y\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:
I really don't have any suggestion as to how it would be best to architect a solution for this problem. But, maybe you would want to categorize your equations first, then design some classes/methods to process each category of interest, and where regex was necessary, you can likely design one/multiple expressions for desired purposes that you wish to accomplish.
I have a string /subscription/ffcc218c-985c-4ec8-82d7-751fdcac93f0/subscribe from which I want to extract the middle string /subscription/<....>/subscribe. I have written the below code to get the string
String subscriber = subscriberDestination.substring(1);
int startPos = subscriber.indexOf("/") + 2;
int destPos = startPos + subscriber.substring(startPos + 2).indexOf("/");
return subscriberDestination.substring(startPos, destPos + 2);
Gives back ffcc218c-985c-4ec8-82d7-751fdcac93f0
How can I use java Pattern library to write better code?
If you want to use a regular expression, a simple way would be:
return subscriber.replaceAll("/.*/([^/]*)/.*", "$1");
/.*/ is for the /subscription/ bit
([^/]*) a capturing group that matches all characters until the next /
/.* is for the /subscribe bit
And the second argument of replaceAll says that we want to keep the first group.
You can use a Pattern to improve efficiency by compiling the expression:
Pattern p = Pattern.compile("/.*/([^/]*)/.*"); ///store it outside the method to reuse it
Matcher m = p.matcher(subscriber);
if (m.find()) return m.group(1);
else return "not found";
5c from me. I recommend to use Pattern for extracting substring with known format:
public final class Foo {
private static final Pattern PATTERN = Pattern.compile(".*subscription\\/(?<uuid>[\\w-]+)\\/subscribe");
public static String getUuid(String url) {
Matcher matcher = PATTERN.matcher(url);
return matcher.matches() ? matcher.group("uuid") : null;
}
}
RegEx Demo
Performance can be improved by:
not creating a substrings.
Also indexOf(..) with a char should be faster than with String
final int startPos = subscriberDestination.indexOf('/',1) + 1 ;
final int destPos = subscriberDestination.indexOf('/',startPos+1);
return subscriberDestination.substring(startPos, destPos );
About useing the java Pattern library:
Do you expect any performance gain? I doubt you'll get some by using java Pattern library. But I recommend to profile it to be absolute sure about it.
I am trying to use Regex to extract the values from a string and use them for the further processing.
The string I have is :
String tring =Format_FRMT: <<<$gen>>>(((valu e))) <<<$gen>>>(((value 13231)))
<<<$gen>>>(((value 13231)))
Regex pattern I have made is :
Pattern p = Pattern.compile("\\<{3}\\$([\\w ]+)\\>{3}\\s?\\({3}([\\w ]+)\\){3}");
When I am running the whole program
Matcher m = p.matcher(tring);
String[] try1 = new String[m.groupCount()];
for(int i = 1 ; i<= m.groupCount();i++)
{
try1[i] = m.group(i);
//System.out.println("group - i" +try1[i]+"\n");
}
I am getting
No match found
Can anybody help me with this? where exactly this is going wrong?
My first aim is just to see whether I am able to get the values in the corresponding groups or not. and If that is working fine then I would like to use them for further processing.
Thanks
Here is an exaple of how to get all the values you need with find():
String tring = "CHARDATA_FRMT: <<<$gen>>>(((valu e))) <<<$gen>>>(((value 13231)))\n<<<$gen>>>(((value 13231)))";
Pattern p = Pattern.compile("<{3}\\$([\\w ]+)>{3}\\s?\\({3}([\\w ]+)\\){3}");
Matcher m = p.matcher(tring);
while (m.find()){
System.out.println("Gen: " + m.group(1) + ", and value: " + m.group(2));
}
See IDEONE demo
Note that you do not have to escape < and > in Java regex.
After you create the Matcher and before you reference its groups, you must call one of the methods that attempts the actual match, like find, matches, or lookingAt. For example:
Matcher m = p.matcher(tring);
if (!m.find()) return; // <---- Add something like this
String[] try1 = new String[m.groupCount()];
You should read the javadocs on the Matcher class to decide which of the above methods makes sense for your data and application. http://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html
I have two different sources feeding input files to my application. Their filename patterns differ, yet they contain common information that I want to retrieve.
Using regex named groups seemed convenient, as it allows for maximum code factorization, however it has its limits, as I cannot concat the two patterns if they use the same group names.
Example:
In other words, this:
String PATTERN_GROUP_NAME = "name";
String PATTERN_GROUP_DATE = "date";
String PATTERN_IMPORT_1 = "(?<" + PATTERN_GROUP_NAME + ">[a-z]{3})_(?<" + PATTERN_GROUP_DATE + ">[0-9]{14})_(stuff stuf)\\.xml";
String PATTERN_IMPORT_2 = "(stuff stuf)_(?<" + PATTERN_GROUP_DATE + ">[0-9]{14})_(?<" + PATTERN_GROUP_NAME + ">[a-z]{3})_(other stuff stuf)\\.xml";
Pattern universalPattern = Pattern.compile(PATTERN_IMPORT_1 + "|" + PATTERN_IMPORT_2);
try {
DirectoryStream<Path> list = Files.newDirectoryStream(workDirectory);
for (Path file : list) {
Matcher matcher = universalPattern.matcher(file.getFileName().toString());
name = matcher.group(PATTERN_GROUP_NAME);
fileDate = dateFormatter.parseDateTime(matcher.group(PATTERN_GROUP_DATE));
(...)
will fail with a java.util.regex.PatternSyntaxException because the named capturing groups are already defined.
What would be the most efficient / elegant way of solving this problem?
Edits:
It goes without saying, but the two patterns I can match my input files against are different enough so no input file can match both.
Use two patterns - then group names can be equal.
You asked for efficient and elegant. Theoretical one pattern could be more efficient, but that is irrelevant here.
First: the code will be slightly longer, but better readable - a weakness of regex. That makes it better maintainable.
In pseudo-code:
Matcher m = firstPattern.matcher ...
if (!m.matches()) {
m = secondPattern.matcher ...
if (!m.matches()) {
continue;
}
}
name = m.group(NAME_GROUP);
...
(Everyone want to do too clever coding, but simplicity may be called for.)
Agree with Joop Eggen's opinion. Two patterns are simple & easily maintainable.
Just for fun, and give you one pattern implementation for your specific case. (a liitle bit longer & ugly.)
String[] inputs = {
"stuff stuf_20111130121212_abc_other stuff stuf.xml",
"stuff stuf_20111130151212_def_other stuff stuf.xml",
"abc_20141220202020_stuff stuf.xml",
"def_20140820202020_stuff stuf.xml"
};
String lookAhead = "(?=([a-z]{3}_[0-9]{14}_stuff stuf\\.xml)|(stuff stuf_[0-9]{14}_[a-z]{3}_other stuff stuf\\.xml))";
String onePattern = lookAhead
+ "((?<name>[a-z]{3})_(other stuff stuf)?|(stuff stuf_)?(?<date>[0-9]{14})_(stuff stuf)?){2}\\.xml";
Pattern universalPattern = Pattern.compile(onePattern);
for (String input : inputs) {
Matcher matcher = universalPattern.matcher(input);
if (matcher.find()) {
//System.out.println(matcher.group());
String name = matcher.group("name");
String fileDate = matcher.group("date");
System.out.println("name : " + name + " fileDate: "
+ fileDate);
}
}
The output:
name : abc fileDate: 20111130121212
name : def fileDate: 20111130151212
name : abc fileDate: 20141220202020
name : def fileDate: 20140820202020
Actually, in your case, the "lookAhead" is not necessary. Since in one pattern, you can't assign two goups with the same name. Therefore, normally, you need to revise your pattern.
From AB|BA ---> (A|B){2}