How to design and split tokens from the string tokenizer function?

How to design and split tokens from the string tokenizer function? - java

I'm building a calculator that can solve formula's as a project of mine in which i encountered the problem that a string such as 2x+7 will get tokenized as "2x","+" ,"7".
I need to properly split it into constants and variables which means 2x should be "2" , "x" . How do i do this without it affecting even complex formulas which include Sin and Cos functions etc?
For example i want 16x + cos(y) to be tokenized as "16" , "x" , "+" , "cos" , "(" , "y" , ")"

This problem would be pretty complicated, and this answer is just an example.
Maybe, we would want to figure out what types of equations we might have, then we would start designing some expressions. For instance, we can have a look at this:
([a-z]+)|([-]?\d+)|[-+*\/]
Demo 1
Or:
([a-z]+)|([-]?\d+)|([-+*\/])|(\(|\))
Demo 2
Example
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "([a-z]+)|([-]?\\d+)|([-+*\\/])";
final String string = "2x+7\n"
+ "2sin(2x + 2y) = 2sin(x)*cos(2y) + 2cos 2x * 2sin 2y\n"
+ "2sin(2x - 2y) = -2tan 2x / cot -2y + -2cos -2x / 2sin 2y\n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:
I really don't have any suggestion as to how it would be best to architect a solution for this problem. But, maybe you would want to categorize your equations first, then design some classes/methods to process each category of interest, and where regex was necessary, you can likely design one/multiple expressions for desired purposes that you wish to accomplish.

Related

Getting data between single and double quotes (special case)

I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1
This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween, but that fails on strings like '""', "\\\""
So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.
Edit: After testing the answer
Code:
public static void main(String[] args) {
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
"local c = { [\"\\\\/\"] = \"/\" }";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
Output:
Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/
It's not handling the escaped quotes correctly.

I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like
'[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"
As a Java String:
String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";
The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.
See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)

For the overflow state, you would probably want to allocate whatever resources that'd be required. You'd likely want to design small benchmark tests and find out about the practical resources that might be necessary to finalize your task.
Another option would be to find some other strategies or maybe languages to solve your problem. For instance, if you could classify your strings into two categories of ' or " wrapped to find some other optimal solutions.
Otherwise, you might want to try designing simple expressions and avoid back-referencing, such as with:
'([^']*)'|"(.*)"
which would probably fail for some other inputs that you might have and we don't know of.
Or maybe present your question slightly more technical such that some experienced users might be able to provide better answers, such as this answer.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "'\"\"'\n"
+ "\"\\\\\\\"\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: '""'
Group 1: ""
Group 2: null
Full match: "\\\""
Group 1: null
Group 2: \\\"
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:

Negative lookbehind: How can I stop matches when the suffix is repeated?

I have some suffix that I want to match when some prefix is absent. However, the suffix might be repeated.
Some examples:
1. prefixsuffix - should not match
2. prefixsuffixsuffix - should not match
3. prefixsuffixsuffixsuffix - should not match
4. suffix - should match
5. suffixsuffix - should match
6. suffixsuffixsuffix - should match
I have tried this regex: (?<!prefix)suffix, which fails on examples 2, 3, since the later suffix are matched.
So I tried this regex: (?<!prefix)(suffix)* hoping it would allow suffix to be repeated, but it seems to have the same issue.
So I want a regex which fulfils the above examples.

In your negative lookbehind, alternate with suffix, and when matching suffix for real, use + instead of * (because * may match zero occurences, which is not desirable):
(?<!prefix|suffix)(suffix)+
https://regex101.com/r/pEoYRA/1

My guess is that maybe we could start with this expression,
^(?=(?!prefix)(suffix))\1+$
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^(?=(?!prefix)(suffix))\\1+$";
final String string = "prefixsuffix\n"
+ "prefixsuffixsuffix\n"
+ "prefixsuffixsuffixsuffix\n\n"
+ "suffix\n"
+ "suffixsuffix\n"
+ "suffixsuffixsuffix";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

You can just add a word boundary assertion before your lookbehind to make sure you match starting from a word character:
\b(?<!prefix)(?:suffix)+
RegEx Demo
However looking at your data even \b(?:suffix)+ may also work for you.

Java 7 Regex and named groups with multiple patterns

I have two different sources feeding input files to my application. Their filename patterns differ, yet they contain common information that I want to retrieve.
Using regex named groups seemed convenient, as it allows for maximum code factorization, however it has its limits, as I cannot concat the two patterns if they use the same group names.
Example:
In other words, this:
String PATTERN_GROUP_NAME = "name";
String PATTERN_GROUP_DATE = "date";
String PATTERN_IMPORT_1 = "(?<" + PATTERN_GROUP_NAME + ">[a-z]{3})_(?<" + PATTERN_GROUP_DATE + ">[0-9]{14})_(stuff stuf)\\.xml";
String PATTERN_IMPORT_2 = "(stuff stuf)_(?<" + PATTERN_GROUP_DATE + ">[0-9]{14})_(?<" + PATTERN_GROUP_NAME + ">[a-z]{3})_(other stuff stuf)\\.xml";
Pattern universalPattern = Pattern.compile(PATTERN_IMPORT_1 + "|" + PATTERN_IMPORT_2);
try {
DirectoryStream<Path> list = Files.newDirectoryStream(workDirectory);
for (Path file : list) {
Matcher matcher = universalPattern.matcher(file.getFileName().toString());
name = matcher.group(PATTERN_GROUP_NAME);
fileDate = dateFormatter.parseDateTime(matcher.group(PATTERN_GROUP_DATE));
(...)
will fail with a java.util.regex.PatternSyntaxException because the named capturing groups are already defined.
What would be the most efficient / elegant way of solving this problem?
Edits:
It goes without saying, but the two patterns I can match my input files against are different enough so no input file can match both.

Use two patterns - then group names can be equal.
You asked for efficient and elegant. Theoretical one pattern could be more efficient, but that is irrelevant here.
First: the code will be slightly longer, but better readable - a weakness of regex. That makes it better maintainable.
In pseudo-code:
Matcher m = firstPattern.matcher ...
if (!m.matches()) {
m = secondPattern.matcher ...
if (!m.matches()) {
continue;
}
}
name = m.group(NAME_GROUP);
...
(Everyone want to do too clever coding, but simplicity may be called for.)

Agree with Joop Eggen's opinion. Two patterns are simple & easily maintainable.
Just for fun, and give you one pattern implementation for your specific case. (a liitle bit longer & ugly.)
String[] inputs = {
"stuff stuf_20111130121212_abc_other stuff stuf.xml",
"stuff stuf_20111130151212_def_other stuff stuf.xml",
"abc_20141220202020_stuff stuf.xml",
"def_20140820202020_stuff stuf.xml"
};
String lookAhead = "(?=([a-z]{3}_[0-9]{14}_stuff stuf\\.xml)|(stuff stuf_[0-9]{14}_[a-z]{3}_other stuff stuf\\.xml))";
String onePattern = lookAhead
+ "((?<name>[a-z]{3})_(other stuff stuf)?|(stuff stuf_)?(?<date>[0-9]{14})_(stuff stuf)?){2}\\.xml";
Pattern universalPattern = Pattern.compile(onePattern);
for (String input : inputs) {
Matcher matcher = universalPattern.matcher(input);
if (matcher.find()) {
//System.out.println(matcher.group());
String name = matcher.group("name");
String fileDate = matcher.group("date");
System.out.println("name : " + name + " fileDate: "
+ fileDate);
}
}
The output:
name : abc fileDate: 20111130121212
name : def fileDate: 20111130151212
name : abc fileDate: 20141220202020
name : def fileDate: 20140820202020
Actually, in your case, the "lookAhead" is not necessary. Since in one pattern, you can't assign two goups with the same name. Therefore, normally, you need to revise your pattern.
From AB|BA ---> (A|B){2}

Matching Urls Inside Strings

I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string "&quot[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the &quot[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?

If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
#Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.

finding text until end of line regex

I'm trying to use regex to find a particular starting character and then getting the rest of the text on that particular line.
For example the text can be like ...
V: mid-voice
T: tempo
I want to use regex to grab "V:" and the the text behind it.
Is there any good, quick way to do this using regular expressions?

If your starting character were fixed, you would create a pattern like:
Pattern vToEndOfLine = Pattern.compile("(V:[^\\n]*)")
and use find() rather than matches().
If your starting character is dynamic, you can always write a method to return the desired pattern:
Pattern getTailOfLinePatternFor(String start) {
return Pattern.compile("(" + start + "[^\\n]*");
}
These can be worked on a little bit depending on your needs.

For a pattern match try for your example:
V:.*$

Here's the best, cleanest and easiest (ie one-line) way:
String[] parts = str.split("(?<=^\\w+): ");
Explanation:
The regex uses a positive look behind to break on the ": " after the first word (in this case "V") and capture both halves.
Here's a test:
String str = "V: mid-voice T: tempo";
String[] parts = str.split("(?<=^\\w+): ");
System.out.println("key = " + parts[0] + "\nvalue = " + parts[1]);
Output:
key = V
value = mid-voice T: tempo

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to design and split tokens from the string tokenizer function? - java

Related

Getting data between single and double quotes (special case)

Negative lookbehind: How can I stop matches when the suffix is repeated?

Java 7 Regex and named groups with multiple patterns

Matching Urls Inside Strings

finding text until end of line regex

Categories

Resources