How would you improve the efficiency of this regex - java

I think my regex pattern I have used could be tidied up and look a little neater but my knowledge of regular expressions is limited. I would like to scan and match a series of letters and numbers on new lines from an input file.
import java.io.File;
import java.util.Scanner;
import java.util.regex.*;
public class App {
public static void main(String[] args) throws Exception {
if (args.length == 1) {
String fileName = args[0];
String fileContent = new Scanner(new File(fileName))
.useDelimiter("\\Z").next();
ArrayList<Integer> parsedContent = new ArrayList<>();
parsedContent = parseContentFromFileContent(fileContent);
int firstInt = parsedContent.get(0);
int secondInt = parsedContent.get(1);
int thirdInt = parsedContent.get(2);
int fourthInt = parsedContent.get(3);
int fifthInt = parsedContent.get(4);
System.out.println("First: " + firstInt);
System.out.println("Second: " + secondInt);
System.out.println("Third: " + thirdInt);
System.out.println("Fourth: " + fourthInt);
System.out.println("Fifth: " + fifthInt);
return;
}
}
public static ArrayList<Integer> parseContentFromFileContent(String fileContent) {
ArrayList<Integer> parsedInts = new ArrayList<>();
String pattern = "(.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)";
Pattern p = Pattern.compile(pattern, Pattern.DOTALL);
Matcher m = p.matcher(fileContent);
if (m.matches()) {
// Group 1: Has to match two letters
switch (m.group(1)) {
case "ab":
parsedInts.add(1);
break;
case "cd":
parsedInts.add(2);
break;
case "ef":
parsedInts.add(3);
break;
}
// Group 2: Has to match a number
parsedInts.add(Integer.parseInt(m.group(2)));
// Group 3: Has to match a letter
parsedInts.add(Integer.parseInt(m.group(3)));
// Group 4: Has to match a single letter
switch (m.group(4)) {
case "a":
parsedInts.add(1);
break;
case "b":
parsedInts.add(2);
break;
case "c":
parsedInts.add(3);
break;
}
// Group 5: Has to match a number
parsedInts.add(Integer.parseInt(m.group(5)));
}
return parsedInts;
}
}
Input file:
ab-123 // Group 1 - Two letters a-z and Group 2 - Number
A=1 // Group 3 - Always A= [number]
a-1 // Group 4 - Letter a-z and Group 5 - Number
cd-1234
A=2
b-2
ef-12345
a=4
c-3
gh-123456
a=4
d-4
Is there a better (cleaner) regex pattern I could use to capture the data from the file above.
pattern = (.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)

Your pattern at the moment isn't very precise, contrary to the description you gave. There are a lot of .+?, but your description quite clearly says two letters or always A= - so you could instead use that in your pattern. Your pattern also accounts for decimal numbers, while there are none in the shown input, so you might be able to drop (?:\\d*\\.)?. Furthermore all your number matching patterns are optional, but according to your description thex shouldn't.
If one takes your pattern quite literally, a possible pattern would be
([a-z]{2})-(\\d+)\\n[Aa]=(\\d+)\\n([a-z])-(\\d+)
See https://regex101.com/r/WNxUQa/1
Note that you might have to adjust your pattern a bit (e.g. using ^ and $), if there might be malicious input.

There is really no such thing as optimizing a regular expression, unless it contains backtracking and you can remove it. You can optimise the way it looks, but all regular expressions that do the same thing compile to the same DFA, or equivalent DFAs, and have the same performance.

Related

Regex to capture unpaired brackets or parentheses

As the title indicates, please, how do I capture unpaired brackets or parentheses with regex, precisely, in java, being new to java. For instance, supposing I have the string below;
Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.
How do I capture 1) and 2).
I have tried:
([^\(\)][\)])
But, the result I am getting includes s) as below, instead of 1) and 2):
s), 1) and 2)
I have checked the link: Regular expression to match balanced parentheses, but, the question seem to be referring to recursive or nested structures, which is quite different from my situation.
My situation is to match the right parenthesis or right bracket, along with any associated text that does not have an associated left parenthesis or bracket.
Maybe,
\b\d+\)
might simply return the desired output, I guess.
Demo 1
Another way is to see what left boundary you might have, which in this case, I see digits, then what other chars we'd have prior to the closing curly bracket, and then we can design some other simple expression similar to:
\b\d[^)]*\)
Demo 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "\\b\\d[^)]*\\)";
final String string = "Programming is productive, (achieving a lot, and getting good results), it is often 1) demanding and 2) costly.\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1a b) demanding and 2a a) costly.\n\n\n"
+ "Programming is productive, (achieving a lot, and getting good results), it is often 1b) demanding and 2b) costly.\n\n"
+ "It is not supposed to match ( s s 1) \n";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: 1)
Full match: 2)
Full match: 1a b)
Full match: 2a a)
Full match: 1b)
Full match: 2b)
Full match: 1)
RegEx Circuit
jex.im visualizes regular expressions:
This is not a regex solution (obviously) but I can't think of a good way to do it. This simply uses a stack to keep track of parens.
For the input String "(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)
It prints out
first)
second)
third)
fourth)
All other parentheses are ignored because they are matched.
String s =
"(*(**)**) first) second) (**) (*ksks*) third) ** fourth)( **)";
Pattern p;
List<String> found = new ArrayList<>();
Stack<Character> tokens = new Stack<>();
int pcount = 0;
for (char c : s.toCharArray()) {
switch (c) {
case ' ':
tokens.clear();
break;
case '(':
pcount++;
break;
case ')':
pcount--;
if (pcount == -1) {
String v = ")";
while (!tokens.isEmpty()) {
v = tokens.pop() + v;
}
found.add(v);
pcount = 0;
}
break;
default:
tokens.push(c);
}
}
found.forEach(System.out::println);
Note: Integrating brackets (]) into the above would be a challenge (though not impossible) because one would need to check constructs like ( [ ) ] where it is unclear how to interpret it. That's why when specifying requirements of this sort they need to be spelled out precisely.

Filter a list by a separated letters search term

I have some sort of a list like the example below which contains (Car & Racoon) words. If I want to search for "c r" I want it to return me "Car" not "Racoon". Below is my current approach but it doesn't consider the order of letters and returns me "Racoon" as well. I want a solution that is as flexible as possible for any spaces separated search terms/letters.
String[] words_list = {"Car", "Racoon"};
String search_input = "c r";
String[] input_parts = search_input.trim().toLowerCase().split(" ");
for (String word : words_list){
int matches= 0;
for (String letter : input_parts) {
if (word.toLowerCase().contains(letter)) {
++matches;
}
}
if (matches == input_parts.length) {
Log.d("Result : ", word);
}
}
This is a good use case for Regular Expression.
Try something like this:
String[] wordsList = {"Car", "Racoon"};
String searchInput = "c r";
String searchRegEx = searchInput.replace(" ", ".{1}");
Pattern pattern = Pattern.compile(searchRegEx, Pattern.CASE_INSENSITIVE);
System.out.println("RegEx (case-insensitive) is: " + searchRegEx);
for (String word : wordsList){
Matcher matcher = pattern.matcher(word);
boolean match = matcher.matches();
System.out.println("Test word '"+word + "' and match was: " + match);
}
As you can see, I find each space and replace it with .{1} which means exactly once instance of any character. If you want to be open to matching one or more characters, you can use something like .+ instead. Or you can be more specific and specify that only characters a-z upper and lower case should be matched: [a-zA-Z]{1} or [a-zA-Z]+. The Pattern.CASE_INSENSTIVE is important because otherwise your word list having a capital C for Car will not match the lowercase input.
In this case, compiling the Pattern is an important optimisation. As you know RegEx can be slow and if you in-line this in your for-loop, it will compile your regular expression for each test which will be slow and inefficient.

Java regex input from txt file

I have a text file that includes some mathematical expressions.
I need to parse the text into components (words, sentences, punctuation, numbers and arithmetic signs) using regular expressions, calculate mathematical expressions and return the text in the original form with the calculated numbers expressions.
I done this without regular expressions (without calculation). Now I am trying to do this using regular expressions.
I not fully understand how to do this correctly. The input text is like this:
Pete like mathematic 5+3 and jesica too sin(3).
In the output I need:
Pete like mathematic 8 and jesica too 0,14.
I need some advice with regex and calculation from people who know how to do this.
My code:
final static Pattern PUNCTUATION = Pattern.compile("([\\s.,!?;:]){1,}");
final static Pattern LETTER = Pattern.compile("([а-яА-Яa-zA-Z&&[^sin]]){1,}");
List<Sentence> sentences = new ArrayList<Sentence>();
List<PartOfSentence> parts = new ArrayList<PartOfSentence>();
StringTokenizer st = new StringTokenizer(text, " \t\n\r:;.!?,/\\|\"\'",
true);
The code with regex (not working):
while (st.hasMoreTokens()) {
String s = st.nextToken().trim();
int size = s.length();
for (int i=0; i<s.length();i++){
//with regex. not working variant
Matcher m = LETTER.matcher(s);
if (m.matches()){
parts.add(new Word(s.toCharArray()));
}
m = PUNCTUATION.matcher(s);
if (m.matches()){
parts.add(new Punctuation(s.charAt(0)));
}
Sentence buf = new Sentence(parts);
if (buf.getWords().size() != 0) {
sentences.add(buf);
parts = new ArrayList<PartOfSentence>();
} else
parts.add(new Punctuation(s.charAt(0)));
Without regex (working):
if (size < 1)
continue;
if (size == 1) {
switch (s.charAt(0)) {
case ' ':
continue;
case ',':
case ';':
case ':':
case '\'':
case '\"':
parts.add(new Punctuation(s.charAt(0)));
break;
case '.':
case '?':
case '!':
parts.add(new Punctuation(s.charAt(0)));
Sentence buf = new Sentence(parts);
if (buf.getWords().size() != 0) {
sentences.add(buf);
parts = new ArrayList<PartOfSentence>();
} else
parts.add(new Punctuation(s.charAt(0)));
break;
default:
parts.add(new Word(s.toCharArray()));
}
} else {
parts.add(new Word(s.toCharArray()));
}
}
This is not a trivial problem to solve as even matching numbers can become quite involved.
Firstly, a number can be matched by the regular expression "(\\d*(\\.\\d*)?\\d(e\\d+)?)" to account for decimal places and exponent formats.
Secondly, there are (at least) three types of expressions that you want to solve: binary, unary and functions. For each one, we create a pattern to match in the solve method.
Thirdly, there are numerous libraries that can implement the reduce method like this or this.
The implementation below does not handle nested expressions e.g., sin(5) + cos(3) or spaces in expressions.
private static final String NUM = "(\\d*(\\.\\d*)?\\d(e\\d+)?)";
public String solve(String expr) {
expr = solve(expr, "(" + NUM + "(!|\\+\\+|--))"); //unary operators
expr = solve(expr, "(" + NUM + "([+-/*]" + NUM + ")+)"); // binary operators
expr = solve(expr, "((sin|cos|tan)\\(" + NUM + "\\))"); // functions
return expr;
}
private String solve(String expr, String pattern) {
Matcher m = Pattern.compile(pattern).matcher(expr);
// assume a reduce method :String -> String that solve expressions
while(m.find()){
expr = m.replaceAll(reduce(m.group()));
}
return expr;
}
//evaluate expression using exp4j, format to 2 decimal places,
//remove trailing 0s and dangling decimal point
private String reduce(String expr){
double res = new ExpressionBuilder(expr).build().evaluate();
return String.format("%.2f",res).replaceAll("0*$", "").replaceAll("\\.$", "");
}
I think you could start by looking for "Function" matching in your input String. Then all is not matched with a Function is simply returned.
For example, this short code do, i hope, what you are seeking :
Class with Main method.
public class App {
StringTokenizer st = new StringTokenizer("Pete likes Mathematics 3+3 and Jessica too 6+3.", " \t\n\r:;.!?,/\\|\"\'", true);
public static void main(String[] args) {
new App();
}
public App(){
ArrayList<String> renderedStrings = new ArrayList<String>();
while(st.hasMoreTokens()){
String s = st.nextToken();
if(!AdditionPatternFuntion.render(s, renderedStrings)){
renderedStrings.add(s);
}
}
for(String s : renderedStrings){
System.out.print(s);
}
}
}
Class "AdditionPattern" that does the real Job
import java.util.ArrayList;
import java.util.StringTokenizer;
import java.util.regex.Pattern;
class AdditionPatternFuntion{
public static boolean render(String s, ArrayList<String> renderedStrings){
Pattern pattern = Pattern.compile("(\\d\\+\\d)");
boolean match = pattern.matcher(s).matches();
if(match){
StringTokenizer additionTokenier = new StringTokenizer(s, "+", false);
Integer firstOperand = new Integer(additionTokenier.nextToken());
Integer secondOperand = new Integer(additionTokenier.nextToken());
renderedStrings.add(new Integer(firstOperand + secondOperand).toString());
}
return match;
}
}
When I run with this input :
Pete likes Mathematics 3+3 and Jessica too 6+3.
I getthis output :
Pete likes Mathematics 6 and Jessica too 9.
To handle "sin()" function you can do the same : Create a new class, "SinPatternFunction" for instance, and do the same.
I think you should even create an Abstract class "FunctionPattern" with a abstract method "render" inside it which you will implement with the AssitionPatternFunction and SinPatternFunction classes.
Finally, you would be able to create a class, let's call it "PatternFunctionHandler", which will create a list of PatternFunction (a SinPatternFunction, an AdditionPatternFunction (and so on)) then call render on each one and return the result.
Your specified requirement is to use regular expressions to:
Divide text into components (words, ...)
Return text with inner arithmetic expressions evaluated
You have started with first step using regular expressions, but have not quite completed it -- after completing it, there remains to:
Recognize and parse components that form arithmetic (sub)expressions.
Evaluate recognized (sub)expression components and produce a value. For evaluating (sub)expressions in infix notation, there exists a very helpful answer.
Substituting value replacements back into original string -- should be simple.
For text division into components defined strictly enough to allow later unambiguos evaluation of the subexpression, I coded a sample, trying out named capturing groups in Java. This sample handles only integer numbers, but floating point should be simple to add.
Sample output on some test inputs was as follows:
Matching 'Pete like mathematic 5+3 and jesica too sin(3).'
WORD('Pete'),WS(' '),WORD('like'),WS(' '),WORD('mathematic'),WS(' '),NUM('5'),OP('+'),NUM('3'),WS(' '),WORD('and'),WS(' '),WORD('jesica'),WS(' '),WORD('too'),WS(' '),FUNC('sin'),FOPENP('('),NUM('3'),CLOSEP(')'),DOT('.')
Matching 'How about solving sin(3 + cos(x)).'
WORD('How'),WS(' '),WORD('about'),WS(' '),WORD('solving'),WS(' '),FUNC('sin'),FOPENP('('),NUM('3'),WS(' '),OP('+'),WS(' '),FUNC('cos'),FOPENP('('),WORD('x'),CLOSEP(')'),CLOSEP(')'),DOT('.')
Matching 'Or arcsin(4.2) we do not know about?'
WORD('Or'),WS(' '),WORD('arcsin'),OPENP('('),NUM('4'),DOT('.'),NUM('2'),CLOSEP(')'),WS(' '),WORD('we'),WS(' '),WORD('do'),WS(' '),WORD('not'),WS(' '),WORD('know'),WS(' '),WORD('about'),PUNCT('?')
Matching ''sin sin sin' the catholic priest has said...'
PUNCT('''),WORD('sin'),WS(' '),WORD('sin'),WS(' '),WORD('sin'),PUNCT('''),WS(' '),WORD('the'),WS(' '),WORD('catholic'),WS(' '),WORD('priest'),WS(' '),WORD('has'),WS(' '),WORD('said'),DOT('.'),DOT('.'),DOT('.')
On named capturing group usage, I found it inconvenient that compiled Pattern or acquired Matcher APIs do not provide access to present group names. Sample code below.
import java.util.*;
import java.util.regex.*;
import static java.util.stream.Collectors.joining;
public class Lexer {
// differentiating _function call opening parentheses_ from expressions one
static final String S_FOPENP = "(?<fopenp>\\()";
static final String S_FUNC = "(?<func>(sin|cos|tan))" + S_FOPENP;
// expression or text opening parentheses
static final String S_OPENP = "(?<openp>\\()";
// expression or text closing parentheses
static final String S_CLOSEP = "(?<closep>\\))";
// separate dot, should help with introducing floating-point support
static final String S_DOT = "(?<dot>\\.)";
// other recognized punctuation
static final String S_PUNCT = "(?<punct>[,!?;:'\"])";
// whitespace
static final String S_WS = "(?<ws>\\s+)";
// integer number pattern
static final String S_NUM = "(?<num>\\d+)";
// treat '* / + -' as mathematical operators. Can be in dashed text.
static final String S_OP = "(?<op>\\*|/|\\+|-)";
// word -- refrain from using \w character class that also includes digits
static final String S_WORD = "(?<word>[a-zA-Z]+)";
// put the predefined components together into single regular expression
private static final String S_ALL = "(" +
S_OPENP + "|" + S_CLOSEP + "|" + S_FUNC + "|" + S_DOT + "|" +
S_PUNCT + "|" + S_WS + "|" + S_NUM + "|" + S_OP + "|" + S_WORD +
")";
static final Pattern ALL = Pattern.compile(S_ALL); // ... & form Pattern
// named capturing groups defined in regular expressions
static final List<String> GROUPS = Arrays.asList(
"func", "fopenp",
"openp", "closep",
"dot", "punct", "ws",
"num", "op",
"word"
);
// divide match into components according to capturing groups
static final List<String> tokenize(Matcher m) {
List<String> tokens = new LinkedList<>();
while (m.find()){
for (String group : GROUPS) {
String grResult = m.group(group);
if (grResult != null)
tokens.add(group.toUpperCase() + "('" + grResult + "')");
}
}
return tokens;
}
// some sample inputs to test
static final List<String> INPUTS = Arrays.asList(
"Pete like mathematic 5+3 and jesica too sin(3).",
"How about solving sin(3 + cos(x)).",
"Or arcsin(4.2) we do not know about?",
"'sin sin sin' the catholic priest has said..."
);
// test
public static void main(String[] args) {
for (String input: INPUTS) {
Matcher m = ALL.matcher(input);
System.out.println("Matching '" + input + "'");
System.out.println(tokenize(m).stream().collect(joining(",")));
}
}
}

pattern matching using regular expressions replace by digits

my program is to take a big string from the user like aaaabaaaaaba
then the output should be replace aaa by 0 and aba by 1 in the given pattern of
string it should not be take a sequence one into the other every sequence is
individual and like aaaabaaabaaaaba here aaa-aba-aab-aaa-aba are individual and
should not overlap eachother while matching please help me to get this program
example: aaaabaaaaaba input ended output is 0101
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Pattern1 {
Scanner sc =new Scanner(System.in);
public void m1()
{ String s;
System.out.println("enter a string");
s=sc.nextLine();
assertTrue(s!=null);
Pattern p = Pattern.compile(s);
Matcher m =p.matcher(".(aaa");
Matcher m1 =p.matcher("aba");
while(m.find())
{
s.replaceAll(s, "1");
}
while(m1.find())
{
s.replaceAll(s, "0");
}
System.out.println(s);
}
private boolean assertTrue(boolean b) {
return b;
// TODO Auto-generated method stub
}
public static void main(String[] args) {
Pattern1 p = new Pattern1();
p.m1();
}
}
With regex and find you can search for each successive match and then add a 0 or 1 depending on the characters to the output.
String test = "aaaabaaaaabaaaa";
Pattern compile = Pattern.compile("(?<triplet>(aaa)|(aba))");
Matcher matcher = compile.matcher(test);
StringBuilder out = new StringBuilder();
int start = 0;
while (matcher.find(start)) {
String triplet = matcher.group("triplet");
switch (triplet) {
case "aaa":
out.append("0");
break;
case "aba":
out.append("1");
break;
}
start = matcher.end();
}
System.out.println(out.toString());
If you have "aaaaaba" (one a too much in the first triplet) as input, it will ignore the last "a" and output "01". So any invalid characters between valid triplets will be ignored.
If you want to go through the string blocks of 3 you can use a for-loop and the substring() function like this:
String test = "aaaabaaaaabaaaa";
StringBuilder out = new StringBuilder();
for (int i = 0; i < test.length() - 2; i += 3) {
String triplet = test.substring(i, i + 3);
switch (triplet) {
case "aaa":
out.append("0");
break;
case "aba":
out.append("1");
break;
}
}
System.out.println(out.toString());
In this case, if a triplet is invalid, it will just be ignored and neither a "0" nor a "1" will be added to the output. If you want to do something in this case, just add a default clause to the switch statement.
Here's what I understand from your question:
The user string will be some sequence of the tokens "aaa" and "aba"
There will be no other combinations of 'a' and 'b'. For example, you will not get "aaabaa" as an input string as "baa" is invalid..
For each consecutive 3 character string, replace "aaa" with 0 and "aba" with 1.
I'm guessing that this is a homework assignment designed to teach you about the dangers of catastrophic backtracking and how to carefully use quantifiers.
My suggestion would be to do this in two parts:
Identify and replace each 3-letter segment with a single character.
Replace those characters with the appropriate value. ('1' or '0')
For example, first construct a pattern like a([ab])a to capture the character ('a' or 'b') between two 'a's. Then, use the Matcher class' replaceAll method to replace each match with the captured character. So, for input aaaabaaaaaba' you getabab` as a result. Finally, replace all 'a' with '0' and all 'b' with '1'.
In Java:
// Create the matcher to identify triplets in the form "aaa" or "aba"
Matcher tripletMatcher = Pattern.compile("a([ab])a").matcher(inputString);
// Replace each triplet with the middle letter, then replace 'a' and 'b' properly.
String result = tripletMatcher.replaceAll("$1").replace('a', '0').replace('b', '1');
There's better ways of doing this, of course, but this should work. I've left the code intentionally dense and hard to read quickly. So, if this is a homework assignment, make sure you understand it fully and then rewrite it yourself.
Also, keep in mind that this will not work if the input string that isn't a sequence of "aaa" and "aba". Any other combination, such as "baa" or "abb", will cause errors. For example, ababaa, aababa, and aaabab will all result in unexpected and potentially incorrect results.

Tokenizing a String but ignoring delimiters within quotes

I wish to have have the following String
!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"
to become an array of the following
{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }
I tried
new StringTokenizer(cmd, "\"")
but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.
Thanks.
EDIT:
I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.
It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.
That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.
Here's an example:
String text = "1 2 \"333 4\" 55 6 \"77\" 8 999";
// 1 2 "333 4" 55 6 "77" 8 999
String regex = "\"([^\"]*)\"|(\\S+)";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
if (m.group(1) != null) {
System.out.println("Quoted [" + m.group(1) + "]");
} else {
System.out.println("Plain [" + m.group(2) + "]");
}
}
The above prints (as seen on ideone.com):
Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]
The pattern is essentially:
"([^"]*)"|(\S+)
\_____/ \___/
1 2
There are 2 alternates:
The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
The second alternate matches any sequence of non-whitespace characters, captured in group 2
The order of the alternates matter in this pattern
Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.
References
regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus
See also
regular-expressions.info/Examples - Programmer - Strings - for pattern with escaped quotes
Appendix
Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.
Related questions
Difference between a Deprecated and Legacy API?
Scanner vs. StringTokenizer vs. String.Split
Validating input using java.util.Scanner - has many examples
Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.
Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.
Apache Commons to the rescue!
import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
#Grab(group='org.apache.commons', module='commons-text', version='1.3')
def str = /is this 'completely "impossible"' or """slightly"" impossible" to parse?/
StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )
println st.tokenList
Output:
[is, this, completely "impossible", or, "slightly" impossible, to, parse?]
A few notes:
this is written in Groovy... it is in fact a Groovy script. The
#Grab line gives a clue to the sort of dependency line you need
(e.g. in build.gradle) ... or just include the .jar in your
classpath of course
StringTokenizer here is NOT
java.util.StringTokenizer ... as the import line shows it is
org.apache.commons.text.StringTokenizer
the def str = ...
line is a way to produce a String in Groovy which contains both
single quotes and double quotes without having to go in for escaping
StringMatcherFactory in apache commons-text 1.3 can be found
here: as you can see, the INSTANCE can provide you with a
bunch of different StringMatchers. You could even roll your own:
but you'd need to examine the StringMatcherFactory source code to
see how it's done.
YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!
PS why is it better to use Apache Commons than any other solution?
Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:
The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...
PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.
later
Sufficiently intrigued by ArtB's question I had a look at the source:
in StringMatcherFactory.java we see:
private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
"'\"".toCharArray());
... rather dull ...
so that leads one to look at StringTokenizer.java:
public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
if (quote != null) {
this.quoteMatcher = quote;
}
return this;
}
OK... and then, in the same java file:
private int readWithQuotes(final char[] srcChars ...
which contains the comment:
// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.
... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\"|s into |""|s... (i.e. where you replace each |"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.
I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleos through which you achieve programming immortality...
In an old fashioned way:
public static String[] split(String str) {
str += " "; // To detect last token when not quoted...
ArrayList<String> strings = new ArrayList<String>();
boolean inQuote = false;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '"' || c == ' ' && !inQuote) {
if (c == '"')
inQuote = !inQuote;
if (!inQuote && sb.length() > 0) {
strings.add(sb.toString());
sb.delete(0, sb.length());
}
} else
sb.append(c);
}
return strings.toArray(new String[strings.size()]);
}
I assume that nested quotes are illegal, and also that empty tokens can be omitted.
This is an old question, however this was my solution as a finite state machine.
Efficient, predictable and no fancy tricks.
100% coverage on tests.
Drag and drop into your code.
/**
* Splits a command on whitespaces. Preserves whitespace in quotes. Trims excess whitespace between chunks. Supports quote
* escape within quotes. Failed escape will preserve escape char.
*
* #return List of split commands
*/
static List<String> splitCommand(String inputString) {
List<String> matchList = new LinkedList<>();
LinkedList<Character> charList = inputString.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.toCollection(LinkedList::new));
// Finite-State Automaton for parsing.
CommandSplitterState state = CommandSplitterState.BeginningChunk;
LinkedList<Character> chunkBuffer = new LinkedList<>();
for (Character currentChar : charList) {
switch (state) {
case BeginningChunk:
switch (currentChar) {
case '"':
state = CommandSplitterState.ParsingQuote;
break;
case ' ':
break;
default:
state = CommandSplitterState.ParsingWord;
chunkBuffer.add(currentChar);
}
break;
case ParsingWord:
switch (currentChar) {
case ' ':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
default:
chunkBuffer.add(currentChar);
}
break;
case ParsingQuote:
switch (currentChar) {
case '"':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
case '\\':
state = CommandSplitterState.EscapeChar;
break;
default:
chunkBuffer.add(currentChar);
}
break;
case EscapeChar:
switch (currentChar) {
case '"': // Intentional fall through
case '\\':
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add(currentChar);
break;
default:
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add('\\');
chunkBuffer.add(currentChar);
}
}
}
if (state != CommandSplitterState.BeginningChunk) {
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
}
return matchList;
}
private enum CommandSplitterState {
BeginningChunk, ParsingWord, ParsingQuote, EscapeChar
}
Recently faced a similar question where command line arguments must be split ignoring quotes link.
One possible case:
"/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force\""
This had to be split to
/opt/jboss-eap/bin/jboss-cli.sh
--connect
--controller=localhost:9990
-c
command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"
Just to add to #polygenelubricants's answer, having any non-space character before and after the quote matcher can work out.
"\\S*\"([^\"]*)\"\\S*|(\\S+)"
Example:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tokenizer {
public static void main(String[] args){
String a = "/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy " +
"/app/jboss-eap-7.1/standalone/updates/sample.war --force\"";
String b = "Hello \"Stack Overflow\"";
String c = "cmd=\"abcd efgh ijkl mnop\" \"apple\" banana mango";
String d = "abcd ef=\"ghij klmn\"op qrst";
String e = "1 2 \"333 4\" 55 6 \"77\" 8 999";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\S*\"([^\"]*)\"\\S*|(\\S+)");
Matcher regexMatcher = regex.matcher(a);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
System.out.println("matchList="+matchList);
}
}
Output:
matchList=[/opt/jboss-eap/bin/jboss-cli.sh, --connect, --controller=localhost:9990, -c, command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"]
This is what I myself use for splitting arguments in command line and things like that.
It's easily adjustible for multiple delimiters and quotes, it can process quotes within the words (like al' 'pha), it supports escaping (quotes as well as spaces) and it's really lenient.
public final class StringUtilities {
private static final List<Character> WORD_DELIMITERS = Arrays.asList(' ', '\t');
private static final List<Character> QUOTE_CHARACTERS = Arrays.asList('"', '\'');
private static final char ESCAPE_CHARACTER = '\\';
private StringUtilities() {
}
public static String[] splitWords(String string) {
StringBuilder wordBuilder = new StringBuilder();
List<String> words = new ArrayList<>();
char quote = 0;
for (int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
if (c == ESCAPE_CHARACTER && i + 1 < string.length()) {
wordBuilder.append(string.charAt(++i));
} else if (WORD_DELIMITERS.contains(c) && quote == 0) {
words.add(wordBuilder.toString());
wordBuilder.setLength(0);
} else if (quote == 0 && QUOTE_CHARACTERS.contains(c)) {
quote = c;
} else if (quote == c) {
quote = 0;
} else {
wordBuilder.append(c);
}
}
if (wordBuilder.length() > 0) {
words.add(wordBuilder.toString());
}
return words.toArray(new String[0]);
}
}
The example you have here would just have to be split by the double quote character.
Another old school way is :
public static void main(String[] args) {
String text = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] splits = text.split(" ");
List<String> list = new ArrayList<>();
String token = null;
for(String s : splits) {
if(s.startsWith("\"") ) {
token = "" + s;
} else if (s.endsWith("\"")) {
token = token + " "+ s;
list.add(token);
token = null;
} else {
if (token != null) {
token = token + " " + s;
} else {
list.add(s);
}
}
}
System.out.println(list);
}
Output : - [One, two, "three four", five, "six seven eight", nine]
private static void findWords(String str) {
boolean flag = false;
StringBuilder sb = new StringBuilder();
for(int i=0;i<str.length();i++) {
if(str.charAt(i)!=' ' && str.charAt(i)!='"') {
sb.append(str.charAt(i));
}
else {
System.out.println(sb.toString());
sb = new StringBuilder();
if(str.charAt(i)==' ' && !flag)
continue;
else if(str.charAt(i)=='"') {
if(!flag) {
flag=true;
}
i++;
while(i<str.length() && str.charAt(i)!='"') {
sb.append(str.charAt(i));
i++;
}
flag=false;
System.out.println(sb.toString());
sb = new StringBuilder();
}
}
}
}
In my case I had a string that includes key="value" . Check this out:
String perfLogString = "2022-11-10 08:35:00,470 PLV=REQ CIP=902.68.5.11 CMID=canonaustr CMN=\"Yanon Australia Pty Ltd\"";
// and this came to my rescue :
String[] str1= perfLogString.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println(Arrays.toString(str1));
This regex matches spaces ONLY if it is followed by even number of double quotes.
On split I get :
[2022-11-10, 08:35:00,470, PLV=REQ, CIP=902.68.5.11, CMID=canonaustr, CMN="Yanon Australia Pty Ltd"]
try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");
I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.
Try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");
It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.
You should use String's split method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer doesn't. At the end of what I provided above, you can just add the following:
String s;
for(String k : strArr) {
s += k;
}
StringTokenizer strTok = new StringTokenizer(s);

Categories

Resources