I want to code a regex in java. The possible strings for this are:
yyyyyy$
<t>yy<\t>$
<t><t>yyyyy<\t><\t>$
<t><t>y<\t>y<\t><t>yyyyy<\t>yy$
And the strings NOT allowed or possible are:
<t><\t>$ (no “y” in the string)
<t>yy<t><\t>$ (one extra <t> ).
Some Specifications are:
There is exactly one $ in any correct string, and
this is always the last symbol in the string. The
string before the $ must be non-empty, and we call
it an expression. An expression is defined recursively
as:
the letter ‘y’
an expression bracketed by <t> and <\t>
a sequence of expressions.
The regex I have built is : y+|y*(<t>y+(<t>y*<\t>)*<\t>)
Now I am coding this regex in java as: "d+|(d*(<s>d+(<s>d*<\\s>)*<\\s>))$"
Code:
private static void checkForPattern(String input) {
Pattern p = Pattern.compile(" d+ | (d*(<s>d+(<s>d*<\\s>)*<\\s>)) $");
//Pattern p= Pattern.compile("d+|d*<s>dd<\\s>$");
Matcher m = p.matcher(input);
if (m.matches()) {
System.out.println("Correct string");
} else {
System.out.println("Wrong string");
}
}
What is the error in the syntax as it is saying "wrong" on every String that I am parsing.
I would suggest not using regex for this since Java's regex engine cannot effectively balance the number of <t> vs <\t> occurrences like other regex engines can (i.e. .NET). Even doing this in those engines is fairly complex and there are likely better ways to go about your problem. The code below does just this: It counts the number of occurrences of <t> and ensures the same number of <\t> exists. Similarly, it counts the number of occurrences of y and ensures there's more than 0 instances. The logic for the countOccurrences method was adapted from this answer on the question Occurrences of substring in a string.
See code in use here
class Main {
public static void main(String[] args) {
String[] strings = {
"yyyyyy$",
"<t>yy<\\t>$",
"<t><t>yyyyy<\\t><\\t>$",
"<t><t>y<\\t>y<\\t><t>yyyyy<\\t>yy$",
"<t><\\t>$",
"<t>yy<t><\\t>$"
};
for(String s : strings) {
if (countOccurrences("<t>", s) == countOccurrences("<\\t>", s) && countOccurrences("y", s) > 0) {
System.out.println("Good: " + s);
} else {
System.out.println("Bad: " + s);
}
}
}
private static int countOccurrences(String needle, String haystack) {
int lastIndex = 0;
int count = 0;
while (lastIndex != -1) {
lastIndex = haystack.indexOf(needle, lastIndex);
if (lastIndex != -1) {
count++;
lastIndex += needle.length();
}
}
return count;
}
}
Result:
Good: yyyyyy$
Good: <t>yy<\t>$
Good: <t><t>yyyyy<\t><\t>$
Good: <t><t>y<\t>y<\t><t>yyyyy<\t>yy$
Bad: <t><\t>$
Bad: <t>yy<t><\t>$
After a thorough research and reading, I have concluded that regex for such type of Language cannot be created as it is an infinite Automata (regex for infinite automata cannot be created). So to solve this problem we will have to create the CFG directly.
CFG for the above mentioned problem is below:
R --> <t>S<\t>$(1.1 production)
R-->SS$(1.2 production)
R-->y$(1.3 production)
S--><t>S<\t>(2.1 production)
S-->SS(2.2 production)
S-->y(2.3 production)
Related
I have come across regular expressions for different problems but I could not find out regex s to balance characters in a string.
I came across a problem, to find if a string is balanced.
ex: aabbccdd is a balanced one, as a characters are repeated in even numbers
but aabbccddd is not a balanced one since ddd is repeated in odd number mode. This is applicable for all characters give an input not to specific a,b,c and d. If i give input as 12344321 or 123454321, it should return balanced and unbalanced result respectively.
How to find the balance using regex. What type of regular expression we should use to find if the string is balanced?
Edit:
I tried to find solution using regex only as the problem demands answer in regex pattern. I would implemented using any other solution if regex was not mentioned explicitly
I don't think you can do it with regex. Why do you need to use them?
I tried this: it works and it's pretty simple
static boolean isBalanced(String str) {
ArrayList<Character> odds = new ArrayList<>(); //Will contain the characters read until now an odd number of times
for (char x : str.toCharArray()) { //Reads each char of the string
if (odds.contains(x)) { //If x was in the arraylist we found x an even number of times so let's remove it
odds.remove(odds.indexOf(x));
}
else {
odds.add(x);
}
}
return odds.isEmpty();
}
Regular expression for this problem exists, but doesn't speed up anythings and will be totally messy. It's easier to prepare NFA, and then switch to REGEX. Still, it's not proper tool.
public static void main(String args[]) {
String s = args[0];
int[] counter = new int[256];
for (int i = 0; i < s.length(); i++) {
counter[s.charAt(i)]++;
}
if (validate(counter)) {
System.out.println("valid");
} else {
System.out.println("invalid");
}
}
public static boolean validate(int[] tab) {
for (int i : tab) {
if (i%2 == 1) {
return false;
}
}
return true;
}
Edit: for pointing the regex existance
Reference for a finite automate for just two characters. Start on the very left, win with double circle. Each state named by the set of characters that have odd count so far.
A string contains many patterns of the form 1(0+)1 where (0+) represents any non-empty consecutive sequence of 0's. The patterns are allowed to overlap.
For example, consider string "1101001", we can see there are two consecutive sequences "1(0)1" and "1(00)1" which are of the form 1(0+)1.
public class Solution {
static int patternCount(String s){
String[] sArray = s.split("1");
for(String str : sArray) {
if(Pattern.matches("[0]+", str)) {
count++;
}
}
return count;
}
public static void main(String[] args) {
int result = patternCount("1001010001");
System.out.println(result);//3
}
}
Sample Input
100001abc101
1001ab010abc01001
1001010001
Sample Output
2
2
3
But still something i feel might fail in future could you pleaese help me to optimize my code as per the Requirement
First: you did not declare the count variable.
Anyway, I think a better method is:
static int patternCount(String s){
Pattern pattern = Pattern.compile("(?<=1)[0]+(?=1)");
Matcher matcher = pattern.matcher(s);
int count = 0;
while (matcher.find())
count++;
return count;
}
You use more regex and less logic; and, for what I could see, it is even faster (see test).
In case you didn't know, the trick used in regex is called lookaround. More precisely, (?<=1) is positive lookbehind and (?=1) is positive lookahead.
I have text as a String and need to calculate number of syllables in each word. I've tried to split all text into array of words and than processed each word separately. I used regular expressions for that. But pattern for syllables doesn't work as it should. Please advice how to change it to calculate correct number of syllables. My initial code.
public int getNumSyllables()
{
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
int count=0;
List <String> tokens = new ArrayList<String>();
for(String word: words){
tokens = Arrays.asList(word.split("[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*"));
count+= tokens.size();
}
return count;
}
This question is from a Java Course of UCSD, am I right?
I think you should provide enough information for this question, so that it won't confused people who want to offer some help. And here I have my own solution, which already been tested by the test case from the local program, also the OJ from UCSD.
You missed some important information about the definition of syllable in this question. Actually I think the key point of this problem is how should you deal with the e. For example, let's say there is a combination of te. And if you put te in the middle of a word, of course it should be counted as a syllable; However if it's at the end of a word, the e should be thought as a silent e in English, so it should not be thought as a syllable.
That's it. And I would like to write down my thought with some pseudo code:
if(last character is e) {
if(it is silent e at the end of this word) {
remove the silent e;
count the rest part as regular;
} else {
count++;
} else {
count it as regular;
}
}
You may find that I am not only using regex to deal with this problem. Actually I have thought about it: can this question really be done only using regex? My answer is: nope, I don't think so. At least now, with the knowledge UCSD gives us, it's too difficult to do that. Regex is a powerful tool, it can map the desired characters very fast. However regex is missing some functionality. Take the te as example again, regex won't be able to think twice when it is facing the word like teate (I made up this word just for example). If our regex pattern would count the first te as syllable, then why the last te not?
Meanwhile, UCSD actually have talked about it on the assignment paper:
If you find yourself doing mental gymnastics to come up with a single regex to count syllables directly, that's usually an indication that there's a simpler solution (hint: consider a loop over characters--see the next hint below). Just because a piece of code (e.g. a regex) is shorter does not mean it is always better.
The hint here is that, you should think this problem together with some loop, combining with regex.
OK, I should finally show my code now:
protected int countSyllables(String word)
{
// TODO: Implement this method so that you can call it from the
// getNumSyllables method in BasicDocument (module 1) and
// EfficientDocument (module 2).
int count = 0;
word = word.toLowerCase();
if (word.charAt(word.length()-1) == 'e') {
if (silente(word)){
String newword = word.substring(0, word.length()-1);
count = count + countit(newword);
} else {
count++;
}
} else {
count = count + countit(word);
}
return count;
}
private int countit(String word) {
int count = 0;
Pattern splitter = Pattern.compile("[^aeiouy]*[aeiouy]+");
Matcher m = splitter.matcher(word);
while (m.find()) {
count++;
}
return count;
}
private boolean silente(String word) {
word = word.substring(0, word.length()-1);
Pattern yup = Pattern.compile("[aeiouy]");
Matcher m = yup.matcher(word);
if (m.find()) {
return true;
} else
return false;
}
You may find that besides from the given method countSyllables, I also create two additional methods countit and silente. countit is for counting the syllables inside the word, silente is trying to figure it out that is this word end with a silent e. And it should also be noticed that the definition of not silent e. For example, the should be consider not silent e, while ate is considered silent e.
And here is the status my code has already passed the test, from both local test case and OJ from UCSD:
And from OJ the test result:
P.S: It should be fine to use something like [^aeiouy] directly, because the word is parsed before we call this method. Also change to lowercase is necessary, that would save a lot of work dealing with the uppercase. What we want is only the number of syllables.
Talking about number, an elegant way is to define count as static, so the private method could directly use count++ inside. But now it's fine.
Feel free to contact me if you still don't get the method of this question :)
Using the concept of user5500105, I have developed the following method to calculate the number of Syllables in a word. The rules are:
consecutive vowels are counted as 1 syllable. eg. "ae" "ou" are 1 syllable
Y is considered as a vowel
e at the end is counted as syllable if e is the only vowel: eg: "the" is one syllable, since "e" at the end is the only vowel while "there" is also 1 syllable because "e" is at the end and there is another vowel in the word.
public int countSyllables(String word) {
ArrayList<String> tokens = new ArrayList<String>();
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
while (m.find()) {
tokens.add(m.group());
}
//check if e is at last and e is not the only vowel or not
if( tokens.size() > 1 && tokens.get(tokens.size()-1).equals("e") )
return tokens.size()-1; // e is at last and not the only vowel so total syllable -1
return tokens.size();
}
This gives you a number of syllables vowels in a word:
public int getNumVowels(String word) {
String regexp = "[bcdfghjklmnpqrstvwxz]*[aeiouy]+[bcdfghjklmnpqrstvwxz]*";
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(word.toLowerCase());
int count = 0;
while (m.find()) {
count++;
}
return count;
}
You can call it on every word in your string array:
String[] words = getText().split("\\s+");
for (String word : words ) {
System.out.println("Word: " + word + ", vowels: " + getNumVowels(word));
}
Update: as freerunner noted, calculating the number of syllables is more complicated than just counting vowels. One need to take into account combinations like ou, ui, oo, the final silent e and possibly something else. As I am not a native English speaker, I am not sure what the correct algorithm would be.
This is how I do it. This is about as simple an algorithm I could come up with.
public static int syllables(String s) {
final Pattern p = Pattern.compile("([ayeiou]+)");
final String lowerCase = s.toLowerCase();
final Matcher m = p.matcher(lowerCase);
int count = 0;
while (m.find())
count++;
if (lowerCase.endsWith("e"))
count--;
return count < 0 ? 1 : count;
}
I use this in combination with a soundex function to determine if words sound alike. The syllable count improves accuracy of my soundex function.
Note: This is strictly for counting the syllables in a word. I assume you can parse your input for words using something like java.util.StringTokenizer.
Your line
String[] words = getText().toLowerCase().split("[a-zA-Z]+");
is splitting ON words, and returning only the space between words! You want to split on the space between words, as follows:
String[] words = getText().toLowerCase().split("\\s+");
you can do it as the following :
public int getNumSyllables()
{
return getSyllables(getTokens("[a-zA-Z]+"));
}
protected List<String> getWordTokens(String word,String pattern)
{
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile(pattern);
Matcher m = tokSplitter.matcher(word);
while (m.find()) {
tokens.add(m.group());
}
return tokens;
}
private int getSyllables(List<String> tokens)
{
int count=0;
for(String word : tokens)
if(word.toLowerCase().endsWith("e") && getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size() > 0)
count+=getWordTokens(word.toLowerCase().substring(0, word.length()-1), "[aeiouy]+").size();
else
count+=getWordTokens(word.toLowerCase(), "[aeiouy]+").size();
return count;
}
I count the the separately, then split the text based on words which are ended with e.
Then counting the syllables, here is my implementation:
int syllables = 0;
word = word.toLowerCase();
if(word.contains("the ")){
syllables ++;
}
String[] split = word.split("e!$|e[?]$|e,|e |e[),]|e$");
ArrayList<String> tokens = new ArrayList<String>();
Pattern tokSplitter = Pattern.compile("[aeiouy]+");
for (int i = 0; i < split.length; i++) {
String s = split[i];
Matcher m = tokSplitter.matcher(s);
while (m.find()) {
tokens.add(m.group());
}
}
syllables += tokens.size();
I've testesd an all test cases are passed.
You are using method split incorrectly. This method recieve separator. Need write something like this:
String[] words = getText().toLowerCase().split(" ");
But if you want to count the number of syllables, it is enough to count the number of vowels:
String input = "text";
Set<Character> vowel = new HashSet<>();
vowel.add('a');
vowel.add('e');
vowel.add('i');
vowel.add('o');
vowel.add('u');
int count = 0;
for (char c : input.toLowerCase().toCharArray()) {
if (vowel.contains(c)){
count++;
}
}
System.out.println("count = " + count);
I need to find whole words in a sentence, but without using regular expressions. So if I wanted to find the word "the" in this sentence: "The quick brown fox jumps over the lazy dog", I'm currently using:
String text = "the, quick brown fox jumps over the lazy dog";
String keyword = "the";
Matcher matcher = Pattern.compile("\\b"+keyword+"\\b").matcher(text);
Boolean contains = matcher.find();
but if I used:
Boolean contains = text.contains(keyword);
and pad the keyword with a space, it won't find the first "the" in the sentence, both because it doesn't have surround whitespaces and the punctuations.
To be clear, I'm building an Android app, and I'm getting memory leaks and it might be because I'm using a regular-expression in a ListView, so it's performing a regular-expression match X number of times, depending on the items in the Listview.
If you needed to check for multiple words and do it without regular expressions you could use StringTokenizer with a space as the delimiter.
You could then build a custom search method. Otherwise, the other solutions using String.contains() or String.indexOf() qualify.
What you do is search for "the". Then for each match you test to see if the surrounding characters are white space (or punctuation), or if the match is at the beginning / end of the string respectively.
public int findWholeWorld(final String text, final String searchString) {
return (" " + text + " ").indexOf(" " + searchString + " ");
}
This will give you the index of the first occurrence of the word "the" or -1 if the word "the" doesn't exist.
Split the string on space, and then see if the resulting array contains your word.
Simply iterate over the characters and keep storing them in a char buffer. Every time you see a whitespace, empty the buffer into a list of words and go on till you reach the end.
In the comments of the StringTokenizer.class:
StringTokenizer is a legacy class that is retained for
compatibility reasons although its use is discouraged in new code. It is
recommended that anyone seeking this functionality use the split
method of String or the java.util.regex package instead.
The following example illustrates how the String.split
method can be used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);
prints the following output:
this
is
a
test
Iterate through your resulting string array and test for equality and keep a count.
for (String s : result)
{
count++;
}
If this is a homework assignment, tell your lecturer to read up on Java, times have changed. I remember having the exact same stupid questions during school and it does nothing to prepare you for the real world.
I have a project that requires whole word matching, but I can't use regular expressions(because regular expressions escape some keywords), I tried to write my own code to simulate it with non-regular expressions (\bxxx\b), I only know C# and it worked fine.
public static class Finder
{
public static bool Find(string? input, string? pattern, bool isMatchCase = false, bool isMatchWholeWord = false, bool isMatchRegex = false)
{
if (String.IsNullOrWhiteSpace(input) || String.IsNullOrWhiteSpace(pattern))
{
return false;
}
if (!isMatchCase && !isMatchRegex)
{
input = input.ToLower();
pattern = pattern.ToLower();
}
if (isMatchWholeWord && !isMatchRegex)
{
int len = pattern.Length;
int suffix = 0;
while (true)
{
int start = input.IndexOf(pattern, suffix);
if (start == -1)
{
return false;
}
int end = start + len - 1;
int prefix = start - 1;
suffix = end + 1;
bool isPrefixMatched, isSuffixMatched;
if (start == 0)
{
isPrefixMatched = true;
}
else
{
isPrefixMatched = IsWord(input[prefix]) != IsWord(input[start]);
}
if (end == input.Length - 1)
{
isSuffixMatched = true;
}
else
{
isSuffixMatched = IsWord(input[suffix]) != IsWord(input[end]);
}
if (isPrefixMatched && isSuffixMatched)
{
return true;
}
}
}
if (isMatchRegex)
{
if (isMatchWholeWord)
{
if (!pattern.StartsWith(#"\b"))
{
pattern = $#"\b{pattern}";
}
if (!pattern.EndsWith(#"\b"))
{
pattern = $#"{pattern}\b";
}
}
return Regex.IsMatch(input, pattern, isMatchCase ? RegexOptions.None : RegexOptions.IgnoreCase);
}
return input.Contains(pattern);
}
private static bool IsWord(char ch)
{
return Char.IsLetterOrDigit(ch) || ch == '_';
}
}
I wish to have have the following String
!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"
to become an array of the following
{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }
I tried
new StringTokenizer(cmd, "\"")
but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.
Thanks.
EDIT:
I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.
It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.
That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.
Here's an example:
String text = "1 2 \"333 4\" 55 6 \"77\" 8 999";
// 1 2 "333 4" 55 6 "77" 8 999
String regex = "\"([^\"]*)\"|(\\S+)";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
if (m.group(1) != null) {
System.out.println("Quoted [" + m.group(1) + "]");
} else {
System.out.println("Plain [" + m.group(2) + "]");
}
}
The above prints (as seen on ideone.com):
Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]
The pattern is essentially:
"([^"]*)"|(\S+)
\_____/ \___/
1 2
There are 2 alternates:
The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
The second alternate matches any sequence of non-whitespace characters, captured in group 2
The order of the alternates matter in this pattern
Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.
References
regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus
See also
regular-expressions.info/Examples - Programmer - Strings - for pattern with escaped quotes
Appendix
Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.
Related questions
Difference between a Deprecated and Legacy API?
Scanner vs. StringTokenizer vs. String.Split
Validating input using java.util.Scanner - has many examples
Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.
Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.
Apache Commons to the rescue!
import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
#Grab(group='org.apache.commons', module='commons-text', version='1.3')
def str = /is this 'completely "impossible"' or """slightly"" impossible" to parse?/
StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )
println st.tokenList
Output:
[is, this, completely "impossible", or, "slightly" impossible, to, parse?]
A few notes:
this is written in Groovy... it is in fact a Groovy script. The
#Grab line gives a clue to the sort of dependency line you need
(e.g. in build.gradle) ... or just include the .jar in your
classpath of course
StringTokenizer here is NOT
java.util.StringTokenizer ... as the import line shows it is
org.apache.commons.text.StringTokenizer
the def str = ...
line is a way to produce a String in Groovy which contains both
single quotes and double quotes without having to go in for escaping
StringMatcherFactory in apache commons-text 1.3 can be found
here: as you can see, the INSTANCE can provide you with a
bunch of different StringMatchers. You could even roll your own:
but you'd need to examine the StringMatcherFactory source code to
see how it's done.
YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!
PS why is it better to use Apache Commons than any other solution?
Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:
The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...
PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.
later
Sufficiently intrigued by ArtB's question I had a look at the source:
in StringMatcherFactory.java we see:
private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
"'\"".toCharArray());
... rather dull ...
so that leads one to look at StringTokenizer.java:
public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
if (quote != null) {
this.quoteMatcher = quote;
}
return this;
}
OK... and then, in the same java file:
private int readWithQuotes(final char[] srcChars ...
which contains the comment:
// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.
... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\"|s into |""|s... (i.e. where you replace each |"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.
I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleos through which you achieve programming immortality...
In an old fashioned way:
public static String[] split(String str) {
str += " "; // To detect last token when not quoted...
ArrayList<String> strings = new ArrayList<String>();
boolean inQuote = false;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '"' || c == ' ' && !inQuote) {
if (c == '"')
inQuote = !inQuote;
if (!inQuote && sb.length() > 0) {
strings.add(sb.toString());
sb.delete(0, sb.length());
}
} else
sb.append(c);
}
return strings.toArray(new String[strings.size()]);
}
I assume that nested quotes are illegal, and also that empty tokens can be omitted.
This is an old question, however this was my solution as a finite state machine.
Efficient, predictable and no fancy tricks.
100% coverage on tests.
Drag and drop into your code.
/**
* Splits a command on whitespaces. Preserves whitespace in quotes. Trims excess whitespace between chunks. Supports quote
* escape within quotes. Failed escape will preserve escape char.
*
* #return List of split commands
*/
static List<String> splitCommand(String inputString) {
List<String> matchList = new LinkedList<>();
LinkedList<Character> charList = inputString.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.toCollection(LinkedList::new));
// Finite-State Automaton for parsing.
CommandSplitterState state = CommandSplitterState.BeginningChunk;
LinkedList<Character> chunkBuffer = new LinkedList<>();
for (Character currentChar : charList) {
switch (state) {
case BeginningChunk:
switch (currentChar) {
case '"':
state = CommandSplitterState.ParsingQuote;
break;
case ' ':
break;
default:
state = CommandSplitterState.ParsingWord;
chunkBuffer.add(currentChar);
}
break;
case ParsingWord:
switch (currentChar) {
case ' ':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
default:
chunkBuffer.add(currentChar);
}
break;
case ParsingQuote:
switch (currentChar) {
case '"':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
case '\\':
state = CommandSplitterState.EscapeChar;
break;
default:
chunkBuffer.add(currentChar);
}
break;
case EscapeChar:
switch (currentChar) {
case '"': // Intentional fall through
case '\\':
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add(currentChar);
break;
default:
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add('\\');
chunkBuffer.add(currentChar);
}
}
}
if (state != CommandSplitterState.BeginningChunk) {
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
}
return matchList;
}
private enum CommandSplitterState {
BeginningChunk, ParsingWord, ParsingQuote, EscapeChar
}
Recently faced a similar question where command line arguments must be split ignoring quotes link.
One possible case:
"/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force\""
This had to be split to
/opt/jboss-eap/bin/jboss-cli.sh
--connect
--controller=localhost:9990
-c
command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"
Just to add to #polygenelubricants's answer, having any non-space character before and after the quote matcher can work out.
"\\S*\"([^\"]*)\"\\S*|(\\S+)"
Example:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tokenizer {
public static void main(String[] args){
String a = "/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy " +
"/app/jboss-eap-7.1/standalone/updates/sample.war --force\"";
String b = "Hello \"Stack Overflow\"";
String c = "cmd=\"abcd efgh ijkl mnop\" \"apple\" banana mango";
String d = "abcd ef=\"ghij klmn\"op qrst";
String e = "1 2 \"333 4\" 55 6 \"77\" 8 999";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\S*\"([^\"]*)\"\\S*|(\\S+)");
Matcher regexMatcher = regex.matcher(a);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
System.out.println("matchList="+matchList);
}
}
Output:
matchList=[/opt/jboss-eap/bin/jboss-cli.sh, --connect, --controller=localhost:9990, -c, command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"]
This is what I myself use for splitting arguments in command line and things like that.
It's easily adjustible for multiple delimiters and quotes, it can process quotes within the words (like al' 'pha), it supports escaping (quotes as well as spaces) and it's really lenient.
public final class StringUtilities {
private static final List<Character> WORD_DELIMITERS = Arrays.asList(' ', '\t');
private static final List<Character> QUOTE_CHARACTERS = Arrays.asList('"', '\'');
private static final char ESCAPE_CHARACTER = '\\';
private StringUtilities() {
}
public static String[] splitWords(String string) {
StringBuilder wordBuilder = new StringBuilder();
List<String> words = new ArrayList<>();
char quote = 0;
for (int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
if (c == ESCAPE_CHARACTER && i + 1 < string.length()) {
wordBuilder.append(string.charAt(++i));
} else if (WORD_DELIMITERS.contains(c) && quote == 0) {
words.add(wordBuilder.toString());
wordBuilder.setLength(0);
} else if (quote == 0 && QUOTE_CHARACTERS.contains(c)) {
quote = c;
} else if (quote == c) {
quote = 0;
} else {
wordBuilder.append(c);
}
}
if (wordBuilder.length() > 0) {
words.add(wordBuilder.toString());
}
return words.toArray(new String[0]);
}
}
The example you have here would just have to be split by the double quote character.
Another old school way is :
public static void main(String[] args) {
String text = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] splits = text.split(" ");
List<String> list = new ArrayList<>();
String token = null;
for(String s : splits) {
if(s.startsWith("\"") ) {
token = "" + s;
} else if (s.endsWith("\"")) {
token = token + " "+ s;
list.add(token);
token = null;
} else {
if (token != null) {
token = token + " " + s;
} else {
list.add(s);
}
}
}
System.out.println(list);
}
Output : - [One, two, "three four", five, "six seven eight", nine]
private static void findWords(String str) {
boolean flag = false;
StringBuilder sb = new StringBuilder();
for(int i=0;i<str.length();i++) {
if(str.charAt(i)!=' ' && str.charAt(i)!='"') {
sb.append(str.charAt(i));
}
else {
System.out.println(sb.toString());
sb = new StringBuilder();
if(str.charAt(i)==' ' && !flag)
continue;
else if(str.charAt(i)=='"') {
if(!flag) {
flag=true;
}
i++;
while(i<str.length() && str.charAt(i)!='"') {
sb.append(str.charAt(i));
i++;
}
flag=false;
System.out.println(sb.toString());
sb = new StringBuilder();
}
}
}
}
In my case I had a string that includes key="value" . Check this out:
String perfLogString = "2022-11-10 08:35:00,470 PLV=REQ CIP=902.68.5.11 CMID=canonaustr CMN=\"Yanon Australia Pty Ltd\"";
// and this came to my rescue :
String[] str1= perfLogString.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println(Arrays.toString(str1));
This regex matches spaces ONLY if it is followed by even number of double quotes.
On split I get :
[2022-11-10, 08:35:00,470, PLV=REQ, CIP=902.68.5.11, CMID=canonaustr, CMN="Yanon Australia Pty Ltd"]
try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");
I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.
Try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");
It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.
You should use String's split method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer doesn't. At the end of what I provided above, you can just add the following:
String s;
for(String k : strArr) {
s += k;
}
StringTokenizer strTok = new StringTokenizer(s);