i have a relatively simple java question. I have a string that looks like this:
"Anderson,T",CWS,SS
I need to parse it in a way that I have
Anderson,T
CWS
SS
all as separate strings.
Thanks!
Here's a solution that will capture quoted strings, remove spaces, and match empty items:
public static void main(String[] args) {
String quoted = "\"(.*?(?<!\\\\)(?:\\\\\\\\)*)\"";
Pattern regex = Pattern.compile(
"(?:^|(?<=,))\\s*(" + quoted + "|[^,]*?)\\s*(?:$|,)");
String line = "\"Anderson,T\",CWS,\"single quote\\\"\", SS ,,hello,,";
Matcher m = regex.matcher(line);
int count = 0;
while (m.find()) {
String s = m.group(2) == null ? m.group(1) : m.group(2);
System.out.println(s);
count++;
}
System.out.printf("(%d matches found)%n", count);
}
I split out the quoted part of the pattern to make it a bit easier to follow. Capturing group 1 is the quoted string, 2 is every other match.
To break down the overall pattern:
Look for start of line or previous comma (?:^|(?<=,)) (don't capture)
Ignore 0+ spaces \\s*
Look for quoted string or string without comma (" + quoted + "|[^,]*?)
(The non-comma match is non-greedy so it doesn't grab any following spaces)
Ignore 0+ spaces again \\s*
Look for end of line, or comma (?:$|,) (don't capture)
To break down the quote pattern:
Look for opening quote \"
Start group capture (
Get the minimum match of any character .*?
Match 0+ even number of backslashes (?<!\\\\)(?:\\\\\\\\)* (to avoid matching escaped quotes with or without preceding escaped backslashes)
Close capturing group )
Match closing quote \"
Assuming your string looks like this
String input = "\"Anderson,T\",CWS,SS";
You can use this solution found for a similar scenario.
String input = "\"Anderson,T\",CWS,SS";
List<String> result = new ArrayList<String>();
int start = 0; //start index. Used to determine where the word starts
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) { //iterate through characters
if (input.charAt(current) == '\"') //if found a quote
inQuotes = !inQuotes; // toggle state
if(current == (input.length() - 1))//if it is the last character
result.add(input.substring(start)); //add last word
else if (input.charAt(current) == ',' && !inQuotes) { //if found a comma not inside quotes
result.add(input.substring(start, current)); //add everything between the start index and the current character. (add a word)
start = current + 1; //update start index
}
}
System.out.println(result);
I have modified it a bit to improve readability. This code stores the strings you want in the list result.
Related
first;snd;3rd;4th;5th;6th;...
How can I split the above after the third occurence of the ; separator? Especially without having to value.split(";") the whole string as an array, as I won't need the values separated. Just the first part of the string up until nth occurence.
Desired output would be:
first;snd;3rd.
I just need that as a string substring, not as split separated values.
Use StringUtils.ordinalIndexOf() from Apache
Finds the n-th index within a String, handling null. This method uses String.indexOf(String).
Parameters:
str - the String to check, may be null
searchStr - the String to find, may be null
ordinal - the n-th searchStr to find
Returns:
the n-th index of the search String, -1 (INDEX_NOT_FOUND) if no match or null string input
Or this way, no libraries required:
public static int ordinalIndexOf(String str, String substr, int n) {
int pos = str.indexOf(substr);
while (--n > 0 && pos != -1)
pos = str.indexOf(substr, pos + 1);
return pos;
}
I would go with this, easy and basic:
String test = "first;snd;3rd;4th;5th;6th;";
int result = 0;
for (int i = 0; i < 3; i++) {
result = test.indexOf(";", result) +1;
}
System.out.println(test.substring(0, result-1));
Output:
first;snd;3rd
You can ofc change the 3 in the loop with the number of arguments you need
If you want to use regular expressions, it is pretty straightforward:
import re
value = "first;snd;3rd;4th;5th;6th;"
reg = r'^([\w]+;[\w]+;[\w]+)'
re.match(reg, value).group()
Outputs:
"first;snd;3rd"
More options here .
You could use a regex that uses a negated character class to match from the start of the string not a semicolon.
Then repeat a grouping structure 2 times that matches a semicolon followed by not a semicolon 1+ times.
^[^;]+(?:;[^;]+){2}
Explanation
^ Assert the start of the string
[^;]+ Negated character class to match not a semicolon 1+ times
(?: Start non capturing group
;[^;]+ Match a semicolon and 1+ times not a semi colon
){2} Close non capturing group and repeat 2 times
For example:
String regex = "^[^;]+(?:;[^;]+){2}";
String string = "first;snd;3rd;4th;5th;6th;...";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(matcher.group(0)); // first;snd;3rd
}
See the Java demo
If you don't want to use split, just use indexOf in a for loop to know the index of the 3rd and 4th ";" then do a substring between these index.
Also you can do a split with a regex that match the 3rd ; but it's probably not the best solution.
If you need to do this frequently it is best to compile the regex upfront in a static Pattern instance:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class NthOccurance {
static Pattern pattern=Pattern.compile("^(([^;]*;){3}).*");
public static void main(String[] args) {
String in="first;snd;3rd;4th;5th;6th;";
Matcher m=pattern.matcher(in);
if (m.matches())
System.out.println(m.group(1));
}
}
Replace the '3' by the number of elements you want.
Below code find index of 3rd occurence of ';' character and make substring.
String s = "first;snd;3rd;4th;5th;6th;";
String splitted = s.substring(0, s.indexOf(";", s.indexOf(";", s.indexOf(";") + 1) + 1));
I have a string "'GLO', FLO" Now, I want a regex expression that will check each words in the string and if:
-word begins and ends with a single quote, replace single quotes with spaces
-if a comma is encounted between words split both words using space.
so, in the end, I should get GLO FLO.
Any help on how to do this using replaceAll() method on the string?
This regex didn't do it for me : "'([^' ]+)|\\s+'"
public static void displaySplitString(final String str) {
String pattern1 = "^'?(\\w+)'?,\\s+(\\w+)$";
StringTokenizer strTok = new StringTokenizer(str, " , ");
while (strTok.hasMoreTokens()) {
String delim = (strTok.nextToken());
delim.replaceAll(pattern1, "$1$2");
System.out.println(delim);
}
} //in main method displaySplitString("'GLO', FLO");
Here is the snippet that should get you going:
public static void displaySplitString(String str)
{
String pattern1 = "^'?(\\w+)'?(?=\\S)";
str = str.replaceAll(pattern1, " $1 ");
StringTokenizer strTok = new StringTokenizer(str, " , ");
while (strTok.hasMoreTokens())
{
String delim = (strTok.nextToken());
System.out.println(delim);
}
}
Here,
I change str argument declaration as not final (so that we could change the str value inside the method)
I am using the first regex ^'?(\\w+)'?(?=\\S) to remove potential single quotes from around the first word
Since you use a StringTokenizer, just 2 lines inside the while block are enough.
The regex means:
^ - Start looking for the match at the very start of the string
'? - match 0 or 1 single quote
(\\w+) - match and capture 1 or more alphanumeric symbols (we'll refer to them as $1 in the replacement pattern)
'? - match 0 or 1 single quote
(?=\\S) - match only if there is no space after the optional single quote. Perhaps, you can even replace this lookahead with a mere , if you always have it there, after the first word.
My program reads a line from a file. This line contains comma-separated text like:
123,test,444,"don't split, this",more test,1
I would like the result of a split to be this:
123
test
444
"don't split, this"
more test
1
If I use the String.split(","), I would get this:
123
test
444
"don't split
this"
more test
1
In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?
You can try out this regex:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
Explanation:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);
Why Split when you can Match?
Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:
"[^"]*"|[^,]+
This will match all the desired fragments (see demo).
Explanation
With "[^"]*", we match complete "double-quoted strings"
or |
we match [^,]+ any characters that are not a comma.
A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.
Building upon #zx81's answer, cause matching idea is really nice, I've added Java 9 results call, which returns a Stream. Since OP wanted to use split, I've collected to String[], as split does.
Caution if you have spaces after your comma-separators (a, b, "c,d"). Then you need to change the pattern.
Jshell demo
$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
| Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"
-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
| Expression value is: java.util.stream.ReferencePipeline$Head#2038ae61
| assigned to temporary variable $68 of type java.util.stream.Stream<MatchResult>
-> $68.map(MatchResult::group).toArray(String[]::new);
| Expression value is: [Ljava.lang.String;#6b09bb57
| assigned to temporary variable $69 of type String[]
-> Arrays.stream($69).forEach(System.out::println);
123
test
444
"don't split, this"
more test
1
Code
String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
.matcher(so)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
Explanation
Regex [^"] matches: a quote, anything but a quote, a quote.
Regex [^"]* matches: a quote, anything but a quote 0 (or more) times , a quote.
That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times - that is: [^,]+ - would "win".
results() requires Java 9 or higher.
It returns Stream<MatchResult>, which I map using group() call and collect to array of Strings. Parameterless toArray() call would return Object[].
You can do this very easily without complex regular expression:
Split on the character ". You get a list of Strings
Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
Join the list of lists, so you get only a list.
If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.
So basically it is something like this:
public class SplitTest {
public static void main(String[] args) {
final String splitMe="123,test,444,\"don't split, this\",more test,1";
final String[] splitByQuote=splitMe.split("\"");
final String[][] splitByComma=new String[splitByQuote.length][];
for(int i=0;i<splitByQuote.length;i++) {
String part=splitByQuote[i];
if (i % 2 == 0){
splitByComma[i]=part.split(",");
}else{
splitByComma[i]=new String[1];
splitByComma[i][0]=part;
}
}
for (String parts[] : splitByComma) {
for (String part : parts) {
System.out.println(part);
}
}
}
}
This will be much cleaner with lambdas, promised!
Please see the below code snippet. This code only considers happy flow. Change the according to your requirement
public static String[] splitWithEscape(final String str, char split,
char escapeCharacter) {
final List<String> list = new LinkedList<String>();
char[] cArr = str.toCharArray();
boolean isEscape = false;
StringBuilder sb = new StringBuilder();
for (char c : cArr) {
if (isEscape && c != escapeCharacter) {
sb.append(c);
} else if (c != split && c != escapeCharacter) {
sb.append(c);
} else if (c == escapeCharacter) {
if (!isEscape) {
isEscape = true;
if (sb.length() > 0) {
list.add(sb.toString());
sb = new StringBuilder();
}
} else {
isEscape = false;
}
} else if (c == split) {
list.add(sb.toString());
sb = new StringBuilder();
}
}
if (sb.length() > 0) {
list.add(sb.toString());
}
String[] strArr = new String[list.size()];
return list.toArray(strArr);
}
I want to remove the leading and trailing whitespace from string:
String s = " Hello World ";
I want the result to be like:
s == "Hello world";
s.trim()
see String#trim()
Without any internal method, use regex like
s.replaceAll("^\\s+", "").replaceAll("\\s+$", "")
or
s.replaceAll("^\\s+|\\s+$", "")
or just use pattern in pure form
String s=" Hello World ";
Pattern trimmer = Pattern.compile("^\\s+|\\s+$");
Matcher m = trimmer.matcher(s);
StringBuffer out = new StringBuffer();
while(m.find())
m.appendReplacement(out, "");
m.appendTail(out);
System.out.println(out+"!");
String s="Test ";
s= s.trim();
I prefer not to use regular expressions for trivial problems. This would be a simple option:
public static String trim(final String s) {
final StringBuilder sb = new StringBuilder(s);
while (sb.length() > 0 && Character.isWhitespace(sb.charAt(0)))
sb.deleteCharAt(0); // delete from the beginning
while (sb.length() > 0 && Character.isWhitespace(sb.charAt(sb.length() - 1)))
sb.deleteCharAt(sb.length() - 1); // delete from the end
return sb.toString();
}
Use the String class trim method. It will remove all leading and trailing whitespace.
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
String s=" Hello World ";
s = s.trim();
For more information See This
Simply use trim(). It only eliminate the start and end excess white spaces of a string.
String fav = " I like apple ";
fav = fav.trim();
System.out.println(fav);
Output:
I like apple //no extra space at start and end of the string
String.trim() answers the question but was not an option for me.
As stated here :
it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.
This results in it trimming the U+0020 space character and all “control code” characters below U+0020 (including the U+0009 tab character), but not the control codes or Unicode space characters that are above that.
I am working with Japanese where we have full-width characters Like this, the full-width space would not be trimmed by String.trim().
I therefore made a function which, like xehpuk's snippet, use Character.isWhitespace().
However, this version is not using a StringBuilder and instead of deleting characters, finds the 2 indexes it needs to take a trimmed substring out of the original String.
public static String trimWhitespace(final String stringToTrim) {
int endIndex = stringToTrim.length();
// Return the string if it's empty
if (endIndex == 0) return stringToTrim;
int firstIndex = -1;
// Find first character which is not a whitespace, if any
// (increment from beginning until either first non whitespace character or end of string)
while (++firstIndex < endIndex && Character.isWhitespace(stringToTrim.charAt(firstIndex))) { }
// If firstIndex did not reach end of string, Find last character which is not a whitespace,
// (decrement from end until last non whitespace character)
while (--endIndex > firstIndex && Character.isWhitespace(stringToTrim.charAt(endIndex))) { }
// Return substring using indexes
return stringToTrim.substring(firstIndex, endIndex + 1);
}
s = s.trim();
More info:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim()
Why do you not want to use predefined methods? They are usually most efficient.
See String#trim() method
Since Java 11 String class has strip() method which is used to returns a string whose value is this string, with all leading and trailing white space removed. This is introduced to overcome the problem of trim method.
Docs: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#strip()
Example:
String str = " abc ";
// public String strip()
str = str.strip(); // Returns abc
There are two more useful methods in Java 11+ String class:
stripLeading() : Returns a string whose value is this string,
with all leading white space removed.
stripTrailing() : Returns a string whose value is this string,
with all trailing white space removed.
While #xehpuk's method is good if you want to avoid using regex, but it has O(n^2) time complexity. The following solution also avoids regex, but is O(n):
if(s.length() == 0)
return "";
char left = s.charAt(0);
char right = s.charAt(s.length() - 1);
int leftWhitespace = 0;
int rightWhitespace = 0;
boolean leftBeforeRight = leftWhitespace < s.length() - 1 - rightWhitespace;
while ((left == ' ' || right == ' ') && leftBeforeRight) {
if(left == ' ') {
leftWhitespace++;
left = s.charAt(leftWhitespace);
}
if(right == ' ') {
rightWhitespace++;
right = s.charAt(s.length() - 1 - rightWhitespace);
}
leftBeforeRight = leftWhitespace < s.length() - 1 - rightWhitespace;
}
String result = s.substring(leftWhitespace, s.length() - rightWhitespace);
return result.equals(" ") ? "" : result;
This counts the number of trailing whitespaces in the beginning and end of the string, until either the "left" and "right" indices obtained from whitespace counts meet, or both indices have reached a non-whitespace character. Afterwards, we either return the substring obtained using the whitespace counts, or the empty string if the result is a whitespace (needed to account for all-whitespace strings with odd number of characters).
I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.
so calling my:
commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")
should return:
["test", "test,test,test", "test"]
My current attempt is to use String.split() to split the String using regular expressions:
String[] array = str.split("[^\\\\],");
But the returned array is:
["tes", "test\,test\,tes", "test"]
Any ideas?
The regular expression
[^\\],
means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.
I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like
(?<!\\),
(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)
Try:
String array[] = str.split("(?<!\\\\),");
Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.
For future reference, here is the complete method i ended up with:
public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
// these characters need to be escaped in a regular expression
String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
String escapedEscapeChar = escapeChar;
// if the escape char for our comma separated list needs to be escaped
// for the regular expression, escape it using the \ char
if(regularExpressionSpecialChars.indexOf(escapeChar) != -1)
escapedEscapeChar = "\\" + escapeChar;
// see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
// remove the escapeChar for the end result
String[] result = new String[temp.length];
for(int i=0; i<temp.length; i++) {
result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
}
return result;
}
As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]
As drvdijk said, (?<!\\), will misinterpret escaped backslashes.
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
-(unescape commas)->
["test\\\\,test\\,test,test" , "test"]
I would expect being able to escape backslashes as well...
"test\\\\\\,test\\\\,test\\,test,test"
-(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
-(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]
drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?
I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).
My answer does not take the escape character as a parameter.
public static List<String> commaDelimitedListStringToStringList(String list) {
// Check the validity of the list
// ex: "te\\st" is not valid, backslash should be escaped
if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
// Could also raise an exception
return null;
}
// Matcher for the list elements
Matcher matcher = Pattern
.compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
// Unescape the list element
result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
}
return result;
}
Description for the pattern (unescaped):
(?<=(^|,)) forward is start of string or a ,
([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,
(?=(,|$)) behind is end of string or a ,
The pattern may be simplified.
Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.
Also, what is the need of having an escape character if only one character is special, it could simply be doubled...
public static List<String> commaDelimitedListStringToStringList2(String list) {
if (!list.matches("^(([^,]|,,)*(,|$))+")) {
return null;
}
Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
.matcher(list);
ArrayList<String> result = new ArrayList<String>();
while (matcher.find()) {
result.add(matcher.group().replaceAll(",,", ","));
}
return result;
}
split(/(?<!\\),/g) worked for me, but the accepted answer did not
> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]
It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.
function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
const splittedText = []
let numberOfDelimitersBeforeOtherCharacter = 0
let nextSplittedTextPartIndex = 0
for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
if (character === escapeCharacter) {
numberOfDelimitersBeforeOtherCharacter++
} else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
nextSplittedTextPartIndex = characterIndex + 1
} else {
numberOfDelimitersBeforeOtherCharacter = 0
}
}
if (nextSplittedTextPartIndex <= text.length) {
splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
}
return splittedText
}
function onChange () {
console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}
addEventListener('change', onChange)
onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>