Java parsing a string with lots of whitespace - java

I have a string with multiple spaces, but when I use the tokenizer it breaks it apart at all of those spaces. I need the tokens to contain those spaces. How can I utilize the StringTokenizer to return the values with the tokens I am splitting on?

You'll note in the docs for the StringTokenizer that it is recommended it shouldn't be used for any new code, and that String.split(regex) is what you want
String foo = "this is some data in a string";
String[] bar = foo.split("\\s+");
Edit to add: Or, if you have greater needs than a simple split, then use the Pattern and Matcher classes for more complex regular expression matching and extracting.
Edit again: If you want to preserve your space, actually knowing a bit about regular expressions really helps:
String[] bar = foo.split("\\b+");
This will split on word boundaries, preserving the space between each word as a String;
public static void main( String[] args )
{
String foo = "this is some data in a string";
String[] bar = foo.split("\\b");
for (String s : bar)
{
System.out.print(s);
if (s.matches("^\\s+$"))
{
System.out.println("\t<< " + s.length() + " spaces");
}
else
{
System.out.println();
}
}
}
Output:
this
<< 1 spaces
is
<< 6 spaces
some
<< 2 spaces
data
<< 6 spaces
in
<< 3 spaces
a
<< 1 spaces
string

Sounds like you may need to use regular expressions (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/package-summary.html) instead of StringTokenizer.

Use String.split("\\s+") instead of StringTokenizer.
Note that this will only extract the non-whitespace characters separated by at least one whitespace character, if you want leading/trailing whitespace characters included with the non-whitespace characters that will be a completely different solution!
This requirement isn't clear from your original question, and there is an edit pending that tries to clarify it.
StringTokenizer in almost every non-contrived case is the wrong tool for the job.

I think It will be good if you use first replaceAll function to replace all the multiple spaces by a single space and then do tokenization using split function.

Related

Remove empty Strings after splitting a StringBuilder into Array Java

Sorry if this question has already been asked, but I could only find results of c#.
So I have this StringBuilder:
StringBuilder sb = new StringBuilder(" 111 11 ");
and I want to split it into an array using this method:
String[] ar = sb.toString().split(" ");
As expected the result array has some empty entries. My question is if I can remove these empty spaces directly when I split the StringBuilder or I have to do it afterwards.
split takes a regex. So:
String[] ar = sb.toString().split("\\s+");
The string \\s is regexp-ese for 'any whitespace', and the + is: 1 or more of it. If you want to split on spaces only (and not on newlines, tabs, etc), try: String[] ar = sb.toString().split(" +"); which is literally: "split on one or more spaces".
This trick works for just about any separator. For example, split on commas? Try: .split("\\s*,\\s*"), which is: 0 or more whitespace, a comma, followed by 0 or more whitespace (and regexes take as much as they can).
Note that this trick does NOT get rid of leading and trailing whitespace. But to do that, use trim. Putting it all together:
String[] ar = sb.toString().trim().split("\\s+");
and for commas:
String[] ar = sb.toString().trim().split("\\s*,\\s*");
I would use guava for this:
String t = " 111 11 ";
Splitter.on(Pattern.compile("\\s+"))
.omitEmptyStrings()
.split(t)
.forEach(System.out::println);
If you do not want to depend on any third party dependencies and do not want to regex filtering,
You can do it in one line with Java 8 Streams API:
Arrays.stream(sb.toString().trim().split(" ")).filter(s-> !s.equals("")).map(s -> s.trim()).toArray();
For a detailed multiline version of the previous:
Arrays.stream(sb.toString()
.trim() // Trim the starting and ending whitespaces from string
.split(" ")) // Split the regarding to spaces
.filter(s-> !s.equals("")) // Filter the non-empty elements from the stream
.map(s -> s.trim()) // Trim the starting and ending whitespaces from element
.toArray(); // Collect the elements to object array
Here is the working code for demonstration:
StringBuilder sb = new StringBuilder(" 111 11 ");
Object[] array = Arrays.stream(sb.toString().trim().split(" ")).filter(s-> !s.equals("")).map(s -> s.trim()).toArray();
System.out.println("(" + array[0] + ")");
System.out.println("(" + array[1] + ")");
There is couple of regex to deal with it, i would also prefer #rzwitserloot method,
but if you would like to see more.
Check it here : How do I split a string with any whitespace chars as delimiters?
glenatron has explained it :
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
Thanks to glenatron
You can use turnkey solution from Apache Commons.
Here is an example:
StringBuilder sb = new StringBuilder(" 111 11 ");
String trimmedString = StringUtils.normalizeSpace(sb.toString());
String[] trimmedAr = trimmedString.split(" ");
System.out.println(Arrays.toString(trimmedAr));
Output: [111, 11].

String split regex [duplicate]

I'm new to regular expressions and would appreciate your help. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes. My last attempt looks like this: (?!") and isn't quite working. It's splitting on the space before the quote.
Example input:
This is a string that "will be" highlighted when your 'regular expression' matches something.
Desired output:
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something.
Note that "will be" and 'regular expression' retain the space between the words.
I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:
[^\s"']+|"([^"]*)"|'([^']*)'
I added the capturing groups because you don't want the quotes in the list.
This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}
If you don't mind having the quotes in the returned list, you can use much simpler code:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
There are several questions on StackOverflow that cover this same question in various contexts using regular expressions. For instance:
parsings strings: extracting words and phrases
Best way to parse Space Separated Text
UPDATE: Sample regex to handle single and double quoted strings. Ref: How can I split on a string except when inside quotes?
m/('.*?'|".*?"|\S+)/g
Tested this with a quick Perl snippet and the output was as reproduced below. Also works for empty strings or whitespace-only strings if they are between quotes (not sure if that's desired or not).
This
is
a
string
that
"will be"
highlighted
when
your
'regular expression'
matches
something.
Note that this does include the quote characters themselves in the matched values, though you can remove that with a string replace, or modify the regex to not include them. I'll leave that as an exercise for the reader or another poster for now, as 2am is way too late to be messing with regular expressions anymore ;)
If you want to allow escaped quotes inside the string, you can use something like this:
(?:(['"])(.*?)(?<!\\)(?>\\\\)*\1|([^\s]+))
Quoted strings will be group 2, single unquoted words will be group 3.
You can try it on various strings here: http://www.fileformat.info/tool/regex.htm or http://gskinner.com/RegExr/
The regex from Jan Goyvaerts is the best solution I found so far, but creates also empty (null) matches, which he excludes in his program. These empty matches also appear from regex testers (e.g. rubular.com).
If you turn the searches arround (first look for the quoted parts and than the space separed words) then you might do it in once with:
("[^"]*"|'[^']*'|[\S]+)+
(?<!\G".{0,99999})\s|(?<=\G".{0,99999}")\s
This will match the spaces not surrounded by double quotes.
I have to use min,max {0,99999} because Java doesn't support * and + in lookbehind.
It'll probably be easier to search the string, grabbing each part, vs. split it.
Reason being, you can have it split at the spaces before and after "will be". But, I can't think of any way to specify ignoring the space between inside a split.
(not actual Java)
string = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
regex = "\"(\\\"|(?!\\\").)+\"|[^ ]+"; // search for a quoted or non-spaced group
final = new Array();
while (string.length > 0) {
string = string.trim();
if (Regex(regex).test(string)) {
final.push(Regex(regex).match(string)[0]);
string = string.replace(regex, ""); // progress to next "word"
}
}
Also, capturing single quotes could lead to issues:
"Foo's Bar 'n Grill"
//=>
"Foo"
"s Bar "
"n"
"Grill"
String.split() is not helpful here because there is no way to distinguish between spaces within quotes (don't split) and those outside (split). Matcher.lookingAt() is probably what you need:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
str = str + " "; // add trailing space
int len = str.length();
Matcher m = Pattern.compile("((\"[^\"]+?\")|('[^']+?')|([^\\s]+?))\\s++").matcher(str);
for (int i = 0; i < len; i++)
{
m.region(i, len);
if (m.lookingAt())
{
String s = m.group(1);
if ((s.startsWith("\"") && s.endsWith("\"")) ||
(s.startsWith("'") && s.endsWith("'")))
{
s = s.substring(1, s.length() - 1);
}
System.out.println(i + ": \"" + s + "\"");
i += (m.group(0).length() - 1);
}
}
which produces the following output:
0: "This"
5: "is"
8: "a"
10: "string"
17: "that"
22: "will be"
32: "highlighted"
44: "when"
49: "your"
54: "regular expression"
75: "matches"
83: "something."
I liked Marcus's approach, however, I modified it so that I could allow text near the quotes, and support both " and ' quote characters. For example, I needed a="some value" to not split it into [a=, "some value"].
(?<!\\G\\S{0,99999}[\"'].{0,99999})\\s|(?<=\\G\\S{0,99999}\".{0,99999}\"\\S{0,99999})\\s|(?<=\\G\\S{0,99999}'.{0,99999}'\\S{0,99999})\\s"
Jan's approach is great but here's another one for the record.
If you actually wanted to split as mentioned in the title, keeping the quotes in "will be" and 'regular expression', then you could use this method which is straight out of Match (or replace) a pattern except in situations s1, s2, s3 etc
The regex:
'[^']*'|\"[^\"]*\"|( )
The two left alternations match complete 'quoted strings' and "double-quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expressions on the left. We replace those with SplitHere then split on SplitHere. Again, this is for a true split case where you want "will be", not will be.
Here is a full working implementation (see the results on the online demo).
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
Pattern regex = Pattern.compile("\'[^']*'|\"[^\"]*\"|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program
If you are using c#, you can use
string input= "This is a string that \"will be\" highlighted when your 'regular expression' matches <something random>";
List<string> list1 =
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""|'(?<match>[\w\s]*)'|<(?<match>[\w\s]*)>").Cast<Match>().Select(m => m.Groups["match"].Value).ToList();
foreach(var v in list1)
Console.WriteLine(v);
I have specifically added "|<(?[\w\s]*)>" to highlight that you can specify any char to group phrases. (In this case I am using < > to group.
Output is :
This
is
a
string
that
will be
highlighted
when
your
regular expression
matches
something random
1st one-liner using String.split()
String s = "This is a string that \"will be\" highlighted when your 'regular expression' matches something.";
String[] split = s.split( "(?<!(\"|').{0,255}) | (?!.*\\1.*)" );
[This, is, a, string, that, "will be", highlighted, when, your, 'regular expression', matches, something.]
don't split at the blank, if the blank is surrounded by single or double quotes
split at the blank when the 255 characters to the left and all characters to the right of the blank are neither single nor double quotes
adapted from original post (handles only double quotes)
I'm reasonably certain this is not possible using regular expressions alone. Checking whether something is contained inside some other tag is a parsing operation. This seems like the same problem as trying to parse XML with a regex -- it can't be done correctly. You may be able to get your desired outcome by repeatedly applying a non-greedy, non-global regex that matches the quoted strings, then once you can't find anything else, split it at the spaces... that has a number of problems, including keeping track of the original order of all the substrings. Your best bet is to just write a really simple function that iterates over the string and pulls out the tokens you want.
A couple hopefully helpful tweaks on Jan's accepted answer:
(['"])((?:\\\1|.)+?)\1|([^\s"']+)
Allows escaped quotes within quoted strings
Avoids repeating the pattern for the single and double quote; this also simplifies adding more quoting symbols if needed (at the expense of one more capturing group)
You can also try this:
String str = "This is a string that \"will be\" highlighted when your 'regular expression' matches something";
String ss[] = str.split("\"|\'");
for (int i = 0; i < ss.length; i++) {
if ((i % 2) == 0) {//even
String[] part1 = ss[i].split(" ");
for (String pp1 : part1) {
System.out.println("" + pp1);
}
} else {//odd
System.out.println("" + ss[i]);
}
}
The following returns an array of arguments. Arguments are the variable 'command' split on spaces, unless included in single or double quotes. The matches are then modified to remove the single and double quotes.
using System.Text.RegularExpressions;
var args = Regex.Matches(command, "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'").Cast<Match>
().Select(iMatch => iMatch.Value.Replace("\"", "").Replace("'", "")).ToArray();
When you come across this pattern like this :
String str = "2022-11-10 08:35:00,470 RAV=REQ YIP=02.8.5.1 CMID=caonaustr CMN=\"Some Value Pyt Ltd\"";
//this helped
String[] str1= str.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println("Value of split string is "+ Arrays.toString(str1));
This results in :[2022-11-10, 08:35:00,470, PLV=REQ, YIP=02.8.5.1, CMID=caonaustr, CMN="Some Value Pyt Ltd"]
This regex matches spaces ONLY if it is followed by even number of double quotes.

Regex matching unescaped commas in Java

Problem description
I am trying to split a into separate strings, with the split() method that the String class provides. The documentation tells me that it will split around matches of the argument, which is a regular expression. The delimiter that I use is a comma, but commas can also be escaped. Escaping character that I use is a forward slash / (just to make things easier by not using a backslash, because that requires additional escaping in string literals in both Java and the regular expressions).
For instance, the input might be this:
a,b/,b//,c///,//,d///,
And the output should be:
a
b,b/
c/,/
d/,
So, the string should be split at each comma, unless that comma is preceded by an odd number of slashes (1, 3, 5, 7, ..., ∞) because that would mean that the comma is escaped.
Possible solutions
My initial guess would be to split it like this:
String[] strings = longString.split("(?<![^/](//)*/),");
but that is not allowed because Java doesn't allow infinite look-behind groups. I could limit the recurrence to, say, 2000 by replacing the * with {0,2000}:
String[] strings = longString.split("(?<![^/](//){0,2000}/),");
but that still puts constraints on the input. So I decided to take the recurrence out of the look-behind group, and came up with this:
String[] strings = longString.split("(?<!/)(?:(//)*),");
However, its output is the following list of strings:
a
b,b (the final slash is lacking in the output)
c/, (the final slash is lacking in the output)
d/,
Why are those slashes omitted in the 2nd and 3rd string, and how can I solve it (in Java)?
You are pretty close. To overcome lookbehind error you can use this workaround:
String[] strings = longString.split("(?<![^/](//){0,99}/),")
You can achieve the split using a positive look behind for an even number of slashes preceding the comma:
String[] strings = longString.split("(?<=[^/](//){0,999999999}),");
But to display the output you want, you need a further step of removing the remaining escapes:
String longString = "a,b/,b//,c///,//,d///,";
String[] strings = longString.split("(?<=[^/](//){0,999999999}),");
for (String s : strings)
System.out.println(s.replaceAll("/(.)", "$1"));
Output:
a
b,b/
c/,/
d/,
If you don't mind another method with regex, I suggest using .matcher:
Pattern pattern = Pattern.compile("(?:[^,/]+|/.)+");
String test = "a,b/,b//,c///,//,d///,";
Matcher matcher = pattern.matcher(test);
while (matcher.find()) {
System.out.println(matcher.group().replaceAll("/(.)", "$1"));
}
Output:
a
b,b/
c/,/
d/,
ideone demo
This method will match everything except the delimiting commas (kind of the reverse). The advantage is that it doesn't rely on lookarounds.
I love regexes, but wouldn't it be easy to write the code manually here, i.e.
boolean escaped = false;
for(int i = 0, len = s.length() ; i < len ; i++){
switch(s.charAt(i)){
case "/": escaped = !escaped; break;
case ",":
if(!escaped){
//found a segment, do something with it
}
//Fallthrough!
default:
escaped = false;
}
}
// handle last segment

Error when splitting a string in java

I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces.
This is the code i'm currently using,
String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");
which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example
String inputString = "\"Word\" some more text.";
Is giving me this output
arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
I want the output to give me an array with
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.
EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string
Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because
by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).
Try maybe this way
String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\" some more text.";
String[] arrayOfWords = inputString.replaceAll(
"^" + regexDelimiters,"").split(regexDelimiters);
for (String s : arrayOfWords)
System.out.println("'" + s + "'");
output:
'Word'
'some'
'more'
'text'
A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:
How to prevent java.lang.String.split() from creating a leading empty string?
So in short form you would have:
String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);
Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.

Python split semantics in Java

When I split a string in python, adjacent space delimiters are merged:
>>> str = "hi there"
>>> str.split()
['hi', 'there']
In Java, the delimiters are not merged:
$ cat Split.java
class Split {
public static void main(String args[]) {
String str = "hi there";
String result = "";
for (String tok : str.split(" "))
result += tok + ",";
System.out.println(result);
}
}
$ javac Split.java ; java Split
hi,,,,,,,,,,,,,,there,
Is there a straightforward way to get python space split semantics in java?
String.split accepts a regular expression, so provide it with one that matches adjacent whitespace:
str.split("\\s+")
If you want to emulate the exact behaviour of Python's str.split(), you'd need to trim as well:
str.trim().split("\\s+")
Quote from the Python docs on str.split():
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
So the above is still not an exact equivalent, because it will return [''] for the empty string, but it's probably okay for your purposes :)
Use str.split("\\s+") instead. This will do what you need.
Java uses Regex to split.
so splitting on a single space will absolutely give you many array elements.
Python split, ltrims and rtrims and then takes runs of spaces into a single space when no parameter has been passed.
So it would more properly be
"my string".trim().split("\\s+");
The problem with Niklas B.'s answer is that trim has its own definition of whitespace, i.e., anything with code up to '\u0020'. The following should get close enough to the Python version, including the fix for the empty string:
class TestSplit {
private static final String[] EMPTY = {};
private static String[] pySplit(String s) {
s = s.replaceAll("^\\s+", "").replaceAll("\\s+$", "");
if (s.isEmpty()) return EMPTY;
return s.split("\\s+");
}
}
In java, String.split takes a regex. So you can do str.split(" +") to get python semantics.

Categories

Resources