Reassemble split string based on previous split in JAVA? - java

If I split a string, say like this:
List<String> words = Arrays.asList(input.split("\\s+"));
And I then wanted to modify those words in various way, then reassmble them using the same logic, assuming no word lengths have changed, is there a way to do that easily? Humor me in that there's a reason I'm doing this.
Note: I need to match all whitspace, not just spaces. Hence the regex.
i.e.:
"Beautiful Country" -> ["Beautiful", "Country"] -> ["BEAUTIFUL", "COUNTRY"] -> "BEAUTIFUL COUNTRY"

If you use String.split, there is no way to be sure that the reassembled strings will be the same as the original ones.
In general (and in your case) there is no way to capture what the actual separators used were. In your example, "\\s+" will match one or more whitespace characters, but you don't know which characters were used, or how many there were.
When you use split, the information about the separators is lost. Period.
(On the other hand, if you don't care that the reassembled string may be a different length or may have different separators to the original, use the Joiner class ...)

Assuming you are have a limit on how many words you can expect, you could try writing a regular expression like
(\S+)(\s+)?(\S+)?(\s+)?(\S+)?
(for the case in which you expect up to three words). You could then use the Matcher API methods groupCount(), group(n) to pull the individual words (the odd groups) or whitespace separators (the even groups >0), do what you needed with the words, and re-assemble them once again...

I tried this:
import java.util.*;
import java.util.stream.*;
public class StringSplits {
private static List<String> whitespaceWords = new ArrayList<>();
public static void main(String [] args) {
String input = "What a Wonderful World! ...";
List<String> words = processInput(input);
// First transformation: ["What", "a", "Wonderful", "World!", "..."]
String first = words.stream()
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(first);
// Second transformation: ["WHAT", "A", "WONDERFUL", "WORLD!", "..."]
String second = words.stream()
.map(String::toUpperCase)
.collect(Collectors.joining("\", \"", "[\"", "\"]"));
System.out.println(second);
// Final transformation: WHAT A WONDERFUL WORLD! ...
String last = IntStream.range(0, words.size())
.mapToObj(i -> words.get(i) + whitespaceWords.get(i))
.map(String::toUpperCase)
.collect(Collectors.joining());
System.out.println(last);
}
/*
* Accepts input string of words containing character words and
* whitespace(s) (as defined in the method Character#isWhitespce).
* Processes and returns only the character strings. Stores the
* whitespace 'words' (a single or multiple whitespaces) in a List<String>.
* NOTE: This method uses String concatenation in a loop. For processing
* large inputs consider using a StringBuilder.
*/
private static List<String> processInput(String input) {
List<String> words = new ArrayList<>();
String word = "";
String whitespaceWord = "";
boolean wordFlag = true;
for (char c : input.toCharArray()) {
if (! Character.isWhitespace(c)) {
if (! wordFlag) {
wordFlag = true;
whitespaceWords.add(whitespaceWord);
word = whitespaceWord = "";
}
word = word + String.valueOf(c);
}
else {
if (wordFlag) {
wordFlag = false;
words.add(word);
word = whitespaceWord = "";
}
whitespaceWord = whitespaceWord + String.valueOf(c);
}
} // end-for
whitespaceWords.add(whitespaceWord);
if (! word.isEmpty()) {
words.add(word);
}
return words;
}
}

Related

Java regex - how to chop String into parts [duplicate]

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.
I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);
String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));
One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}
If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

convert to capital letter using java 8 stream

I have a List of strings like this "Taxi or bus driver". I need to convert first letter of each word to capital letter except the word "or" . Is there any easy way to achieve this using Java stream.
I have tried with Pattern.compile.splitasstream technique, I could not concat all splitted tokens back to form the original string
any help will be appreciated.If any body needs I can post my code here.
You need the right pattern to identify the location where a change has to be made, a zero-width pattern when you want to use splitAsStream. Match location which are
a word start
looking at a lower case character
not looking at the word “or”
Declare it like 
static final Pattern WORD_START_BUT_NOT_OR = Pattern.compile("\\b(?=\\p{Ll})(?!or\\b)");
Then, using it to process the tokens is straight-forward with a stream and map. Getting a string back works via .collect(Collectors.joining()):
List<String> input = Arrays.asList("Taxi or bus driver", "apples or oranges");
List<String> result = input.stream()
.map(s -> WORD_START_BUT_NOT_OR.splitAsStream(s)
.map(w -> Character.toUpperCase(w.charAt(0))+w.substring(1))
.collect(Collectors.joining()))
.collect(Collectors.toList());
result.forEach(System.out::println);
Taxi or Bus Driver
Apples or Oranges
Note that when splitting, there will always be a first token, regardless of whether it matched the criteria. Since the word “or” usually never appears at the beginning of a phrase and the transformation is transparent to non-lowercase letter characters, this should not a problem here. Otherwise, treating the first element specially with a stream would make the code too complicated. If that’s an issue, a loop would be preferable.
A loop based solution could look like
private static final Pattern FIRST_WORD_CHAR_BUT_NOT_OR
= Pattern.compile("\\b(?!or\\b)\\p{Ll}");
(now using a pattern that matches the character rather than looking at it)
public static String capitalizeWords(String phrase) {
Matcher m = FIRST_WORD_CHAR_BUT_NOT_OR.matcher(phrase);
if(!m.find()) return phrase;
StringBuffer sb = new StringBuffer();
do m.appendReplacement(sb, m.group().toUpperCase()); while(m.find());
return m.appendTail(sb).toString();
}
which, as a bonus, is also capable of handling characters which span multiple char units. Starting with Java 9, the StringBuffer can be replaced with StringBuilder to increase the efficiency. This method can be used like
List<String> result = input.stream()
.map(s -> capitalizeWords(s))
.collect(Collectors.toList());
Replacing the lambda expression s -> capitalizeWords(s) with a method reference of the form ContainingClass::capitalizeWords is also possible.
Here is my code:
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
public class ConvertToCapitalUsingStreams {
// collection holds all the words that are not to be capitalized
private static final List<String> EXCLUSION_LIST = Arrays.asList(new String[]{"or"});
public String convertToInitCase(final String data) {
String[] words = data.split("\\s+");
List<String> initUpperWords = Arrays.stream(words).map(word -> {
//first make it lowercase
return word.toLowerCase();
}).map(word -> {
//if word present in EXCLUSION_LIST return the words as is
if (EXCLUSION_LIST.contains(word)) {
return word;
}
//if the word not present in EXCLUSION_LIST, Change the case of
//first letter of the word and return
return Character.toUpperCase(word.charAt(0)) + word.substring(1);
}).collect(Collectors.toList());
// convert back the list of words into a single string
String finalWord = String.join(" ", initUpperWords);
return finalWord;
}
public static void main(String[] a) {
System.out.println(new ConvertToCapitalUsingStreams().convertToInitCase("Taxi or bus driver"));
}
}
Note:
You may also want to look at this SO post about using apache commons-text library to do this job.
Split your string as words then convert first character to uppercase, then joining it to form original String:
String input = "Taxi or bus driver";
String output = Stream.of(input.split(" "))
.map(w -> {
if (w.equals("or") || w.length() == 0) {
return w;
}
return w.substring(1) + Character.toUpperCase(w.charAt(0));
})
.collect(Collectors.joining(" "));

How to get two letter country code from url? [duplicate]

I want to split the string "004-034556" into two strings by the delimiter "-":
part1 = "004";
part2 = "034556";
That means the first string will contain the characters before '-', and the second string will contain the characters after '-'.
I also want to check if the string has '-' in it.
Use the appropriately named method String#split().
String string = "004-034556";
String[] parts = string.split("-");
String part1 = parts[0]; // 004
String part2 = parts[1]; // 034556
Note that split's argument is assumed to be a regular expression, so remember to escape special characters if necessary.
there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), and the opening square bracket [, the opening curly brace {, These special characters are often called "metacharacters".
For instance, to split on a period/dot . (which means "any character" in regex), use either backslash \ to escape the individual special character like so split("\\."), or use character class [] to represent literal character(s) like so split("[.]"), or use Pattern#quote() to escape the entire string like so split(Pattern.quote(".")).
String[] parts = string.split(Pattern.quote(".")); // Split on the exact string.
To test beforehand if the string contains certain character(s), just use String#contains().
if (string.contains("-")) {
// Split it.
} else {
throw new IllegalArgumentException("String " + string + " does not contain -");
}
Note, this does not take a regular expression. For that, use String#matches() instead.
If you'd like to retain the split character in the resulting parts, then make use of positive lookaround. In case you want to have the split character to end up in left hand side, use positive lookbehind by prefixing ?<= group on the pattern.
String string = "004-034556";
String[] parts = string.split("(?<=-)");
String part1 = parts[0]; // 004-
String part2 = parts[1]; // 034556
In case you want to have the split character to end up in right hand side, use positive lookahead by prefixing ?= group on the pattern.
String string = "004-034556";
String[] parts = string.split("(?=-)");
String part1 = parts[0]; // 004
String part2 = parts[1]; // -034556
If you'd like to limit the number of resulting parts, then you can supply the desired number as 2nd argument of split() method.
String string = "004-034556-42";
String[] parts = string.split("-", 2);
String part1 = parts[0]; // 004
String part2 = parts[1]; // 034556-42
An alternative to processing the string directly would be to use a regular expression with capturing groups. This has the advantage that it makes it straightforward to imply more sophisticated constraints on the input. For example, the following splits the string into two parts, and ensures that both consist only of digits:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class SplitExample
{
private static Pattern twopart = Pattern.compile("(\\d+)-(\\d+)");
public static void checkString(String s)
{
Matcher m = twopart.matcher(s);
if (m.matches()) {
System.out.println(s + " matches; first part is " + m.group(1) +
", second part is " + m.group(2) + ".");
} else {
System.out.println(s + " does not match.");
}
}
public static void main(String[] args) {
checkString("123-4567");
checkString("foo-bar");
checkString("123-");
checkString("-4567");
checkString("123-4567-890");
}
}
As the pattern is fixed in this instance, it can be compiled in advance and stored as a static member (initialised at class load time in the example). The regular expression is:
(\d+)-(\d+)
The parentheses denote the capturing groups; the string that matched that part of the regexp can be accessed by the Match.group() method, as shown. The \d matches and single decimal digit, and the + means "match one or more of the previous expression). The - has no special meaning, so just matches that character in the input. Note that you need to double-escape the backslashes when writing this as a Java string. Some other examples:
([A-Z]+)-([A-Z]+) // Each part consists of only capital letters
([^-]+)-([^-]+) // Each part consists of characters other than -
([A-Z]{2})-(\d+) // The first part is exactly two capital letters,
// the second consists of digits
Use:
String[] result = yourString.split("-");
if (result.length != 2)
throw new IllegalArgumentException("String not in correct format");
This will split your string into two parts. The first element in the array will be the part containing the stuff before the -, and the second element in the array will contain the part of your string after the -.
If the array length is not 2, then the string was not in the format: string-string.
Check out the split() method in the String class.
This:
String[] out = string.split("-");
should do the thing you want. The string class has many method to operate with a string.
// This leaves the regexes issue out of question
// But we must remember that each character in the Delimiter String is treated
// like a single delimiter
public static String[] SplitUsingTokenizer(String subject, String delimiters) {
StringTokenizer strTkn = new StringTokenizer(subject, delimiters);
ArrayList<String> arrLis = new ArrayList<String>(subject.length());
while(strTkn.hasMoreTokens())
arrLis.add(strTkn.nextToken());
return arrLis.toArray(new String[0]);
}
With Java 8:
List<String> stringList = Pattern.compile("-")
.splitAsStream("004-034556")
.collect(Collectors.toList());
stringList.forEach(s -> System.out.println(s));
Use org.apache.commons.lang.StringUtils' split method which can split strings based on the character or string you want to split.
Method signature:
public static String[] split(String str, char separatorChar);
In your case, you want to split a string when there is a "-".
You can simply do as follows:
String str = "004-034556";
String split[] = StringUtils.split(str,"-");
Output:
004
034556
Assume that if - does not exists in your string, it returns the given string, and you will not get any exception.
The requirements left room for interpretation. I recommend writing a method,
public final static String[] mySplit(final String s)
which encapsulate this function. Of course you can use String.split(..) as mentioned in the other answers for the implementation.
You should write some unit-tests for input strings and the desired results and behaviour.
Good test candidates should include:
- "0022-3333"
- "-"
- "5555-"
- "-333"
- "3344-"
- "--"
- ""
- "553535"
- "333-333-33"
- "222--222"
- "222--"
- "--4555"
With defining the according test results, you can specify the behaviour.
For example, if "-333" should return in [,333] or if it is an error.
Can "333-333-33" be separated in [333,333-33] or [333-333,33] or is it an error? And so on.
To summarize: there are at least five ways to split a string in Java:
String.split():
String[] parts ="10,20".split(",");
Pattern.compile(regexp).splitAsStream(input):
List<String> strings = Pattern.compile("\\|")
.splitAsStream("010|020202")
.collect(Collectors.toList());
StringTokenizer (legacy class):
StringTokenizer strings = new StringTokenizer("Welcome to EXPLAINJAVA.COM!", ".");
while(strings.hasMoreTokens()){
String substring = strings.nextToken();
System.out.println(substring);
}
Google Guava Splitter:
Iterable<String> result = Splitter.on(",").split("1,2,3,4");
Apache Commons StringUtils:
String[] strings = StringUtils.split("1,2,3,4", ",");
So you can choose the best option for you depending on what you need, e.g. return type (array, list, or iterable).
Here is a big overview of these methods and the most common examples (how to split by dot, slash, question mark, etc.)
You can try like this also
String concatenated_String="hi^Hello";
String split_string_array[]=concatenated_String.split("\\^");
Assuming, that
you don't really need regular expressions for your split
you happen to already use apache commons lang in your app
The easiest way is to use StringUtils#split(java.lang.String, char). That's more convenient than the one provided by Java out of the box if you don't need regular expressions. Like its manual says, it works like this:
A null input String returns null.
StringUtils.split(null, *) = null
StringUtils.split("", *) = []
StringUtils.split("a.b.c", '.') = ["a", "b", "c"]
StringUtils.split("a..b.c", '.') = ["a", "b", "c"]
StringUtils.split("a:b:c", '.') = ["a:b:c"]
StringUtils.split("a b c", ' ') = ["a", "b", "c"]
I would recommend using commong-lang, since usually it contains a lot of stuff that's usable. However, if you don't need it for anything else than doing a split, then implementing yourself or escaping the regex is a better option.
For simple use cases String.split() should do the job. If you use guava, there is also a Splitter class which allows chaining of different string operations and supports CharMatcher:
Splitter.on('-')
.trimResults()
.omitEmptyStrings()
.split(string);
The fastest way, which also consumes the least resource could be:
String s = "abc-def";
int p = s.indexOf('-');
if (p >= 0) {
String left = s.substring(0, p);
String right = s.substring(p + 1);
} else {
// s does not contain '-'
}
String Split with multiple characters using Regex
public class StringSplitTest {
public static void main(String args[]) {
String s = " ;String; String; String; String, String; String;;String;String; String; String; ;String;String;String;String";
//String[] strs = s.split("[,\\s\\;]");
String[] strs = s.split("[,\\;]");
System.out.println("Substrings length:"+strs.length);
for (int i=0; i < strs.length; i++) {
System.out.println("Str["+i+"]:"+strs[i]);
}
}
}
Output:
Substrings length:17
Str[0]:
Str[1]:String
Str[2]: String
Str[3]: String
Str[4]: String
Str[5]: String
Str[6]: String
Str[7]:
Str[8]:String
Str[9]:String
Str[10]: String
Str[11]: String
Str[12]:
Str[13]:String
Str[14]:String
Str[15]:String
Str[16]:String
But do not expect the same output across all JDK versions. I have seen one bug which exists in some JDK versions where the first null string has been ignored. This bug is not present in the latest JDK version, but it exists in some versions between JDK 1.7 late versions and 1.8 early versions.
There are only two methods you really need to consider.
Use String.split for a one-character delimiter or you don't care about performance
If performance is not an issue, or if the delimiter is a single character that is not a regular expression special character (i.e., not one of .$|()[{^?*+\) then you can use String.split.
String[] results = input.split(",");
The split method has an optimization to avoid using a regular expression if the delimeter is a single character and not in the above list. Otherwise, it has to compile a regular expression, and this is not ideal.
Use Pattern.split and precompile the pattern if using a complex delimiter and you care about performance.
If performance is an issue, and your delimiter is not one of the above, you should pre-compile a regular expression pattern which you can then reuse.
// Save this somewhere
Pattern pattern = Pattern.compile("[,;:]");
/// ... later
String[] results = pattern.split(input);
This last option still creates a new Matcher object. You can also cache this object and reset it for each input for maximum performance, but that is somewhat more complicated and not thread-safe.
You can split a string by a line break by using the following statement:
String textStr[] = yourString.split("\\r?\\n");
You can split a string by a hyphen/character by using the following statement:
String textStr[] = yourString.split("-");
public class SplitTest {
public static String[] split(String text, String delimiter) {
java.util.List<String> parts = new java.util.ArrayList<String>();
text += delimiter;
for (int i = text.indexOf(delimiter), j=0; i != -1;) {
String temp = text.substring(j,i);
if(temp.trim().length() != 0) {
parts.add(temp);
}
j = i + delimiter.length();
i = text.indexOf(delimiter,j);
}
return parts.toArray(new String[0]);
}
public static void main(String[] args) {
String str = "004-034556";
String delimiter = "-";
String result[] = split(str, delimiter);
for(String s:result)
System.out.println(s);
}
}
Please don't use StringTokenizer class as it is a legacy class that is retained for compatibility reasons, and its use is discouraged in new code. And we can make use of the split method as suggested by others as well.
String[] sampleTokens = "004-034556".split("-");
System.out.println(Arrays.toString(sampleTokens));
And as expected it will print:
[004, 034556]
In this answer I also want to point out one change that has taken place for split method in Java 8. The String#split() method makes use of Pattern.split, and now it will remove empty strings at the start of the result array. Notice this change in documentation for Java 8:
When there is a positive-width match at the beginning of the input
sequence then an empty leading substring is included at the beginning
of the resulting array. A zero-width match at the beginning however
never produces such empty leading substring.
It means for the following example:
String[] sampleTokensAgain = "004".split("");
System.out.println(Arrays.toString(sampleTokensAgain));
we will get three strings: [0, 0, 4] and not four as was the case in Java 7 and before. Also check this similar question.
One way to do this is to run through the String in a for-each loop and use the required split character.
public class StringSplitTest {
public static void main(String[] arg){
String str = "004-034556";
String split[] = str.split("-");
System.out.println("The split parts of the String are");
for(String s:split)
System.out.println(s);
}
}
Output:
The split parts of the String are:
004
034556
import java.io.*;
public class BreakString {
public static void main(String args[]) {
String string = "004-034556-1234-2341";
String[] parts = string.split("-");
for(int i=0;i<parts.length;i++) {
System.out.println(parts[i]);
}
}
}
You can use Split():
import java.io.*;
public class Splitting
{
public static void main(String args[])
{
String Str = new String("004-034556");
String[] SplittoArray = Str.split("-");
String string1 = SplittoArray[0];
String string2 = SplittoArray[1];
}
}
Else, you can use StringTokenizer:
import java.util.*;
public class Splitting
{
public static void main(String[] args)
{
StringTokenizer Str = new StringTokenizer("004-034556");
String string1 = Str.nextToken("-");
String string2 = Str.nextToken("-");
}
}
Here are two ways two achieve it.
WAY 1: As you have to split two numbers by a special character you can use regex
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TrialClass
{
public static void main(String[] args)
{
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher("004-034556");
while(m.find())
{
System.out.println(m.group());
}
}
}
WAY 2: Using the string split method
public class TrialClass
{
public static void main(String[] args)
{
String temp = "004-034556";
String [] arrString = temp.split("-");
for(String splitString:arrString)
{
System.out.println(splitString);
}
}
}
You can simply use StringTokenizer to split a string in two or more parts whether there are any type of delimiters:
StringTokenizer st = new StringTokenizer("004-034556", "-");
while(st.hasMoreTokens())
{
System.out.println(st.nextToken());
}
Check out the split() method in the String class on javadoc.
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
String data = "004-034556-1212-232-232";
int cnt = 1;
for (String item : data.split("-")) {
System.out.println("string "+cnt+" = "+item);
cnt++;
}
Here many examples for split string but I little code optimized.
String str="004-034556"
String[] sTemp=str.split("-");// '-' is a delimiter
string1=004 // sTemp[0];
string2=034556//sTemp[1];
I just wanted to write an algorithm instead of using Java built-in functions:
public static List<String> split(String str, char c){
List<String> list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++){
if(str.charAt(i) != c){
sb.append(str.charAt(i));
}
else{
if(sb.length() > 0){
list.add(sb.toString());
sb = new StringBuilder();
}
}
}
if(sb.length() >0){
list.add(sb.toString());
}
return list;
}
You can use the method split:
public class Demo {
public static void main(String args[]) {
String str = "004-034556";
if ((str.contains("-"))) {
String[] temp = str.split("-");
for (String part:temp) {
System.out.println(part);
}
}
else {
System.out.println(str + " does not contain \"-\".");
}
}
}
To split a string, uses String.split(regex). Review the following examples:
String data = "004-034556";
String[] output = data.split("-");
System.out.println(output[0]);
System.out.println(output[1]);
Output
004
034556
Note:
This split (regex) takes a regex as an argument. Remember to escape the regex special characters, like period/dot.
String s = "TnGeneral|DOMESTIC";
String a[]=s.split("\\|");
System.out.println(a.toString());
System.out.println(a[0]);
System.out.println(a[1]);
Output:
TnGeneral
DOMESTIC
String s="004-034556";
for(int i=0;i<s.length();i++)
{
if(s.charAt(i)=='-')
{
System.out.println(s.substring(0,i));
System.out.println(s.substring(i+1));
}
}
As mentioned by everyone, split() is the best option which may be used in your case. An alternative method can be using substring().

Converting a Single Line String to Another Single Line [duplicate]

I want to split the string "004-034556" into two strings by the delimiter "-":
part1 = "004";
part2 = "034556";
That means the first string will contain the characters before '-', and the second string will contain the characters after '-'.
I also want to check if the string has '-' in it.
Use the appropriately named method String#split().
String string = "004-034556";
String[] parts = string.split("-");
String part1 = parts[0]; // 004
String part2 = parts[1]; // 034556
Note that split's argument is assumed to be a regular expression, so remember to escape special characters if necessary.
there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), and the opening square bracket [, the opening curly brace {, These special characters are often called "metacharacters".
For instance, to split on a period/dot . (which means "any character" in regex), use either backslash \ to escape the individual special character like so split("\\."), or use character class [] to represent literal character(s) like so split("[.]"), or use Pattern#quote() to escape the entire string like so split(Pattern.quote(".")).
String[] parts = string.split(Pattern.quote(".")); // Split on the exact string.
To test beforehand if the string contains certain character(s), just use String#contains().
if (string.contains("-")) {
// Split it.
} else {
throw new IllegalArgumentException("String " + string + " does not contain -");
}
Note, this does not take a regular expression. For that, use String#matches() instead.
If you'd like to retain the split character in the resulting parts, then make use of positive lookaround. In case you want to have the split character to end up in left hand side, use positive lookbehind by prefixing ?<= group on the pattern.
String string = "004-034556";
String[] parts = string.split("(?<=-)");
String part1 = parts[0]; // 004-
String part2 = parts[1]; // 034556
In case you want to have the split character to end up in right hand side, use positive lookahead by prefixing ?= group on the pattern.
String string = "004-034556";
String[] parts = string.split("(?=-)");
String part1 = parts[0]; // 004
String part2 = parts[1]; // -034556
If you'd like to limit the number of resulting parts, then you can supply the desired number as 2nd argument of split() method.
String string = "004-034556-42";
String[] parts = string.split("-", 2);
String part1 = parts[0]; // 004
String part2 = parts[1]; // 034556-42
An alternative to processing the string directly would be to use a regular expression with capturing groups. This has the advantage that it makes it straightforward to imply more sophisticated constraints on the input. For example, the following splits the string into two parts, and ensures that both consist only of digits:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class SplitExample
{
private static Pattern twopart = Pattern.compile("(\\d+)-(\\d+)");
public static void checkString(String s)
{
Matcher m = twopart.matcher(s);
if (m.matches()) {
System.out.println(s + " matches; first part is " + m.group(1) +
", second part is " + m.group(2) + ".");
} else {
System.out.println(s + " does not match.");
}
}
public static void main(String[] args) {
checkString("123-4567");
checkString("foo-bar");
checkString("123-");
checkString("-4567");
checkString("123-4567-890");
}
}
As the pattern is fixed in this instance, it can be compiled in advance and stored as a static member (initialised at class load time in the example). The regular expression is:
(\d+)-(\d+)
The parentheses denote the capturing groups; the string that matched that part of the regexp can be accessed by the Match.group() method, as shown. The \d matches and single decimal digit, and the + means "match one or more of the previous expression). The - has no special meaning, so just matches that character in the input. Note that you need to double-escape the backslashes when writing this as a Java string. Some other examples:
([A-Z]+)-([A-Z]+) // Each part consists of only capital letters
([^-]+)-([^-]+) // Each part consists of characters other than -
([A-Z]{2})-(\d+) // The first part is exactly two capital letters,
// the second consists of digits
Use:
String[] result = yourString.split("-");
if (result.length != 2)
throw new IllegalArgumentException("String not in correct format");
This will split your string into two parts. The first element in the array will be the part containing the stuff before the -, and the second element in the array will contain the part of your string after the -.
If the array length is not 2, then the string was not in the format: string-string.
Check out the split() method in the String class.
This:
String[] out = string.split("-");
should do the thing you want. The string class has many method to operate with a string.
// This leaves the regexes issue out of question
// But we must remember that each character in the Delimiter String is treated
// like a single delimiter
public static String[] SplitUsingTokenizer(String subject, String delimiters) {
StringTokenizer strTkn = new StringTokenizer(subject, delimiters);
ArrayList<String> arrLis = new ArrayList<String>(subject.length());
while(strTkn.hasMoreTokens())
arrLis.add(strTkn.nextToken());
return arrLis.toArray(new String[0]);
}
With Java 8:
List<String> stringList = Pattern.compile("-")
.splitAsStream("004-034556")
.collect(Collectors.toList());
stringList.forEach(s -> System.out.println(s));
Use org.apache.commons.lang.StringUtils' split method which can split strings based on the character or string you want to split.
Method signature:
public static String[] split(String str, char separatorChar);
In your case, you want to split a string when there is a "-".
You can simply do as follows:
String str = "004-034556";
String split[] = StringUtils.split(str,"-");
Output:
004
034556
Assume that if - does not exists in your string, it returns the given string, and you will not get any exception.
The requirements left room for interpretation. I recommend writing a method,
public final static String[] mySplit(final String s)
which encapsulate this function. Of course you can use String.split(..) as mentioned in the other answers for the implementation.
You should write some unit-tests for input strings and the desired results and behaviour.
Good test candidates should include:
- "0022-3333"
- "-"
- "5555-"
- "-333"
- "3344-"
- "--"
- ""
- "553535"
- "333-333-33"
- "222--222"
- "222--"
- "--4555"
With defining the according test results, you can specify the behaviour.
For example, if "-333" should return in [,333] or if it is an error.
Can "333-333-33" be separated in [333,333-33] or [333-333,33] or is it an error? And so on.
To summarize: there are at least five ways to split a string in Java:
String.split():
String[] parts ="10,20".split(",");
Pattern.compile(regexp).splitAsStream(input):
List<String> strings = Pattern.compile("\\|")
.splitAsStream("010|020202")
.collect(Collectors.toList());
StringTokenizer (legacy class):
StringTokenizer strings = new StringTokenizer("Welcome to EXPLAINJAVA.COM!", ".");
while(strings.hasMoreTokens()){
String substring = strings.nextToken();
System.out.println(substring);
}
Google Guava Splitter:
Iterable<String> result = Splitter.on(",").split("1,2,3,4");
Apache Commons StringUtils:
String[] strings = StringUtils.split("1,2,3,4", ",");
So you can choose the best option for you depending on what you need, e.g. return type (array, list, or iterable).
Here is a big overview of these methods and the most common examples (how to split by dot, slash, question mark, etc.)
You can try like this also
String concatenated_String="hi^Hello";
String split_string_array[]=concatenated_String.split("\\^");
Assuming, that
you don't really need regular expressions for your split
you happen to already use apache commons lang in your app
The easiest way is to use StringUtils#split(java.lang.String, char). That's more convenient than the one provided by Java out of the box if you don't need regular expressions. Like its manual says, it works like this:
A null input String returns null.
StringUtils.split(null, *) = null
StringUtils.split("", *) = []
StringUtils.split("a.b.c", '.') = ["a", "b", "c"]
StringUtils.split("a..b.c", '.') = ["a", "b", "c"]
StringUtils.split("a:b:c", '.') = ["a:b:c"]
StringUtils.split("a b c", ' ') = ["a", "b", "c"]
I would recommend using commong-lang, since usually it contains a lot of stuff that's usable. However, if you don't need it for anything else than doing a split, then implementing yourself or escaping the regex is a better option.
For simple use cases String.split() should do the job. If you use guava, there is also a Splitter class which allows chaining of different string operations and supports CharMatcher:
Splitter.on('-')
.trimResults()
.omitEmptyStrings()
.split(string);
The fastest way, which also consumes the least resource could be:
String s = "abc-def";
int p = s.indexOf('-');
if (p >= 0) {
String left = s.substring(0, p);
String right = s.substring(p + 1);
} else {
// s does not contain '-'
}
String Split with multiple characters using Regex
public class StringSplitTest {
public static void main(String args[]) {
String s = " ;String; String; String; String, String; String;;String;String; String; String; ;String;String;String;String";
//String[] strs = s.split("[,\\s\\;]");
String[] strs = s.split("[,\\;]");
System.out.println("Substrings length:"+strs.length);
for (int i=0; i < strs.length; i++) {
System.out.println("Str["+i+"]:"+strs[i]);
}
}
}
Output:
Substrings length:17
Str[0]:
Str[1]:String
Str[2]: String
Str[3]: String
Str[4]: String
Str[5]: String
Str[6]: String
Str[7]:
Str[8]:String
Str[9]:String
Str[10]: String
Str[11]: String
Str[12]:
Str[13]:String
Str[14]:String
Str[15]:String
Str[16]:String
But do not expect the same output across all JDK versions. I have seen one bug which exists in some JDK versions where the first null string has been ignored. This bug is not present in the latest JDK version, but it exists in some versions between JDK 1.7 late versions and 1.8 early versions.
There are only two methods you really need to consider.
Use String.split for a one-character delimiter or you don't care about performance
If performance is not an issue, or if the delimiter is a single character that is not a regular expression special character (i.e., not one of .$|()[{^?*+\) then you can use String.split.
String[] results = input.split(",");
The split method has an optimization to avoid using a regular expression if the delimeter is a single character and not in the above list. Otherwise, it has to compile a regular expression, and this is not ideal.
Use Pattern.split and precompile the pattern if using a complex delimiter and you care about performance.
If performance is an issue, and your delimiter is not one of the above, you should pre-compile a regular expression pattern which you can then reuse.
// Save this somewhere
Pattern pattern = Pattern.compile("[,;:]");
/// ... later
String[] results = pattern.split(input);
This last option still creates a new Matcher object. You can also cache this object and reset it for each input for maximum performance, but that is somewhat more complicated and not thread-safe.
You can split a string by a line break by using the following statement:
String textStr[] = yourString.split("\\r?\\n");
You can split a string by a hyphen/character by using the following statement:
String textStr[] = yourString.split("-");
public class SplitTest {
public static String[] split(String text, String delimiter) {
java.util.List<String> parts = new java.util.ArrayList<String>();
text += delimiter;
for (int i = text.indexOf(delimiter), j=0; i != -1;) {
String temp = text.substring(j,i);
if(temp.trim().length() != 0) {
parts.add(temp);
}
j = i + delimiter.length();
i = text.indexOf(delimiter,j);
}
return parts.toArray(new String[0]);
}
public static void main(String[] args) {
String str = "004-034556";
String delimiter = "-";
String result[] = split(str, delimiter);
for(String s:result)
System.out.println(s);
}
}
Please don't use StringTokenizer class as it is a legacy class that is retained for compatibility reasons, and its use is discouraged in new code. And we can make use of the split method as suggested by others as well.
String[] sampleTokens = "004-034556".split("-");
System.out.println(Arrays.toString(sampleTokens));
And as expected it will print:
[004, 034556]
In this answer I also want to point out one change that has taken place for split method in Java 8. The String#split() method makes use of Pattern.split, and now it will remove empty strings at the start of the result array. Notice this change in documentation for Java 8:
When there is a positive-width match at the beginning of the input
sequence then an empty leading substring is included at the beginning
of the resulting array. A zero-width match at the beginning however
never produces such empty leading substring.
It means for the following example:
String[] sampleTokensAgain = "004".split("");
System.out.println(Arrays.toString(sampleTokensAgain));
we will get three strings: [0, 0, 4] and not four as was the case in Java 7 and before. Also check this similar question.
One way to do this is to run through the String in a for-each loop and use the required split character.
public class StringSplitTest {
public static void main(String[] arg){
String str = "004-034556";
String split[] = str.split("-");
System.out.println("The split parts of the String are");
for(String s:split)
System.out.println(s);
}
}
Output:
The split parts of the String are:
004
034556
import java.io.*;
public class BreakString {
public static void main(String args[]) {
String string = "004-034556-1234-2341";
String[] parts = string.split("-");
for(int i=0;i<parts.length;i++) {
System.out.println(parts[i]);
}
}
}
You can use Split():
import java.io.*;
public class Splitting
{
public static void main(String args[])
{
String Str = new String("004-034556");
String[] SplittoArray = Str.split("-");
String string1 = SplittoArray[0];
String string2 = SplittoArray[1];
}
}
Else, you can use StringTokenizer:
import java.util.*;
public class Splitting
{
public static void main(String[] args)
{
StringTokenizer Str = new StringTokenizer("004-034556");
String string1 = Str.nextToken("-");
String string2 = Str.nextToken("-");
}
}
Here are two ways two achieve it.
WAY 1: As you have to split two numbers by a special character you can use regex
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TrialClass
{
public static void main(String[] args)
{
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher("004-034556");
while(m.find())
{
System.out.println(m.group());
}
}
}
WAY 2: Using the string split method
public class TrialClass
{
public static void main(String[] args)
{
String temp = "004-034556";
String [] arrString = temp.split("-");
for(String splitString:arrString)
{
System.out.println(splitString);
}
}
}
You can simply use StringTokenizer to split a string in two or more parts whether there are any type of delimiters:
StringTokenizer st = new StringTokenizer("004-034556", "-");
while(st.hasMoreTokens())
{
System.out.println(st.nextToken());
}
Check out the split() method in the String class on javadoc.
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
String data = "004-034556-1212-232-232";
int cnt = 1;
for (String item : data.split("-")) {
System.out.println("string "+cnt+" = "+item);
cnt++;
}
Here many examples for split string but I little code optimized.
String str="004-034556"
String[] sTemp=str.split("-");// '-' is a delimiter
string1=004 // sTemp[0];
string2=034556//sTemp[1];
I just wanted to write an algorithm instead of using Java built-in functions:
public static List<String> split(String str, char c){
List<String> list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++){
if(str.charAt(i) != c){
sb.append(str.charAt(i));
}
else{
if(sb.length() > 0){
list.add(sb.toString());
sb = new StringBuilder();
}
}
}
if(sb.length() >0){
list.add(sb.toString());
}
return list;
}
You can use the method split:
public class Demo {
public static void main(String args[]) {
String str = "004-034556";
if ((str.contains("-"))) {
String[] temp = str.split("-");
for (String part:temp) {
System.out.println(part);
}
}
else {
System.out.println(str + " does not contain \"-\".");
}
}
}
To split a string, uses String.split(regex). Review the following examples:
String data = "004-034556";
String[] output = data.split("-");
System.out.println(output[0]);
System.out.println(output[1]);
Output
004
034556
Note:
This split (regex) takes a regex as an argument. Remember to escape the regex special characters, like period/dot.
String s = "TnGeneral|DOMESTIC";
String a[]=s.split("\\|");
System.out.println(a.toString());
System.out.println(a[0]);
System.out.println(a[1]);
Output:
TnGeneral
DOMESTIC
String s="004-034556";
for(int i=0;i<s.length();i++)
{
if(s.charAt(i)=='-')
{
System.out.println(s.substring(0,i));
System.out.println(s.substring(i+1));
}
}
As mentioned by everyone, split() is the best option which may be used in your case. An alternative method can be using substring().

How to split a string, but also keep the delimiters?

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.
I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);
String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));
One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}
If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

Categories

Resources