Java regex input from txt file - java

I have a text file that includes some mathematical expressions.
I need to parse the text into components (words, sentences, punctuation, numbers and arithmetic signs) using regular expressions, calculate mathematical expressions and return the text in the original form with the calculated numbers expressions.
I done this without regular expressions (without calculation). Now I am trying to do this using regular expressions.
I not fully understand how to do this correctly. The input text is like this:
Pete like mathematic 5+3 and jesica too sin(3).
In the output I need:
Pete like mathematic 8 and jesica too 0,14.
I need some advice with regex and calculation from people who know how to do this.
My code:
final static Pattern PUNCTUATION = Pattern.compile("([\\s.,!?;:]){1,}");
final static Pattern LETTER = Pattern.compile("([а-яА-Яa-zA-Z&&[^sin]]){1,}");
List<Sentence> sentences = new ArrayList<Sentence>();
List<PartOfSentence> parts = new ArrayList<PartOfSentence>();
StringTokenizer st = new StringTokenizer(text, " \t\n\r:;.!?,/\\|\"\'",
true);
The code with regex (not working):
while (st.hasMoreTokens()) {
String s = st.nextToken().trim();
int size = s.length();
for (int i=0; i<s.length();i++){
//with regex. not working variant
Matcher m = LETTER.matcher(s);
if (m.matches()){
parts.add(new Word(s.toCharArray()));
}
m = PUNCTUATION.matcher(s);
if (m.matches()){
parts.add(new Punctuation(s.charAt(0)));
}
Sentence buf = new Sentence(parts);
if (buf.getWords().size() != 0) {
sentences.add(buf);
parts = new ArrayList<PartOfSentence>();
} else
parts.add(new Punctuation(s.charAt(0)));
Without regex (working):
if (size < 1)
continue;
if (size == 1) {
switch (s.charAt(0)) {
case ' ':
continue;
case ',':
case ';':
case ':':
case '\'':
case '\"':
parts.add(new Punctuation(s.charAt(0)));
break;
case '.':
case '?':
case '!':
parts.add(new Punctuation(s.charAt(0)));
Sentence buf = new Sentence(parts);
if (buf.getWords().size() != 0) {
sentences.add(buf);
parts = new ArrayList<PartOfSentence>();
} else
parts.add(new Punctuation(s.charAt(0)));
break;
default:
parts.add(new Word(s.toCharArray()));
}
} else {
parts.add(new Word(s.toCharArray()));
}
}

This is not a trivial problem to solve as even matching numbers can become quite involved.
Firstly, a number can be matched by the regular expression "(\\d*(\\.\\d*)?\\d(e\\d+)?)" to account for decimal places and exponent formats.
Secondly, there are (at least) three types of expressions that you want to solve: binary, unary and functions. For each one, we create a pattern to match in the solve method.
Thirdly, there are numerous libraries that can implement the reduce method like this or this.
The implementation below does not handle nested expressions e.g., sin(5) + cos(3) or spaces in expressions.
private static final String NUM = "(\\d*(\\.\\d*)?\\d(e\\d+)?)";
public String solve(String expr) {
expr = solve(expr, "(" + NUM + "(!|\\+\\+|--))"); //unary operators
expr = solve(expr, "(" + NUM + "([+-/*]" + NUM + ")+)"); // binary operators
expr = solve(expr, "((sin|cos|tan)\\(" + NUM + "\\))"); // functions
return expr;
}
private String solve(String expr, String pattern) {
Matcher m = Pattern.compile(pattern).matcher(expr);
// assume a reduce method :String -> String that solve expressions
while(m.find()){
expr = m.replaceAll(reduce(m.group()));
}
return expr;
}
//evaluate expression using exp4j, format to 2 decimal places,
//remove trailing 0s and dangling decimal point
private String reduce(String expr){
double res = new ExpressionBuilder(expr).build().evaluate();
return String.format("%.2f",res).replaceAll("0*$", "").replaceAll("\\.$", "");
}

I think you could start by looking for "Function" matching in your input String. Then all is not matched with a Function is simply returned.
For example, this short code do, i hope, what you are seeking :
Class with Main method.
public class App {
StringTokenizer st = new StringTokenizer("Pete likes Mathematics 3+3 and Jessica too 6+3.", " \t\n\r:;.!?,/\\|\"\'", true);
public static void main(String[] args) {
new App();
}
public App(){
ArrayList<String> renderedStrings = new ArrayList<String>();
while(st.hasMoreTokens()){
String s = st.nextToken();
if(!AdditionPatternFuntion.render(s, renderedStrings)){
renderedStrings.add(s);
}
}
for(String s : renderedStrings){
System.out.print(s);
}
}
}
Class "AdditionPattern" that does the real Job
import java.util.ArrayList;
import java.util.StringTokenizer;
import java.util.regex.Pattern;
class AdditionPatternFuntion{
public static boolean render(String s, ArrayList<String> renderedStrings){
Pattern pattern = Pattern.compile("(\\d\\+\\d)");
boolean match = pattern.matcher(s).matches();
if(match){
StringTokenizer additionTokenier = new StringTokenizer(s, "+", false);
Integer firstOperand = new Integer(additionTokenier.nextToken());
Integer secondOperand = new Integer(additionTokenier.nextToken());
renderedStrings.add(new Integer(firstOperand + secondOperand).toString());
}
return match;
}
}
When I run with this input :
Pete likes Mathematics 3+3 and Jessica too 6+3.
I getthis output :
Pete likes Mathematics 6 and Jessica too 9.
To handle "sin()" function you can do the same : Create a new class, "SinPatternFunction" for instance, and do the same.
I think you should even create an Abstract class "FunctionPattern" with a abstract method "render" inside it which you will implement with the AssitionPatternFunction and SinPatternFunction classes.
Finally, you would be able to create a class, let's call it "PatternFunctionHandler", which will create a list of PatternFunction (a SinPatternFunction, an AdditionPatternFunction (and so on)) then call render on each one and return the result.

Your specified requirement is to use regular expressions to:
Divide text into components (words, ...)
Return text with inner arithmetic expressions evaluated
You have started with first step using regular expressions, but have not quite completed it -- after completing it, there remains to:
Recognize and parse components that form arithmetic (sub)expressions.
Evaluate recognized (sub)expression components and produce a value. For evaluating (sub)expressions in infix notation, there exists a very helpful answer.
Substituting value replacements back into original string -- should be simple.
For text division into components defined strictly enough to allow later unambiguos evaluation of the subexpression, I coded a sample, trying out named capturing groups in Java. This sample handles only integer numbers, but floating point should be simple to add.
Sample output on some test inputs was as follows:
Matching 'Pete like mathematic 5+3 and jesica too sin(3).'
WORD('Pete'),WS(' '),WORD('like'),WS(' '),WORD('mathematic'),WS(' '),NUM('5'),OP('+'),NUM('3'),WS(' '),WORD('and'),WS(' '),WORD('jesica'),WS(' '),WORD('too'),WS(' '),FUNC('sin'),FOPENP('('),NUM('3'),CLOSEP(')'),DOT('.')
Matching 'How about solving sin(3 + cos(x)).'
WORD('How'),WS(' '),WORD('about'),WS(' '),WORD('solving'),WS(' '),FUNC('sin'),FOPENP('('),NUM('3'),WS(' '),OP('+'),WS(' '),FUNC('cos'),FOPENP('('),WORD('x'),CLOSEP(')'),CLOSEP(')'),DOT('.')
Matching 'Or arcsin(4.2) we do not know about?'
WORD('Or'),WS(' '),WORD('arcsin'),OPENP('('),NUM('4'),DOT('.'),NUM('2'),CLOSEP(')'),WS(' '),WORD('we'),WS(' '),WORD('do'),WS(' '),WORD('not'),WS(' '),WORD('know'),WS(' '),WORD('about'),PUNCT('?')
Matching ''sin sin sin' the catholic priest has said...'
PUNCT('''),WORD('sin'),WS(' '),WORD('sin'),WS(' '),WORD('sin'),PUNCT('''),WS(' '),WORD('the'),WS(' '),WORD('catholic'),WS(' '),WORD('priest'),WS(' '),WORD('has'),WS(' '),WORD('said'),DOT('.'),DOT('.'),DOT('.')
On named capturing group usage, I found it inconvenient that compiled Pattern or acquired Matcher APIs do not provide access to present group names. Sample code below.
import java.util.*;
import java.util.regex.*;
import static java.util.stream.Collectors.joining;
public class Lexer {
// differentiating _function call opening parentheses_ from expressions one
static final String S_FOPENP = "(?<fopenp>\\()";
static final String S_FUNC = "(?<func>(sin|cos|tan))" + S_FOPENP;
// expression or text opening parentheses
static final String S_OPENP = "(?<openp>\\()";
// expression or text closing parentheses
static final String S_CLOSEP = "(?<closep>\\))";
// separate dot, should help with introducing floating-point support
static final String S_DOT = "(?<dot>\\.)";
// other recognized punctuation
static final String S_PUNCT = "(?<punct>[,!?;:'\"])";
// whitespace
static final String S_WS = "(?<ws>\\s+)";
// integer number pattern
static final String S_NUM = "(?<num>\\d+)";
// treat '* / + -' as mathematical operators. Can be in dashed text.
static final String S_OP = "(?<op>\\*|/|\\+|-)";
// word -- refrain from using \w character class that also includes digits
static final String S_WORD = "(?<word>[a-zA-Z]+)";
// put the predefined components together into single regular expression
private static final String S_ALL = "(" +
S_OPENP + "|" + S_CLOSEP + "|" + S_FUNC + "|" + S_DOT + "|" +
S_PUNCT + "|" + S_WS + "|" + S_NUM + "|" + S_OP + "|" + S_WORD +
")";
static final Pattern ALL = Pattern.compile(S_ALL); // ... & form Pattern
// named capturing groups defined in regular expressions
static final List<String> GROUPS = Arrays.asList(
"func", "fopenp",
"openp", "closep",
"dot", "punct", "ws",
"num", "op",
"word"
);
// divide match into components according to capturing groups
static final List<String> tokenize(Matcher m) {
List<String> tokens = new LinkedList<>();
while (m.find()){
for (String group : GROUPS) {
String grResult = m.group(group);
if (grResult != null)
tokens.add(group.toUpperCase() + "('" + grResult + "')");
}
}
return tokens;
}
// some sample inputs to test
static final List<String> INPUTS = Arrays.asList(
"Pete like mathematic 5+3 and jesica too sin(3).",
"How about solving sin(3 + cos(x)).",
"Or arcsin(4.2) we do not know about?",
"'sin sin sin' the catholic priest has said..."
);
// test
public static void main(String[] args) {
for (String input: INPUTS) {
Matcher m = ALL.matcher(input);
System.out.println("Matching '" + input + "'");
System.out.println(tokenize(m).stream().collect(joining(",")));
}
}
}

Related

What RegEx separates terms of Polynomial

I have a String 5x^3-2x^2+5x
I want a regex which splits this string as
5x^3,
-2x^2,
5x
I tried "(-)|(\\+)",
but this did not work. As it did not consider negative power terms.
You can split your string using this regex,
\+|(?=-)
The way this works is, it splits the string consuming + character but if there is - then it splits using - but doesn't consume - as that is lookahead.
Check out this Java code,
String s = "5x^3-2x^2+5x";
System.out.println(Arrays.toString(s.split("\\+|(?=-)")));
Gives your expected output below,
[5x^3, -2x^2, 5x]
Edit:
Although in one of OP's comment in his post he said, there won't be negative powers but just in case you have negative powers as well, you can use this regex which handles negative powers as well,
\+|(?<!\^)(?=-)
Check this updated Java code,
List<String> list = Arrays.asList("5x^3-2x^2+5x", "5x^3-2x^-2+5x");
for (String s : list) {
System.out.println(s + " --> " +Arrays.toString(s.split("\\+|(?<!\\^)(?=-)")));
}
New output,
5x^3-2x^2+5x --> [5x^3, -2x^2, 5x]
5x^3-2x^-2+5x --> [5x^3, -2x^-2, 5x]
Maybe,
-?[^\r\n+-]+(?=[+-]|$)
or some similar expressions might have been worked OK too, just in case you might have had constants in the equations.
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "-?[^\r\n+-]+(?=[+-]|$)";
final String string = "5x^3-2x^2+5x\n"
+ "5x^3-2x^2+5x-5\n"
+ "-5x^3-2x^2+5x+5";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
In below program , You can get break of every single variable. So debug it and combine regex as you need it. It will work fine for all input.
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="5x^3-2x^2+5x";
String re1="([-+]\\d+)"; // Integer Number 1
String re2="((?:[a-z][a-z0-9_]*))"; // Variable Name 1
String re3="(\\^)"; // Any Single Character 1
String re4="([-+]\\d+)"; // Integer Number 2
String re5="([-+]\\d+)"; // Integer Number 1
String re6="((?:[a-z][a-z0-9_]*))"; // Variable Name 2
String re7="(\\^)"; // Any Single Character 2
String re8="([-+]\\d+)"; // Integer Number 3
String re9="([-+]\\d+)"; // Integer Number 2
String re10="((?:[a-z][a-z0-9_]*))"; // Variable Name 3
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String int1=m.group(1);
String var1=m.group(2);
String c1=m.group(3);
String int2=m.group(4);
String signed_int1=m.group(5);
String var2=m.group(6);
String c2=m.group(7);
String int3=m.group(8);
String signed_int2=m.group(9);
String var3=m.group(10);
System.out.print("("+int1.toString()+")"+"("+var1.toString()+")"+"("+c1.toString()+")"+"("+int2.toString()+")"+"("+signed_int1.toString()+")"+"("+var2.toString()+")"+"("+c2.toString()+")"+"("+int3.toString()+")"+"("+signed_int2.toString()+")"+"("+var3.toString()+")"+"\n");
}
}
}

Java regex - how to chop String into parts [duplicate]

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.
I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);
String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));
One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}
If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

How would you improve the efficiency of this regex

I think my regex pattern I have used could be tidied up and look a little neater but my knowledge of regular expressions is limited. I would like to scan and match a series of letters and numbers on new lines from an input file.
import java.io.File;
import java.util.Scanner;
import java.util.regex.*;
public class App {
public static void main(String[] args) throws Exception {
if (args.length == 1) {
String fileName = args[0];
String fileContent = new Scanner(new File(fileName))
.useDelimiter("\\Z").next();
ArrayList<Integer> parsedContent = new ArrayList<>();
parsedContent = parseContentFromFileContent(fileContent);
int firstInt = parsedContent.get(0);
int secondInt = parsedContent.get(1);
int thirdInt = parsedContent.get(2);
int fourthInt = parsedContent.get(3);
int fifthInt = parsedContent.get(4);
System.out.println("First: " + firstInt);
System.out.println("Second: " + secondInt);
System.out.println("Third: " + thirdInt);
System.out.println("Fourth: " + fourthInt);
System.out.println("Fifth: " + fifthInt);
return;
}
}
public static ArrayList<Integer> parseContentFromFileContent(String fileContent) {
ArrayList<Integer> parsedInts = new ArrayList<>();
String pattern = "(.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)";
Pattern p = Pattern.compile(pattern, Pattern.DOTALL);
Matcher m = p.matcher(fileContent);
if (m.matches()) {
// Group 1: Has to match two letters
switch (m.group(1)) {
case "ab":
parsedInts.add(1);
break;
case "cd":
parsedInts.add(2);
break;
case "ef":
parsedInts.add(3);
break;
}
// Group 2: Has to match a number
parsedInts.add(Integer.parseInt(m.group(2)));
// Group 3: Has to match a letter
parsedInts.add(Integer.parseInt(m.group(3)));
// Group 4: Has to match a single letter
switch (m.group(4)) {
case "a":
parsedInts.add(1);
break;
case "b":
parsedInts.add(2);
break;
case "c":
parsedInts.add(3);
break;
}
// Group 5: Has to match a number
parsedInts.add(Integer.parseInt(m.group(5)));
}
return parsedInts;
}
}
Input file:
ab-123 // Group 1 - Two letters a-z and Group 2 - Number
A=1 // Group 3 - Always A= [number]
a-1 // Group 4 - Letter a-z and Group 5 - Number
cd-1234
A=2
b-2
ef-12345
a=4
c-3
gh-123456
a=4
d-4
Is there a better (cleaner) regex pattern I could use to capture the data from the file above.
pattern = (.+?).((?:\\d*\\.)?\\d+)?\\n..((?:\\d*\\.)?\\d+)?\\n(.+?).((?:\\d*\\.)?\\d+)
Your pattern at the moment isn't very precise, contrary to the description you gave. There are a lot of .+?, but your description quite clearly says two letters or always A= - so you could instead use that in your pattern. Your pattern also accounts for decimal numbers, while there are none in the shown input, so you might be able to drop (?:\\d*\\.)?. Furthermore all your number matching patterns are optional, but according to your description thex shouldn't.
If one takes your pattern quite literally, a possible pattern would be
([a-z]{2})-(\\d+)\\n[Aa]=(\\d+)\\n([a-z])-(\\d+)
See https://regex101.com/r/WNxUQa/1
Note that you might have to adjust your pattern a bit (e.g. using ^ and $), if there might be malicious input.
There is really no such thing as optimizing a regular expression, unless it contains backtracking and you can remove it. You can optimise the way it looks, but all regular expressions that do the same thing compile to the same DFA, or equivalent DFAs, and have the same performance.

Get certain data from text - Java

I am creating a bukkit plugin for minecraft and i need to know a few things before i move on.
I want to check if a text has this layout: "B:10 S:5" for example.
It stands for Buy:amount and Sell:amount
How can i check the easiest way if it follows the syntax?
It can be any number that is 0 or over.
Another problem is to get this data out of the text. how can i check what text is after B: and S: and return it as an integer
I have not tried out this yet because i have no idea where to start.
Thanks for help!
In the simple problem you gave, you can get away with a simple answer. Otherwise, see the regex answer below.
boolean test(String str){
try{
//String str = "B:10 S:5";
String[] arr = str.split(" ");//split to left and right of space = [B:10,S:5]
String[] bArr = arr[0].split(":");//split ...first colon = [B,10]
String[] sArr = arr[1].split(":");//... second colon = [S,5]
//need to use try/catch here in case the string is not an int value.
String labelB = bArr[0];
Integer b = Integer.parseInt(bArr[1]);
String labelS = sArr[0];
Integer s = Integer.parseInt(sArr[1]);
}catch( Exception e){return false;}
return true;
}
See my answer here for a related task. More related details below.
How can I parse a string for a set?
Essentially, you need to use regex and iterate through the groups. Just in case the grammar is not always B and S, I made this more abstract.Also, if there are extra white spaces in the middle for what ever reason, I made that more broad too. The pattern says there are 4 groups (indicated by parentheses): label1, number1, label2, and number2. + means 1 or more. [] means a set of characters. a-z is a range of characters (don't put anything between A-Z and a-z). There are also other ways of showing alpha and numeric patterns, but these are easier to read.
//this is expensive
Pattern p=Pattern.compile("([A-Za-z]+):([0-9]+)[ ]+([A-Za-z]+):([0-9]+)");
boolean test(String txt){
Matcher m=p.matcher(txt);
if(!m.matches())return false;
int groups=m.groupCount();//should only equal 5 (default whole match+4 groups) here, but you can test this
System.out.println("Matched: " + m.group(0));
//Label1 = m.group(1);
//val1 = m.group(2);
//Label2 = m.group(3);
//val2 = m.group(4);
return true;
}
Use Regular Expression.
In your case,^B:(\d)+ S:(\d)+$ is enough.
In java, to use a regular expression:
public class RegExExample {
public static void main(String[] args) {
Pattern p = Pattern.compile("^B:(\d)+ S:(\d)+$");
for (int i = 0; i < args.length; i++)
if (p.matcher(args[i]).matches())
System.out.println( "ARGUMENT #" + i + " IS VALID!")
else
System.out.println( "ARGUMENT #" + i + " IS INVALID!");
}
}
This sample program take inputs from command line, validate it against the pattern and print the result to STDOUT.

How to split a string, but also keep the delimiters?

I have a multiline string which is delimited by a set of different delimiters:
(Text1)(DelimiterA)(Text2)(DelimiterC)(Text3)(DelimiterB)(Text4)
I can split this string into its parts, using String.split, but it seems that I can't get the actual string, which matched the delimiter regex.
In other words, this is what I get:
Text1
Text2
Text3
Text4
This is what I want
Text1
DelimiterA
Text2
DelimiterC
Text3
DelimiterB
Text4
Is there any JDK way to split the string using a delimiter regex but also keep the delimiters?
You can use lookahead and lookbehind, which are features of regular expressions.
System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));
And you will get:
[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]
The last one is what you want.
((?<=;)|(?=;)) equals to select an empty character before ; or after ;.
EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:
static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";
public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}
You want to use lookarounds, and split on zero-width matches. Here are some examples:
public class SplitNDump {
static void dump(String[] arr) {
for (String s : arr) {
System.out.format("[%s]", s);
}
System.out.println();
}
public static void main(String[] args) {
dump("1,234,567,890".split(","));
// "[1][234][567][890]"
dump("1,234,567,890".split("(?=,)"));
// "[1][,234][,567][,890]"
dump("1,234,567,890".split("(?<=,)"));
// "[1,][234,][567,][890]"
dump("1,234,567,890".split("(?<=,)|(?=,)"));
// "[1][,][234][,][567][,][890]"
dump(":a:bb::c:".split("(?=:)|(?<=:)"));
// "[][:][a][:][bb][:][:][c][:]"
dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
// "[:][a][:][bb][:][:][c][:]"
dump(":::a::::b b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
// "[:::][a][::::][b b][::][c][:]"
dump("a,bb:::c d..e".split("(?!^)\\b"));
// "[a][,][bb][:::][c][ ][d][..][e]"
dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
// "[Array][Index][Out][Of][Bounds][Exception]"
dump("1234567890".split("(?<=\\G.{4})"));
// "[1234][5678][90]"
// Split at the end of each run of letter
dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
// "[Booo][yaaaa][h! Yipp][ieeee][!!]"
}
}
And yes, that is triply-nested assertion there in the last pattern.
Related questions
Java split is eating my characters.
Can you use zero-width matching regex in String split?
How do I convert CamelCase into human-readable names in Java?
Backreferences in lookbehind
See also
regular-expressions.info/Lookarounds
A very naive solution, that doesn't involve regex would be to perform a string replace on your delimiter along the lines of (assuming comma for delimiter):
string.replace(FullString, "," , "~,~")
Where you can replace tilda (~) with an appropriate unique delimiter.
Then if you do a split on your new delimiter then i believe you will get the desired result.
import java.util.regex.*;
import java.util.LinkedList;
public class Splitter {
private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");
private Pattern pattern;
private boolean keep_delimiters;
public Splitter(Pattern pattern, boolean keep_delimiters) {
this.pattern = pattern;
this.keep_delimiters = keep_delimiters;
}
public Splitter(String pattern, boolean keep_delimiters) {
this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
}
public Splitter(Pattern pattern) { this(pattern, true); }
public Splitter(String pattern) { this(pattern, true); }
public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
public Splitter() { this(DEFAULT_PATTERN); }
public String[] split(String text) {
if (text == null) {
text = "";
}
int last_match = 0;
LinkedList<String> splitted = new LinkedList<String>();
Matcher m = this.pattern.matcher(text);
while (m.find()) {
splitted.add(text.substring(last_match,m.start()));
if (this.keep_delimiters) {
splitted.add(m.group());
}
last_match = m.end();
}
splitted.add(text.substring(last_match));
return splitted.toArray(new String[splitted.size()]);
}
public static void main(String[] argv) {
if (argv.length != 2) {
System.err.println("Syntax: java Splitter <pattern> <text>");
return;
}
Pattern pattern = null;
try {
pattern = Pattern.compile(argv[0]);
}
catch (PatternSyntaxException e) {
System.err.println(e);
return;
}
Splitter splitter = new Splitter(pattern);
String text = argv[1];
int counter = 1;
for (String part : splitter.split(text)) {
System.out.printf("Part %d: \"%s\"\n", counter++, part);
}
}
}
/*
Example:
> java Splitter "\W+" "Hello World!"
Part 1: "Hello"
Part 2: " "
Part 3: "World"
Part 4: "!"
Part 5: ""
*/
I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.
Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453
Pass the 3rd aurgument as "true". It will return delimiters as well.
StringTokenizer(String str, String delimiters, true);
I know this is a very-very old question and answer has also been accepted. But still I would like to submit a very simple answer to original question. Consider this code:
String str = "Hello-World:How\nAre You&doing";
inputs = str.split("(?!^)\\b");
for (int i=0; i<inputs.length; i++) {
System.out.println("a[" + i + "] = \"" + inputs[i] + '"');
}
OUTPUT:
a[0] = "Hello"
a[1] = "-"
a[2] = "World"
a[3] = ":"
a[4] = "How"
a[5] = "
"
a[6] = "Are"
a[7] = " "
a[8] = "You"
a[9] = "&"
a[10] = "doing"
I am just using word boundary \b to delimit the words except when it is start of text.
I got here late, but returning to the original question, why not just use lookarounds?
Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));
output:
[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]
EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:
{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }
I hope that's easier to read. Thanks for the heads-up, #finnw.
I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:
Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:
https://stackoverflow.com/users/18393/cletus
Anyway, the code:
public class Split {
public static List<String> split(String s, String pattern) {
assert s != null;
assert pattern != null;
return split(s, Pattern.compile(pattern));
}
public static List<String> split(String s, Pattern pattern) {
assert s != null;
assert pattern != null;
Matcher m = pattern.matcher(s);
List<String> ret = new ArrayList<String>();
int start = 0;
while (m.find()) {
ret.add(s.substring(start, m.start()));
ret.add(m.group());
start = m.end();
}
ret.add(start >= s.length() ? "" : s.substring(start));
return ret;
}
private static void testSplit(String s, String pattern) {
System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
List<String> tokens = split(s, pattern);
System.out.printf("Found %d matches%n", tokens.size());
int i = 0;
for (String token : tokens) {
System.out.printf(" %d/%d: '%s'%n", ++i, tokens.size(), token);
}
System.out.println();
}
public static void main(String args[]) {
testSplit("abcdefghij", "z"); // "abcdefghij"
testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
}
}
I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).
So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.
A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:
[o], '', [o], '', [o]
But the regexp o+ will return the expected result when splitting "aooob"
[], 'a', [ooo], 'b', []
To use this StringTokenizerEx:
final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
// uses the split String detected and memorized in 'aString'
final nextDelimiter = aStringTokenizerEx.getDelimiter();
}
The code of this class is available at DZone Snippets.
As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.
Note: (late 2009 edit)
The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:
Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).
The Google common-library Guava contains also a Splitter which is:
simpler to use
maintained by Google (and not by you)
So it may worth being checked out. From their initial rough documentation (pdf):
JDK has this:
String[] pieces = "foo.bar".split("\\.");
It's fine to use this if you want exactly what it does:
- regular expression
- result as an array
- its way of handling empty pieces
Mini-puzzler: ",a,,b,".split(",") returns...
(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above
Answer: (e) None of the above.
",a,,b,".split(",")
returns
"", "a", "", "b"
Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)
In any case, our Splitter is simply more flexible: The default behavior is simplistic:
Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]
If you want extra features, ask for them!
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]
Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.
Here is a simple clean implementation which is consistent with Pattern#split and works with variable length patterns, which look behind cannot support, and it is easier to use. It is similar to the solution provided by #cletus.
public static String[] split(CharSequence input, String pattern) {
return split(input, Pattern.compile(pattern));
}
public static String[] split(CharSequence input, Pattern pattern) {
Matcher matcher = pattern.matcher(input);
int start = 0;
List<String> result = new ArrayList<>();
while (matcher.find()) {
result.add(input.subSequence(start, matcher.start()).toString());
result.add(matcher.group());
start = matcher.end();
}
if (start != input.length()) result.add(input.subSequence(start, input.length()).toString());
return result.toArray(new String[0]);
}
I don't do null checks here, Pattern#split doesn't, why should I. I don't like the if at the end but it is required for consistency with the Pattern#split . Otherwise I would unconditionally append, resulting in an empty string as the last element of the result if the input string ends with the pattern.
I convert to String[] for consistency with Pattern#split, I use new String[0] rather than new String[result.size()], see here for why.
Here are my tests:
#Test
public void splitsVariableLengthPattern() {
String[] result = Split.split("/foo/$bar/bas", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar", "/bas" }, result);
}
#Test
public void splitsEndingWithPattern() {
String[] result = Split.split("/foo/$bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/", "$bar" }, result);
}
#Test
public void splitsStartingWithPattern() {
String[] result = Split.split("$foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "", "$foo", "/bar" }, result);
}
#Test
public void splitsNoMatchesPattern() {
String[] result = Split.split("/foo/bar", "\\$\\w+");
Assert.assertArrayEquals(new String[] { "/foo/bar" }, result);
}
I will post my working versions also(first is really similar to Markus).
public static String[] splitIncludeDelimeter(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
int now, old = 0;
while(matcher.find()){
now = matcher.end();
list.add(text.substring(old, now));
old = now;
}
if(list.size() == 0)
return new String[]{text};
//adding rest of a text as last element
String finalElement = text.substring(old);
list.add(finalElement);
return list.toArray(new String[list.size()]);
}
And here is second solution and its round 50% faster than first one:
public static String[] splitIncludeDelimeter2(String regex, String text){
List<String> list = new LinkedList<>();
Matcher matcher = Pattern.compile(regex).matcher(text);
StringBuffer stringBuffer = new StringBuffer();
while(matcher.find()){
matcher.appendReplacement(stringBuffer, matcher.group());
list.add(stringBuffer.toString());
stringBuffer.setLength(0); //clear buffer
}
matcher.appendTail(stringBuffer); ///dodajemy reszte ciagu
list.add(stringBuffer.toString());
return list.toArray(new String[list.size()]);
}
Another candidate solution using a regex. Retains token order, correctly matches multiple tokens of the same type in a row. The downside is that the regex is kind of nasty.
package javaapplication2;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaApplication2 {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
String num = "58.5+variable-+98*78/96+a/78.7-3443*12-3";
// Terrifying regex:
// (a)|(b)|(c) match a or b or c
// where
// (a) is one or more digits optionally followed by a decimal point
// followed by one or more digits: (\d+(\.\d+)?)
// (b) is one of the set + * / - occurring once: ([+*/-])
// (c) is a sequence of one or more lowercase latin letter: ([a-z]+)
Pattern tokenPattern = Pattern.compile("(\\d+(\\.\\d+)?)|([+*/-])|([a-z]+)");
Matcher tokenMatcher = tokenPattern.matcher(num);
List<String> tokens = new ArrayList<>();
while (!tokenMatcher.hitEnd()) {
if (tokenMatcher.find()) {
tokens.add(tokenMatcher.group());
} else {
// report error
break;
}
}
System.out.println(tokens);
}
}
Sample output:
[58.5, +, variable, -, +, 98, *, 78, /, 96, +, a, /, 78.7, -, 3443, *, 12, -, 3]
I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):
static String[] splitWithDelimiters(String s) {
if (s == null || s.length() == 0) {
return new String[0];
}
LinkedList<String> result = new LinkedList<String>();
StringBuilder sb = null;
boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
for (char c : s.toCharArray()) {
if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
if (sb != null) {
result.add(sb.toString());
}
sb = new StringBuilder();
wasLetterOrDigit = !wasLetterOrDigit;
}
sb.append(c);
}
result.add(sb.toString());
return result.toArray(new String[0]);
}
I suggest using Pattern and Matcher, which will almost certainly achieve what you want. Your regular expression will need to be somewhat more complicated than what you are using in String.split.
I don't think it is possible with String#split, but you can use a StringTokenizer, though that won't allow you to define your delimiter as a regex, but only as a class of single-digit characters:
new StringTokenizer("Hello, world. Hi!", ",.!", true); // true for returnDelims
If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with.
Example:
I want to split the string "boo:and:foo" and keep ':' at its righthand String.
String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");
Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution.
But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.
Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).
It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.
Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.
My quick test:
String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();
Result:
'|ab|','|cd|','|eg|'|
A bit too much... :-)
Tweaked Pattern.split() to include matched pattern to the list
Added
// add match to the list
matchList.add(input.subSequence(start, end).toString());
Full source
public static String[] inclusiveSplit(String input, String re, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Pattern pattern = Pattern.compile(re);
Matcher m = pattern.matcher(input);
// Add segments before each match found
while (m.find()) {
int end = m.end();
if (!matchLimited || matchList.size() < limit - 1) {
int start = m.start();
String match = input.subSequence(index, start).toString();
matchList.add(match);
// add match to the list
matchList.add(input.subSequence(start, end).toString());
index = end;
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index, input.length())
.toString();
matchList.add(match);
index = end;
}
}
// If no match was found, return this
if (index == 0)
return new String[] { input.toString() };
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize - 1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Here's a groovy version based on some of the code above, in case it helps. It's short, anyway. Conditionally includes the head and tail (if they are not empty). The last part is a demo/test case.
List splitWithTokens(str, pat) {
def tokens=[]
def lastMatch=0
def m = str=~pat
while (m.find()) {
if (m.start() > 0) tokens << str[lastMatch..<m.start()]
tokens << m.group()
lastMatch=m.end()
}
if (lastMatch < str.length()) tokens << str[lastMatch..<str.length()]
tokens
}
[['<html><head><title>this is the title</title></head>',/<[^>]+>/],
['before<html><head><title>this is the title</title></head>after',/<[^>]+>/]
].each {
println splitWithTokens(*it)
}
An extremely naive and inefficient solution which works nevertheless.Use split twice on the string and then concatenate the two arrays
String temp[]=str.split("\\W");
String temp2[]=str.split("\\w||\\s");
int i=0;
for(String string:temp)
System.out.println(string);
String temp3[]=new String[temp.length-1];
for(String string:temp2)
{
System.out.println(string);
if((string.equals("")!=true)&&(string.equals("\\s")!=true))
{
temp3[i]=string;
i++;
}
// System.out.println(temp.length);
// System.out.println(temp2.length);
}
System.out.println(temp3.length);
String[] temp4=new String[temp.length+temp3.length];
int j=0;
for(i=0;i<temp.length;i++)
{
temp4[j]=temp[i];
j=j+2;
}
j=1;
for(i=0;i<temp3.length;i++)
{
temp4[j]=temp3[i];
j+=2;
}
for(String s:temp4)
System.out.println(s);
String expression = "((A+B)*C-D)*E";
expression = expression.replaceAll("\\+", "~+~");
expression = expression.replaceAll("\\*", "~*~");
expression = expression.replaceAll("-", "~-~");
expression = expression.replaceAll("/+", "~/~");
expression = expression.replaceAll("\\(", "~(~"); //also you can use [(] instead of \\(
expression = expression.replaceAll("\\)", "~)~"); //also you can use [)] instead of \\)
expression = expression.replaceAll("~~", "~");
if(expression.startsWith("~")) {
expression = expression.substring(1);
}
String[] expressionArray = expression.split("~");
System.out.println(Arrays.toString(expressionArray));
One of the subtleties in this question involves the "leading delimiter" question: if you are going to have a combined array of tokens and delimiters you have to know whether it starts with a token or a delimiter. You could of course just assume that a leading delim should be discarded but this seems an unjustified assumption. You might also want to know whether you have a trailing delim or not. This sets two boolean flags accordingly.
Written in Groovy but a Java version should be fairly obvious:
String tokenRegex = /[\p{L}\p{N}]+/ // a String in Groovy, Unicode alphanumeric
def finder = phraseForTokenising =~ tokenRegex
// NB in Groovy the variable 'finder' is then of class java.util.regex.Matcher
def finderIt = finder.iterator() // extra method added to Matcher by Groovy magic
int start = 0
boolean leadingDelim, trailingDelim
def combinedTokensAndDelims = [] // create an array in Groovy
while( finderIt.hasNext() )
{
def token = finderIt.next()
int finderStart = finder.start()
String delim = phraseForTokenising[ start .. finderStart - 1 ]
// Groovy: above gets slice of String/array
if( start == 0 ) leadingDelim = finderStart != 0
if( start > 0 || leadingDelim ) combinedTokensAndDelims << delim
combinedTokensAndDelims << token // add element to end of array
start = finder.end()
}
// start == 0 indicates no tokens found
if( start > 0 ) {
// finish by seeing whether there is a trailing delim
trailingDelim = start < phraseForTokenising.length()
if( trailingDelim ) combinedTokensAndDelims << phraseForTokenising[ start .. -1 ]
println( "leading delim? $leadingDelim, trailing delim? $trailingDelim, combined array:\n $combinedTokensAndDelims" )
}
If you want keep character then use split method with loophole in .split() method.
See this example:
public class SplitExample {
public static void main(String[] args) {
String str = "Javathomettt";
System.out.println("method 1");
System.out.println("Returning words:");
String[] arr = str.split("t", 40);
for (String w : arr) {
System.out.println(w+"t");
}
System.out.println("Split array length: "+arr.length);
System.out.println("method 2");
System.out.println(str.replaceAll("t", "\n"+"t"));
}
I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.
string[] mySplit(string s,string delimiter)
{
string[] result = s.Split(delimiter);
for(int i=0;i<result.Length-1;i++)
{
result[i] += delimiter; //this one would add the delimiter to each items end except the last item,
//you can modify it however you want
}
}
string[] res = mySplit(myString,myDelimiter);
Its not too elegant, but it'll do.

Categories

Resources