Passing multiple delimiters to StringTokenizer constructor - java

I have seen that the syntax for passing multiple delimiters (eg. '.' , '?', '!') to the StringTokenizer constructor is:
StringTokenizer obj=new StringTokenizer(str,".?!");
What I am not getting is that, I have enclosed all the delimiters together in double quotes, so does that not make it a String rather than individual
characters. How does the StringTokenizer class identify them as separate characters? Why is ".?!" not treated as a single delimiter?

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code.
So forget about it.
It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
So use String#split instead.
String[] elements = str.split("\\.\\?!"); // treats ".?!" as a single delimiter
String[] elements2 = str.split("[.?!]"); // three delimiters
If you miss StringTokenizer's Enumeration nature, get an Iterator.
Iterator<String> iterator = Arrays.asList(elements).iterator();
while (iterator.hasNext()) {
String next = iterator.next();
// ...
}
How does the StringTokenizer class identify them as separate characters?
It's an implementation detail and it shouldn't be your concern. There are a couple of ways to do that. They use String#charAt(int) and String#codePointAt(int).
Why is ".?!" not treated as a single delimiter?
That's the choice they've made: "We will take a String and we will be looking for delimeters there." The Javadoc makes it clear.
*
* #param str a string to be parsed.
* #param delim the delimiters.
* #param returnDelims flag indicating whether to return the delimiters
* as tokens.
* #exception NullPointerException if str is <CODE>null</CODE>
*/
public StringTokenizer(String str, String delim, boolean returnDelims) {

That's just how StringTokenizer is defined. Just take a look at the javadoc
Constructs a string tokenizer for the specified string. All characters in the delim argument are the delimiters for separating tokens.
Also in source code you will find delimiterCodePoints field described as following
/**
* When hasSurrogates is true, delimiters are converted to code
* points and isDelimiter(int) is used to determine if the given
* codepoint is a delimiter.
*/
private int[] delimiterCodePoints;
so basically each of delimiters character is being converted to the int code stored in the array - the array is then used to decide whether the character is delimiter or not

It's true that you pass a single string rather than individual characters, but what is done with that string is up to the StringTokenizer. The StringTokenizer takes each character from your delimiter string and uses each one as a delimiter. This way, you can split the string on multiple different delimiters without having to run the tokenizer more than once.
You can see the documentation for this function here where it states:
The characters in the delim argument are the delimiters for separating tokens.
If you don't pass anything in for this parameter, it defaults to " \t\n\r\f", which is basically just whitespace.

How does the StringTokenizer class identify them as separate characters?
There is a method in String called charAt and codePointAt, which returns the character or code point at an index:
"abc".charAt(0) // 'a'
The StringTokenizer's implementation will use it both of these methods on the delimiters passed in at some point. In my version of the JDK, the code points of the delimiters string are extracted and added to an array delimiterCodePoints in a method called setMaxDelimCodePoint, which is called by the constructor:
private void setMaxDelimCodePoint() {
// ...
if (hasSurrogates) {
delimiterCodePoints = new int[count];
for (int i = 0, j = 0; i < count; i++, j += Character.charCount(c)) {
c = delimiters.codePointAt(j); <--- notice this line
delimiterCodePoints[i] = c;
}
}
}
And then this array is accessed in the isDelimiter method, which decides whether a character is a delimiter:
private boolean isDelimiter(int codePoint) {
for (int i = 0; i < delimiterCodePoints.length; i++) {
if (delimiterCodePoints[i] == codePoint) {
return true;
}
}
return false;
}
Of course, this is not the only way that the API could be designed. The constructor could have accepted an array of char as delimiters instead, but I am not qualified to say why the designers did it this way.
Why is ".?!" not treated as a single delimiter?
StringTokenizer only supports single character delimiters. If you want a string as a delimiter, you can use Scanner or String.split instead. For both of these, the delimiter is represented as a regular expression, so you have to use "\\.\\?!" instead. You can learn more about regular expressions here

Related

Java String Split with multiple delimiter using pipe '|'

I am trying to break a string b = "x+yi" into a two integers x and y.
This is my original answer.
Here I removed trailing 'i' character with substring method:
int Integerpart = (int)(new Integer(b.split("\\+")[0]));
int Imaginary = (int)(new Integer((b.split("\\+")[1]).
substring(0, b.split("\\+")[1].length() - 1)));
But I found that the code below just works same:
int x = (int)(new Integer(a.split("\\+|i")[0]));
int y = (int)(new Integer(a.split("\\+|i")[1]));
Is there something special with '|'? I looked up documentation and many other questions but I couldn't find the answer.
The split() method takes a regular expression that controls the split. Try
"[+i]". The braces mark a group of characters, in this case "+" and "i".
However, that won't accomplish what you are trying to do. You will end up with something "b = x", "y", "". Regular expressions also offer search and capture capabilities. Look at String.matches(String regex).
You can use the given link for understanding of How Delimiters Works.
How do I use a delimiter in Java Scanner?
Another alternative Way
You can use useDelimiter(String pattern) method of Scanner class. The use of useDelimiter(String pattern) method of Scanner class. Basically we have used the String semicolon(;) to tokenize the String declared on the constructor of Scanner object.
There are three possible token on the String “Anne Mills/Female/18″ which is name,gender and age. The scanner class is used to split the String and output the tokens in the console.
import java.util.Scanner;
/*
* This is a java example source code that shows how to use useDelimiter(String pattern)
* method of Scanner class. We use the string ; as delimiter
* to use in tokenizing a String input declared in Scanner constructor
*/
public class ScannerUseDelimiterDemo {
public static void main(String[] args) {
// Initialize Scanner object
Scanner scan = new Scanner("Anna Mills/Female/18");
// initialize the string delimiter
scan.useDelimiter("/");
// Printing the delimiter used
System.out.println("The delimiter use is "+scan.delimiter());
// Printing the tokenized Strings
while(scan.hasNext()){
System.out.println(scan.next());
}
// closing the scanner stream
scan.close();
}
}

How can I split string with string array delimiters, keeping delimiter in JAVA

What I wanna do is
str = "POW(MIN(100.21,123)*34,2)";
customSplit(str, string[] {"POW","MIN","(",")",",","*","+"});
result :
POW
(
MIN
(
100.21
123
)
*
34
2
)
delimiter is not a char but string
has multiple delimiter
retain delimiter as token too
Why split on POW and MIN? Aren't they just operation names that can change? If so you might try to split on the special characters only, e.g. like this: str.split("(?=[()*,+])|(?<=[()*,+])") (of course there probably are a lot more special characters so feel free to expand those classes)
This makes use of zero-length look-aheads and look-behinds, i.e. it splits before or after one of the characters in those character classes but doesn't consume the characters.
public static String[] customSplitter(String str, String[] delimiters) {
StringBuilder multisplitRegexBuff = new StringBuilder("(");
int c;
for (c = 0; c < delimiters.length - 1; c++) {
multisplitRegexBuff.append(Pattern.quote(delimiters[c])).append("|");
}
multisplitRegexBuff.append(Pattern.quote(delimiters[c])).append(")");
return str.replaceAll(multisplitRegexBuff.toString(), "\n$1\n").replaceAll("\n\n", "\n").split("\n");
}
Explanation:
customSplitter accepts a string and string[], the output of algo then returns string[] delimited by every index of input string[]
The end result of customSplitter method is similar to below for example case
str.replaceAll("(POW|MIN|\\(|\\)|\\,|\\*|\\+)", "\n$1\n").replaceAll("\n\n", "\n").split("\n")'
Algo first builds the pattern based on indexes of delimiter array and then applies replaceAll and split.

Replace parts of a string in Java

I need to replace parts of a string by looking up the System properties.
For example, consider the string It was {var1} beauty killed {var2}
I need to parse the string, and replace all the words contained within the parenthesis by looking up their value in System properties. If System.getProperty() returns null, then simply replace with empty character. This is pretty straightforward when I know the variables well ahead. But the string that I need to parse is not defined ahead. I wouldn't know how many number of variables are in the string and what the variable names are. Assuming a simple, well formatted string (no nested parenthesis, open - close matches), what is the simplest or the most elegant way to parse through the string and replace all the character sequences that are enclosed in the parenthesis?
Only solution I could come up with is to traverse the string from the first character, note down the positions of the start and end positions of the parenthesis, replace the string between them, and then continue until reaching the end of the string. Is there simpler way to do this?
You can use the parentheses to break the initial string into substrings, and then replace every other substring.
String[] substituteValues = {"the", "str", "other", "another"};
int substituteValuesIndex = 0;
String test = "Here is {var1} string called {var2}";
// split the string up into substrings
test = test.replaceAll("\\}", "\\{");
String[] splitString = test.split("\\{");
// now sub in your values
for (int k=1; k < splitString.length; k = k+2) {
splitString[k] = substituteValues[substituteValuesIndex];
substituteValuesIndex++;
}
String result = "";
for (String s : splitString) {
result = result + s;
}

Splitting strings based on a delimiter

I am trying to break apart a very simple collection of strings that come in the forms of
0|0
10|15
30|55
etc etc. Essentially numbers that are seperated by pipes.
When I use java's string split function with .split("|"). I get somewhat unpredictable results. white space in the first slot, sometimes the number itself isn't where I thought it should be.
Can anybody please help and give me advice on how I can use a reg exp to keep ONLY the integers?
I was asked to give the code trying to do the actual split. So allow me to do that in hopes to clarify further my problem :)
String temp = "0|0";
String splitString = temp.split("|");
results
\n
0
|
0
I am trying to get
0
0
only. Forever grateful for any help ahead of time :)
I still suggest to use split(), it skips null tokens by default. you want to get rid of non numeric characters in the string and only keep pipes and numbers, then you can easily use split() to get what you want. or you can pass multiple delimiters to split (in form of regex) and this should work:
String[] splited = yourString.split("[\\|\\s]+");
and the regex:
import java.util.regex.*;
Pattern pattern = Pattern.compile("\\d+(?=([\\|\\s\\r\\n]))");
Matcher matcher = pattern.matcher(yourString);
while (matcher.find()) {
System.out.println(matcher.group());
}
The pipe symbol is special in a regexp (it marks alternatives), you need to escape it. Depending on the java version you are using this could well explain your unpredictable results.
class t {
public static void main(String[]_)
{
String temp = "0|0";
String[] splitString = temp.split("\\|");
for (int i=0; i<splitString.length; i++)
System.out.println("splitString["+i+"] is " + splitString[i]);
}
}
outputs
splitString[0] is 0
splitString[1] is 0
Note that one backslash is the regexp escape character, but because a backslash is also the escape character in java source you need two of them to push the backslash into the regexp.
You can do replace white space for pipes and split it.
String test = "0|0 10|15 30|55";
test = test.replace(" ", "|");
String[] result = test.split("|");
Hope this helps for you..
You can use StringTokenizer.
String test = "0|0";
StringTokenizer st = new StringTokenizer(test);
int firstNumber = Integer.parseInt(st.nextToken()); //will parse out the first number
int secondNumber = Integer.parseInt(st.nextToken()); //will parse out the second number
Of course you can always nest this inside of a while loop if you have multiple strings.
Also, you need to import java.util.* for this to work.
The pipe ('|') is a special character in regular expressions. It needs to be "escaped" with a '\' character if you want to use it as a regular character, unfortunately '\' is a special character in Java so you need to do a kind of double escape maneuver e.g.
String temp = "0|0";
String[] splitStrings = temp.split("\\|");
The Guava library has a nice class Splitter which is a much more convenient alternative to String.split(). The advantages are that you can choose to split the string on specific characters (like '|'), or on specific strings, or with regexps, and you can choose what to do with the resulting parts (trim them, throw ayway empty parts etc.).
For example you can call
Iterable<String> parts = Spliter.on('|').trimResults().omitEmptyStrings().split("0|0")
This should work for you:
([0-9]+)
Considering a scenario where in we have read a line from csv or xls file in the form of string and need to separate the columns in array of string depending on delimiters.
Below is the code snippet to achieve this problem..
{ ...
....
String line = new BufferedReader(new FileReader("your file"));
String[] splittedString = StringSplitToArray(stringLine,"\"");
...
....
}
public static String[] StringSplitToArray(String stringToSplit, String delimiter)
{
StringBuffer token = new StringBuffer();
Vector tokens = new Vector();
char[] chars = stringToSplit.toCharArray();
for (int i=0; i 0) {
tokens.addElement(token.toString());
token.setLength(0);
i++;
}
} else {
token.append(chars[i]);
}
}
if (token.length() > 0) {
tokens.addElement(token.toString());
}
// convert the vector into an array
String[] preparedArray = new String[tokens.size()];
for (int i=0; i < preparedArray.length; i++) {
preparedArray[i] = (String)tokens.elementAt(i);
}
return preparedArray;
}
Above code snippet contains method call to StringSplitToArray where in the method converts the stringline into string array splitting the line depending on the delimiter specified or passed to the method. Delimiter can be comma separator(,) or double code(").
For more on this, follow this link : http://scrapillars.blogspot.in

String splitting

I have a string in what is the best way to put the things in between $ inside a list in java?
String temp = $abc$and$xyz$;
how can i get all the variables within $ sign as a list in java
[abc, xyz]
i can do using stringtokenizer but want to avoid using it if possible.
thx
Maybe you could think about calling String.split(String regex) ...
The pattern is simple enough that String.split should work here, but in the more general case, one alternative for StringTokenizer is the much more powerful java.util.Scanner.
String text = "$abc$and$xyz$";
Scanner sc = new Scanner(text);
while (sc.findInLine("\\$([^$]*)\\$") != null) {
System.out.println(sc.match().group(1));
} // abc, xyz
The pattern to find is:
\$([^$]*)\$
\_____/ i.e. literal $, a sequence of anything but $ (captured in group 1)
1 and another literal $
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
(…) is used for grouping. (pattern) is a capturing group and creates a backreference.
The backslash preceding the $ (outside of character class definition) is used to escape the $, which has a special meaning as the end of line anchor. That backslash is doubled in a String literal: "\\" is a String of length one containing a backslash).
This is not a typical usage of Scanner (usually the delimiter pattern is set, and tokens are extracted using next), but it does show how'd you use findInLine to find an arbitrary pattern (ignoring delimiters), and then using match() to access the MatchResult, from which you can get individual group captures.
You can also use this Pattern in a Matcher find() loop directly.
Matcher m = Pattern.compile("\\$([^$]*)\\$").matcher(text);
while (m.find()) {
System.out.println(m.group(1));
} // abc, xyz
Related questions
Validating input using java.util.Scanner
Scanner vs. StringTokenizer vs. String.Split
Just try this one:temp.split("\\$");
I would go for a regex myself, like Riduidel said.
This special case is, however, simple enough that you can just treat the String as a character sequence, and iterate over it char by char, and detect the $ sign. And so grab the strings yourself.
On a side node, I would try to go for different demarkation characters, to make it more readable to humans. Use $ as start-of-sequence and something else as end-of-sequence for instance. Or something like I think the Bash shell uses: ${some_value}. As said, the computer doesn't care but you debugging your string just might :)
As for an appropriate regex, something like (\\$.*\\$)* or so should do. Though I'm no expert on regexes (see http://www.regular-expressions.info for nice info on regexes).
Basically I'd ditto Khotyn as the easiest solution. I see you post on his answer that you don't want zero-length tokens at beginning and end.
That brings up the question: What happens if the string does not begin and end with $'s? Is that an error, or are they optional?
If it's an error, then just start with:
if (!text.startsWith("$") || !text.endsWith("$"))
return "Missing $'s"; // or whatever you do on error
If that passes, fall into the split.
If the $'s are optional, I'd just strip them out before splitting. i.e.:
if (text.startsWith("$"))
text=text.substring(1);
if (text.endsWith("$"))
text=text.substring(0,text.length()-1);
Then do the split.
Sure, you could make more sophisticated regex's or use StringTokenizer or no doubt come up with dozens of other complicated solutions. But why bother? When there's a simple solution, use it.
PS There's also the question of what result you want to see if there are two $'s in a row, e.g. "$foo$$bar$". Should that give ["foo","bar"], or ["foo","","bar"] ? Khotyn's split will give the second result, with zero-length strings. If you want the first result, you should split("\$+").
If you want a simple split function then use Apache Commons Lang which has StringUtils.split. The java one uses a regex which can be overkill/confusing.
You can do it in simple manner writing your own code.
Just use the following code and it will do the job for you
import java.util.ArrayList;
import java.util.List;
public class MyStringTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
List <String> result = getTokenizedStringsList("$abc$efg$hij$");
for(String token : result)
{
System.out.println(token);
}
}
private static List<String> getTokenizedStringsList(String string) {
List <String> tokenList = new ArrayList <String> ();
char [] in = string.toCharArray();
StringBuilder myBuilder = null;
int stringLength = in.length;
int start = -1;
int end = -1;
{
for(int i=0; i<stringLength;)
{
myBuilder = new StringBuilder();
while(i<stringLength && in[i] != '$')
i++;
i++;
while((i)<stringLength && in[i] != '$')
{
myBuilder.append(in[i]);
i++;
}
tokenList.add(myBuilder.toString());
}
}
return tokenList;
}
}
You can use
String temp = $abc$and$xyz$;
String array[]=temp.split(Pattern.quote("$"));
List<String> list=new ArrayList<String>();
for(int i=0;i<array.length;i++){
list.add(array[i]);
}
Now the list has what you want.

Categories

Resources