Equivalent to StringTokenizer with multiple characters delimiters

Equivalent to StringTokenizer with multiple characters delimiters - java

I try to split a String into tokens.
The token delimiters are not single characters, some delimiters are included into others (example, & and &&), and I need to have the delimiters returned as token.
StringTokenizer is not able to deal with multiple characters delimiters. I presume it's possible with String.split, but fail to guess the magical regular expression that will suits my needs.
Any idea ?
Example:
Token delimiters: "&", "&&", "=", "=>", " "
String to tokenize: a & b&&c=>d
Expected result: an string array containing "a", " ", "&", " ", "b", "&&", "c", "=>", "d"
--- Edit ---
Thanks to all for your help, Dasblinkenlight gives me the solution. Here is the "ready to use" code I wrote with his help:
private static String[] wonderfulTokenizer(String string, String[] delimiters) {
// First, create a regular expression that matches the union of the delimiters
// Be aware that, in case of delimiters containing others (example && and &),
// the longer may be before the shorter (&& should be before &) or the regexpr
// parser will recognize && as two &.
Arrays.sort(delimiters, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return -o1.compareTo(o2);
}
});
// Build a string that will contain the regular expression
StringBuilder regexpr = new StringBuilder();
regexpr.append('(');
for (String delim : delimiters) { // For each delimiter
if (regexpr.length() != 1) regexpr.append('|'); // Add union separator if needed
for (int i = 0; i < delim.length(); i++) {
// Add an escape character if the character is a regexp reserved char
regexpr.append('\\');
regexpr.append(delim.charAt(i));
}
}
regexpr.append(')'); // Close the union
Pattern p = Pattern.compile(regexpr.toString());
// Now, search for the tokens
List<String> res = new ArrayList<String>();
Matcher m = p.matcher(string);
int pos = 0;
while (m.find()) { // While there's a delimiter in the string
if (pos != m.start()) {
// If there's something between the current and the previous delimiter
// Add it to the tokens list
res.add(string.substring(pos, m.start()));
}
res.add(m.group()); // add the delimiter
pos = m.end(); // Remember end of delimiter
}
if (pos != string.length()) {
// If it remains some characters in the string after last delimiter
// Add this to the token list
res.add(string.substring(pos));
}
// Return the result
return res.toArray(new String[res.size()]);
}
It could be optimize if you have many strings to tokenize by creating the Pattern only one time.

You can use the Pattern and a simple loop to achieve the results that you are looking for:
List<String> res = new ArrayList<String>();
Pattern p = Pattern.compile("([&]{1,2}|=>?| +)");
String s = "s=a&=>b";
Matcher m = p.matcher(s);
int pos = 0;
while (m.find()) {
if (pos != m.start()) {
res.add(s.substring(pos, m.start()));
}
res.add(m.group());
pos = m.end();
}
if (pos != s.length()) {
res.add(s.substring(pos));
}
for (String t : res) {
System.out.println("'"+t+"'");
}
This produces the result below:
's'
'='
'a'
'&'
'=>'
'b'

Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like
http://www.antlr.org/

Try this:
String test = "a & b&&c=>d=A";
String regEx = "(&[&]?|=[>]?)";
String[] res = test.split(regEx);
for(String s : res){
System.out.println("Token: "+s);
}
I added the '=A' at the end to show that that is also parsed.
As mentioned in another answer, if you need the atypical behaviour of keeping the delimiters in the result, you will probably need to create you parser yourself....but in that case you really have to think about what a "delimiter" is in your code.

Related

N-th indexOf in String?

I need to extract a sub-string of a URL.
URLs
/service1/api/v1.0/foo -> foo
/service1/api/v1.0/foo/{fooId} -> foo/{fooId}
/service1/api/v1.0/foo/{fooId}/boo -> foo/{fooId}/boo
And some of those URLs may have request parameters.
Code
String str = request.getRequestURI();
str = str.substring(str.indexOf("/") + 1);
str = str.substring(str.indexOf("/") + 1);
str = str.substring(str.indexOf("/") + 1);
str = str.substring(str.indexOf("/") + 1, str.indexOf("?"));
Is there a better way to extract the sub-string instead of recurrent usage of indexOf method?

There are many alternative ways:
Use Java-Stream API on splitted String with \ delimiter:
String str = "/service1/api/v1.0/foo/{fooId}/boo";
String[] split = str.split("\\/");
String url = Arrays.stream(split).skip(4).collect(Collectors.joining("/"));
System.out.println(url);
With the elimination of the parameter, the Stream would be like:
String url = Arrays.stream(split)
.skip(4)
.map(i -> i.replaceAll("\\?.+", ""))
.collect(Collectors.joining("/"));
This is also where Regex takes its place! Use the classes Pattern and Matcher.
String str = "/service1/api/v1.0/foo/{fooId}/boo";
Pattern pattern = Pattern.compile("\\/.*?\\/api\\/v\\d+\\.\\d+\\/(.+)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
If you rely on the indexOf(..) usage, you might want to use the while-loop.
String str = "/service1/api/v1.0/foo/{fooId}/boo?parameter=value";
String string = str;
while(!string.startsWith("v1.0")) {
string = string.substring(string.indexOf("/") + 1);
}
System.out.println(string.substring(string.indexOf("/") + 1, string.indexOf("?")));
Other answers include a way that if the prefix is not mutable, you might want to use only one call of idndexOf(..) method (#JB Nizet):
string.substring("/service1/api/v1.0/".length(), string.indexOf("?"));
All these solutions are based on your input and fact, the pattern is known, or at least the number of the previous section delimited with \ or the version v1.0 as a checkpoint - the best solution might not appear here since there are unlimited combinations of the URL. You have to know all the possible combinations of input URL to find the best way to handle it.

Path is quite useful for that :
public static void main(String[] args) {
Path root = Paths.get("/service1/api/v1.0/foo");
Path relativize = root.relativize(Paths.get("/service1/api/v1.0/foo/{fooId}/boo"));
System.out.println(relativize);
}
Output :
{fooId}/boo

How about this:
String s = "/service1/api/v1.0/foo/{fooId}/boo";
String[] sArray = s.split("/");
StringBuilder sb = new StringBuilder();
for (int i = 4; i < sArray.length; i++) {
sb.append(sArray[i]).append("/");
}
sb.deleteCharAt(sb.length() - 1);
System.out.println(sb.toString());
Output:
foo/{fooId}/boo
If the url prefix is always /service1/api/v1.0/, you just need to do s.substring("/service1/api/v1.0/".length()).

There are a few good options here.
1) If you know "foo" will always be the 4th token, then you have the right idea already. The only issue with your way is that you have the information you need to be efficient, but you aren't using it. Instead of copying the String multiple times and looping anew from the beginning of the new String, you could just continue from where you left off, 4 times, to find the starting point of what you want.
String str = "/service1/api/v1.0/foo/{fooId}/boo";
// start at the beginning
int start = 0;
// get the 4th index of '/' in the string
for (int i = 0; i != 4; i++) {
// get the next index of '/' after the index 'start'
start = str.indexOf('/',start);
// increase the pointer to the next character after this slash
start++;
}
// get the substring
str = str.substring(start);
This will be far, far more efficient than any regex pattern.
2) Regex: (java.util.regex.*). This will work if you what you want is always preceded by "service1/api/v1.0/". There may be other directories before it, e.g. "one/two/three/service1/api/v1.0/".
// \Q \E will automatically escape any special chars in the path
// (.+) will capture the matched text at that position
// $ marks the end of the string (technically it matches just before '\n')
Pattern pattern = Pattern.compile("/service1/api/v1\\.0/(.+)$");
// get a matcher for it
Matcher matcher = pattern.matcher(str);
// if there is a match
if (matcher.find()) {
// get the captured text
str = matcher.group(1);
}
If your path can vary some, you can use regex to account for it. e.g.: service/api/v3/foo/{bar}/baz/" (note varying number formats and trailing '/') could be matched as well by changing the regex to "/service\\d*/api/v\\d+(?:\\.\\d+)?/(.+)(?:/|$)"

JSP JSTL funciton fn:split is not working properly

Today, I come across one issue and need your help to fix it.
I am trying to split the string using JSTL fn:split function that is likewise,
<c:set var="stringArrayName" value="${fn:split(element, '~$')}" />
Actual String :- "abc~$pqr$xyz"
Expected Result :-
abc
pqr$xyz
only 2-string part expecting, but it gives
abc
pqr
xyz
here, total 3-string parts returning, which is wrong.
NOTE :- I have added <%#taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%> at the top of JSP.
any help really appreciates!!

JSTL split not work like the Java split you can check the difference from the code source :
org.apache.taglibs.standard.functions.Functions.split
public static String[] split(String input, String delimiters) {
String[] array;
if (input == null) {
input = "";
}
if (input.length() == 0) {
array = new String[1];
array[0] = "";
return array;
}
if (delimiters == null) {
delimiters = "";
}
StringTokenizer tok = new StringTokenizer(input, delimiters);
int count = tok.countTokens();
array = new String[count];
int i = 0;
while (tok.hasMoreTokens()) {
array[i++] = tok.nextToken();
}
return array;
}
java.lang.String.split
public String[] split(String regex, int limit) {
return Pattern.compile(regex).split(this, limit);
}
So it's clearly that fn:split use StringTokenizer
...
StringTokenizer tok = new StringTokenizer(input, delimiters);
int count = tok.countTokens();
array = new String[count];
int i = 0;
while (tok.hasMoreTokens()) {
array[i++] = tok.nextToken();
}
...
Not like java.lang.String.split which use regular expression
return Pattern.compile(regex).split(this, limit);
//-----------------------^
from the StringTokenizer documentation it says :
Constructs a string tokenizer for the specified string. The characters
in the delim argument are the delimiters for separating tokens.
Delimiter characters themselves will not be treated as tokens.
How `fn:split` exactly work?
It split on each character in the delimiter, in your case you have two characters ~ and $ so if your string is abc~$pqr$xyz it will split it like this :
abc~$pqr$xyz
^^ ^
1st split :
abc
$pqr$xyz
2nd split :
abc
pqr$xyz
3rd split :
abc
pqr
xyz
Solution
use split in your Servlet instead of JSTL
for example :
String[] array = "abc~$pqr$xyz".split("~\\$");

Remove pattern from string in Java

I am currently working on a tool, which helps me to analyze a constantly growing String, that can look like this: String s = "AAAAAAABBCCCDDABQ". What I want to do is to find a sequence of A's and B's, do something and then remove that sequence from the original String.
My code looks like this:
while (someBoolean){
if(Pattern.matches("A+B+", s)) {
//Do stuff
//Remove the found pattern
}
if(Pattern.matches("C+D+", s)) {
//Do other stuff
//Remove the found pattern
}
}
return s;
Also, how I could remove the three sequences, so that s just contains "Q" at the end of the calculation, without and endless loop?

You should use a regex replacement loop, i.e. the methods appendReplacement(StringBuffer sb, String replacement) and appendTail(StringBuffer sb).
To find one of many patterns, use the | regex matcher, and capture each pattern separately.
You can then use group(int group) to get the matched string for each capture group (first group is group 1), which returns null if that group didn't match. For better performance, to simply check whether the group matched, use start(int group), which returns -1 if that group didn't match.
Example:
String s = "AAAAAAABBCCCDDABQ";
StringBuffer buf = new StringBuffer();
Pattern p = Pattern.compile("(A+B+)|(C+D+)");
Matcher m = p.matcher(s);
while (m.find()) {
if (m.start(1) != -1) { // Group 1 found
System.out.println("Found AB: " + m.group(1));
m.appendReplacement(buf, ""); // Replace matched substring with ""
} else if (m.start(2) != -1) { // Group 2 found
System.out.println("Found CD: " + m.group(2));
m.appendReplacement(buf, ""); // Replace matched substring with ""
}
}
m.appendTail(buf);
String remain = buf.toString();
System.out.println("Remain: " + remain);
Output
Found AB: AAAAAAABB
Found CD: CCCDD
Found AB: AB
Remain: Q

This solution assumes that the string always ends in Q.
String s="AAAAAAABBCCCDDABQ";
Pattern abPattern = Pattern.compile("A+B+");
Pattern cdPattern = Pattern.compile("C+D+");
while (s.length() > 1){
Matcher abMatcher = abPattern.matcher(s);
if (abMatcher.find()) {
s = abMatcher.replaceFirst("");
//Do other stuff
}
Matcher cdMatcher = cdPattern.matcher(s);
if (cdMatcher.find()) {
s = cdMatcher.replaceFirst("");
//Do other stuff
}
}
System.out.println(s);

You are probably looking for something like this:
String input = "AAAAAAABBCCCDDABQ";
String result = input;
String[] chars = {"A", "B", "C", "D"}; // chars to replace
for (String ch : chars) {
if (result.contains(ch)) {
String pattern = "[" + ch + "]+";
result = result.replaceAll(pattern, ch);
}
}
System.out.println(input); //"AAAAAAABBCCCDDABQ"
System.out.println(result); //"ABCDABQ"
This basically replace sequence of each character for single one.
If you want to remove the sequence completely, just replace ch to "" in replaceAll method parameters inside if body.

Uppercase all characters but not those in quoted strings

I have a String and I would like to uppercase everything that is not quoted.
Example:
My name is 'Angela'
Result:
MY NAME IS 'Angela'
Currently, I am matching every quoted string then looping and concatenating to get the result.
Is it possible to achieve this in one regex expression maybe using replace?

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
String format = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());
System.out.println(input);
System.out.println("----------------------");
System.out.println(format);
Input: 's'Hello This is 'Java' Not '.NET'
Output: 's'HELLO THIS IS 'Java' NOT '.NET'

You could use a regular expression like this:
([^'"]+)(['"]+[^'"]+['"]+)(.*)
# match and capture everything up to a single or double quote (but not including)
# match and capture a quoted string
# match and capture any rest which might or might not be there.
This will only work with one quoted string, obviously. See a working demo here.

Ok. This will do it for you.. Not efficient, but will work for all cases. I actually don't suggest this solution as it will be too slow.
public static void main(String[] args) {
String s = "'Peter' said, My name is 'Angela' and I will not change my name to 'Pamela'.";
Pattern p = Pattern.compile("('\\w+')");
Matcher m = p.matcher(s);
List<String> quotedStrings = new ArrayList<>();
while(m.find()) {
quotedStrings.add(m.group(1));
}
s=s.toUpperCase();
// System.out.println(s);
for (String str : quotedStrings)
s= s.replaceAll("(?i)"+str, str);
System.out.println(s);
}
O/P :
'Peter' SAID, MY NAME IS 'Angela' AND I WILL NOT CHANGE MY NAME TO 'Pamela'.

Adding to the answer by #jan_kiran, we need to call the
appendTail()
method appendTail(). Updated code is:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\'(.*?)\\'");
String input = "'s'Hello This is 'Java' Not '.NET'";
Matcher regexMatcher = regex.matcher(input);
StringBuffer sb = new StringBuffer();
int counter = 0;
while (regexMatcher.find())
{// Finds Matching Pattern in String
regexMatcher.appendReplacement(sb, "{"+counter+"}");
matchList.add(regexMatcher.group());// Fetching Group from String
counter++;
}
regexMatcher.appendTail(sb);
String formatted_string = MessageFormat.format(sb.toString().toUpperCase(), matchList.toArray());

I did not find my luck with these solutions, as they seemed to remove trailing non-quoted text.
This code works for me, and treats both ' and " by remembering the last opening quotation mark type. Replace toLowerCase appropriately, of course...
Maybe this is extremely slow; I don't know:
private static String toLowercaseExceptInQuotes(String line) {
StringBuffer sb = new StringBuffer(line);
boolean nowInQuotes = false;
char lastQuoteType = 0;
for (int i = 0; i < sb.length(); ++i) {
char cchar = sb.charAt(i);
if (cchar == '"' || cchar == '\''){
if (!nowInQuotes) {
nowInQuotes = true;
lastQuoteType = cchar;
}
else {
if (lastQuoteType == cchar) {
nowInQuotes = false;
}
}
}
else if (!nowInQuotes) {
sb.setCharAt(i, Character.toLowerCase(sb.charAt(i)));
}
}
return sb.toString();
}

Split a quoted string with a delimiter

I want to split a string with a delimiter white space. but it should handle quoted strings intelligently. E.g. for a string like
"John Smith" Ted Barry
It should return three strings John Smith, Ted and Barry.

After messing around with it, you can use Regex for this. Run the equivalent of "match all" on:
((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))
A Java Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
public static void main(String[] args)
{
String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\"";
Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))");
Matcher m = p.matcher(someString);
while(m.find()) {
System.out.println("'" + m.group() + "'");
}
}
}
Output:
'Multiple quote test'
'not'
'in'
'quotes'
'inside quote'
'A work in progress'
The regular expression breakdown with the example used above can be viewed here:
http://regex101.com/r/wM6yT9
With all that said, regular expressions should not be the go to solution for everything - I was just having fun. This example has a lot of edge cases such as the handling unicode characters, symbols, etc. You would be better off using a tried and true library for this sort of task. Take a look at the other answers before using this one.

Try this ugly bit of code.
String str = "hello my dear \"John Smith\" where is Ted Barry";
List<String> list = Arrays.asList(str.split("\\s"));
List<String> resultList = new ArrayList<String>();
StringBuilder builder = new StringBuilder();
for(String s : list){
if(s.startsWith("\"")) {
builder.append(s.substring(1)).append(" ");
} else {
resultList.add((s.endsWith("\"")
? builder.append(s.substring(0, s.length() - 1))
: builder.append(s)).toString());
builder.delete(0, builder.length());
}
}
System.out.println(resultList);

well, i made a small snipet that does what you want and some more things. since you did not specify more conditions i did not go through the trouble. i know this is a dirty way and you can probably get better results with something that is already made. but for the fun of programming here is the example:
String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello";
int wordQuoteStartIndex=0;
int wordQuoteEndIndex=0;
int wordSpaceStartIndex = 0;
int wordSpaceEndIndex = 0;
boolean foundQuote = false;
for(int index=0;index<example.length();index++) {
if(example.charAt(index)=='\"') {
if(foundQuote==true) {
wordQuoteEndIndex=index+1;
//Print the quoted word
System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1)
foundQuote=false;
if(index+1<example.length()) {
wordSpaceStartIndex = index+1;
}
}else {
wordSpaceEndIndex=index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordQuoteStartIndex=index;
foundQuote = true;
}
}
if(foundQuote==false) {
if(example.charAt(index)==' ') {
wordSpaceEndIndex = index;
if(wordSpaceStartIndex!=wordSpaceEndIndex) {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex));
}
wordSpaceStartIndex = index+1;
}
if(index==example.length()-1) {
if(example.charAt(index)!='\"') {
//print the word in spaces
System.out.println(example.substring(wordSpaceStartIndex, example.length()));
}
}
}
}
this also checks for words that were not separated with a space after or before the quotes, such as the words "hello" before "John Smith" and after "Basi German".
when the string is modified to "John Smith" Ted Barry the output is three strings,
1) "John Smith"
2) Ted
3) Barry
The string in the example is hello"John Smith" Ted Barry lol"Basi German"hello and prints
1)hello
2)"John Smith"
3)Ted
4)Barry
5)lol
6)"Basi German"
7)hello
Hope it helps

This is my own version, clean up from http://pastebin.com/aZngu65y (posted in the comment).
It can take care of Unicode. It will clean up all excessive spaces (even in quote) - this can be good or bad depending on the need. No support for escaped quote.
private static String[] parse(String param) {
String[] output;
param = param.replaceAll("\"", " \" ").trim();
String[] fragments = param.split("\\s+");
int curr = 0;
boolean matched = fragments[curr].matches("[^\"]*");
if (matched) curr++;
for (int i = 1; i < fragments.length; i++) {
if (!matched)
fragments[curr] = fragments[curr] + " " + fragments[i];
if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)"))
matched = false;
else {
matched = true;
if (fragments[curr].matches("\"[^\"]*\""))
fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim();
if (fragments[curr].length() != 0)
curr++;
if (i + 1 < fragments.length)
fragments[curr] = fragments[i + 1];
}
}
if (matched) {
return Arrays.copyOf(fragments, curr);
}
return null; // Parameter failure (double-quotes do not match up properly).
}
Sample input for comparison:
"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd
asjdhj sdf ffhj "fdsf fsdjh"
日本語　中文 "Tiếng Việt" "English"
dsfsd
sdf " s dfs fsd f " sd f fs df fdssf "日本語　中文"
"" "" ""
" sdfsfds " "f fsdf
(2nd line is empty, 3rd line is spaces, last line is malformed).
Please judge with your own expected output, since it may varies, but the baseline is that, the 1st case should return [sdfskjf, sdfjkhsd, hfrif ehref, fksdfj sdkfj fkdsjf, sdf, sfssd].

commons-lang has a StrTokenizer class to do this for you, and there is also java-csv library.
Example with StrTokenizer:
String params = "\"John Smith\" Ted Barry"
// Initialize tokenizer with input string, delimiter character, quote character
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"');
for (String token : tokenizer.getTokenArray()) {
System.out.println(token);
}
Output:
John Smith
Ted
Barry

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Equivalent to StringTokenizer with multiple characters delimiters - java

Split won't do it for you as it removed the delimeter. You probably need to tokenize the string on your own (i.e. a for-loop) or use a framework like http://www.antlr.org/

Related

N-th indexOf in String?

JSP JSTL funciton fn:split is not working properly

Remove pattern from string in Java

Uppercase all characters but not those in quoted strings

Split a quoted string with a delimiter

Categories

Resources