Split the String by \ which contains following string "abc\u12345. "

Split the String by \ which contains following string "abc\u12345. " - java

Before posting I tried using string split("\u") or \\u or \u it does not work, reason being is that \u is considered as unicode character while in this case it's not.

as already mentioned \u12345 is a unicode character and therefore handled as a single symbol.
If you have these in your string its already too late. If you get this from a file or over network you could read your input and escape each \ or \u you encounter before storing it in your string variable and working on it.
if you elaborate the context of your task a little more, perhaps we could find other solutions for you.

Java understands it as Unicode Character so, right thing to do will be to update the source to read it properly and avoid passing Unicode to java if not needed. One workaround way could be to convert the entire string into a character Array and check if character is greater than 128 and if yes, I append the rest of the array in a seperate StringBuilder. See of it below helps :
public static void tryMee(String input)
{
StringBuilder b1 = new StringBuilder();
StringBuilder b2 = new StringBuilder();
boolean isUni = false;
for (char c : input.toCharArray())
{
if (c >= 128)
{
b2.append("\\u").append(String.format("%04X", (int) c));
isUni = true;
}
else if(isUni) b2.append(c);
else b1.append(c);
}
System.out.println("B1: "+b1);
System.out.println("B2: "+b2);
}

Try this. You did not escape properly
split("\\\\u")
or
split(Pattern.quote("\\u"))

import java.util.Arrays;
public class Example {
public static void main (String[]args){
String str = "abc\u12345";
// first replace \\u with something else, for example with -u
char [] chars = str.toCharArray();
StringBuilder sb = new StringBuilder();
for(char c: chars){
if(c >= 128){
sb.append("-u").append(Integer.toHexString(c | 0x10000).substring(1) );
}else{
sb.append(c);
}
}
String replaced = sb.toString();
// now you can split by -u
String [] splited = sb.toString().split("-u");
System.out.println(replaced);
System.out.println(Arrays.toString(splited));
}
}

Related

deleting special characters from a string

okay.
this is my first post here and I'm kind of new to java
so my question is simple :
is there any instruction in java that remove special characters from a string ?
my string should be only letters
so when the user enters a spacebar or a point or whatever that isn't a letter
it should be removed or ignored
well my idea was about making an array of characters and shift letters to the left each time there is something that isn't a letter
so I wrote this code knowing that x is my string
char h[]=new char [d];
for (int f=0;f<l;f++)
{
h[f]=x.charAt(f);
}
int ii=0;
while (ii<l)
{
if(h[ii]==' '||h[ii]==','||h[ii]=='-'||h[ii]=='\\'||h[ii]=='('||h[ii]==')'||h[ii]=='_'||h[ii]=='\''||h[ii]=='/'||h[ii]==';'||h[ii]=='!'||h[ii]=='*'||h[ii]=='.')
{
for(int m=ii;m<l-1;m++)
{
h[m]=h[m+1];
}
d=d-1;
ii--;
}
ii++;
}
well this works it removes the special char but I can't include all the exceptions in the condition I wonder if there is something easier :)

As others have said Strings in Java are immutable.
One way to catch all characters you do not want is to only allow the ones you want:
final String input = "some string . ";
final StringBuffer sb = new StringBuffer();
final String permittedCharacters = "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
for (char c : input.toCharArray()){
if (permittedCharacters.indexOf(c)>=0){
sb.append(c);
}
}
final String endString = sb.toString();

Short answer - No, String is immutable. But you can use StringBuffer instead. This c ass contains deleteCharAt(int) method, that can be useful.

Reading new line as two characters

I have written a small program
class Test {
public static void main(String[] args) {
String s = "\n";
System.out.println(s.length());
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i));
}
}
}
The program gives the length as 1 and treats \n as single new line character.
My requirement is to treat \n as normal string so with 2 characters (First character \ and second character n), what can be done to achieve it?
NOTE: 1) We can't change the string to add additional escape character.
2) We don't want to use any additional 3rd Party library

You can use the StringEscapeUtils utility class from commons-lang.
String s = "\n";
s = StringEscapeUtils.escapeJava(s);
System.out.println(s.length());
for (int i = 0; i < s.length(); i++) {
System.out.println(s.charAt(i));
}
Output:
2
\
n
If you absolutely can't use a library like commons-lang, then you can write your own method to do it. You can browse through the code of the above class to see an example of how you can escape the string to account for different special characters.

As far as I know, you can't. The issue is that "\n" is one character. The single backslash is an escape.
char ch = '\n'; // <-- not two characters. it's one.

It's as simple as that:
Once you go past the line
String s = "\n";
s will contain a single new line character, and there's nothing you can do about it.
You can obviously create a new String and replace all new line characters by "\n", but I don't think that's what you wanted.

We can't change the string to add additional escape character.
I guess that is not possible because \n has a special meaning when used in String.
Escape the backslash with double backslash like this \\n
This shall give you length as 2
String str = "\\n";
System.out.println(str.length());
Or try using apache commons-lang's
StringEscapeUtils#escapeJava()

You could search through the string and replace all character versions of /n with //n.
String s = convertNewLineChars("\n");
public String convertNewLineChars(String s)
{
//for each character in string, replace '\n' with \\n
}
Edit
Use an enum for all your possible special characters
public enum SpecialCharacter
{
NEWLINE('\n', "\\\\n"), //see note at the bottom of the answer for why
RETURN('\r', "\\\\r"); //there are four backslashes.
private char character;
private String charAsString;
private SpecialCharacter(char character, String charAsString)
{
this.character = character;
this.charAsString = charAsString;
}
public char getCharacter()
{
return this.character;
}
public String getCharAsString()
{
return this.charAsString;
}
public static SpecialCharacter[] getAllCharacters()
{
return new SpecialCharacter[] {NEWLINE, RETURN}; //etc...
}
}
Create a static method for removing these characters
public static String removeSpecialCharacters(String s)
{
String returnString = s;
for (SpecialCharacter character : SpecialCharacter.getAllCharacters())
{
returnString = returnString.replaceAll("["+character.getCharacter()+"]", character.getCharAsString());
}
return returnString;
}
Then you can say something like:
String s = removeSpecialCharacters("\nfdafhoean\noasd\r\rjfoi");
System.out.println(s);
This will work for any SpecialCharacter you add to the enum.
*Note that replaceAll() will consume the extra backslash... if you simply call System.out.println(SpecialCharacter.NEWLINE.getCharAsString()); you will receive the output of \\n

Just use another \ character infront of the \n to convert \n (new line) to \n two characters

Identify valid characters of a character array

I am receiving a character array where it has valid and invalid characters. I need to only retrieve the characters that are valid. How can I do this ?
char[]ch = str.toCharacterArray();
for(int i=0;i<ch.length.i++){
if(!Character.isLetter(ch[i]))
system.out.println("not valid");
else
system.out.println("valid");
}
I don't think the above code works because I get invalid for all the valid and invalid characters in the character array.
by meaning characters I am expecting all alphanumeric and special characters
Note: I am getting values from the server, therefore the character array contains valid and invalid characters.

try following method:
// assume input is not too large string
public static String extractPrintableChars(final String input) {
String out = input;
if(input != null) {
char[] charArr = input.toCharArray();
StringBuilder sb = new StringBuilder();
for (char ch : charArr) {
if(Character.isDefined(ch) && !Character.isISOControl(ch)) {
sb.append(ch);
}
}
out = sb.toString();
}
return out;
}

Have a look into Matcher class available in java. I don't think it to be wise to loop over all the characters until there is a max Cap for it.
/[^0-9]+/g, ''
with above regEx it will wipe out all the charecters other then numeric with noting. Modify it as per your need.
regards
Punith

How can I replace non-printable Unicode characters in Java?

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):
my_string.replaceAll("\\p{Cntrl}", "?");
The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:
my_string.replaceAll("[^\\p{Print}]", "?");
However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

my_string.replaceAll("\\p{C}", "?");
See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

Op De Cirkel is mostly right. His suggestion will work in most cases:
myString.replaceAll("\\p{C}", "?");
But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.
Using the other constituent categories is an option:
myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");
However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:
StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);
// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL: // \p{Cc}
case Character.FORMAT: // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE: // \p{Cs}
case Character.UNASSIGNED: // \p{Cn}
newString.append('?');
break;
default:
newString.append(Character.toChars(codePoint));
break;
}
}

methods below for your goal
public static String removeNonAscii(String str)
{
return str.replaceAll("[^\\x00-\\x7F]", "");
}
public static String removeNonPrintable(String str) // All Control Char
{
return str.replaceAll("[\\p{C}]", "");
}
public static String removeSomeControlChar(String str) // Some Control Char
{
return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}
public static String removeFullControlChar(String str)
{
return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}

You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).
In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.

I have used this simple function for this:
private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
Matcher matcher = pattern.matcher(text);
if ( matcher.find() ) {
text = text.replace(matcher.group(0), "");
}
return text;
}
Hope this is useful.

Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:
myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")
Tested with Scala REPL.

I propose it remove the non printable characters like below instead of replacing it
private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}

Supported multilanguage
public static String cleanUnprintableChars(String text, boolean multilanguage)
{
String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
// strips off all non-ASCII characters
text = text.replaceAll(regex, "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}

I have redesigned the code for phone numbers +9 (987) 124124
Extract digits from a string in Java
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length() ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}
if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}
return new String( result, 0, cursor );
}

Tokenizing a String but ignoring delimiters within quotes

I wish to have have the following String
!cmd 45 90 "An argument" Another AndAnother "Another one in quotes"
to become an array of the following
{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }
I tried
new StringTokenizer(cmd, "\"")
but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.
Thanks.
EDIT:
I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.

It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.
That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.
Here's an example:
String text = "1 2 \"333 4\" 55 6 \"77\" 8 999";
// 1 2 "333 4" 55 6 "77" 8 999
String regex = "\"([^\"]*)\"|(\\S+)";
Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
if (m.group(1) != null) {
System.out.println("Quoted [" + m.group(1) + "]");
} else {
System.out.println("Plain [" + m.group(2) + "]");
}
}
The above prints (as seen on ideone.com):
Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]
The pattern is essentially:
"([^"]*)"|(\S+)
\_____/ \___/
1 2
There are 2 alternates:
The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
The second alternate matches any sequence of non-whitespace characters, captured in group 2
The order of the alternates matter in this pattern
Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.
References
regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus
See also
regular-expressions.info/Examples - Programmer - Strings - for pattern with escaped quotes
Appendix
Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.
Related questions
Difference between a Deprecated and Legacy API?
Scanner vs. StringTokenizer vs. String.Split
Validating input using java.util.Scanner - has many examples

Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.
Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.

Apache Commons to the rescue!
import org.apache.commons.text.StringTokenizer
import org.apache.commons.text.matcher.StringMatcher
import org.apache.commons.text.matcher.StringMatcherFactory
#Grab(group='org.apache.commons', module='commons-text', version='1.3')
def str = /is this 'completely "impossible"' or """slightly"" impossible" to parse?/
StringTokenizer st = new StringTokenizer( str )
StringMatcher sm = StringMatcherFactory.INSTANCE.quoteMatcher()
st.setQuoteMatcher( sm )
println st.tokenList
Output:
[is, this, completely "impossible", or, "slightly" impossible, to, parse?]
A few notes:
this is written in Groovy... it is in fact a Groovy script. The
#Grab line gives a clue to the sort of dependency line you need
(e.g. in build.gradle) ... or just include the .jar in your
classpath of course
StringTokenizer here is NOT
java.util.StringTokenizer ... as the import line shows it is
org.apache.commons.text.StringTokenizer
the def str = ...
line is a way to produce a String in Groovy which contains both
single quotes and double quotes without having to go in for escaping
StringMatcherFactory in apache commons-text 1.3 can be found
here: as you can see, the INSTANCE can provide you with a
bunch of different StringMatchers. You could even roll your own:
but you'd need to examine the StringMatcherFactory source code to
see how it's done.
YES! You can not only include the "other type of quote" and it is correctly interpreted as not being a token boundary ... but you can even escape the actual quote which is being used to turn off tokenising, by doubling the quote within the tokenisation-protected bit of the String! Try implementing that with a few lines of code ... or rather don't!
PS why is it better to use Apache Commons than any other solution?
Apart from the fact that there is no point re-inventing the wheel, I can think of at least two reasons:
The Apache engineers can be counted on to have anticipated all the gotchas and developed robust, comprehensively tested, reliable code
It means you don't clutter up your beautiful code with stoopid utility methods - you just have a nice, clean bit of code which does exactly what it says on the tin, leaving you to get on with the, um, interesting stuff...
PPS Nothing obliges you to look on the Apache code as mysterious "black boxes". The source is open, and written in usually perfectly "accessible" Java. Consequently you are free to examine how things are done to your heart's content. It's often quite instructive to do so.
later
Sufficiently intrigued by ArtB's question I had a look at the source:
in StringMatcherFactory.java we see:
private static final AbstractStringMatcher.CharSetMatcher QUOTE_MATCHER = new AbstractStringMatcher.CharSetMatcher(
"'\"".toCharArray());
... rather dull ...
so that leads one to look at StringTokenizer.java:
public StringTokenizer setQuoteMatcher(final StringMatcher quote) {
if (quote != null) {
this.quoteMatcher = quote;
}
return this;
}
OK... and then, in the same java file:
private int readWithQuotes(final char[] srcChars ...
which contains the comment:
// If we've found a quote character, see if it's followed by a second quote. If so, then we need to actually put the quote character into the token rather than end the token.
... I can't be bothered to follow the clues any further. You have a choice: either your "hackish" solution, where you systematically pre-process your strings before submitting them for tokenising, turning |\\"|s into |""|s... (i.e. where you replace each |"| with |""|)...
Or... you examine org.apache.commons.text.StringTokenizer.java to figure out how to tweak the code. It's a small file. I don't think it would be that difficult. Then you compile, essentially making a fork of the Apache code.
I don't think it can be configured. But if you found a code-tweak solution which made sense you might submit it to Apache and then it might be accepted for the next iteration of the code, and your name would figure at least in the "features request" part of Apache: this could be a form of kleos through which you achieve programming immortality...

In an old fashioned way:
public static String[] split(String str) {
str += " "; // To detect last token when not quoted...
ArrayList<String> strings = new ArrayList<String>();
boolean inQuote = false;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if (c == '"' || c == ' ' && !inQuote) {
if (c == '"')
inQuote = !inQuote;
if (!inQuote && sb.length() > 0) {
strings.add(sb.toString());
sb.delete(0, sb.length());
}
} else
sb.append(c);
}
return strings.toArray(new String[strings.size()]);
}
I assume that nested quotes are illegal, and also that empty tokens can be omitted.

This is an old question, however this was my solution as a finite state machine.
Efficient, predictable and no fancy tricks.
100% coverage on tests.
Drag and drop into your code.
/**
* Splits a command on whitespaces. Preserves whitespace in quotes. Trims excess whitespace between chunks. Supports quote
* escape within quotes. Failed escape will preserve escape char.
*
* #return List of split commands
*/
static List<String> splitCommand(String inputString) {
List<String> matchList = new LinkedList<>();
LinkedList<Character> charList = inputString.chars()
.mapToObj(i -> (char) i)
.collect(Collectors.toCollection(LinkedList::new));
// Finite-State Automaton for parsing.
CommandSplitterState state = CommandSplitterState.BeginningChunk;
LinkedList<Character> chunkBuffer = new LinkedList<>();
for (Character currentChar : charList) {
switch (state) {
case BeginningChunk:
switch (currentChar) {
case '"':
state = CommandSplitterState.ParsingQuote;
break;
case ' ':
break;
default:
state = CommandSplitterState.ParsingWord;
chunkBuffer.add(currentChar);
}
break;
case ParsingWord:
switch (currentChar) {
case ' ':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
default:
chunkBuffer.add(currentChar);
}
break;
case ParsingQuote:
switch (currentChar) {
case '"':
state = CommandSplitterState.BeginningChunk;
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
chunkBuffer = new LinkedList<>();
break;
case '\\':
state = CommandSplitterState.EscapeChar;
break;
default:
chunkBuffer.add(currentChar);
}
break;
case EscapeChar:
switch (currentChar) {
case '"': // Intentional fall through
case '\\':
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add(currentChar);
break;
default:
state = CommandSplitterState.ParsingQuote;
chunkBuffer.add('\\');
chunkBuffer.add(currentChar);
}
}
}
if (state != CommandSplitterState.BeginningChunk) {
String newWord = chunkBuffer.stream().map(Object::toString).collect(Collectors.joining());
matchList.add(newWord);
}
return matchList;
}
private enum CommandSplitterState {
BeginningChunk, ParsingWord, ParsingQuote, EscapeChar
}

Recently faced a similar question where command line arguments must be split ignoring quotes link.
One possible case:
"/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force\""
This had to be split to
/opt/jboss-eap/bin/jboss-cli.sh
--connect
--controller=localhost:9990
-c
command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"
Just to add to #polygenelubricants's answer, having any non-space character before and after the quote matcher can work out.
"\\S*\"([^\"]*)\"\\S*|(\\S+)"
Example:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Tokenizer {
public static void main(String[] args){
String a = "/opt/jboss-eap/bin/jboss-cli.sh --connect --controller=localhost:9990 -c command=\"deploy " +
"/app/jboss-eap-7.1/standalone/updates/sample.war --force\"";
String b = "Hello \"Stack Overflow\"";
String c = "cmd=\"abcd efgh ijkl mnop\" \"apple\" banana mango";
String d = "abcd ef=\"ghij klmn\"op qrst";
String e = "1 2 \"333 4\" 55 6 \"77\" 8 999";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("\\S*\"([^\"]*)\"\\S*|(\\S+)");
Matcher regexMatcher = regex.matcher(a);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
System.out.println("matchList="+matchList);
}
}
Output:
matchList=[/opt/jboss-eap/bin/jboss-cli.sh, --connect, --controller=localhost:9990, -c, command="deploy /app/jboss-eap-7.1/standalone/updates/sample.war --force"]

This is what I myself use for splitting arguments in command line and things like that.
It's easily adjustible for multiple delimiters and quotes, it can process quotes within the words (like al' 'pha), it supports escaping (quotes as well as spaces) and it's really lenient.
public final class StringUtilities {
private static final List<Character> WORD_DELIMITERS = Arrays.asList(' ', '\t');
private static final List<Character> QUOTE_CHARACTERS = Arrays.asList('"', '\'');
private static final char ESCAPE_CHARACTER = '\\';
private StringUtilities() {
}
public static String[] splitWords(String string) {
StringBuilder wordBuilder = new StringBuilder();
List<String> words = new ArrayList<>();
char quote = 0;
for (int i = 0; i < string.length(); i++) {
char c = string.charAt(i);
if (c == ESCAPE_CHARACTER && i + 1 < string.length()) {
wordBuilder.append(string.charAt(++i));
} else if (WORD_DELIMITERS.contains(c) && quote == 0) {
words.add(wordBuilder.toString());
wordBuilder.setLength(0);
} else if (quote == 0 && QUOTE_CHARACTERS.contains(c)) {
quote = c;
} else if (quote == c) {
quote = 0;
} else {
wordBuilder.append(c);
}
}
if (wordBuilder.length() > 0) {
words.add(wordBuilder.toString());
}
return words.toArray(new String[0]);
}
}

The example you have here would just have to be split by the double quote character.

Another old school way is :
public static void main(String[] args) {
String text = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] splits = text.split(" ");
List<String> list = new ArrayList<>();
String token = null;
for(String s : splits) {
if(s.startsWith("\"") ) {
token = "" + s;
} else if (s.endsWith("\"")) {
token = token + " "+ s;
list.add(token);
token = null;
} else {
if (token != null) {
token = token + " " + s;
} else {
list.add(s);
}
}
}
System.out.println(list);
}
Output : - [One, two, "three four", five, "six seven eight", nine]

private static void findWords(String str) {
boolean flag = false;
StringBuilder sb = new StringBuilder();
for(int i=0;i<str.length();i++) {
if(str.charAt(i)!=' ' && str.charAt(i)!='"') {
sb.append(str.charAt(i));
}
else {
System.out.println(sb.toString());
sb = new StringBuilder();
if(str.charAt(i)==' ' && !flag)
continue;
else if(str.charAt(i)=='"') {
if(!flag) {
flag=true;
}
i++;
while(i<str.length() && str.charAt(i)!='"') {
sb.append(str.charAt(i));
i++;
}
flag=false;
System.out.println(sb.toString());
sb = new StringBuilder();
}
}
}
}

In my case I had a string that includes key="value" . Check this out:
String perfLogString = "2022-11-10 08:35:00,470 PLV=REQ CIP=902.68.5.11 CMID=canonaustr CMN=\"Yanon Australia Pty Ltd\"";
// and this came to my rescue :
String[] str1= perfLogString.split("\\s(?=(([^\"]*\"){2})*[^\"]*$)\\s*");
System.out.println(Arrays.toString(str1));
This regex matches spaces ONLY if it is followed by even number of double quotes.
On split I get :
[2022-11-10, 08:35:00,470, PLV=REQ, CIP=902.68.5.11, CMID=canonaustr, CMN="Yanon Australia Pty Ltd"]

try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");

I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.

Try this:
String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");
It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.
You should use String's split method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer doesn't. At the end of what I provided above, you can just add the following:
String s;
for(String k : strArr) {
s += k;
}
StringTokenizer strTok = new StringTokenizer(s);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split the String by \ which contains following string "abc\u12345. " - java

Before posting I tried using string split("\u") or \\u or \u it does not work, reason being is that \u is considered as unicode character while in this case it's not.

Try this. You did not escape properly split("\\\\u") or split(Pattern.quote("\\u"))

Related

deleting special characters from a string

Reading new line as two characters

Identify valid characters of a character array

How can I replace non-printable Unicode characters in Java?

Tokenizing a String but ignoring delimiters within quotes

Categories

Resources