split a string at comma but avoid escaped comma and backslash - java

I'd like to split a string at comma ",". The string contains escaped commas "\," and escaped backslashs "\\". Commas at the beginning and end as well as several commas in a row should lead to empty strings.
So ",,\,\\,," should become "", "", "\,\\", "", ""
Note that my example strings show backslash as single "\". Java strings would have them doubled.
I tried with several packages but had no success. My last idea would be to write my own parser.

In this case a custom function sounds better for me. Try this:
public String[] splitEscapedString(String s) {
//Character that won't appear in the string.
//If you are reading lines, '\n' should work fine since it will never appear.
String c = "\n";
StringBuilder sb = new StringBuilder();
for(int i = 0;i<s.length();++i){
if(s.charAt(i)=='\\') {
//If the String is well formatted(all '\' are followed by a character),
//this line should not have problem.
sb.append(s.charAt(++i));
}
else {
if(s.charAt(i) == ',') {
sb.append(c);
}
else {
sb.append(s.charAt(i));
}
}
}
return sb.toString().split(c);
}

Don't use .split() but find all matches between (unescaped) commas:
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile(
"(?: # Start of group\n" +
" \\\\. # Match either an escaped character\n" +
"| # or\n" +
" [^\\\\,]++ # Match one or more characters except comma/backslash\n" +
")* # Do this any number of times",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
Result: ["", "", "\\,\\\\", "", ""]
I used a possessive quantifier (++) in order to avoid excessive backtracking due to the nested quantifiers.

While certainly a dedicated library is a good idea the following will work
public static String[] splitValues(final String input) {
final ArrayList<String> result = new ArrayList<String>();
// (?:\\\\)* matches any number of \-pairs
// (?<!\\) ensures that the \-pairs aren't preceded by a single \
final Pattern pattern = Pattern.compile("(?<!\\\\)(?:\\\\\\\\)*,");
final Matcher matcher = pattern.matcher(input);
int previous = 0;
while (matcher.find()) {
result.add(input.substring(previous, matcher.end() - 1));
previous = matcher.end();
}
result.add(input.substring(previous, input.length()));
return result.toArray(new String[result.size()]);
}
Idea is to find , prefixed by no or even-numbered \ (i.e. not escaped ,) and as the , is the last part of the pattern cut at end()-1 which is just before the ,.
Function is tested against most odds I can think of except for null-input. If you like handling List<String> better you can of course change the return; I just adopted the pattern implemented in split() to handle escapes.
Example class uitilizing this function:
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Print {
public static void main(final String[] args) {
String input = ",,\\,\\\\,,";
final String[] strings = splitValues(input);
System.out.print("\""+input+"\" => ");
printQuoted(strings);
}
public static String[] splitValues(final String input) {
final ArrayList<String> result = new ArrayList<String>();
// (?:\\\\)* matches any number of \-pairs
// (?<!\\) ensures that the \-pairs aren't preceded by a single \
final Pattern pattern = Pattern.compile("(?<!\\\\)(?:\\\\\\\\)*,");
final Matcher matcher = pattern.matcher(input);
int previous = 0;
while (matcher.find()) {
result.add(input.substring(previous, matcher.end() - 1));
previous = matcher.end();
}
result.add(input.substring(previous, input.length()));
return result.toArray(new String[result.size()]);
}
public static void printQuoted(final String[] strings) {
if (strings.length > 0) {
System.out.print("[\"");
System.out.print(strings[0]);
for(int i = 1; i < strings.length; i++) {
System.out.print("\", \"");
System.out.print(strings[i]);
}
System.out.println("\"]");
} else {
System.out.println("[]");
}
}
}

I have used below solution for generic sting splitter with quotes(' and ") and escape(\) character.
public static List<String> split(String str, final char splitChar) {
List<String> queries = new ArrayList<>();
int length = str.length();
int start = 0, current = 0;
char ch, quoteChar;
while (current < length) {
ch=str.charAt(current);
// Handle escape char by skipping next char
if(ch == '\\') {
current++;
}else if(ch == '\'' || ch=='"'){ // Handle quoted values
quoteChar = ch;
current++;
while(current < length) {
ch = str.charAt(current);
// Handle escape char by skipping next char
if (ch == '\\') {
current++;
} else if (ch == quoteChar) {
break;
}
current++;
}
}else if(ch == splitChar) { // Split sting
queries.add(str.substring(start, current + 1));
start = current + 1;
}
current++;
}
// Add last value
if (start < current) {
queries.add(str.substring(start));
}
return queries;
}
public static void main(String[] args) {
String str = "abc,x\\,yz,'de,f',\"lm,n\"";
List<String> queries = split(str, ',');
System.out.println("Size: "+queries.size());
for (String query : queries) {
System.out.println(query);
}
}
Getting result
Size: 4
abc,
x\,yz,
'de,f',
"lm,n"

Related

How to write a replaceAll function java?

I'm trying to write a program that will allow a user to input a phrase (for example: "I like cats") and print each word on a separate line. I have already written the part to allow a new line at every space but I don't want to have blank lines between the words because of excess spaces. I can't use any regular expressions such as String.split(), replaceAll() or trim().
I tried using a few different methods but I don't know how to delete spaces if you don't know the exact number there could be. I tried a bunch of different methods but nothing seems to work.
Is there a way I could implement it into the code I've already written?
for (i=0; i<length-1;) {
j = text.indexOf(" ", i);
if (j==-1) {
j = text.length();
}
System.out.print("\n"+text.substring(i,j));
i = j+1;
}
Or how can I write a new expression for it? Any suggestions would really be appreciated.
I have already written the part to allow a new line at every space but
I don't want to have blank lines between the words because of excess
spaces.
If you can't use trim() or replaceAll(), you can use java.util.Scanner to read each word as a token. By default Scanner uses white space pattern as a delimiter for finding tokens. Similarly, you can also use StringTokenizer to print each word on new line.
String str = "I like cats";
Scanner scanner = new Scanner(str);
while (scanner.hasNext()) {
System.out.println(scanner.next());
}
OUTPUT
I
like
cats
Here is a simple solution using substring() and indexOf()
public static void main(String[] args) {
List<String> split = split("I like cats");
split.forEach(System.out::println);
}
public static List<String> split(String s){
List<String> list = new ArrayList<>();
while(s.contains(" ")){
int pos = s.indexOf(' ');
list.add(s.substring(0, pos));
s = s.substring(pos + 1);
}
list.add(s);
return list;
}
Edit:
If you only want to print the text without splitting or making lists, you can use this:
public static void main(String[] args) {
newLine("I like cats");
}
public static void newLine(String s){
while(s.contains(" ")){
int pos = s.indexOf(' ');
System.out.println(s.substring(0, pos));
s = s.substring(pos + 1);
}
System.out.println(s);
}
I think this will solve your problem.
public static List<String> getWords(String text) {
List<String> words = new ArrayList<>();
BreakIterator breakIterator = BreakIterator.getWordInstance();
breakIterator.setText(text);
int lastIndex = breakIterator.first();
while (BreakIterator.DONE != lastIndex) {
int firstIndex = lastIndex;
lastIndex = breakIterator.next();
if (lastIndex != BreakIterator.DONE && Character.isLetterOrDigit(text.charAt(firstIndex))) {
words.add(text.substring(firstIndex, lastIndex));
}
}
return words;
}
public static void main(String[] args) {
String text = "I like cats";
List<String> words = getWords(text);
for (String word : words) {
System.out.println(word);
}
}
Output :
I
like
cats
What about something like this, its O(N) time complexity:
Just use a string builder to create the string as you iterate through your string, add "\n" whenever you find a space
String word = "I like cats";
StringBuilder sb = new StringBuilder();
boolean newLine = true;
for(int i = 0; i < word.length(); i++) {
if (word.charAt(i) == ' ') {
if (newLine) {
sb.append("\n");
newLine = false;
}
} else {
newLine = true;
sb.append(word.charAt(i));
}
}
String result = sb.toString();
EDIT: Fixed the problem mentioned on comments (new line on multiple spaces)
Sorry, I didnot caution you cannot use replaceAll().
This is my other solution:
String s = "I like cats";
Pattern p = Pattern.compile("([\\S])+");
Matcher m = p.matcher(s);
while (m.find( )) {
System.out.println(m.group());
}
Old solution:
String s = "I like cats";
System.out.println(s.replaceAll("( )+","\n"));
You almost done all job. Just make small addition, and your code will work as you wish:
for (int i = 0; i < length - 1;) {
j = text.indexOf(" ", i);
if (i == j) { //if next space after space, skip it
i = j + 1;
continue;
}
if (j == -1) {
j = text.length();
}
System.out.print("\n" + text.substring(i, j));
i = j + 1;
}

Reverse multiple strings between delimiters

I need to replace substrings within a string that are delimited. For example (abc),(def) should be (cba),(fed) after reversing.
I tried the following code but it gives back the string without reversing.
String s = "(abc),(cdef)";
s = s.replaceAll("\\(\\[.*?\\]\\)",
new StringBuilder("$1").reverse().toString());
An alternative:
String s = "(abc),(cdef),(ghij)", res = "";
Matcher m = Pattern.compile("\\((.*?)\\)").matcher(s);
while(m.find()){
res += "(" + new StringBuilder(m.group(1)).reverse().toString() + "),";
}
if(res.length() > 0)
res = res.substring(0,res.length()-1);
System.out.println(res);
Prints:
(cba),(fedc),(jihg)
Another alternative if you are using Java 8:
String s = "(abc),(cdef),(ghijklm)";
Pattern pattern = Pattern.compile("[a-z]+");
Matcher matcher = pattern.matcher(s);
List<String> reversedStrings = new ArrayList<>();
while(matcher.find()){
reversedStrings.add(new StringBuilder(matcher.group()).reverse().toString());
}
reversedStrings.forEach(System.out::print);
Low tech approach using a stack to reverse:
public static String reverse(String s) {
StringBuilder buffer = new StringBuilder();
Stack<Character> stack = new Stack<>();
for(char c : s.toCharArray() ) {
if(Character.isLetter(c)) { stack.push(c); }
else if(c == ')') {
while (!stack.isEmpty()) { buffer.append(stack.pop()); }
buffer.append(',');
}
}
return buffer.deleteCharAt(buffer.length()-1).toString();
}
For a different take, here is an algorithm for performing an in-place trim of the parentheses and internal reversal of each component on a StringBuilder. I have omitted input validation checking in favor of focusing on the core algorithm. You might want to add more input validation if you use this for real. For example, it currently throws an exception on an empty input string or a string that accidentally has a trailing ',' at the end, not followed by another string component.
public class TestReverse {
public static void main(String[] args) {
for (String arg: args) {
StringBuilder input = new StringBuilder(arg);
// Point start at first '(' and end at first ','.
int start = 0, end = input.indexOf(",");
// Keep iterating over string components as long as we find another ','.
while (end > 0) {
// Trim leading '(' and readjust end to keep it pointing at ','.
input.deleteCharAt(start);
end -= 1;
// Trim trailing ')' and readjust end again to keep it pointing at ','.
input.deleteCharAt(end - 1);
end -= 1;
// Reverse the remaining range of the component.
reverseStringBuilderRange(input, start, end - 1);
// Point start at next '(' and end at next ',', or -1 if no ',' remaining.
start = end + 1;
end = input.indexOf(",", start);
}
// Handle the last string component, which won't have a trailing ','.
input.deleteCharAt(start);
input.deleteCharAt(input.length() - 1);
reverseStringBuilderRange(input, start, input.length() - 1);
System.out.println(input);
}
}
private static void reverseStringBuilderRange(StringBuilder sb, int start, int end) {
for (int i = start, j = end; i < j; ++i, --j) {
char temp = sb.charAt(i);
sb.setCharAt(i, sb.charAt(j));
sb.setCharAt(j, temp);
}
}
}
> javac TestReverse.java && java TestReverse '(abc),(def)' '(foo),(bar),(baz)' '(just one)'
cba,fed
oof,rab,zab
eno tsuj

Split string into list of substrings of different character types

I am writing a spell checker that takes a text file as input and outputs the file with spelling corrected.
The program should preserve formatting and punctuation.
I want to split the input text into a list of string tokens such that each token is either 1 or more: word, punctuation, whitespace, or digit characters.
For example:
Input:
words.txt:
asdf don't ]'.'..;'' as12....asdf.
asdf
Input as list:
["asdf" , " " , "don't" , " " , "]'.'..;''" , " " , "as" , "12" ,
"...." , "asdf" , "." , "\n" , "asdf"]
Words like won't and i'll should be treated as a single token.
Having the data in this format would allow me to process the tokens like so:
String output = "";
for(String token : tokens) {
if(isWord(token)) {
if(!inDictionary(token)) {
token = correctSpelling(token);
}
}
output += token;
}
So my main question is how can i split a string of text into a list of substrings as described above? Thank you.
The main difficulty here would be to find the regex that matches what you consider to be a "word". For my example I consider ' to be part of a word if it's proceeded by a letter or if the following character is a letter:
public static void main(String[] args) {
String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";
//The pattern:
Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");
Matcher m = p.matcher(in);
//If you want to collect the words
List<String> words = new ArrayList<String>();
StringBuilder result = new StringBuilder();
Now find something from the start
int pos = 0;
while(m.find(pos)) {
//Add everything from starting position to beginning of word
result.append(in.substring(pos, m.start()));
//Handle dictionary logig
String token = m.group();
words.add(token); //not used actually
if(!inDictionary(token)) {
token = correctSpelling(token);
}
//Add to result
result.append(token);
//Repeat from end position
pos = m.end();
}
//Append remainder of input
result.append(in.substring(pos));
System.out.println("Result: " + result.toString());
}
Because I like solving puzzles, I tried the following and I think it works fine:
public class MyTokenizer {
private final String str;
private int pos = 0;
public MyTokenizer(String str) {
this.str = str;
}
public boolean hasNext() {
return pos < str.length();
}
public String next() {
int type = getType(str.charAt(pos));
StringBuilder sb = new StringBuilder();
while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
sb.append(str.charAt(pos));
pos++;
}
return sb.toString();
}
private int getType(char c) {
String sc = Character.toString(c);
if (sc.matches("\\d")) {
return 0;
}
else if (sc.matches("\\w")) {
return 1;
}
else if (sc.matches("\\s")) {
return 2;
}
else if (sc.matches("\\p{Punct}")) {
return 3;
}
else {
return 4;
}
}
public static void main(String... args) {
MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
while(mt.hasNext()) {
System.out.println(mt.next());
}
}
}

Regex for replacing specific characters before and after specific substring

I am going through the Java CodingBat exercises. Here is the one I have just completed:
Given a string and a non-empty word string, return a string made of each char just before and just after every appearance of the word in the string. Ignore cases where there is no char before or after the word, and a char may be included twice if it is between two words.
My code, which works:
public String wordEnds(String str, String word){
String s = "";
String n = " " + str + " "; //To avoid OOB exceptions
int sL = str.length();
int wL = word.length();
int nL = n.length();
int i = 1;
while (i < nL - 1) {
if (n.substring(i, i + wL).equals(word)) {
s += n.charAt(i - 1);
s += n.charAt(i + wL);
i += wL;
} else {
i++;
}
}
s = s.replaceAll("\\s", "");
return s;
}
My question is about regular expressions. I want to know if the above is doable with a regex statement, and if so, how?
You can use Java regex objects Pattern and Matcher for doing this.
public class CharBeforeAndAfterSubstring {
public static String wordEnds(String str, String word) {
java.util.regex.Pattern p = java.util.regex.Pattern.compile(word);
java.util.regex.Matcher m = p.matcher(str);
StringBuilder beforeAfter = new StringBuilder();
for (int startIndex = 0; m.find(startIndex); startIndex = m.start() + 1) {
if (m.start() - 1 > -1)
beforeAfter.append(Character.toChars(str.codePointAt(m.start() - 1)));
if (m.end() < str.length())
beforeAfter.append(Character.toChars(str.codePointAt(m.end())));
}
return beforeAfter.toString();
}
public static void main(String[] args) {
String x = "abcXY1XYijk";
String y = "XY";
System.out.println(wordEnds(x, y));
}
}
(?=(.|^)XY(.|$))
Try this.Just grab the captures and remove the None or empty values.See demo.
https://regex101.com/r/sJ9gM7/73
To get a string containing the character before and after each occurrence of one string within the other, you could use the regex expression:
"(^|.)" + str + "(.|$)"
and then you could iterate through the groups and concatenate them.
This expression will look for (^|.), either the start of the string ^ or any character ., followed by str value, followed by (.|$), any character . or the end of the string $.
You could try something like this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public String wordEnds(String str, String word){
Pattern p = Pattern.compile("(.)" + str + "(.)");
Matcher m = p.matcher(word);
String result = "";
int i = 0;
while(m.find()) {
result += m.group(i++);
}
return result;
}

String split & join

I have a collection of strings. I need to be able to join the items in this collection into one string and afterwards split that string backwards and get original string collection.
Definitely I need to introduce a delimiter character for join/split operation. Given the fact that original strings can contain any characters, I also need to deal with delimiter escaping. My question is very simple - is there a Java class/library that can provide me required functionality out-of-the-box? Something like:
String join(String[] source, String delimiter, String escape);
String[] split(String source, String delimiter, String escape);
or similar, without having to do the work manually?
Without the escaping part, there are:
StringUtils.split(..) and StringUtils.join(..) from commons-lang
Joiner and Splitter from guava.
Splitting: String.split takes regex pattern as argument (delimeter) and returns String[] as result.
Split and Join escaping separator:
#NonNull
public static String joinEscaping(char separator, String... aa) {
String escape = separator != '\\' ? "\\" : "#";
StringBuilder result = new StringBuilder();
for (int i = 0; i < aa.length; i++) {
String a = aa[i];
a = (a == null) ? "" : a;
a = a.replace(escape, escape + escape);
a = a.replace(separator + "", escape + separator);
result.append(a);
if (i + 1 < aa.length) {
result.append(separator);
}
}
return result.toString();
}
public static List<String> splitUnescaping(char separator, #NonNull String str) {
char escape = separator != '\\' ? '\\' : '#';
List<String> result = new ArrayList<>();
int yourAreHere = 0;
boolean newPart = true;
while (true) {
int sep = str.indexOf(separator, yourAreHere);
int exc = str.indexOf(escape, yourAreHere);
if (sep == -1 && exc == -1) { // last part
add(result, str.substring(yourAreHere), newPart);
break;
}
if (sep == -1 && exc + 1 == str.length()) { // ghost escape
add(result, str.substring(yourAreHere, exc), newPart);
break;
}
if (exc == -1 || (sep != -1 && sep < exc)) {
add(result, str.substring(yourAreHere, sep), newPart);
yourAreHere = sep + 1;
newPart = true;
} else {
char next = str.charAt(exc + 1);
add(result, str.substring(yourAreHere, exc) + next, newPart);
yourAreHere = exc + 2;
newPart = false;
}
}
return result;
}
private static void add(List<String> result, String part, boolean newPart) {
if (newPart) {
result.add(part);
} else {
int last = result.size() - 1;
result.set(last, result.get(last) + part);
}
}

Categories

Resources