splitting a comma delimited string with escaping quotes - java

I see that there are several similar questions, but I have not found any of the answers satisfactory. I have a comma delimited file where each line looks something like this:
4477,52544,,,P,S, ,,SUSAN JONES,9534 Black Bear Dr,,"CITY, NV 89506",9534 BLACK BEAR DR,,CITY,NV,89506,2008,,,, , , , ,,1
The problem that comes into play is when a token escapes a comma with quotes "CITY, NV 89506"
I need a result where the escaped tokens are handled and every token is included, even empty ones .

Consider a proper CSV parser such as opencsv. It will be highly tested (unlike a new, home-grown solution) and handle edge-conditions such as the one you describe (and lots you haven't thought about).
In the download, there is an examples folder which contains "addresses.csv" with this line:
Jim Sample,"3 Sample Street, Sampleville, Australia. 2615",jim#sample.com
In the same directory, the file AddressExample.java parses this file, and is highly relevant to your question.

Here is one way to answer your question using delivered java.lang.String methods. I believe it does what you need.
private final char QUOTE = '"';
private final char COMMA = ',';
private final char SUB = 0x001A; // or whatever character you know will NEVER
// appear in the input String
public void readLine(String line) {
System.out.println("original: " + line);
// Replace commas inside quoted text with substitute character
boolean quote = false;
for (int index = 0; index < line.length(); index++) {
char ch = line.charAt(index);
if (ch == QUOTE) {
quote = !quote;
} else if (ch == COMMA && quote) {
line = replaceChar(line, index, SUB);
System.out.println("replaced: " + line);
}
}
// Strip out all quotation marks
for (int index = 0; index < line.length(); index++) {
if (line.charAt(index) == QUOTE) {
line = removeChar(line, index);
System.out.println("stripped: " + line);
}
}
// Parse input into tokens
String[] tokens = line.split(",");
// restore commas in place of SUB characters
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokens[i].replace(SUB, COMMA);
}
// Display final results
System.out.println("Final Parsed Tokens: ");
for (String token : tokens) {
System.out.println("[" + token + "]");
}
}
private String replaceChar(String input, int position, char replacement) {
String begin = input.substring(0, position);
String end = input.substring(position + 1, input.length());
return begin + replacement + end;
}
private String removeChar(String input, int position) {
String begin = input.substring(0, position);
String end = input.substring(position + 1, input.length());
return begin + end;
}

Related

CSV Data excluding commas between another character set

for a class assignment, I'm using data from https://www.kaggle.com/shivamb/netflix-shows which has presented a small problem for me:
it is a CSV, however, the cast variable was also separated by commas affecting the .split function I was using. the data has a set of [value, value, value," value,value ", value, ...]. the goal is to exclude the values within the " ".
currently to run this function I have:
while ( inFile.hasNext() ){
String delims = "[,]"; //Delimiters for seperation
String[] tokens = inFile.nextLine().split(delims); // seperation operator put in to string array
for (String token : tokens) {
System.out.println(token);
}
Because it's a class assignment, I would simple just code the logic.
For each character decide if you want to add it to a current word or if a new word has to start. So its pretty easy to store if you are in the " " and react on this..
something like this
public List<String> split(String line)
{
List<String> result = new ArrayList<>();
String currentWord = "";
boolean inWord = false;
for (int i = 0; i < line.length(); i++)
{
char c = line.charAt(i);
if (c == ',' && !inWord)
{
result.add(currentWord.trim());
currentWord = "";
continue;
}
if (c == '"')
{
inWord = !inWord;
continue;
}
currentWord += c;
}
return result;
}
there are some hard core regular expressions like here: Splitting on comma outside quotes
but I would not use them in an assignment.
I'm sure there is a simpler way of doing this but this is one solution I came up with.
while ( inFile.hasNext() ) {
int quote = 0;
String delims = "[,]"; //Delimiters for seperation
String[] tokens = inFile.nextLine().split(delims);
for (String token : tokens) {
if(token.contains("\"")) { //If contains a quote
quote++; //Increment quote counter
}
if (quote != 1) //If not between quotes
{
if(token.indexOf(" ") == -1) //Print if no space at beginning
{
System.out.println(token);
}
else { //Print from first character
System.out.println(token.substring(token.indexOf(" ") + 1));
}
}
}
}
inFile.close();

Pig it method that I am trying to make trouble checking punctuation at the end java

I am trying to answer this question.
Move the first letter of each word to the end of it, then add "ay" to the end of the word. Leave punctuation marks untouched.
This is what I did so far:
public static String pigIt(String str) {
//Populating the String argument into the String Array after splitting them by spaces
String[] strArray = str.split(" ");
System.out.println("\nPrinting strArray: " + Arrays.toString(strArray));
String toReturn = "";
for (int i = 0; i < strArray.length; i++) {
String word = strArray[i];
for (int j = 1; j < word.length(); j++) {
toReturn += Character.toString(word.charAt(j));
}
//Outside of inner for loop
if (!(word.contains("',.!?:;")) && (i != strArray.length - 1)) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " ";
} else if (word.contains("',.!?:;")) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " " + strArray[strArray.length - 1];
}
}
return toReturn;
}
It is supposed to return the punctuation mark without adding "ay" + "". I think I am overthinking but please help. Please see the below debugger.
One of the problems here is that your else if statement is never being invoked. The .contains method will not work with multiple characters like that unless you are trying to match them all. In your conditions you are essentially asking if the word matches that entire string "',.!?:;". If you just keep the exclamation point in there it will work invoke it. I don't know how else you can use contains besides making a condition for each one like word.contains("!")|| word.contains(",")|| word.contains("'"), etc.. You can also use regex for this problem.
Alternatively, you can use something like,
Character ch = new Character(yourString.charAt(i));
if(!Character.isAlphabetic(yourString.charAt(i))) {
to determine if a character is not an alphabetical one, and is a symbol or punctuation.
I think the best way is not relay on str.split("\\s++"), because you could have punctuation in any plase. The best one is to look through the string and find all not letter or digit symbols. After that you can define a word borders and translate it.
public static String pigIt(String str) {
StringBuilder buf = new StringBuilder();
for (int i = 0, j = 0; j <= str.length(); j++) {
char ch = j < str.length() ? str.charAt(j) : '\0';
if (Character.isLetterOrDigit(ch))
continue;
if (i < j) {
buf.append(str.substring(i + 1, j));
buf.append(str.charAt(i));
buf.append("ay");
}
if (ch != '\0')
buf.append(ch);
i = j + 1;
}
return buf.toString();
}
Output:
System.out.println(pigIt(",Hello, !World")); // ,elloHay, !orldWay
Regex may be difficult to start with but is very powerful:
public static String pigIt(String str) {
return str.replaceAll("([a-zA-Z])([a-zA-Z]*)", "$2$1ay");
}
The () specify groups. So I have one group with the first alphabet character and a second group with the remaining alphabet characters.
In the replace parameter you can refer to these groups ($1, $2).
String.replaceAll will search all matching string parts and apply the replacement. Non matching characters like the punctuations are left untouched.
public static void main(String[] args) {
System.out.println("Hello, World, ! -->"+ pigIt("Hello, World, !"));
System.out.println("Hello?, Wo$, F, ! -->"+ pigIt("Hello?, Wo$, F, !"));
}
The output of this method is:
Hello, World, ! -->elloHay, orldWay, !
Hello?, Wo$, F, ! -->elloHay?, oWay$, Fay, !

Format the results of JTextArea where it doesn't skip a line?

I wrote a program that given a list of anything adds single quotes around it, and an apstrophe at the end so like
"Dogs are cool" becomes 'Dogs', 'are', 'cool'
except the issue is the program gives one line to the single quote character
here are the results
'190619904419','
190619904469','
190619904569','
190619904669','
190619904759','
190619904859','
190619904869','
'
see how it appends the single quote to the end of the first line
when it should be the following
'190619904419',
'190619904469',
'190619904569',
'190619904669',
'190619904759',
'190619904859',
'190619904869',
The text is inputted in JTextArea, and I do the following
String line = JTextArea.getText().toString()
and I throw it in this method.
private static String SQLFormatter(String list, JFrame frame){
String ret = "";
String currentWord = "";
for(int i = 0; i < list.length(); i++){
char c = list.charAt(i);
if( i == list.length() - 1){
currentWord += c;
currentWord = '\'' + currentWord + '\'';
ret += currentWord;
currentWord = "";
}
else if(c != ' '){
currentWord += c;
}else if(c == ' '){
currentWord = '\'' + currentWord + '\'' + ',';
ret += currentWord;
currentWord = "";
}
}
return ret;
}
Any advice, the bug is in there somewhere but im not sure if its the method or some jtextarea feature I am missing.
[JTEXT AREA RESULTS][1]
[1]: https://i.stack.imgur.com/WXBKs.png
So it's a little hard to tell without the input, but there seem to be other white space, like carriage returns, in the input, which throws off your parsing. Also, if the thing has multiple white spaces or ends in white space, you might get more than you want (for example trailing comma, which I see you worked to avoid). Your original routine works on "Dogs are cool", but not as well on "Dogs \rare \rcool \r". Here is a slightly modified version that I think addresses the issues (I also pulled out the unused jframe parameter).
I also tried to think of it as the comma precedes any word but the first. I introduced a boolean for that, though it would have worked to check if ret was empty.
public static String SQLFormatter(String list) {
String ret = "";
String currentWord = "";
boolean firstWord = true;
for (int i = 0; i < list.length(); i++) {
// note modified to prepend comma to words beyond first and treat any white space as separator
// but multiple whitespace is treated as if just one space
char c = list.charAt(i);
if (!Character.isWhitespace(c)) {
currentWord += c;
} else if (!currentWord.equals("")) {
currentWord = '\'' + currentWord + '\'';
if (firstWord) {
ret += currentWord;
firstWord = false;
} else {
ret = ret + ',' + currentWord;
}
currentWord = "";
}
}
return ret;
}

split a string when there is a change in character without a regular expression

There is a way to split a string into repeating characters using a regex function but I want to do it without using it.
for example, given a string like: "EE B" my output will be an array of strings e.g
{"EE", " ", "B"}
my approach is:
given a string I will first find the number of unique characters in a string so I know the size of the array. Then I will change the string to an array of characters. Then I will check if the next character is the same or not. if it is the same then append them together if not begin a new string.
my code so far..
String myinput = "EE B";
char[] cinput = new char[myinput.length()];
cinput = myinput.toCharArray(); //turn string to array of characters
int uniquecha = myinput.length();
for (int i = 0; i < cinput.length; i++) {
if (i != myinput.indexOf(cinput[i])) {
uniquecha--;
} //this should give me the number of unique characters
String[] returninput = new String[uniquecha];
Arrays.fill(returninput, "");
for (int i = 0; i < uniquecha; i++) {
returninput[i] = "" + myinput.charAt(i);
for (int j = 0; j < myinput.length - 1; j++) {
if (myinput.charAt(j) == myinput.charAt(j + 1)) {
returninput[j] += myinput.charAt(j + 1);
} else {
break;
}
}
} return returninput;
but there is something wrong with the second part as I cant figure out why it is not beginning a new string when the character changes.
You question says that you don't want to use regex, but I see no reason for that requirement, other than this is maybe homework. If you are open to using regex here, then there is a one line solution which splits your input string on the following pattern:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
This pattern uses lookarounds to split whenever what precedes is a non whitespace character and what proceeds is a whitespace character, or vice-versa.
String input = "EE B";
String[] parts = input.split("(?<=\\S)(?=\\s)|(?<=\\s)(?=\\S)");
System.out.println(Arrays.toString(parts));
[EE, , B]
^^ a single space character in the middle
Demo
If I understood correctly, you want to split the characters in a string so that similar-consecutive characters stay together. If that's the case, here is how I would do it:
public static ArrayList<String> splitString(String str)
{
ArrayList<String> output = new ArrayList<>();
String combo = "";
//iterates through all the characters in the input
for(char c: str.toCharArray()) {
//check if the current char is equal to the last added char
if(combo.length() > 0 && c != combo.charAt(combo.length() - 1)) {
output.add(combo);
combo = "";
}
combo += c;
}
output.add(combo); //adds the last character
return output;
}
Note that instead of using an array (has a fixed size) to store the output, I used an ArrayList, which has a variable size. Also, instead of checking the next character for equality with the current one, I preferred to use the last character for that. The variable combo is used to temporarily store the characters before they go to output.
Now, here is one way to print the result following your guidelines:
public static void main(String[] args)
{
String input = "EEEE BCD DdA";
ArrayList<String> output = splitString(input);
System.out.print("[");
for(int i = 0; i < output.size(); i++) {
System.out.print("\"" + output.get(i) + "\"");
if(i != output.size()-1)
System.out.print(", ");
}
System.out.println("]");
}
The output when running the above code will be:
["EEEE", " ", "B", "C", "D", " ", "D", "d", "A"]

How do I reverse words in string that has line feed (\n or \r)?

I have a string as follows:
String sentence = "I have bananas\r" +
"He has apples\r" +
"I own 3 cars\n" +
"*!"
I'd like to reverse this string so as to have an output like this:
"*!" +
"\ncars 3 own I" +
"\rapples has He" +
"\rbananas have I"
Here is a program I wrote.
public static String reverseWords(String sentence) {
StringBuilder str = new StringBuilder();
String[] arr = sentence.split(" ");
for (int i = arr.length -1; i>=0; i--){
str.append(arr[i]).append(" ");
}
return str.toString();
}
But I don't get the output as expected. What is wrong?
The problem is you are only splitting on spaces, but that is not the only type of whitespace in your sentence. You can use the pattern \s to match all whitespace. However, then you don't know what to put back in that position after the split. So instead we will split on the zero-width position in front of or behind a whitespace character.
Change your split to this:
String[] arr = sentence.split("(?<=\\s)|(?=\\s)");
Also, now that you are preserving the whitespace characters, you no longer need to append them. So change your append to this:
str.append(arr[i]);
The final problem is that your output will be garbled due to the presence of \r. So, if you want to see the result clearly, you should replace those characters. For example:
System.out.println(reverseWords(sentence).replaceAll("\\r","\\\\r").replaceAll("\\n","\\\\n"));
This modified code now give the desired output.
Output:
*!\ncars 3 own I\rapples has He\rbananas have I
Note:
Since you are freely mixing \r and \n, I did not add any code to treat \r\n as a special case, which means that it will be reversed to become \n\r. If that is a problem, then you will need to prevent or undo that reversal.
For example, this slightly more complex regex will prevent us from reversing any consecutive whitespace characters:
String[] arr = sentence.split("(?<=\\s)(?!\\s)|(?<!\\s)(?=\\s)");
The above regex will match the zero-width position where there is whitespace behind but not ahead OR where there is whitespace ahead but not behind. So it won't split in the middle of consecutive whitespaces, and the order of sequences such as \r\n will be preserved.
The logic behind this question is simple, there are two steps to achieve the OP's target:
reverse the whole string;
reverse the words between (words splitted by spaces);
Instead of using StringBuilder, I'd prefer char[] to finish this, which is easy to understand.
The local test code is:
public class WordReverse {
public static void main(String... args) {
String s = " We have bananas\r" +
"He has apples\r" +
"I own 3 cars\n" +
"*!";
System.out.println(reverseSentenceThenWord(s));
}
/**
* return itself if the #param s is null or empty;
* #param s
* #return the words (non-whitespace character compound) reversed string;
*/
private static String reverseSentenceThenWord(String s) {
if (s == null || s.length() == 0) return s;
char[] arr = s.toCharArray();
int len = arr.length;
reverse(arr, 0, len - 1);
boolean inWord = !isSpace(arr[0]); // used to track the start and end of a word;
int start = inWord ? 0 : -1; // is the start valid?
for (int i = 0; i < len; ++i) {
if (!isSpace(arr[i])) {
if (!inWord) {
inWord = true;
start = i; // just set the start index of the new word;
}
} else {
if (inWord) { // from word to space, we do the reverse for the traversed word;
reverse(arr, start, i - 1);
}
inWord = false;
}
}
if (inWord) reverse(arr, start, len - 1); // reverse the last word if it ends the sentence;
String ret = new String(arr);
ret = showWhiteSpaces(ret);
// uncomment the line above to present all whitespace escape characters;
return ret;
}
private static void reverse(char[] arr, int i, int j) {
while (i < j) {
char c = arr[i];
arr[i] = arr[j];
arr[j] = c;
i++;
j--;
}
}
private static boolean isSpace(char c) {
return String.valueOf(c).matches("\\s");
}
private static String showWhiteSpaces(String s) {
String[] hidden = {"\t", "\n", "\f", "\r"};
String[] show = {"\\\\t", "\\\\n", "\\\\f", "\\\\r"};
for (int i = hidden.length - 1; i >= 0; i--) {
s = s.replaceAll(hidden[i], show[i]);
}
return s;
}
}
The output is not in my PC as OP provided but as:
*!
bananas have I
However, if you set a breakpoint and debug it and check the returned string, it will be as:
which is the right answer.
UPDATE
Now, if you would like to show the escaped whitespaces, you can just uncomment this line before returning the result:
// ret = showWhiteSpaces(ret);
And the final output will be exactly the same as expected in the OP's question:
*!\ncars 3 own I\rapples has He\rbananas have I
Take a look at the output you're after carefully. You actually need two iteration steps here - you first need to iterate over all the lines backwards, then all the words in each line backwards. At present you're just splitting once by space (not by new line) and iterating over everything returned in that backwards, which won't do what you want!
Take a look at the example below - I've kept closely to your style and just added a second loop. It first iterates over new lines (either by \n or by \r, since split() takes a regex), then by words in each of those lines.
Note however this comes with a caveat - it won't preserve the \r and the \n. For that you'd need to use lookahead / lookbehind in your split to preserve the delimiters (see here for an example.)
public static String reverseWords(String sentence) {
StringBuilder str = new StringBuilder();
String[] lines = sentence.split("[\n\r]");
for (int i = lines.length - 1; i >= 0; i--) {
String[] words = lines[i].split(" ");
for (int j = words.length - 1; j >= 0; j--) {
str.append(words[j]).append(" ");
}
str.append("\n");
}
return str.toString();
}

Categories

Resources