I have a csv file where each field (except column headings) has a double quote text qualifier: field: "some value". However some of the fields in the file have a double quote within the value; field2: "25" TV" or field3: "25" x 14" x 2"" or field4: "A"bcd"ef"g". (I think you get the point). In cases where I have data like in fields 2-4, my java file process fails due to me specifying that the double-quote is a text-qualifier on the fields and it looks as if there are too many fields for that row. How do I do either or all of the following:
remove the double-quote character from inside the field
replace the double-quote character with another value
have my java process "ignore" or "skip" double-quotes within a field.
What is my level of control over this file? The file comes in as-is, but I just need data from two different columns in the file. I can do whatever I need to do to it to get that data.
First, if it is indeed a CSV file, you should be using the presence of commas to break each line into columns.
Once its broken in columns, if we know for sure that the value should begin and end with double-quote ("), we can simply remove all of the double-quote and then re-apply the ones at the beginning and end.
String input = "\"hello\",\"goodbye Java \"the best\" language\", \"this is really \"\"\"bad\"";
String[] parsed = input.split(",");
String[] clean = new String[parsed.length];
int index = 0;
for (String value : parsed) {
clean[index] = "\"" + value.replace("\"", "") + "\"";
index++;
}
If a comma could exist inside of the value, the following should be used instead
String input = "\"hello\",\"goodbye,\" Java \"the best\" language\", \"this is really \"\"\"bad\"";
String[] parsed = input.split("\"\\s*,\\s*\"");
String[] clean = new String[parsed.length];
int index = 0;
for (String value : parsed) {
clean[index] = "\"" + value.replace("\"", "") + "\"";
index++;
}
}
Note that if the sequence of \"\s*,\s*\" existed inside a value, the record would be ambiguous. For example, if it was a two column file, the input record
"abc","def","ghi" could be either
value 1 = "abc","def" value 2 = "ghi"
or
value 1 = "abc" value 2 = "def","ghi"
Note many CSV implementations will escape a double quote as two consecutive quotes.
So "25"" TV" might (should?) be your input.
Assuming that a comma is the column separator and that every column is surrounded by double quotes:
String[] columns = input.split("\",\"");
if (columns.length > 0) {
columns[0] = columns[0].substring(1);
String lastColumn = columns[columns.length-1];
columns[columns.length-1] = lastColumn.substring(0,lastColumn.length()-1);
}
The columns will still have the internal double quotes. You can replace them out if you don't want them.
Related
I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier. If any record has less or more columns than defined that record should be rejected and loaded to different path.
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
"ID","Name","DESC" "1" , "ABC", "A,B C" "2" , "XYZ" , "ABC is bother" "3" , "YYZ" , "" 4 , "XAA" , "sf,sd
sdfsf"
Last record splitted into two records due new line char in desc field
Below is the code I tried to handle but not able to handle correctly.
val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
.filter(col("a_cnt")
.!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
.filter(col("a_cnt")
.equalTo(NoOfDelimiters)).drop("a_cnt")
With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.
Thanks in Advance.
You may use a closure with replaceAllIn to remove any chars you want inside a match:
var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"
See the Scala demo.
Details
" - matches a "
[^"]* - any 0+ chars other than "
(?:""[^"]*)* - matches 0 or more sequences of "" and then 0+ chars other than "
" - a ".
The code finds all non-overlapping matches of the above pattern in y and upon finding a match (m) the , and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\n]", ""), where m.group(0) is the match value and [,\n] matches either , or a newline).
I need to replace parts of a string by looking up the System properties.
For example, consider the string It was {var1} beauty killed {var2}
I need to parse the string, and replace all the words contained within the parenthesis by looking up their value in System properties. If System.getProperty() returns null, then simply replace with empty character. This is pretty straightforward when I know the variables well ahead. But the string that I need to parse is not defined ahead. I wouldn't know how many number of variables are in the string and what the variable names are. Assuming a simple, well formatted string (no nested parenthesis, open - close matches), what is the simplest or the most elegant way to parse through the string and replace all the character sequences that are enclosed in the parenthesis?
Only solution I could come up with is to traverse the string from the first character, note down the positions of the start and end positions of the parenthesis, replace the string between them, and then continue until reaching the end of the string. Is there simpler way to do this?
You can use the parentheses to break the initial string into substrings, and then replace every other substring.
String[] substituteValues = {"the", "str", "other", "another"};
int substituteValuesIndex = 0;
String test = "Here is {var1} string called {var2}";
// split the string up into substrings
test = test.replaceAll("\\}", "\\{");
String[] splitString = test.split("\\{");
// now sub in your values
for (int k=1; k < splitString.length; k = k+2) {
splitString[k] = substituteValues[substituteValuesIndex];
substituteValuesIndex++;
}
String result = "";
for (String s : splitString) {
result = result + s;
}
In my jsp page I have a dropdown list with multiple selection , and I store these values in an array of Strings using getParamterValues() , then I'm converting the array to a String that has this format: ('x','y','z'). So it can work with the IN operator of SQL server.
But the problem is that after the array is converted into a String each element is surrounded with backslashes. Like so: (\'X\',\'Z\',\'Y\').
I have used String.replaceAll("\\\\", ""); which was working fine in another Java application. I am unsure why it doesn't work with my servlet solution (web Application).
here is my code :
String[] Names = request.getParameterValues("Name");
String Name = "(";
for (int i = 0; i < Names.length; i++) {
Name += "'".concat(Names[i]).concat("'") + ',';
}
Name = Name.concat(")");
Name = Name.replace(",)", ")");
Name = Name.replaceAll("\\\\", "");
I know that Name = Name.replaceAll("\\\\", ""); will remove the backslashes but I don't know why it's not working in the servlet ?!
Is there a problem with values from the dropdown list?
Try using something like:
String[] names = request.getParameterValues("Name");
StringBuilder name = new StringBuilder("(");
for(int index = 0; index <names.length; index++){
name.append("'");
name.append(names[index].replace("\\","").replace("/",""));
name.append("'");
name.append(index != names.length -1? "," : ")");
}
String output = name.toString();
The replace() method replaces every instance of the sub string, therefore you do not have to use "\\\\" unless of course if you only want to remove the double slashes and leave the single slashes.
If the problem persists then there are two possible reasons for it.
The debugger expresses ' as \', so there should be no problem when sending the query to the server.
The \ is not actually a slash or backslash but another character that looks like as backslash. You can find which character it is by using int test = output.charAt(output.length() - 3); and then check the value of the test variable using the debugger.
I am writing a parser for a file containing the following string pattern:
Key : value
Key : value
Key : value
etc...
I am able to retrieve those lines one by one into a list. What I would like to do is to separate the key from the value for each one of those strings. I know there is the split() method that can take a Regex and do this for me, but I am very unfamiliar with them so I don't know what Regex to give as a parameter to the split() function.
Also, while not in the specifications of the file I am parsing, I would like for that Regex to be able to recognize the following patterns as well (if possible):
Key: value
Key :value
Key:value
etc...
So basically, whether there's a space or not after/before/after AND before the : character, I would like for that Regex to be able to detect it. What is the Regex that can achieve this?
In other words split method should look for : and zero or more whitespaces before or after it.
Key: value
^^
Key :value
^^
Key:value
^
Key : value
^^^
In that case split("\\s*:\\s*") should do the trick.
Explanation:
\\s represents any whitespace
* means one or more occurrences of element described before it
\\s* means zero or more whitespaces.
On the other hand you may want also to find entire key:value pair and place parts matching key and value in separate groups (you can even name groups as you like using (?<groupName>regex)). In that case you may use
Pattern p = Pattern.compile("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)");
Matcher m = p.matcher(yourData);
while(m.find()){
System.out.println("key = " + m.group("key"));
System.out.println("value = " + m.group("value"));
System.out.println("--------");
}
If you want to use String.split(), you could use this:
String input = "key : value";
String[] s = input.split("\\s*:\\s*");
String key = s[0];
String value = s[1];
This will split the String at the ":", but add all whitespaces in front of the ":" to it, so that you will receive a trimmed string.
Explanation:
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
Note that this solution will cause an ArrayIndexOutOfBoundsException if your input line does not contain the key-value-format as you defined it.
If you are not sure if the line really contain the key-value-String, maybe because you want to have an empty line at the end of your file like there normally is, you could do it like that:
String input = "key : value";
Matcher m = Pattern.compile("(\\S+)\\s*:\\s*(.+)").matcher(input);
if (m.matches())
{
String key = m.group(1); // note that the count starts by 1 here
String value = m.group(2);
}
Explanation:
\\S+ matches any non-whitespace String - if it contains whitespaces, the next part of the regex will be matches with this expression already. Note that the () around it mark so that you can get it's value by m.group().
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
The last group, .+, will match any string, containing whitespaces and so on.
you can use the split method but can pass delimiter as ":"
This splits the string when it sees ':', then you can trim the values to get the key and value.
String s = " keys : value ";
String keyValuePairs[] = s.split(":");
String key = keyValuePairs[0].trim();
String value = keyValuePairs[1].trim();
You can also make use of regex to simplify it.
String keyValuePairs[] = s.trim().split("[ ]*:[ ]*");
s.trim() will remove the spaces before and after the string (if you have it in your case), So sting will become "keys : value" and
[ ]*:[ ]*
to split the string with regular expression saying spaces (one or more) : spaces (one or more) as delimiter.
For a pure regex solution, you can use the following pattern (note the space at the beginning):
?: ?
See http://regexr.com/39evh
String[] tokensVal = str.split(":");
String key = tokensVal[0].trim();
String value = tokensVal[1].trim();
I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces.
This is the code i'm currently using,
String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");
which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example
String inputString = "\"Word\" some more text.";
Is giving me this output
arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
I want the output to give me an array with
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.
EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string
Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because
by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).
Try maybe this way
String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\" some more text.";
String[] arrayOfWords = inputString.replaceAll(
"^" + regexDelimiters,"").split(regexDelimiters);
for (String s : arrayOfWords)
System.out.println("'" + s + "'");
output:
'Word'
'some'
'more'
'text'
A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:
How to prevent java.lang.String.split() from creating a leading empty string?
So in short form you would have:
String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);
Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.