How to process Text Qualifier delimited file in scala

How to process Text Qualifier delimited file in scala - java

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier. If any record has less or more columns than defined that record should be rejected and loaded to different path.
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
"ID","Name","DESC" "1" , "ABC", "A,B C" "2" , "XYZ" , "ABC is bother" "3" , "YYZ" , "" 4 , "XAA" , "sf,sd
sdfsf"
Last record splitted into two records due new line char in desc field
Below is the code I tried to handle but not able to handle correctly.
val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
.filter(col("a_cnt")
.!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
.filter(col("a_cnt")
.equalTo(NoOfDelimiters)).drop("a_cnt")
With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.
Thanks in Advance.

You may use a closure with replaceAllIn to remove any chars you want inside a match:
var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"
See the Scala demo.
Details
" - matches a "
[^"]* - any 0+ chars other than "
(?:""[^"]*)* - matches 0 or more sequences of "" and then 0+ chars other than "
" - a ".
The code finds all non-overlapping matches of the above pattern in y and upon finding a match (m) the , and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\n]", ""), where m.group(0) is the match value and [,\n] matches either , or a newline).

Related

regex - How to match elements while ignoring others between quotation marks?

I can't seem to find the regex that suits my needs.
I have a .txt file of this form:
Abc "test" aBC : "Abc aBC"
Brooking "ABC" sadxzc : "I am sad"
asd : "lorem"
a22 : "tactius"
testsa2 : "bruchia"
test : "Abc aBC"
b2 : "Ast2"
From this .txt file I wish to extract everything matching this regex "([a-zA-Z]\w+)", except the ones between the quotation marks.
I want to rename every word (except the words in quotation marks), so I should have for example the following output:
A "test " B : "Abc aBC"
Z "ABC" X : "I am sad"
Test : "lorem"
F : "tactius"
H : "bruchia"
Game : "Abc aBC"
S: "Ast2"
Is this even achievable using a regex? Are there alternatives without using regex?

If quotes are balanced and there is no escaping in the input like \" then you can use this regex to match words outside double quotes:
(?=(?:(?:[^"]*"){2})*[^"]*$)(\b[a-zA-Z]\w+\b)
RegEx Demo
In java it will be:
Pattern p = Pattern.compile("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(\\b[a-zA-Z]\\w+\\b)");
This regex will match word if those are outside double quotes by using a lookahead to make sure there are even number of quotes after each matched word.

A simple approach might be to split the string by ", then do the replace using your regex on every odd part (on parts 1, 3, ..., if you start the numbering from 1), and join everything back.
UPD
However, it is also simple to implement manually. Just go along the line and track whether you are inside quotes or not.
insideQuotes = false
result = ""
currentPart = ""
input = input + '"' // so that we do not need to process the last part separately
for ch in string
if ch == '"'
if not insideQuotes
currentPart = replace(currentPart)
result = result + currentPart + '"'
currentPart = ""
insideQuotes = not insideQuotes
else
currentPart = currentPart + ch
drop the last symbol of result (it is that quote mark that we have added)
However, think also on whether you will need some more advanced syntax. For example, quote escaping like
word "inside quote \" still inside" outside again
? If yes, then you will need a more advanced parser, or you might think of using some special format.

You can’t formulate a “within quotes” condition the way you might think. But you can easily search for unquoted words or quoted strings and take action only for the unquoted words:
Pattern p = Pattern.compile("\"[^\"]*\"|([a-zA-Z]\\w+)");
for(String s: lines) {
Matcher m=p.matcher(s);
while(m.find()) {
if(m.group(1)!=null) {
System.out.println("take action with "+m.group(1));
}
}
}
This utilizes the fact that each search for the next match starts at the end of the previous. So if you find a quoted string ("[^"]*") you don’t take any action and continue searching for other matches. Only if there is no match for a quoted string, the pattern looks for a word (([a-zA-Z]\w+)) and if one is found, the group 1 captures the word (will be non null).

Splitting a string with a certain pattern in Java

I am writing a parser for a file containing the following string pattern:
Key : value
Key : value
Key : value
etc...
I am able to retrieve those lines one by one into a list. What I would like to do is to separate the key from the value for each one of those strings. I know there is the split() method that can take a Regex and do this for me, but I am very unfamiliar with them so I don't know what Regex to give as a parameter to the split() function.
Also, while not in the specifications of the file I am parsing, I would like for that Regex to be able to recognize the following patterns as well (if possible):
Key: value
Key :value
Key:value
etc...
So basically, whether there's a space or not after/before/after AND before the : character, I would like for that Regex to be able to detect it. What is the Regex that can achieve this?

In other words split method should look for : and zero or more whitespaces before or after it.
Key: value
^^
Key :value
^^
Key:value
^
Key : value
^^^
In that case split("\\s*:\\s*") should do the trick.
Explanation:
\\s represents any whitespace
* means one or more occurrences of element described before it
\\s* means zero or more whitespaces.
On the other hand you may want also to find entire key:value pair and place parts matching key and value in separate groups (you can even name groups as you like using (?<groupName>regex)). In that case you may use
Pattern p = Pattern.compile("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)");
Matcher m = p.matcher(yourData);
while(m.find()){
System.out.println("key = " + m.group("key"));
System.out.println("value = " + m.group("value"));
System.out.println("--------");
}

If you want to use String.split(), you could use this:
String input = "key : value";
String[] s = input.split("\\s*:\\s*");
String key = s[0];
String value = s[1];
This will split the String at the ":", but add all whitespaces in front of the ":" to it, so that you will receive a trimmed string.
Explanation:
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
Note that this solution will cause an ArrayIndexOutOfBoundsException if your input line does not contain the key-value-format as you defined it.
If you are not sure if the line really contain the key-value-String, maybe because you want to have an empty line at the end of your file like there normally is, you could do it like that:
String input = "key : value";
Matcher m = Pattern.compile("(\\S+)\\s*:\\s*(.+)").matcher(input);
if (m.matches())
{
String key = m.group(1); // note that the count starts by 1 here
String value = m.group(2);
}
Explanation:
\\S+ matches any non-whitespace String - if it contains whitespaces, the next part of the regex will be matches with this expression already. Note that the () around it mark so that you can get it's value by m.group().
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
The last group, .+, will match any string, containing whitespaces and so on.

you can use the split method but can pass delimiter as ":"
This splits the string when it sees ':', then you can trim the values to get the key and value.
String s = " keys : value ";
String keyValuePairs[] = s.split(":");
String key = keyValuePairs[0].trim();
String value = keyValuePairs[1].trim();
You can also make use of regex to simplify it.
String keyValuePairs[] = s.trim().split("[ ]*:[ ]*");
s.trim() will remove the spaces before and after the string (if you have it in your case), So sting will become "keys : value" and
[ ]*:[ ]*
to split the string with regular expression saying spaces (one or more) : spaces (one or more) as delimiter.

For a pure regex solution, you can use the following pattern (note the space at the beginning):
?: ?
See http://regexr.com/39evh

String[] tokensVal = str.split(":");
String key = tokensVal[0].trim();
String value = tokensVal[1].trim();

Java 7 Unicode Regex Tabs-only and Spaces-only

I'm currently trying to add support to our application for Japanese and French language encodings. In doing so, I'm trying to create two Pattern matchers to detect tabs-only and spaces-only in a read file, regardless of language encoding.
These will be used to determine what delimiter is used in a file, so they can be processed accordingly.
When I've tried compiling a space pattern
Pattern.compile(" ", Pattern.UNICODE_CHARACTER_CLASS);
I don't see it generating a regex to handle different unicode space values.
eg something like "[\\u00A0\\u2028\\u2029\\u3000\\u00C2\\u009A\\u0041]"
Compilation seems to work properly with the '\s' character set, but that includes tabs and newlines.
How should I be doing this in Java?
UPDATE
So part of the reason this wasn't working was the fact that Japanese web text HAS NO spaces, even though there appear to be spaces. Take the following line from a web imoprt:
実なので説明は不要だろう。その後1987
There are actually no spaces here う。そ. Just three characters.
Fixing this is really the subject of another question, so I have accepted Casimir's answer, as it handled the French case just fine.

You can use a negated character class. Example:
[^\\S \\t]
that means \s without space and tab.
Or you can use a class intersection:
[\\s&&[^ \\t]]

If I follow your question, you could use something like this for spaces -
Pattern p = Pattern.compile("^[ ]+$", Pattern.UNICODE_CHARACTER_CLASS);
String[] inputs = {" ", " ", " \t", "Hello"};
for (String input : inputs) {
Matcher m = p.matcher(input);
System.out.printf("For input: '%s' = %s%n", input, m.find());
}
Output is
For input: ' ' = true
For input: ' ' = true
For input: ' ' = false
For input: 'Hello' = false
and for tabs
Pattern p = Pattern.compile("^[\t]+$", Pattern.UNICODE_CHARACTER_CLASS);
String[] inputs = {"\t", "\t\t", " \t", "Hello"};
for (String input : inputs) {
Matcher m = p.matcher(input);
System.out.printf("For input: '%s' = %s%n", input, m.find());
}
Output is
For input: ' ' = true
For input: ' ' = true
For input: ' ' = false
For input: 'Hello' = false
Finally, use * instead of + for 0 or more matches. This uses +, so that is 1 or more match required. Starting with (^) and ending with ($).

Split string in Java, retain delimiters including items inside quotes

I have a .txt input file as follows:
Start "String" (100, 100) Test One:
Nextline 10;
Test Second Third(2, 4, 2, 4):
String "7";
String "8";
Test "";
End;
End.
I've intended to read this file in as one String and then split it based on certain delimiters.
I've almost met the desired output with this code:
String tr= entireFile.replaceAll("\\s+", "");
String[] input = tr.split("(?<=[(,):;.])|(?=[(,):;.])|(?=\\p{Upper})");
My current output is:
Start"
String"
(
100
,
100
)
Test
One
:
Nextline10
;
Test
Second
Third
(
2
,
4
,
2
,
4
)
:
String"7"
;
String"8"
;
Test""
;
End
;
End
.
However, I'm having trouble treating items inside quotes or just plain quotes "" as a separate token. So "String" and "7" and "" should all be on separate lines. Is there a way to do this with regex? My expected output is below, thanks for any help.
Start
"String"
(
100
,
100
)
Test
One
:
Nextline
10
;
Test
Second
Third
(
2
,
4
,
2
,
4
)
:
String
"7"
;
String
"8"
;
Test
""
;
End
;
End
.

Here's the regex I came up with:
String[] input = entireFile.split(
"\\s+|" + // Splits on whitespace or
"(?<=\\()|" + // splits on the positive lookbehind ( or
"(?=[,).:;])|" + // splits on any of the positive lookaheads ,).:; or
"((?<!\\s)(?=\\())"); // splits on the positive lookahead ( with a negative lookbehind whitespace
To understand all that positive/negative lookahead/lookbehind terminology, take a look at this answer.
Note that you should apply this split directly to the input file without removing whitespace, aka take out this line:
String tr= entireFile.replaceAll("\\s+", "");

Parsing comma-separated values containing quoted commas and newlines

I have string with some special characters.
The aim is to retrieve String[] of each line (, separated)
You have special character “ where you can have /n and ,
For example Main String
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL
Titi",God," timmy, tomy,tony,
tini".
You can see that there are you /n in "".
Can any Help me to Parse this.
Thanks
__ More Explanation
with the Main Sting I need to separate these
Here Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie,KameL,Titi
God
timmy, tomy,tony,tini
Problem is : for Julie,KameL,Titi there is line break /n or in between KameL and Titi
similar problem for timmy, tomy,tony,tini there is line break /n or in between tony and tini.
new this text is in file (compulsory line by line reading)
Alpha,Beta Charli,Delta,Delta Echo ,Frank George,Henry
1234-5,"Ida, John
", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln
",Mary / New York,123456
12543-01,"Ocean, Peter
output i want to remove this "
Alpha
Beta Charli
Delta
Delta Echo
Frank George
Henry
1234-5
Ida
John
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Mary / New York
123456
12543-01
Ocean
Peter

Parsing CSV is a whole lot harder than one would imagine at first sight, and that's why your best option is to use a well-designed and tested library to do that work for you. Two libraries are opencsv and supercsv, and many others. Have a look at both and use the one that's the best fit to your requirements and style.

Description
Consider the following powershell example of a universal regex tested on a Java parser which requires no extra processing to reassemble the data parts. The first matching group will match a quote, then carry that to the end of the match so that you're assured to capture the entire value between but not including the quotes. I also don't capture the commas unless they were embedded a quote delimited substring.
(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)
Example
$Matches = #()
$String = 'Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"'
$Regex = '(?:^|,\s{0,})(["]?)\s{0,}((?:.|\n|\r)*?)\1(?=[,]\s{0,}|$)'
Write-Host start with
write-host $String
Write-Host
Write-Host found
([regex]"(?i)(?m)$Regex").matches($String) | foreach {
write-host "key at $($_.Groups[1].Index) = '$($_.Groups[1].Value)'`t= value at $($_.Groups[2].Index) = '$($_.Groups[2].Value)'"
} # next match
Yields
start with
Alpha,Beta,Gama,"23-5-2013,TOM",TOTO,"Julie, KameL\n
Titi",God,"timmy, \n
tomy,tony,tini"
found
key at 0 = '' = value at 0 = 'Alpha'
key at 6 = '' = value at 6 = 'Beta'
key at 11 = '' = value at 11 = 'Gama'
key at 16 = '"' = value at 17 = '23-5-2013,TOM'
key at 32 = '' = value at 32 = 'TOTO'
key at 37 = '"' = value at 38 = 'Julie, KameL\n
Titi'
key at 60 = '' = value at 60 = 'God'
key at 64 = '"' = value at 65 = 'timmy, \n
tomy,tony,tini'
Summary
(?: start non capture group
^ require start of string
| or
,\s{0,} a comma followed by any number of white space
) close the non capture group
( start capture group 1
["]? consume a quote if it exists, I like doing it this way incase you want to include other characters then a quote
) close capture group 1
\s{0,} consume any spaces if they exist, this means you don't need to trim the value later
( start capture group 2
(?:.|\n|\r)*? capture all characters including a new line, non greedy
) close capture group 2
\1 if there was a quote it would be stored in group 1, so if there was one then require it here
(?= start zero assertion look ahead
[,]\s{0,} must have a comma followed by optional whitespace
| or
$ end of the string
) close the zero assertion look ahead

Try this:
String source = "Alpha,Beta,Gama,\"23-5-2013,TOM\",TOTO,\"Julie, KameL\n"
+ "Titi\",God,\" timmy, tomy,tony,\n"
+ "tini\".";
Pattern p = Pattern.compile("(([^\"][^,]*)|\"([^\"]*)\"),?");
Matcher m = p.matcher(source);
while(m.find())
{
if(m.group(2) != null)
System.out.println( m.group(2).replace("\n", "") );
else if(m.group(3) != null)
System.out.println( m.group(3).replace("\n", "") );
}
If it matches a string without quotes, the result is returned in group 2.
Strings with quotes are returned in group 3. Hence i needed a distinction in the while-block.
You might find a prettier way.
Output:
Alpha
Beta
Gama
23-5-2013,TOM
TOTO
Julie, KameLTiti
God
timmy, tomy,tony,tini
.

See this related answer for a decent Java-compatible regex for parsing CSV.
It recognizes:
Newlines (after values or inside quoted values)
Quoted values containing escaped double-quotes like ""this""
In short, you will use this pattern: (?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))
Then collect each Matcher group(1) in a find() loop.
Note: Although I have posted this answer here about a "decent" regex I discovered, just to save people searching for one, it is by no means robust. I still agree with this answer by user "fgv": a CSV Parser is preferrable.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to process Text Qualifier delimited file in scala - java

Related

regex - How to match elements while ignoring others between quotation marks?

Splitting a string with a certain pattern in Java

Java 7 Unicode Regex Tabs-only and Spaces-only

Split string in Java, retain delimiters including items inside quotes

Parsing comma-separated values containing quoted commas and newlines

Categories

Resources