Split string in Java, retain delimiters including items inside quotes - java

I have a .txt input file as follows:
Start "String" (100, 100) Test One:
Nextline 10;
Test Second Third(2, 4, 2, 4):
String "7";
String "8";
Test "";
End;
End.
I've intended to read this file in as one String and then split it based on certain delimiters.
I've almost met the desired output with this code:
String tr= entireFile.replaceAll("\\s+", "");
String[] input = tr.split("(?<=[(,):;.])|(?=[(,):;.])|(?=\\p{Upper})");
My current output is:
Start"
String"
(
100
,
100
)
Test
One
:
Nextline10
;
Test
Second
Third
(
2
,
4
,
2
,
4
)
:
String"7"
;
String"8"
;
Test""
;
End
;
End
.
However, I'm having trouble treating items inside quotes or just plain quotes "" as a separate token. So "String" and "7" and "" should all be on separate lines. Is there a way to do this with regex? My expected output is below, thanks for any help.
Start
"String"
(
100
,
100
)
Test
One
:
Nextline
10
;
Test
Second
Third
(
2
,
4
,
2
,
4
)
:
String
"7"
;
String
"8"
;
Test
""
;
End
;
End
.

Here's the regex I came up with:
String[] input = entireFile.split(
"\\s+|" + // Splits on whitespace or
"(?<=\\()|" + // splits on the positive lookbehind ( or
"(?=[,).:;])|" + // splits on any of the positive lookaheads ,).:; or
"((?<!\\s)(?=\\())"); // splits on the positive lookahead ( with a negative lookbehind whitespace
To understand all that positive/negative lookahead/lookbehind terminology, take a look at this answer.
Note that you should apply this split directly to the input file without removing whitespace, aka take out this line:
String tr= entireFile.replaceAll("\\s+", "");

Related

Using String.split() How can I split a string based on a regular expression excluding a certain string

I have this string:
"round((TOTAL_QTY * 100) / SUM(ORDER_ITEMS->TOTAL_QTY) , 1)"
I tried to split the string using the following code:
String[] tokens = function.split("[ )(*+-/^!##%&]");
Result is the following array:
"round"
""
"TOTAL_QTY"
""
""
"100"
""
""
""
"SUM"
"ORDER_ITEMS"
"->TOTAL_QTY"
""
""
""
"1"
But I need to split the string as follows:
"round",
"TOTAL_QTY",
"100",
"SUM",
"ORDER_ITEMS->TOTAL_QTY",
"1"
To make it clearer. First of all I need to ignore -> when it splits the string and then remove those empty strings in the result array.
Solution 1
Ok, I think you can do it in two steps, replace all non necessary characters with space for example and then split with space, your regex can look like like :
[)(*+/^!##%&,]|\\b-\\b
Your code :
String[] tokens = function.replaceAll("[)(*+/^!##%&,]|\\b-\\b", " ").split("\\s+");
Note that I used \\b-\\b to replace only - :
Solution 2
Or If you want something clean, you can use Pattern with Matcher like this :
Pattern.compile("\\b\\w+->\\w+\\b|\\b\\w+\\b")
.matcher("round((TOTAL_QTY * 100) / SUM(ORDER_ITEMS->TOTAL_QTY) , 1)")
.results()
.map(MatchResult::group)
.forEach(s -> System.out.println(String.format("\"%s\"", s)));
regex demo
Details
\b\w+->\w+\b to match that special case of ORDER_ITEMS->TOTAL_QTY
| or
\b\w+\b any other word with word boundaries
Note, this solution work from Java9+, but you can use a simple Pattern and Matcher solution.
Outputs
"round"
"TOTAL_QTY"
"100"
"SUM"
"ORDER_ITEMS->TOTAL_QTY"
"1"
Could see a couple of very good solutions provide by YCF_L
Here is one more solution:
String[] tokens = function.replace(")","").split("\\(+|\\*|/|,");
Explanation:
\\(+ Will split by ( and + will ensure that multiple open bracket cases and handled e.g. round((
|\\*|/|, OR split by * OR split by / OR split by ,
Output:
round
TOTAL_QTY
100
SUM
ORDER_ITEMS->TOTAL_QTY
1

How to process Text Qualifier delimited file in scala

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
I need to read this file with text (single column) and then check no of delimiters by considering Text Qualifier. If any record has less or more columns than defined that record should be rejected and loaded to different path.
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
"ID","Name","DESC" "1" , "ABC", "A,B C" "2" , "XYZ" , "ABC is bother" "3" , "YYZ" , "" 4 , "XAA" , "sf,sd
sdfsf"
Last record splitted into two records due new line char in desc field
Below is the code I tried to handle but not able to handle correctly.
val SourceFileDF = spark.read.text(InputFilePath)
SourceFile = SourceFile.filter("value != ''") // Removing empty records while reading
val aCnt = coalesce(length(regexp_replace($"value","[^,]", "")), lit(0)) //to count no of delimiters
val Delimitercount = SourceFileDF.withColumn("a_cnt", aCnt)
var invalidrecords= Delimitercount
.filter(col("a_cnt")
.!==(NoOfDelimiters)).toDF()
val GoodRecordsDF = Delimitercount
.filter(col("a_cnt")
.equalTo(NoOfDelimiters)).drop("a_cnt")
With above code I am able to reject all the records which has less or more delimiters but not able to ignore if delimiter is with in text qualifier.
Thanks in Advance.
You may use a closure with replaceAllIn to remove any chars you want inside a match:
var y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
y = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", ""))
print(y) // => 4 , "XAA" , "sfsdnsdfsf"
See the Scala demo.
Details
" - matches a "
[^"]* - any 0+ chars other than "
(?:""[^"]*)* - matches 0 or more sequences of "" and then 0+ chars other than "
" - a ".
The code finds all non-overlapping matches of the above pattern in y and upon finding a match (m) the , and newlines (LF) are removed from the match value (with m.group(0).replaceAll("[,\n]", ""), where m.group(0) is the match value and [,\n] matches either , or a newline).

Split string to include null values represented by white space

I am trying to split an input string that contains whitespace, but I do not want to cut it off from my split, I want to include it in my split array. Is there a better regex or method to use in this case?
String data = "1 a1 b1 r5";
String splitData = data.split("\\s+");
for(String x : splitData){
System.out.print(x + ", ");
}
Expected output: 1, , , , , ,a1, b1, , r5
I'm confused by your methodology here. If this is all you're trying to do, it can be done much more simply:
String input = "1 a1 b1 r5";
String output = input.replace(" ", ", ");
System.out.println(output);
The middle line simply replaces the space character, " ", with a comma followed by a space, ", ". The final output matches your requested output:
1, , , , , a1, b1, , r5
If this is a minimal example and you actually intend to use a more complex regex, please post that regex and we can get to work on it.
If you want to create an array of tokens, just use:
String[] sp = s.split( " " );
However, this will create 4 empty items after the first one, not 5.
There is 1 space between "a1" and "b1", and you expect 0 empty items there.
There are 2 spaces between "b1" and "r5" and you expect 1 empty item there.
There are 5 empty spaces between "1" and "a1". Why do you expect 5 empty items there instead of 4?
And why does your expected output not have a space after the comma in front of "a1" ?

pattern split to get all values in a string representing object

I have Strings that represent rows in a table like this:
{failures=4, successes=6, name=this_is_a_name, p=40.00}
I made an expression that can be used with Pattern.split() to get me back all of the values in a String[]:
[\{\,](.*?)\=
In the online regex tester it works well with the exception of the ending }.
But when I actually run the pattern against my first row I get a String[] where the first element is an empty string. I only want the 4 values (not keys) from each row not the extra empty value.
Pattern getRowValues = Pattern.compile("[\\{\\,](.*?)\\=");
String[] row = getRowValues.split("{failures=4, successes=6, name=this_is_a_name, p=40.00}");
//CURRENT
//row[0]=> ""
//row[1]=>"4"
//row[2]=>"6"
//row[3]=>"this_is_a_name"
//row[4]=>"40.00}"
//WANT
//row[0]=>"4"
//row[1]=>"6"
//row[2]=>"this_is_a_name"
//row[3]=>"40.00"
String[] parts = getRowValues
// Strip off the leading '{' and trailing '}'
.replaceAll("^\\{|\\}$", "")
// then just split on comma-space
.split(", ");
If you want just the values:
String[] parts = getRowValues
// Strip off the leading '{' and up to (but no including) the first =,
// and the trailing '}'
.replaceAll("^\\{[^=]*|\\}$", "")
// then just split on comma-space and up to (but no including) the =
.split(", [^=]*");
Option 1
Modify your regular expression to [{,](.*?)=|[}] where I removed all the unnecessarily escaped characters in each of the [...] constructs and added the |[}]
See also Live Demo
Option 2
=([^,]*)[,}]
This regular expression will do the following:
capture all the substrings after the = and before the , or close }
Example
Live Demo
https://regex101.com/r/yF2gG7/1
Sample text
{failures=4, successes=6, name=this_is_a_name, p=40.00}
Capture groups
Each match gets the following capture groups:
Capture group 0 gets the entire substring from = to , or }
Capture group 1 gets just the value not including the =, ,, or } characters
Sample Matches
[0][0] = =4,
[0][1] = 4
[1][0] = =6,
[1][1] = 6
[2][0] = =this_is_a_name,
[2][1] = this_is_a_name
[3][0] = =40.00}
[3][1] = 40.00
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^,]* any character except: ',' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
[,}] any character of: ',', '}'
----------------------------------------------------------------------

How do I count repetitive/continuous appearance of a character in String(When I don't know index of start/end)?

So if I have 22332, I want to replace that for BEA, as in mobile keypad.I want to see how many times a digit appear so that I can count A--2,B--22,C--222,D--3,E--33,F--333, etc(and a 0 is pause).I want to write a decoder that takes in digit string and replaces digit occurrences with letters.example : 44335557075557777 will be decoded as HELP PLS.
This is the key portion of the code:
public void printMessages() throws Exception {
File msgFile = new File("messages.txt");
Scanner input = new Scanner(msgFile);
while(input.hasNext()) {
String x = input.next();
String y = input.nextLine();
System.out.println(x+":"+y);
}
It takes the input from a file as digit String.Then Scanner prints the digit.I tried to split the string digits and then I don't know how to evaluate the appearance of the mentioned kind in the question.
for(String x : b.split(""))
System.out.print(x);
gives: 44335557075557777(input from the file).
I don't know how can I call each repetitive index and see how they formulate such pattern as in mobile keypad.If I use for loop then I have to cycle through whole string and use lots of if statements. There must be some other way.
Another suggestion of making use of regex in breaking the encoded string.
By making use of look-around + back-reference makes it easy to split the string at positions that preceding and following characters are different.
e.g.
String line = "44335557075557777";
String[] tokens = line.split("(?<=(.))(?!\\1)");
// tokens will contain ["44", "33", "555", "7", "0", "7", "555", "7777"]
Then it should be trivial for you to map each string to its corresponding character, either by a Map or even naively by bunch of if-elses
Edit: Some background on the regex
(?<=(.))(?!\1)
(?<= ) : Look behind group, which means finding
something (a zero-length patternin this example)
preceded by this group of pattern
( ) : capture group #1
. : any char
: zero-length pattern between look behind and look
ahead group
(?! ) : Negative look ahead group, which means finding
a pattern (zero-length in this example) NOT followed
by this group of pattern
\1 : back-reference, whatever matched by
capture group #1
So it means, find any zero-length positions, for which the character before and after such position is different, and use such positions to do splitting.

Categories

Resources