Parsing using Pattern in Java

Parsing using Pattern in Java - java

I want to Parse the lines of a file Using parsingMethod
test.csv
Frank George,Henry,Mary / New York,123456
,Beta Charli,"Delta,Delta Echo
", 25/11/1964, 15/12/1964,"40,000,000.00",0.0975,2,"King, Lincoln ",Alpha
This is the way i read line
public static void main(String[] args) throws Exception {
File file = new File("C:\\Users\\test.csv");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line2;
while ((line2= reader.readLine()) !=null) {
String[] tab = parsingMethod(line2, ",");
for (String i : tab) {
System.out.println( i );
}
}
}
public static String[] parsingMethod(String line,String parser) {
List<String> liste = new LinkedList<String>();
String patternString ="(([^\"][^"+parser+ "]*)|\"([^\"]*)\")" +parser+"?";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher =pattern.matcher(line);
while (matcher.find()) {
if(matcher.group(2) != null){
liste.add(matcher.group(2).replace("\n","").trim());
}else if(matcher.group(3) != null){
liste.add(matcher.group(3).replace("\n","").trim());
}
}
String[] result = new String[liste.size()];
return liste.toArray(result);
}
}
Output :
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta
Delta Echo
"
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
"
Alpha
Delta
Delta Echo
I want to remove this " ,
Can any one help me to improve my Pattern.
Expected output
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta
Delta Echo
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King
Lincoln
Alpha
Delta
Delta Echo
Output for line 3
25/11/1964
15/12/1964
40
000
000.00
0.0975
2
King
Lincoln

Your code didn't compile properly but that was caused by some of the " not being escaped.
But this should do the trick:
String patternString = "(?:^.,|)([^\"]*?|\".*?\")(?:,|$)";
Pattern pattern = Pattern.compile(patternString, Pattern.MULTILINE);
(?:^.,|) is a non capturing group that matches a single character at the start of the line
([^\"]*?|\".*?\") is a capturing group that either matches everything but " OR anything in between " "
(?:,|$) is a non capturing group that matches a end of the line or a comma.
Note: ^ and $ only work as stated when the pattern is compiled with the Pattern.MULTILINE flag

I can't reproduce your result but I'm thinking maybe you want to leave the quotes out of the second captured group, like this:
"(([^\"][^"+parser+ "]*)|\"([^\"]*))\"" +parser+"?"
Edit: Sorry, this won't work. Maybe you want to let any number of ^\" in the first group as well, like this: (([^,\"]*)|\"([^\"]*)\"),?

As i can see the lines are related so try this:
public static void main(String[] args) throws Exception {
File file = new File("C:\\Users\\test.csv");
BufferedReader reader = new BufferedReader(new FileReader(file));
StringBuilder line = new StringBuilder();
String lineRead;
while ((lineRead = reader.readLine()) != null) {
line.append(lineRead);
}
String[] tab = parsingMethod(line.toString());
for (String i : tab) {
System.out.println(i);
}
}
public static String[] parsingMethod(String line) {
List<String> liste = new LinkedList<String>();
String patternString = "(([^\"][^,]*)|\"([^\"]*)\"),?";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
if (matcher.group(2) != null) {
liste.add(matcher.group(2).replace("\n", "").trim());
} else if (matcher.group(3) != null) {
liste.add(matcher.group(3).replace("\n", "").trim());
}
}
String[] result = new String[liste.size()];
return liste.toArray(result);
}
Ouput:
Frank George
Henry
Mary / New York
123456
Beta Charli
Delta,Delta Echo
25/11/1964
15/12/1964
40,000,000.00
0.0975
2
King, Lincoln
Alpha
as Delta, Delta Echo is in a quotation this should appear in the same line ! like as King, Lincoln

Related

How can i split on a string

I have a .txt file that I browse through a bufferReader and I need to extract the last character from this String, I leave the line below
<path
action="m"
text-mod="true"
mods="true"
kind="file">branches/RO/2021Align01/CO/DGSIG-DAO/src/main/java/eu/ca/co/vo/CsoorspeWsVo.java</path>
I have the following code that takes my entire line and sets it in a list, but I just need it Cs00rspeWsVo
while ((line = bufferdReader.readLine()) != null) {
Excel4 excel4 = new Excel4();
if (line.contains("</path>")) {
int index1 = line.indexOf(">");
int index2 = line.lastIndexOf("<");
line = line.substring(index1, index2);
excel4.setName(line);
listExcel4.add(excel4);
}
}
and I only want to extract Cs00rspeWsVo from here.
can anyone help me? thanks

You can use Regex groups to get it for example
public static void main(String []args){
String input = "<path\n" +
" action=\"m\"\n" +
" text-mod=\"true\"\n" +
" mods=\"true\"\n" +
" kind=\"file\">branches/RO/2021Align01/CO/DGSIG-DAO/src/main/java/eu/ca/co/vo/CsoorspeWsVo.java</path>\n";
Pattern pattern = Pattern.compile("kind=\"file\">.+/(.+\\..+)</path>");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
String fileName = matcher.group(1);
System.out.println(fileName);
}
}
Output will be -> CsoorspeWsVo.java
and if you want the fill path change the regex to
Pattern pattern = Pattern.compile("kind=\"file\">(.+)</path>");
The output will be:
branches/RO/2021Align01/CO/DGSIG-DAO/src/main/java/eu/ca/co/vo/CsoorspeWsVo.java
And you can get name and extension in two groups for example
Pattern pattern = Pattern.compile("kind=\"file\">.+/(.+)\\.(.+)</path>");
And inside the if
String fileName = matcher.group(1);
String fileExtension = matcher.group(2);

Finding six consecutive integers in three lines of string

I have written an OCR program in Java where it scans documents and finds all text in it. My primary task is to find the Invoice number which can be 6 or more integer.
I used the substring functionality but that's not so efficient as the position of that number is changing with every document, but it is always present in the first three lines of OCR text.
I want to write code in Java 8 from where I can iterate through the first three lines and get this 6 consecutive numbers.
I am using Tesseract for OCR.
Example:
,——— ————i_
g DAILYW RK SHE 278464
E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha
from this, I need to extract the number 278464.
Please help!!

try the following code using regex.
import java.lang.Math; // headers MUST be above the first class
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
// arguments are passed using the text field below this editor
public static void main(String[] args)
{
Pattern pattern = Pattern.compile("(?<=\\D)\\d{6}(?!\\d)");
String str = "g DAILYW RK SHE 278464";
Matcher matcher = pattern.matcher(str);
if(matcher.find()){
String s = matcher.group();
//278464
System.out.println(s);
}
}
}
(?<=\\D) match but not catch text current and before current are not numbers
\\d{6} match exactly 6 numbers
(?!\\d) match but not catch text current and after current are not numbers

It can be solved simply with \\d{6,} as shown below:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String args[]) {
// Tests
String[] textArr1 = { ",——— ————i_", "g DAILYW RK SHE 2784647",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
String[] textArr2 = { ",——— ————i_", "g DAILYW RK SHE ——— ————",
"E C 0 mp] on THE 278464 POUJER Hello, Mumbai, Co. Maha" };
String[] textArr3 = { ",——— 278464————i_", "g DAILYW RK SHE POUJER",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
System.out.println(getInvoiceNumber(textArr1));
System.out.println(getInvoiceNumber(textArr2));
System.out.println(getInvoiceNumber(textArr3));
}
static String getInvoiceNumber(String[] textArr) {
String invoiceNumber = "";
Pattern pattern = Pattern.compile("\\d{6,}");
for (String text : textArr) {
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
invoiceNumber = matcher.group();
}
}
return invoiceNumber;
}
}
Output:
2784647
278464
278464

check this code.
public class Test {
private static final Pattern p = Pattern.compile("(\\d{6,})");
public static void main(String[] args) {
try {
Scanner scanner = new Scanner(new File("here put your file path"));
System.out.println("done");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
// create matcher for pattern p and given string
Matcher m = p.matcher(line);
// if an occurrence if a pattern was found in a given string...
if (m.find()) {
System.out.println(m.group(1)); // second matched digits
}
}
scanner.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}

Java extract multiline values from a file

I'm reading file line by line and some lines have multiline values as below due to which my loop breaks and returns unexpected result.
TSNK/Metadata/tk.filename=PZSIIF-anefnsadual-rasdfepdasdort.pdf
TSNK/Metadata/tk_ISIN=LU0291600822,LU0871812862,LU0327774492,LU0291601986,LU0291605201
,LU0291595725,LU0291599800,LU0726995649,LU0726996290,LU0726995995,LU0726995136,LU0726995482,LU0726995219,LU0855227368
TSNK/Metadata/tk_GroupCode=PZSIIF
TSNK/Metadata/tk_GroupCode/PZSIIF=y
TSNK/Metadata/tk_oneTISNumber=16244,17007,16243,11520,19298,18247,20755
TSNK/Metadata/tk_oneTISNumber_TEXT=Neo Emerging Market Corporate Debt
Neo Emerging Market Debt Opportunities II
Neo Emerging Market Investment Grade Debt
Neo Floating Rate II
Neo Upper Tier Floating Rate
Global Balanced Regulation 28
Neo Multi-Sector Credit Income
Here TSNK/Metadata/tk_ISIN & TSNK/Metadata/tk_oneTISNumber_TEXT have multiline values. While reading line by line from file how do I read these fields as single line ?
I have tried below logic but it did not produce expected result:
try {
fr = new FileReader(FILENAME);
br = new BufferedReader(fr);
String sCurrentLine;
br = new BufferedReader(new FileReader(FILENAME));
int i=1;
CharSequence OneTIS = "TSNK/Metadata/tk_oneTISNumber_TEXT";
StringBuilder builder = new StringBuilder();
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.contains(OneTIS)==true) {
System.out.println("Line number here -> "+i);
builder.append(sCurrentLine);
builder.append(",");
}
else {
System.out.println("else --->");
}
//System.out.println("Line number"+i+" Value is---->>>> "+sCurrentLine);
i++;
}
System.out.println("Line number"+i+" Value is---->>>> "+builder);

The solution involves Scanner and multiline regular expressions.
The assumption here is that all of your lines start with TSNK/Metadata/
Scanner scanner = new Scanner(new File("file.txt"));
scanner.useDelimiter("TSNK/Metadata/");
Pattern p = Pattern.compile("(.*)=(.*)", Pattern.DOTALL | Pattern.MULTILINE);
String s = null;
do {
if (scanner.hasNext()) {
s = scanner.next();
Matcher matcher = p.matcher(s);
if (matcher.find()) {
System.out.println("key = '" + matcher.group(1) + "'");
String[] values = matcher.group(2).split("[,\n]");
int i = 1;
for (String value : values) {
System.out.println(String.format(" val(%d)='%s',", (i++), value ));
}
}
}
} while (s != null);
The above produces output
key = 'tk.filename'
val(0)='PZSIIF-anefnsadual-rasdfepdasdort.pdf',
key = 'tk_ISIN'
val(0)='LU0291600822',
val(1)='LU0871812862',
val(2)='LU0327774492',
val(3)='LU0291601986',
val(4)='LU0291605201',
val(5)='',
val(6)='LU0291595725',
val(7)='LU0291599800',
val(8)='LU0726995649',
val(9)='LU0726996290',
val(10)='LU0726995995',
val(11)='LU0726995136',
val(12)='LU0726995482',
val(13)='LU0726995219',
val(14)='LU0855227368',
key = 'tk_GroupCode'
val(0)='PZSIIF',
key = 'tk_GroupCode/PZSIIF'
val(0)='y',
key = 'tk_oneTISNumber'
val(0)='16244',
val(1)='17007',
val(2)='16243',
val(3)='11520',
val(4)='19298',
val(5)='18247',
val(6)='20755',
key = 'tk_oneTISNumber_TEXT'
val(0)='Neo Emerging Market Corporate Debt ',
val(1)='Neo Emerging Market Debt Opportunities II ',
val(2)='Neo Emerging Market Investment Grade Debt ',
val(3)='Neo Floating Rate II ',
val(4)='Neo Upper Tier Floating Rate ',
val(5)='Global Balanced Regulation 28 ',
val(6)='Neo Multi-Sector Credit Income',
Please note empty entry (val(5) for key tk_ISIN) due to new line followed by a comma in that entry. It can be sorted quite easily either by rejecting empty strings or by adjusting the splitting pattern.
Hope this helps!

Why is Java placing the string before the word and not after?

from the String value want to getting word before and after the <in>
String ref = "application<in>rid and test<in>efd";
int result = ref.indexOf("<in>");
int result1 = ref.lastIndexOf("<in>");
String firstWord = ref.substring(0, result);
String[] wor = ref.split("<in>");
for (int i = 0; i < wor.length; i++) {
System.out.println(wor[i]);
}
}
my Expected Output
String[] output ={application,rid,test,efd}
i tried with 2 Option first one IndexOf but if the String have more than two <in>i 'm not getting my expected output
Second One splitits also not getting with my expected Output
please suggest best option to getting the word(before and after <in>)

You could use an expression like so: \b([^ ]+?)<in>([^ ]+?)\b (example here). This should match the string prior and after the <in> tag and place them in two groups.
Thus, given this:
String ref = "application<in>rid and test<in>efd";
Pattern p = Pattern.compile("\\b([^ ]+?)<in>([^ ]+?)\\b");
Matcher m = p.matcher(ref);
while(m.find())
System.out.println("Prior: " + m.group(1) + " After: " + m.group(2));
Yields:
Prior: application After: rid
Prior: test After: efd
Alternatively using split:
String[] phrases = ref.split("\\s+");
for(String s : phrases)
if(s.contains("<in>"))
{
String[] split = s.split("<in>");
for(String t : split)
System.out.println(t);
}
Yields:
application
rid
test
efd

Regex is your friend :)
public static void main(String args[]) throws Exception {
String ref = "application<in>rid and test<in>efd";
Pattern p = Pattern.compile("\\w+(?=<in>)|(?<=<in>)\\w+");
Matcher m = p.matcher(ref);
while (m.find()) {
System.out.println(m.group());
}
}
O/P :
application
rid
test
efd

No doubt matching what you need using Pattern/Matcher API is simpler for tis problem.
However if you're looking for a short and quick String#split solution then you can consider:
String ref = "application<in>rid and test<in>efd";
String[] toks = ref.split("<in>|\\s+.*?(?=\\b\\w+<in>)");
Output:
application
rid
test
efd
RegEx Demo
This regex splits on <in> or a pattern that matches a space followed by 0 more chars followed by a word and <in>.

You can also try the below code, it is quite simple
class StringReplace1
{
public static void main(String args[])
{
String ref = "application<in>rid and test<in>efd";
System.out.println((ref.replaceAll("<in>", " ")).replaceAll(" and "," "));
}
}

Replace While Pattern is Found

I'm trying to go through a string and replace all instances of a regex-matching string. For some reason when I use if then it will work and replace just one string instance of a regex-match. When I change the if to while then it does some weird replacement over itself and makes a mess on the first regex-matching string while not even touching the others...
pattern = Pattern.compile(regex);
matcher = pattern.matcher(docToProcess);
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
match = docToProcess.substring(start, end);
stringBuilder.replace(start, end, createRef(match));
docToProcess = stringBuilder.toString();
}

Aside from the sysouts I only added the last assignment. See if it helps:
// your snippet:
pattern = Pattern.compile(regex);
matcher = pattern.matcher(docToProcess);
while (matcher.find()) {
start = matcher.start();
end = matcher.end();
match = docToProcess.substring(start, end);
String rep = createRef(match);
stringBuilder.replace(start, end, rep);
docToProcess = stringBuilder.toString();
// my addition:
System.out.println("Found: '" + matcher.group() + "'");
System.out.println("Replacing with: '" + rep + "'");
System.out.println(" --> " + docToProcess);
matcher = pattern.matcher(docToProcess);
}

Not sure exactly what problem you got but maybe this example will help a little:
I want to change names in sentence like:
Jack -> Albert
Albert -> Paul
Paul -> Jack
We can do this with little help of appendReplacement and appendTail methods from Matcher class
//this method can use Map<String,String>, or maybe even be replaced with Map.get(key)
static String getReplacement(String name) {
if ("Jack".equals(name))
return "Albert";
else if ("Albert".equals(name))
return "Paul";
else
return "Jack";
}
public static void main(String[] args) {
String sentence = "Jack and Albert are goint to see Paul. Jack is tall, " +
"Albert small and Paul is not in home.";
Matcher m = Pattern.compile("Jack|Albert|Paul").matcher(sentence);
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, getReplacement(m.group()));
}
m.appendTail(sb);
System.out.println(sb);
}
Output:
Albert and Paul are goint to see Jack. Albert is tall, Paul small and Jack is not in home.

If createRef(match) returns a string which is not the same length as (end - start) then the indexes you are using in docToProcess.substring(start, end) will potentially overlap.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing using Pattern in Java - java

Related

How can i split on a string

Finding six consecutive integers in three lines of string

Java extract multiline values from a file

Why is Java placing the string before the word and not after?

Replace While Pattern is Found

Categories

Resources