regex join line infos - java

I have to parse this package:
WGS AUFFUELLUNGEN
ADMIN1 23.03.
17:09 -20- 1500.00
17:10 JD20 560.00
17:11 -2.0- 112.00
ADMIN1 24.03.
14:51 JD50 500.00
ADMIN2 27.03.
08:58 JD50 500.00
----------------------
3172.00
Parsing the user and date is easy:
\r?\n(.*)\s+(\d\d\.\d\d\.)
Parsing the time, denomination and amount is easy too:
\r?\n(\d\d:\d\d)\s+(.*)\s+(\d+\.\d\d)
But I need a parsing that detects user, date, time, denomination and amount for every booking at once.
Any ideas?

You will need some kind of intermediate structure you can iterate over. If you cant change your java code maybe you can use a regex to first match a whole block of you example string. In a second step you match all the details.
public class RegexTestCase {
private static final String PACKAGE
= "WGS AUFFUELLUNGEN \n" +
"ADMIN1 23.03.\n" +
"17:09 -20- 1500.00\n" +
"17:10 JD20 560.00\n" +
"17:11 -2.0- 112.00\n" +
"ADMIN1 24.03.\n" +
"14:51 JD50 500.00\n" +
"ADMIN2 27.03.\n" +
"08:58 JD50 500.00\n" +
"----------------------\n" +
" 3172.00\n";
private static final String NL = "\\r?\\n";
private static final String USER_DATE_REGEX
= "(.*?)\\s+(\\d\\d\\.\\d\\d\\.)";
private static final String TIME_AMOUNT_REGEX
= "(\\d\\d:\\d\\d)\\s+(.*?)\\s+(\\d+\\.\\d\\d)";
private static final String BLOCK_REGEX
= USER_DATE_REGEX + NL + "((" + TIME_AMOUNT_REGEX + NL + ")+)";
#Test
public void testRegex() throws Exception {
Pattern blockPattern = Pattern.compile( BLOCK_REGEX );
Pattern timeAmountPattern = Pattern.compile( TIME_AMOUNT_REGEX );
int count = 0;
Matcher blockMatcher = blockPattern.matcher( PACKAGE );
while (blockMatcher.find() ) {
String name = blockMatcher.group( 1 );
String date = blockMatcher.group( 2 );
String block = blockMatcher.group( 3 );
Matcher timeAmountMatcher = timeAmountPattern.matcher( block );
while ( timeAmountMatcher.find() ) {
String time = timeAmountMatcher.group( 1 );
String denom = timeAmountMatcher.group( 2 );
String amount = timeAmountMatcher.group( 3 );
assertEquals( "wrong name", RESULTS[count].name, name );
assertEquals( "wrong date", RESULTS[count].date, date );
assertEquals( "wrong time", RESULTS[count].time, time );
assertEquals( "wrong denom", RESULTS[count].denom, denom );
assertEquals( "wrong amount", RESULTS[count].amount, amount );
count++;
}
}
assertEquals( "wrong number of results", 5, count);
}
private static final Result[] RESULTS
= { new Result("ADMIN1", "23.03.", "17:09", "-20-", "1500.00")
, new Result("ADMIN1", "23.03.", "17:10", "JD20", "560.00")
, new Result("ADMIN1", "23.03.", "17:11", "-2.0-", "112.00")
, new Result("ADMIN1", "24.03.", "14:51", "JD50", "500.00")
, new Result("ADMIN2", "27.03.", "08:58", "JD50", "500.00")
};
static final class Result {
private final String name;
private final String date;
private final String time;
private final String denom;
private final String amount;
Result( String name, String date, String time, String denom, String amount ) {
this.name = name;
this.date = date;
this.time = time;
this.denom = denom;
this.amount = amount;
}
}
}

Your second regex is too eager, have a look at this.
I suggest to change it into \r?\n(\d\d:\d\d)\s+(.*?)\s+(\d+.\d\d)
This regex would match user, date, time, denomination and amount for every booking at once, but I have added the multiline regex flag.:
(^(.*)\s+(\d\d\.\d\d\.)$|^(\d\d:\d\d)\s+(.*)\s+(\d+\.\d\d)$)+

Split the entire string by new line
Iterate over the each line and
a. look for username and date by regex1, if matches then extract userName and Date
b. if regex1 doesn't, then look for time, denomincation and amount regex2 . if it matches
then extract time, denomination and amount from this.
final String userRegex = "^(\\w+)\\s+(\\d+\\.\\d+\\.)$";
final String timeRegex = "^(\\d+:\\d+)\\s+([\\S]+)\\s+(\\d+\\.?\\d+)$";
Sample Source:
public static void main(String[] args) {
final String userRegex = "^(\\w+)\\s+(\\d+\\.\\d+\\.)$";
final String timeRegex = "^(\\d+:\\d+)\\s+([\\S]+)\\s+(\\d+\\.?\\d+)$";
final String string = "WGS AUFFUELLUNGEN\n"
+ "ADMIN1 23.03.\n"
+ "17:09 -20- 1500.00\n"
+ "17:10 JD20 560.00\n"
+ "17:11 -2.0- 112.00\n"
+ "ADMIN1 24.03.\n"
+ "14:51 JD50 500.00\n"
+ "ADMIN2 27.03.\n"
+ "08:58 JD50 500.00\n"
+ "----------------------\n"
+ " 3172.00\n";
String[] list = string.split("\n");
Matcher m;
int cnt=1;
for (String s : list) {
m=Pattern.compile(userRegex).matcher(s);
if (m.matches()) {
System.out.println("##### List "+cnt+" ######");
System.out.println("User Name:"+m.group(1));
System.out.println("Date :"+m.group(2));
cnt++;
}
else
{
m=Pattern.compile(timeRegex).matcher(s);
if(m.matches())
{
System.out.println("Time :"+m.group(1));
System.out.println("Denomination :"+m.group(2));
System.out.println("Amount :"+m.group(3));
System.out.println("---------------------");
}
}
}
}

Related

Split filename into groups

Input:
"MyPrefix_CH-DE_ProductName.pdf"
Desired output:
["MyPrefix", "CH", "DE", "ProductName"]
CH is a country code, and it should come from a predefined list, eg. ["CH", "IT", "FR", "GB"]
Edit: prefix can contain _ and - as well but not CH or DE.
DE is a language code, and it should come from a predefined list, eg. ["EN", "IT", "FR", "DE"]
How do I do that?
I'm looking for a regex based solution here.
I'll assume that the extension is always pdf
String str = "MyPref_ix__CH-DE_ProductName.pdf";
String regex = "(.*)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)\\.pdf";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
String[] res = new String[4];
if(matcher.matches()) {
res[0] = matcher.group(1);
res[1] = matcher.group(2);
res[2] = matcher.group(3);
res[3] = matcher.group(4);
}
You can try the following
String input = "MyPrefix_CH-DE_ProductName.pdf";
String[] segments = input.split("_");
String prefix = segments[0];
String countryCode = segments[1].split("-")[0];
String languageCode = segments[1].split("-")[1];
String fileName = segments[2].substring(0, segments[2].length() - 4);
System.out.println("prefix " + prefix);
System.out.println("countryCode " + countryCode);
System.out.println("languageCode " + languageCode);
System.out.println("fileName " + fileName);
this code does the split and create an object using the returned result, more OOP.
package com.local;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
/**
* Hello world!
*
*/
public class App
{
public static void main( String[] args )
{
List<String> countries = Arrays.asList("CH", "IT", "FR", "GB");
List<String> languages = Arrays.asList("EN", "IT", "FR", "DE");
String filename = "MyPrefix_CH-DE_ProductName.pdf";
//Remove prefix
filename = filename.split("\\.")[0];
List<String> result = Arrays.asList(filename.split("[_\\-]"));
FileNameSplitResult resultOne = new FileNameSplitResult(result.get(0), result.get(1), result.get(2), result.get(3));
System.out.println(resultOne);
}
static class FileNameSplitResult{
String prefix;
String country;
String language;
String productName;
public FileNameSplitResult(String prefix, String country, String language, String productName) {
this.prefix = prefix;
this.country = country;
this.language = language;
this.productName = productName;
}
#Override
public String toString() {
return "FileNameSplitResult{" +
"prefix='" + prefix + '\'' +
", country='" + country + '\'' +
", language='" + language + '\'' +
", productName='" + productName + '\'' +
'}';
}
}
}
Result of execution:
FileNameSplitResult{prefix='MyPrefix', country='CH', language='DE', productName='ProductName'}
You can use String.split two times so you can first split by '_' to get the CH-DE string and then split by '-' to get the CountryCode and LanguageCode.
Updated after your edit, with input containing '_' and '-':
The following code scans through the input String to find countries matches. I changed the input to "My-Pre_fix_CH-DE_ProductName.pdf"
Check the following code:
public static void main(String[] args) {
String [] countries = {"CH", "IT", "FR", "GB"};
String input = "My-Pre_fix_CH-DE_ProductName.pdf";
//First scan to find country position
int index = -1;
for (int i=0; i<input.length()-4; i++){
for (String country:countries){
String match = "_" + country + "-";
String toMatch = input.substring(i, match.length()+i);
if (match.equals(toMatch)){
//Found index
index=i;
break;
}
}
}
String prefix = input.substring(0,index);
String remaining = input.substring(index+1);//remaining is CH-DE_ProductName.pdf
String [] countryLanguageProductCode = remaining.split("_");
String country = countryLanguageProductCode[0].split("-")[0];
String language = countryLanguageProductCode[0].split("-")[1];
String productName = countryLanguageProductCode[1].split("\\.")[0];
System.out.println("[\"" + prefix +"\", \"" + country + "\", \"" + language +"\", \"" + productName+"\"]");
}
It outputs:
["My-Pre_fix", "CH", "DE", "ProductName"]
You can use the following regex :
^(.*?)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)$
Java code :
Pattern p = Pattern.compile("^(.*?)_(CH|IT|FR|GB)-(EN|IT|FR|DE)_(.*)$");
Matcher m = p.matcher(input);
if (m.matches()) {
String[] result = { m.group(1), m.group(2), m.group(3), m.group(4) };
}
You can try it here.
Note that it would still fail if the prefix could contain a substring like _CH-EN_, and I don't think there's much than can be done about it beside sanitize the inputs.
One more alternative, which is pretty much the same as #billal GHILAS and #Aaron answers but using named groups. I find it handy for myself or for others who after a while look at my code immediately see what my regex does. The named groups make it easier.
String str = "My_Prefix_CH-DE_ProductName.pdf";
Pattern filePattern = Pattern.compile("(?<prefix>\\w+)_"
+ "(?<country>CH|IT|FR|GB)-"
+ "(?<language>EN|IT|FR|DE)_"
+ "(?<product>\\w+)\\.");
Matcher file = filePattern.matcher(str);
file.find();
System.out.println("Prefix: " + file.group("prefix"));
System.out.println("Country: " + file.group("country"));
System.out.println("Language: " + file.group("language"));
System.out.println("Product: " + file.group("product"));

Strings manipulation in java

I have a multiline String as below,I want to lift 'VC-38NN' whenever String line contains 'Profoma invoice'. My code below still prints everything once the search string is found.
Payment date
receipt serial
Profoma invoice VC-38NN
Welcome again
if(multilineString.toLowerCase().contains("Profoma invoice".toLowerCase()))
{
System.out.println(multilineString+"");
}
else
{
System.out.println("Profoma invoice not found");
}
Here are two possible solutions:
String input = "Payment date\n" +
"receipt serial\n" +
"Profoma invoice VC-38NN\n" +
"Welcome again";
// non-regex solution
String uppercased = input.toUpperCase();
// find "profoma invoice"
int profomaInvoiceIndex = uppercased.indexOf("PROFOMA INVOICE ");
if (profomaInvoiceIndex != -1) {
// find the first new line character after "profoma invoice".
int newLineIndex = uppercased.indexOf("\n", profomaInvoiceIndex);
if (newLineIndex == -1) { // if there is no new line after that, use the end of the string
newLineIndex = uppercased.length();
}
int profomaInvoiceLength = "profoma invoice ".length();
// substring from just after "profoma invoice" to the new line
String result = uppercased.substring(profomaInvoiceIndex + profomaInvoiceLength, newLineIndex);
System.out.println(result);
}
// regex solution
Matcher m = Pattern.compile("^profoma invoice (.+)$", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE).matcher(input);
if (m.find()) {
System.out.println(m.group(1));
}
Explanation in comments:
public class StackOverflow55313851 {
public final static String TEXT = "Profoma invoice";
public static void main(String[] args) {
String multilineString = "Payment date\n" +
"receipt serial\n" +
"Profoma invoice VC-38NN\n" +
"Welcome again";
// split text by line breaks
String[] lines = multilineString.split("\n");
// iterate over every line
for (String line : lines) {
// if it contains desired text
if (line.toLowerCase().contains(TEXT.toLowerCase())) {
// find position of desired text in this line
int indexOfInvoiceText = line.toLowerCase().indexOf(TEXT.toLowerCase());
// get only part of the line following the desired text
String invoiceNumber = line.substring(indexOfInvoiceText + TEXT.length() + 1);
System.out.println(invoiceNumber);
}
}
}
}

Can't format the output the correct way

I have this code to read each line of a file of this type "603,The Matrix,1999-03-30,63000000,136,7.9,9079"
but I only need to read the first 3 parameters and the 3rd one each is a date needs to be read separately, therefor it needs to put the year in a var and the month in other var and then the day in another var but the output I get is this: "[603 | The Matrix | 03-603,The Matrix,1999-1999-03-30"
int i;
Scanner leitorFicheiroFilmes = new Scanner(ficheiroFilmes);
ArrayList<Filmes> filme = new ArrayList<>();
for (i = 0; leitorFicheiroFilmes.hasNextLine(); i++) {
String line = leitorFicheiroFilmes.nextLine();
String dados[] = linha.split(",");
if (dados.length == 7) {
int idFilme = Integer.parseInt(dados[0]);
String titulo = dados[1];
String dadosNew[] = line.split("-");
String ano = dados[2];
String mes = dadosNew[0];
String dia = dadosNew[1];
filme.add(new Filmes(idFilme, title, year, month, day, parseActoresFile(), parseGenerosFile(idFilme)));
}
}
this is the class with the constructor:
public class Filmes {
int idFilme;
String titulo;
ArrayList<Actores> actores = new ArrayList<Actores>();
ArrayList<GenerosCinematograficos> generos = new ArrayList<GenerosCinematograficos>();
String year, month, day;
public Filmes(int idFilme, String titulo, String year, String month, String day, ArrayList<Actores> actores, ArrayList<GenerosCinematograficos> generos) {
this.idFilme = idFilme;
this.titulo = titulo;
this.year = year;
this.month = month;
this.day = day;
this.actores = actores;
this.generos = generos;
}
public String toString() {
return idFilme + " | " + titulo + " | " + dia + "-" + mes + "-" + ano;
}
}
String dadosNew[] = line.split("-");
must be
String dadosNew[] = dados[2].split("-");
dadosNew array will be [1999,03,30] from which you can get the date, month and year by accessing the correct indices.
You are reading incorrect values to your variables when parsing the date
String dadosNew[] = line.split("-");
String ano = dados[2];
String mes = dadosNew[0];
String dia = dadosNew[1];
to
String dadosNew[] = dados[2].split("-");
String ano = dadosNew[0];
String mes = dadosNew[1];
String dia = dadosNew[2];
The problem is here :
String dadosNew[] = line.split("-");
With the input (line) being "603,The Matrix,1999-03-30,63000000,136,7.9,9079" The result wille be :
{"603,The Matrix,1999", "03", "30,63000000,136,7.9,9079"}
You want to split only the date, and this is contained in dados[2], so to correct it you have to do :
String dadosNew[] = dados[2].split("-");

multiple String location find using a key in a tag

I want to parse an input eg: GH123FG12B1A58 .
'GH' / 'FG' / 'A' / 'B' will be there in all the tags in same order but different position . eg: GH14555523FG1555552B55551A55558
Need to find the value after every keys
I see this can be done by using patter , get start & end index ? Is there any other way to acomplish this ?
import java.util.Scanner;
public class Shipparse {
public static void main(String[] args) {
#SuppressWarnings("resource")
Scanner Iname = new Scanner(System.in);
System.out.println("Enter the invoice :");
String Maintag = Iname.nextLine();
String GH = "GH";
String FG = "FG";
String AB = "AB";
String B = "B";
String A = "A";
int cus = Maintag.indexOf(GH);
int cys = Maintag.indexOf(FG);
int ats = Maintag.indexOf(AB);
int ss = Maintag.indexOf(B);
int se = Maintag.indexOf(A);
int tlength = Maintag.length();
StringBuilder str = new StringBuilder(Maintag);
String cnum;
if ( ats == -1) {
cnum = str.substring((cus +2) , cys);
System.out.println("Customer :" + cnum);
String cyn = str.substring((cys + 2), ss);
System.out.println("Agent :" + cyn);
} else {
cnum = str.substring((cus +2) , ats);
System.out.println("Customer :" + cnum);
String cyn = str.substring((ats + 2), ss);
System.out.println("Company:" + cyn);
}
String spoint = str.substring((ss +1) , se);
System.out.println("TYPE NUM:" + spoint);
String send = str.substring((se +1) , tlength);
System.out.println("FIELD NUM :" + send);
}
}

java regular expression check date format

private static String REGEX_ANY_MONTH = "January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|"
+ "July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec";
private static String REGEX_ANY_YEAR = "[0-9]{4}";
private static String REGEX_ANY_DATE = "[0-9]{1,2}";
private String getFormat(String date) {
if (date.matches(REGEX_ANY_MONTH + " " + REGEX_ANY_DATE + "," + " " + REGEX_ANY_YEAR)) {
return "{MONTH} {DAY}, {YEAR}";
} else if (date.matches(REGEX_ANY_MONTH + " " + REGEX_ANY_YEAR)){
return "{MONTH} {YEAR}";
}
return null;
}
#Test
public void testGetFormatDateString() throws Exception{
String format = null;
String test = null;
test = "March 2012";
format = Whitebox.<String> invokeMethod(obj, "getFormat", test);
Assert.assertEquals("{MONTH} {YEAR}", format);
test = "March 10, 2012";
format = Whitebox.<String> invokeMethod(obj, "getFormat", test);
Assert.assertEquals("{MONTH} {DATE}, {YEAR}", format);
}
Both of the asserts are failing for me. What am I missing?
You need to wrap your piped list of month names in parentheses in order for it to match.
private static String REGEX_ANY_MONTH = "(January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|"
+ "July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)";
Otherwise the 'or' condition will be or-ing more than just the month.

Categories

Resources