Regular Expression Statement - java

I've never been good with regex and I can't seem to get this...
I am trying to match statements along these lines (these are two lines in a text file I'm reading)
Lname Fname 12.35 1
Jones Bananaman 7.1 3
Currently I am using this for a while statement
reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")
But it doesn't enter the while statement.
The program reads the text file just fine when I remove the while.
The code segment is this:
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath);
while(reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")){
employeeInfo.add(new EmployeeFile(reader.next(), reader.next(), reader.nextDouble(), reader.nextInt(), new employeeRemove()));
}
for(EmployeeFile element: employeeInfo){
output.add(element);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}

Use the \s character class for the spaces between words:
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]"))
Update:
According to the javadoc for the Scanner class, by default it splits it's tokens using whitespace. You can change the delimiter it uses with the useDelimiter(String pattern) method of Scanner.
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath).useDelimiter("\\n");
...
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]")){
...
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html

From what I can see (And correct me if I'm wrong, because regex always seems to trick my brain :p), you're not handling the spaces correctly. You need to use \s, not just the standard ' ' character
EDIT: Sorry, \s. Someone else beat me to it :p

Actually
\w+
is going to catch [Lname, Fname, 12, 35, 1] for Lname Fname 12.35 1. So you can just store reader.nextLine() and then extract all regex matches from there. From there, you can abstract it a bit for instance by :
class EmployeeFile {
.....
public EmployeeFile(String firstName, String lastName,
Double firstDouble, int firstInt,
EmployeeRemove er){
.....
}
public EmployeeFile(String line) {
//TODO : extract all the required info from the string array
// instead of doing it while reading at the same time.
// Keep input parsing separate from input reading.
// Turn this into a string array using the regex pattern
// mentioned above
}
}

I created my own version, without files and the last loop, that goes like that:
private static void initializeFileData() {
String[] testStrings = {"Lname Fname 12.35 1", "Jones Bananaman 7.1 3"};
Pattern myPattern = Pattern.compile("(\\w+)\\s+(\\w+)\\s+(\\d*\\.\\d{1,2})\\s+([0-5])");
for (String s : testStrings) {
Matcher myMatcher = myPattern.matcher(s);
if (myMatcher.groupCount() == 4) {
String lastName = myMatcher.group(1);
String firstName = myMatcher.group(2);
double firstValue = Double.parseDouble(myMatcher.group(3) );
int secondValue = Integer.parseInt(myMatcher.group(4));
//employeeInfo.add(new EmployeeFile(lastName, firstName, firstValue, secondValue, new employeeRemove()));
}
}
}
Notice that I removed the slash before the dot (you want a dot, not any character) and inserted the parenthesis, in order to create the groups.
I hope it helps.

Related

Java Matcher start in respect to line

I'm trying to write a program to check if a list of keys are present in the long text.
I consume my text and feed it to the matcher in 1000 rows as a single String separated by \n new line symbol.
In case of matching I call match.start() to get the position of matched symbol. But it returns me the position not in regard new line but like the single string.
Here's text example:
The Project Gutenberg EBook of The Adventures of Sherlock Holmes
by Sir Arthur Conan Doyle
(#15 in our series by Sir Arthur Conan Doyle)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
I consume it using this method:
public String readLinesBatch(int startLine, int step, String file) {
try (Stream<String> lines = Files.lines(Paths.get(file))) {
return lines.skip(startLine)
.limit(step).collect(Collectors.joining(System.lineSeparator()));
} catch (IOException e) {
log.error("Exception while reading lines: {}", e.getMessage());
}
return "";
}
After that I feed it to the Matcher method:
public List<OffsetResult> matchV1(String source, Integer line) {
List<OffsetResult> result = new ArrayList<>();
Matcher match = Pattern.compile(String.join("|", keys))
.matcher(source);
while (match.find()) {
int offsetStart = match.start();
getLine(source, match.start());
result.add(new OffsetResult(match.group(), line, offsetStart));
}
return result;
}
The result I receive if Arthur is present in my keys is next:
Arthur=[charOffset=72]
But what I need is it to be 7. Because Arthur world occurs on a 2 line at the 7th position.
I googled and found nothing regarding this issue.
Does anyone have some ideas?
Thanks in advance!
UPD: my OffsetResult class
public class OffsetResult {
private String key;
private Integer lineOffset;
private Integer charOffset;
}
You can either split the string then find position in every of them and take e.g. first
Arrays.stream(input.split(String.format("%n")))
.map(s -> someMethodReturningPositionOrNull(s))
.filter(s -> s != null)
.findFirst()
.orElseGet(someDefaultValueOrNull)
or extend your regular expression to get last new line character just before searched string (then you also need to take care about string occuring before first new line character)

Regex to remove line break within double quote in CSV

Hi I have a csv file with an error in it.so i want it to correct with regular expression, some of the fields contain line break, Example as below
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy
California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
the above two lines should be in one line
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre PkwyCalifornia",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
I tried to use the below regex but it didnt help me
%s/\\([^\"]\\)\\n/\\1/
Try this:
public static void main(String[] args) {
String input = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
Matcher matcher = Pattern.compile("\"([^\"]*[\n\r].*?)\"").matcher(input);
Pattern patternRemoveLineBreak = Pattern.compile("[\n\r]");
String result = input;
while(matcher.find()) {
String quoteWithLineBreak = matcher.group(1);
String quoteNoLineBreaks = patternRemoveLineBreak.matcher(quoteWithLineBreak).replaceAll(" ");
result = result.replaceFirst(quoteWithLineBreak, quoteNoLineBreaks);
}
//Output
System.out.println(result);
}
Output:
"AHLR150","CDS","-1","MDCPBusinessRelationshipID",,,"Investigating","1600 Amphitheatre Pkwy California",,"Mountain View",,"United States",,"California",,,"94043-1351","9958"
Create a RegEx surrounding the text you want to keep by parentheses and that will create a group of matched characters. Then replace the string using the group index to compose as you wish.
String test = "\"AHLR150\",\"CDS\",\"-1\",\"MDCPBusinessRelationshipID\","
+ ",,\"Investigating\",\"1600 Amphitheatre Pkwy\n"
+ "California\",,\"Mountain View\",,\"United\n"
+ "States\",,\"California\",,,\"94043-1351\",\"9958\"\n";
System.out.println(test.replaceAll("(\"[^\"]*)\n([^\"]*\")", "$1$2"));
So when we replace the matching string ("United\nStates") by $1$2 we are removing the line break because it not belongs to any group:
$1 => the first group (\"[^\"]*) that will match "United
$2 => the second group ([^\"]*\")" that will match States"
Based on this you can try with:
/\r?\n|\r/
I checked it here and seems to be fine

SwiftMessage Regular expression

I have the below message:
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}
And i want it to be converted like below, with whitespaces in block 4 (which is
{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}{5:{CHK:1DBBF1D81EE1}{TNG:}}
I tried to extract using groups and then apply regular expression. But, i was unsuccessfully. Unable to find the error i am making.
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(({1:\\w+})({2:\\w+})({4::\\d+:\\w+:\\d+.:\\w+:\\d+.:\\d+\\w+,\\d:\\d+.:\\w+,\\d+:\\d+.:\\w+:\\d+:\\w+:\\d+.:\\w+-})({5:{\\w+:.\\w+}{\\w+.}}))");
Matcher m = pat.matcher(data);
if(m.matches()) {
System.out.println(m.group(0));
}
}
Thanks in Adavance
You have just matched the string and simply printed it but havn't put logic of introducing a space in between. You need to add the logic of introducing space in block 4.
Looking at the expected output of your block 4, you can first catch the block 4 using this regex,
(.*?)(\\{4.*?\\})(.*?)
and then replace colon with a space colon ( :) in group 2 content which you call as block 4. I see you are not introducing space with every colon instead just for colon which are followed by 2-3 characters followed by colon. I have implemented the logic accordingly in my replaceAll() method.
Here is the modified java code,
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(.*)(\\{4.*?\\})(.*)");
Matcher m = pat.matcher(data);
if (m.find()) {
String g1 = m.group(1);
String g2 = m.group(2).replaceAll(":(?=\\w{2,3}:)", " :");
String g3 = m.group(3);
System.out.println(g1 + g2 + g3);
} else {
System.out.println("Didn't match");
}
}
This prints the following output as you expect,
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}

parse csv, do not split within single OR double quotes

I try to parse a csv with java and have the following issue: The second column is a String (which may also contain comma) enclosed in double-quotes, except if the string itself contains a double quote, then the entire string is enclosed with a single quote. e.g.
Lines may lokk like this:
someStuff,"hello", someStuff
someStuff,"hello, SO", someStuff
someStuff,'say "hello, world"', someStuff
someStuff,'say "hello, world', someStuff
someStuff are placeholders for other elements, which can also include quotes in the same style
I'm looking for a generic way to split the lines at commas UNLESS enclosed in single OR double quotes in order to get the second column as a String. With second column I mean the fields:
hello
hello, SO
say "hello, world"
say "hello, world
I tried OpenCSV but fail as one can only specifiy one type of quote:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVReader reader = new CSVReader(new FileReader(file));
String[] nextLine;
while ((nextLine = reader.readNext()) != null) {
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
The solution with opencsv fails on the last line where there is only one double quote enclosed in single quotes:
someStuff | hello | someStuff
someStuff | hello, SO | someStuff
someStuff | 'say "hello, world"' | someStuff
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
If you truly cannot use a real CSV parser you could use a regex. This is generally not a good idea as there are always edge cases that you cannot handle but if the formatting is strictly as you describe then this may work.
public void test() {
String[] tests = {"numeStuff,\"hello\", someStuff, someStuff",
"numeStuff,\"hello, SO\", someStuff, someStuff",
"numeStuff,'say \"hello, world\"', someStuff, someStuff"
};
/* Matches a field and a potentially empty separator.
*
* ( - Field Group
* \" - Start with a quote
* [^\"]*? - Non-greedy match on anything that is not a quote
* \" - End with a quote
* | - Or
* ' - Start with a strop
* [^']*? - Non-greedy match on anything that is not a strop
* ' - End with a strop
* | - Or
* [^\"'] - Not starting with a quote or strop
* [^,$]*? - Non-greedy match on anything that is not a comma or end-of-line
* ) - End field group
* ( - Separator group
* [,$] - Comma separator or end of line
* ) - End separator group
*/
Pattern p = Pattern.compile("(\"[^\"]*?\"|'[^\']*?\'|[^\"'][^,\r\n]*?)([,\r\n]|$)");
for (String t : tests) {
System.out.println("Matching: " + t);
Matcher m = p.matcher(t);
while (m.find()) {
System.out.println(m.group(1));
}
}
}
It does not appear that opencsv supports this out of the box. You could extend com.opencsv.CSVParser and implement your own algorithm for handling two types of quotes. This is the source of the method you would be changing and here is a stub to get you started.
class MyCSVParser extends CSVParser{
#Override
private String[] parseLine(String nextLine, boolean multi) throws IOException{
//Your algorithm here
}
}
Basically you only need to track ," and ,' (trimming what's in the middle).
When you encounter one of those, set the appropriate flag (eg. singleQuoteOpen, doubleQuoteOpen) to true to indicate they're open and you are in ignore-commas mode.
When you meet the appropriate closing quote, reset the flag and keep slicing the elements.
To perform the check, stop at every comma (when not in ignore-commas mode) and look at the next char (if any, and trimming).
Note: the regex solution is good and also shorter, but less customizable for edge cases (at least without big headaches).
If the use of single and double quotes is consistent per line, one could chose the corresponding type of quote per line:
public class CSVDemo {
public static void main(String[] args) throws IOException {
CSVDemo demo = new CSVDemo();
demo.process("input.csv");
}
public void process(String fileName) throws IOException {
String file = this.getClass().getClassLoader().getResource(fileName)
.getFile();
CSVParser doubleParser = new CSVParser(',', '"');
CSVParser singleParser = new CSVParser(',', '\'');
String[] nextLine;
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
if (line.contains(",'") && line.contains("',")) {
nextLine = singleParser.parseLine(line);
} else {
nextLine = doubleParser.parseLine(line);
}
System.out.println(nextLine[0] + " | " + nextLine[1] + " | "
+ nextLine[2]);
}
}
}
}
It doesn't seem opencv supports this. However, have a look at this previous question and my answer as well as the other answers in case they help
you: https://stackoverflow.com/a/15905916/1688441
Below an example, please not notInsideComma actually meant "Inside quotes". The following code could be extended to check for both quotes and double quotes.
public static ArrayList<String> customSplitSpecific(String s)
{
ArrayList<String> words = new ArrayList<String>();
boolean notInsideComma = true;
int start =0, end=0;
for(int i=0; i<s.length()-1; i++)
{
if(s.charAt(i)==',' && notInsideComma)
{
words.add(s.substring(start,i));
start = i+1;
}
else if(s.charAt(i)=='"')
notInsideComma=!notInsideComma;
}
words.add(s.substring(start));
return words;
}

regular expressions escape on character

I have to separate a big list of emails and names, I have to split on commas but some names have commas in them so I have to deal with that first. Luckily the names are between "quotes".
At the moment I get with my regex output like this for example (edit: it doesn't display emails in the forum I see!):
"Talboom, Esther"
"Wolde, Jos van der"
"Debbie Derksen" <deberken#casema.nl>, corine <corine5#xs4all.nl>, "
The last one went wrong cause the name had no comma so it continues until it founds one and that was the one i want to use to separate. So I want it to look until it finds '<'.
How can I do that?
import java.util.regex.Pattern;
import java.util.regex.Matcher;
String test = "\"Talboom, Esther\" <E.Talboom#wegener.nl>, \"Wolde, Jos van der\" <J.vdWolde#wegener.nl>, \"Debbie Derksen\" <deberken#casema.nl>, corine <corine5#xs4all.nl>, \"Markies Aart\" <A.Markies#wegenernieuwsmedia.nl>";
Pattern pattern = Pattern.compile("\".*?,.*?\"");
Matcher matcher = pattern.matcher(test);
boolean found = false;
while (matcher.find ()) {
System.out.println(matcher.group());
}
edit:
better line to work with since not all have a name or quotes:
String test = "\"Talboom, Esther\" <E.Talboom#wegener.nl>, DRP - Wouter Haan <wouter#drp.eu>, \"Wolde, Jos van der\" <J.vdWolde#wegener.nl>, \"Debbie Derksen\" <deberken#casema.nl>, corine <corine5#xs4all.nl>, clankilllller#gmail.com, \"Markies Aart\" <A.Markies#wegenernieuwsmedia.nl>";
I would simplify the code by using String.split and String.replaceAll. This avoids the hassle of working with a Pattern and makes the code neat and brief.
Try this:
public static void main(String[] args) {
String test = "\"Talboom, Esther\" <E.Talboom#wegener.nl>, \"Wolde, Jos van der\" <J.vdWolde#wegener.nl>, \"Debbie Derksen\" <deberken#casema.nl>, corine <corine5#xs4all.nl>, \"Markies Aart\" <A.Markies#wegenernieuwsmedia.nl>";
// Split up into each person's details
String[] nameEmailPairs = test.split(",\\s*(?=\")");
for (String nameEmailPair : nameEmailPairs) {
// Extract exactly the parts you need from the person's details
String name = nameEmailPair.replaceAll("\"([^\"]+)\".*", "$1");
String email = nameEmailPair.replaceAll(".*<([^>]+).*", "$1");
System.out.println(name + " = " + email);
}
}
Output, showing it actually works :)
Talboom, Esther = E.Talboom#wegener.nl
Wolde, Jos van der = J.vdWolde#wegener.nl
Debbie Derksen = corine5#xs4all.nl
Markies Aart = A.Markies#wegenernieuwsmedia.nl

Categories

Resources