Java Matcher start in respect to line - java

I'm trying to write a program to check if a list of keys are present in the long text.
I consume my text and feed it to the matcher in 1000 rows as a single String separated by \n new line symbol.
In case of matching I call match.start() to get the position of matched symbol. But it returns me the position not in regard new line but like the single string.
Here's text example:
The Project Gutenberg EBook of The Adventures of Sherlock Holmes
by Sir Arthur Conan Doyle
(#15 in our series by Sir Arthur Conan Doyle)
Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.
I consume it using this method:
public String readLinesBatch(int startLine, int step, String file) {
try (Stream<String> lines = Files.lines(Paths.get(file))) {
return lines.skip(startLine)
.limit(step).collect(Collectors.joining(System.lineSeparator()));
} catch (IOException e) {
log.error("Exception while reading lines: {}", e.getMessage());
}
return "";
}
After that I feed it to the Matcher method:
public List<OffsetResult> matchV1(String source, Integer line) {
List<OffsetResult> result = new ArrayList<>();
Matcher match = Pattern.compile(String.join("|", keys))
.matcher(source);
while (match.find()) {
int offsetStart = match.start();
getLine(source, match.start());
result.add(new OffsetResult(match.group(), line, offsetStart));
}
return result;
}
The result I receive if Arthur is present in my keys is next:
Arthur=[charOffset=72]
But what I need is it to be 7. Because Arthur world occurs on a 2 line at the 7th position.
I googled and found nothing regarding this issue.
Does anyone have some ideas?
Thanks in advance!
UPD: my OffsetResult class
public class OffsetResult {
private String key;
private Integer lineOffset;
private Integer charOffset;
}

You can either split the string then find position in every of them and take e.g. first
Arrays.stream(input.split(String.format("%n")))
.map(s -> someMethodReturningPositionOrNull(s))
.filter(s -> s != null)
.findFirst()
.orElseGet(someDefaultValueOrNull)
or extend your regular expression to get last new line character just before searched string (then you also need to take care about string occuring before first new line character)

Related

Java string problem: replacing specific part of a string

A method replacement replaces all names (from given String a) in [Name] or {Name} brackets, with telephone numbers if [] these brackets, or e-mails if {} these brackets. The address book is represented with array tel, whose elements can be "Tel Name telephoneNumber" or "Mail Name mail". For example if input is: "You can contact jake via phone number [Jake] or via email {Jake}", output should be "You can contact jake via phone number +12345 or via email jake#gmail.com", and tel elements are "Tel Jake +12345" and "Mail Jake jake#gmail.com". If the given name does not exist in address book do nothing with the string. The problem that I have is when it comes to replacing substrings I use method replaceFirst which will replace the first occurrence of the substring that I want to replace.
Maybe the shorter question would be how to replace specific part of string?
public static String replacement(String a, String[] tel) {
for (int i = 0; i<a.length()-1; i++) {
char c = a.charAt(i);
if (c=='[') {
int ind = a.indexOf(']', i);
String name = a.substring(i+1, ind);
for (int j=0; j<tel.length; j++) {
int ind1 = tel[j].indexOf(' ', 4);
String name1 = tel[j].substring(4, ind1);
String p = tel[j].substring(0,3);
String help = "Tel";
int temp = p.compareTo(help);
if (ime.equals(ime1)==true && temp==0) {
String telephone = tel[j].substring(ind1+1, tel[j].length());
a = a.replaceFirst(name, telephone);
}
}
}
if (c=='{') {
int ind = a.indexOf('}', i);
String name = a.substring(i+1, ind);
for (int j=0; j<tel.length; j++) {
int ind1 = tel[j].indexOf(' ', 5);
String name1 = tel[j].substring(5, ind1);
String p = tel[j].substring(0,4);
if (name.equals(name1) && p.compareTo("Mail")==0) {
String mail = tel[j].substring(ind1+1, tel[j].length());
a = a.replaceFirst(name, mail);
}
}
}
}
return a;
}
Main:
String a = "In NY you can contact peter via telephone number [Peter] or e-mail {Peter}. In London you can contact anna via telephone number [Anna] or e-mail {Anna}."
+ "In Chicago you can contact shawn via telephone number [Shawn] or e-mail {Shawn}";
String [] tel = {"Mail Peter peter#gmail.com", "Tel Anna +3456","Tel Shawn +1234", "Mail Shawn shawn#yahoo.com"};
String t = replacement(a,tel);
System.out.println(t);
Console:
In NY you can contact peter via telephone number [peter#gmail.com] or e-mail {peter#gmail.com}.
In London you can contact anna via telephone number [+3456] or e-mail {Anna}.In Chicago you can
contact shawn via telephone number [+1234] or e-mail {shawn#yahoo.com}
Instead of encoding the type of the data (email vs phone number) and the replacement key into strings, I would put the data into separate variables and ues data structures like Map:
Map<String, String> tel = Map.of("Anna", "+3456", "Shawn", "+1234");
Map<String, String> mail = Map.of("Peter", "peter#gmail.com", "Shawn", "shawn#yahoo.com");
String t = replacement(a, tel, mail);
The replacement function could use a regular expression to find the substrings that match the key words you want to replace [something] and {something}. It would check which one it found, and add a replacement using the telephone or email it finds in the map data structure.
private static String replacement(String a, Map<String, String> tel, Map<String, String> mail) {
Pattern compile = Pattern.compile("\\{(.*?)\\}|\\[(.*?)\\]");
Matcher matcher = compile.matcher(a);
StringBuilder sb = new StringBuilder();
// Find substrings matching {something} and [something]
while (matcher.find()) {
String matched = matcher.group(0);
// Which was it, { or [ ?
if (matched.charAt(0) == '{') {
// Email. Replace from "mail"
String emailAddress = mail.getOrDefault(matcher.group(1), matched);
matcher.appendReplacement(sb, emailAddress);
} else if (matched.charAt(0) == '[') {
// Telephone. Replace from "tel"
String phoneNumber = tel.getOrDefault(matcher.group(2), matched);
matcher.appendReplacement(sb, phoneNumber);
}
}
matcher.appendTail(sb);
return sb.toString();
}
Handling of strings in a specified format is done best using regular expressions. You define a specified pattern and after you find a part matching your pattern, you can replace it or analyze further.
It's best to write your code to make it easily extensible. For example - if a new contact form is added (home address, fax, business phone number), it should be easy to handle it in the code. Your solution makes it harder to resolve such problems as a whole new if branch is required and it's easy to make a mistake, it also makes the code less readable.
When dealing with a kind of dictionary (like your input String array), it's worth using a Map as it makes the processing faster and the code more readable. When a constant values are present, it's worth to define them too - as constants or enum values. Also - Java allows for writing more functional and more readable, functional-style code instead of nested for-eaches - it's worth using those features (JDK8+).
Please, find the code snippet below and a whole project with tests comparing your solution to mine on GitHub - you can view it there or clone the repository and verify the code yourself:
// we can simply add new contact types and their matchers using the constant below
private static final Map<Pattern, ContactType> CONTACT_PATTERNS = Map.of(
Pattern.compile("\\[(\\S+)]"), ContactType.TEL,
Pattern.compile("\\{(\\S+)}"), ContactType.MAIL
);
#Override
public String replace(String input, String[] dictionary) {
// we're mapping the dictionary to make it easier to use and more readable (also in debugging)
Map<ContactType, Map<String, String>> contactTypeToNameToValue =
Arrays.stream(dictionary)
.map(entry -> entry.split(" ")) // dictionary entry is split by ' ' character
.collect(groupingBy(entry -> ContactType.fromString(entry[0]), // first split part is the contact type
toMap(entry -> entry[1], // second part is the person's name
entry -> entry[2]))); // third part is the contact value
String output = input;
for (Map.Entry<Pattern, ContactType> entry : CONTACT_PATTERNS.entrySet()) {
Pattern pattern = entry.getKey();
ContactType contactType = entry.getValue();
output = pattern.matcher(output)
.replaceAll(matchResult -> {
String name = matchResult.group(1);
// we search our dictionary and get value from it or get the original value if nothing matches given name
return Optional.ofNullable(contactTypeToNameToValue.get(contactType))
.map(nameToValue -> nameToValue.get(name))
.orElseGet(matchResult::group);
});
}
return output;
}
public enum ContactType {
TEL,
MAIL;
private static ContactType fromString(String value) {
return Arrays.stream(values())
.filter(enumValue -> enumValue.name().equalsIgnoreCase(value))
.findFirst()
.orElseThrow(RuntimeException::new);
}
}

Remove elements from Date Format String using a Regular Expression

I want to remove elements a supplied Date Format String - for example convert the format "dd/MM/yyyy" to "MM/yyyy" by removing any non-M/y element.
What I'm trying to do is create a localised month/year format based on the existing day/month/year format provided for the Locale.
I've done this using regular expressions, but the solution seems longer than I'd expect.
An example is below:
public static void main(final String[] args) {
System.out.println(filterDateFormat("dd/MM/yyyy HH:mm:ss", 'M', 'y'));
System.out.println(filterDateFormat("MM/yyyy/dd", 'M', 'y'));
System.out.println(filterDateFormat("yyyy-MMM-dd", 'M', 'y'));
}
/**
* Removes {#code charsToRetain} from {#code format}, including any redundant
* separators.
*/
private static String filterDateFormat(final String format, final char...charsToRetain) {
// Match e.g. "ddd-"
final Pattern pattern = Pattern.compile("[" + new String(charsToRetain) + "]+\\p{Punct}?");
final Matcher matcher = pattern.matcher(format);
final StringBuilder builder = new StringBuilder();
while (matcher.find()) {
// Append each match
builder.append(matcher.group());
}
// If the last match is "mmm-", remove the trailing punctuation symbol
return builder.toString().replaceFirst("\\p{Punct}$", "");
}
Let's try a solution for the following date format strings:
String[] formatStrings = { "dd/MM/yyyy HH:mm:ss",
"MM/yyyy/dd",
"yyyy-MMM-dd",
"MM/yy - yy/dd",
"yyabbadabbadooMM" };
The following will analyze strings for a match, then print the first group of the match.
Pattern p = Pattern.compile(REGEX);
for(String formatStr : formatStrings) {
Matcher m = p.matcher(formatStr);
if(m.matches()) {
System.out.println(m.group(1));
}
else {
System.out.println("Didn't match!");
}
}
Now, there are two separate regular expressions I've tried. First:
final String REGEX = "(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
Didn't match!
Didn't match!
Second:
final String REGEX = "(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
MM/yy - yy
Didn't match!
Now, let's see what the first regex actually matches to:
(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*) First regex =
(?:[^My]*) Any amount of non-Ms and non-ys (non-capturing)
([My]+ followed by one or more Ms and ys
[^\\w]* optionally separated by non-word characters
(implying they are also not Ms or ys)
[My]+) followed by one or more Ms and ys
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
What this means is that at least 2 M/ys are required to match the regex, although you should be careful that something like MM-dd or yy-DD will match as well, because they have two M-or-y regions 1 character long. You can avoid getting into trouble here by just keeping a sanity check on your date format string, such as:
if(formatStr.contains('y') && formatStr.contains('M') && m.matches())
{
String yMString = m.group(1);
... // other logic
}
As for the second regex, here's what it means:
(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*) Second regex =
(?:[^My]*) Any amount of non-Ms and non-ys
(non-capturing)
( ) followed by
(?:[My]+ )+[My]+ at least two text segments consisting of
one or more Ms or ys, where each segment is
[^\\w]* optionally separated by non-word characters
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
This regex will match a slightly broader series of strings, but it still requires that any separations between Ms and ys be non-words ([^a-zA-Z_0-9]). Additionally, keep in mind that this regex will still match "yy", "MM", or similar strings like "yyy", "yyyy"..., so it would be useful to have a sanity check as described for the previous regular expression.
Additionally, here's a quick example of how one might use the above to manipulate a single date format string:
LocalDateTime date = LocalDateTime.now();
String dateFormatString = "dd/MM/yyyy H:m:s";
System.out.println("Old Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
Pattern p = Pattern.compile("(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)");
Matcher m = p.matcher(dateFormatString);
if(dateFormatString.contains("y") && dateFormatString.contains("M") && m.matches())
{
dateFormatString = m.group(1);
System.out.println("New Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
}
else
{
throw new IllegalArgumentException("Couldn't shorten date format string!");
}
Output:
Old Format: "dd/MM/yyyy H:m:s" = 14/08/2019 16:55:45
New Format: "MM/yyyy" = 08/2019
I'll try to answer with the understanding of my question : how do I remove from a list/table/array of String, elements that does not exactly follow the patern 'dd/MM'.
so I'm looking for a function that looks like
public List<String> removeUnWantedDateFormat(List<String> input)
We can expect, from my knowledge on Dateformat, only 4 possibilities that you would want, hoping i dont miss any, which are "MM/yyyy", "MMM/yyyy", "MM/yy", "MM/yyyy". So that we know what we are looking for we can do an easy function.
public List<String> removeUnWantedDateFormat(List<String> input) {
String s1 = "MM/yyyy";
string s2 = "MMM/yyyy";
String s3 = "MM/yy";
string s4 = "MMM/yy";
for (String format:input) {
if (!s1.equals(format) && s2.equals(format) && s3.equals(format) && s4.equals(format))
input.remove(format);
}
return input;
}
Better not to use regex if you can, it costs a lot of resources. And great improvement would be to use an enum of the date format you accept, like this you have better control over it, and even replace them.
Hope this will help, cheers
edit: after i saw the comment, i think it would be better to use contains instead of equals, should work like a charm and instead of remove,
input = string expected.
so it would looks more like:
public List<String> removeUnWantedDateFormat(List<String> input) {
List<String> comparaisons = new ArrayList<>();
comparaison.add("MMM/yyyy");
comparaison.add("MMM/yy");
comparaison.add("MM/yyyy");
comparaison.add("MM/yy");
for (String format:input) {
for(String comparaison: comparaisons)
if (format.contains(comparaison)) {
format = comparaison;
break;
}
}
return input;
}

SwiftMessage Regular expression

I have the below message:
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}
And i want it to be converted like below, with whitespaces in block 4 (which is
{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA -}{5:{CHK:1DBBF1D81EE1}{TNG:}}
I tried to extract using groups and then apply regular expression. But, i was unsuccessfully. Unable to find the error i am making.
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(({1:\\w+})({2:\\w+})({4::\\d+:\\w+:\\d+.:\\w+:\\d+.:\\d+\\w+,\\d:\\d+.:\\w+,\\d+:\\d+.:\\w+:\\d+:\\w+:\\d+.:\\w+-})({5:{\\w+:.\\w+}{\\w+.}}))");
Matcher m = pat.matcher(data);
if(m.matches()) {
System.out.println(m.group(0));
}
}
Thanks in Adavance
You have just matched the string and simply printed it but havn't put logic of introducing a space in between. You need to add the logic of introducing space in block 4.
Looking at the expected output of your block 4, you can first catch the block 4 using this regex,
(.*?)(\\{4.*?\\})(.*?)
and then replace colon with a space colon ( :) in group 2 content which you call as block 4. I see you are not introducing space with every colon instead just for colon which are followed by 2-3 characters followed by colon. I have implemented the logic accordingly in my replaceAll() method.
Here is the modified java code,
public static void StringReplace() {
String data = "{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4::20:TEST000001:23B:CRED:32A:141117EUR0,1:33B:EUR1000,00:50A:ANZBAU30:59:ANZBAU30:71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}";
Pattern pat = Pattern.compile("(.*)(\\{4.*?\\})(.*)");
Matcher m = pat.matcher(data);
if (m.find()) {
String g1 = m.group(1);
String g2 = m.group(2).replaceAll(":(?=\\w{2,3}:)", " :");
String g3 = m.group(3);
System.out.println(g1 + g2 + g3);
} else {
System.out.println("Didn't match");
}
}
This prints the following output as you expect,
{1:F01ANZBDEF0AXXX0509036846}{2:I103ANZBDEF0XXXXN}{4: :20:TEST000001 :23B:CRED :32A:141117EUR0,1 :33B:EUR1000,00 :50A:ANZBAU30 :59:ANZBAU30 :71A:SHA-}{5:{CHK:1DBBF1D81EE1}{TNG:}}

How to get full sentence using regex in java

As of now, I'm parsing PDF using PDFBox later I will be parsing other documents (.docx/.doc). Using PDFBox, I'm getting all file content into one string. Now, I wanted to get complete sentence wherever a user define words matches.
For example:
... some text here..
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
Relational Database.
... some text here ..
If user gives the input year, then it should return whole sentence.
Expected Output:
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
I'm trying below code, but it showing nothing. Can anyone correct this
Pattern pattern = Pattern.compile("[\\w|\\W]*+[YEAR]+[\\w]*+.");
Also, If I have to include multiple words to match as OR condition, then what should I make change in my regex ?
Please note all words are in uppercase.
Do not try to put everything into the single regexp. There's a standard Java class java.text.BreakIterator which can be used to find the sentence boundaries.
public static String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word, Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
return input.substring(start, end);
}
return null;
}
Usage:
public static void main(String[] args) {
String input = "... some text...\n Raman took more than 12 year to complete his schooling and now he\nis pursuing higher study. Relational Database. \n... some text...";
System.out.println(getSentence(input, "YEAR"));
}
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$) [^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(result);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}
A small fix to #Tagir Valeev answer to prevent index out of bounds exceptions.
private String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word , Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
if(start == BreakIterator.DONE) {
start = 0;
}
if(end == BreakIterator.DONE) {
end = input.length();
}
return input.substring(start, end);
}
return null;
}

Regular Expression Statement

I've never been good with regex and I can't seem to get this...
I am trying to match statements along these lines (these are two lines in a text file I'm reading)
Lname Fname 12.35 1
Jones Bananaman 7.1 3
Currently I am using this for a while statement
reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")
But it doesn't enter the while statement.
The program reads the text file just fine when I remove the while.
The code segment is this:
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath);
while(reader.hasNext("\\w+ \\w+ \\d*\\.\\d{1,2} [0-5]")){
employeeInfo.add(new EmployeeFile(reader.next(), reader.next(), reader.nextDouble(), reader.nextInt(), new employeeRemove()));
}
for(EmployeeFile element: employeeInfo){
output.add(element);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
Use the \s character class for the spaces between words:
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]"))
Update:
According to the javadoc for the Scanner class, by default it splits it's tokens using whitespace. You can change the delimiter it uses with the useDelimiter(String pattern) method of Scanner.
private void initializeFileData(){
try {
Scanner reader = new Scanner(openedPath).useDelimiter("\\n");
...
while(reader.hasNext("\\w+\\s\\w+\\s\\d*\\.\\d{1,2}\\s[0-5]")){
...
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
From what I can see (And correct me if I'm wrong, because regex always seems to trick my brain :p), you're not handling the spaces correctly. You need to use \s, not just the standard ' ' character
EDIT: Sorry, \s. Someone else beat me to it :p
Actually
\w+
is going to catch [Lname, Fname, 12, 35, 1] for Lname Fname 12.35 1. So you can just store reader.nextLine() and then extract all regex matches from there. From there, you can abstract it a bit for instance by :
class EmployeeFile {
.....
public EmployeeFile(String firstName, String lastName,
Double firstDouble, int firstInt,
EmployeeRemove er){
.....
}
public EmployeeFile(String line) {
//TODO : extract all the required info from the string array
// instead of doing it while reading at the same time.
// Keep input parsing separate from input reading.
// Turn this into a string array using the regex pattern
// mentioned above
}
}
I created my own version, without files and the last loop, that goes like that:
private static void initializeFileData() {
String[] testStrings = {"Lname Fname 12.35 1", "Jones Bananaman 7.1 3"};
Pattern myPattern = Pattern.compile("(\\w+)\\s+(\\w+)\\s+(\\d*\\.\\d{1,2})\\s+([0-5])");
for (String s : testStrings) {
Matcher myMatcher = myPattern.matcher(s);
if (myMatcher.groupCount() == 4) {
String lastName = myMatcher.group(1);
String firstName = myMatcher.group(2);
double firstValue = Double.parseDouble(myMatcher.group(3) );
int secondValue = Integer.parseInt(myMatcher.group(4));
//employeeInfo.add(new EmployeeFile(lastName, firstName, firstValue, secondValue, new employeeRemove()));
}
}
}
Notice that I removed the slash before the dot (you want a dot, not any character) and inserted the parenthesis, in order to create the groups.
I hope it helps.

Categories

Resources