How to get full sentence using regex in java - java

As of now, I'm parsing PDF using PDFBox later I will be parsing other documents (.docx/.doc). Using PDFBox, I'm getting all file content into one string. Now, I wanted to get complete sentence wherever a user define words matches.
For example:
... some text here..
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
Relational Database.
... some text here ..
If user gives the input year, then it should return whole sentence.
Expected Output:
Raman took more than 12 year to complete his schooling and now he
is pursuing higher study.
I'm trying below code, but it showing nothing. Can anyone correct this
Pattern pattern = Pattern.compile("[\\w|\\W]*+[YEAR]+[\\w]*+.");
Also, If I have to include multiple words to match as OR condition, then what should I make change in my regex ?
Please note all words are in uppercase.

Do not try to put everything into the single regexp. There's a standard Java class java.text.BreakIterator which can be used to find the sentence boundaries.
public static String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word, Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
return input.substring(start, end);
}
return null;
}
Usage:
public static void main(String[] args) {
String input = "... some text...\n Raman took more than 12 year to complete his schooling and now he\nis pursuing higher study. Relational Database. \n... some text...";
System.out.println(getSentence(input, "YEAR"));
}

Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$) [^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(result);
while (reMatcher.find()) {
System.out.println(reMatcher.group());
}

A small fix to #Tagir Valeev answer to prevent index out of bounds exceptions.
private String getSentence(String input, String word) {
Matcher matcher = Pattern.compile(word , Pattern.LITERAL | Pattern.CASE_INSENSITIVE)
.matcher(input);
if(matcher.find()) {
BreakIterator br = BreakIterator.getSentenceInstance(Locale.ENGLISH);
br.setText(input);
int start = br.preceding(matcher.start());
int end = br.following(matcher.end());
if(start == BreakIterator.DONE) {
start = 0;
}
if(end == BreakIterator.DONE) {
end = input.length();
}
return input.substring(start, end);
}
return null;
}

Related

Remove elements from Date Format String using a Regular Expression

I want to remove elements a supplied Date Format String - for example convert the format "dd/MM/yyyy" to "MM/yyyy" by removing any non-M/y element.
What I'm trying to do is create a localised month/year format based on the existing day/month/year format provided for the Locale.
I've done this using regular expressions, but the solution seems longer than I'd expect.
An example is below:
public static void main(final String[] args) {
System.out.println(filterDateFormat("dd/MM/yyyy HH:mm:ss", 'M', 'y'));
System.out.println(filterDateFormat("MM/yyyy/dd", 'M', 'y'));
System.out.println(filterDateFormat("yyyy-MMM-dd", 'M', 'y'));
}
/**
* Removes {#code charsToRetain} from {#code format}, including any redundant
* separators.
*/
private static String filterDateFormat(final String format, final char...charsToRetain) {
// Match e.g. "ddd-"
final Pattern pattern = Pattern.compile("[" + new String(charsToRetain) + "]+\\p{Punct}?");
final Matcher matcher = pattern.matcher(format);
final StringBuilder builder = new StringBuilder();
while (matcher.find()) {
// Append each match
builder.append(matcher.group());
}
// If the last match is "mmm-", remove the trailing punctuation symbol
return builder.toString().replaceFirst("\\p{Punct}$", "");
}
Let's try a solution for the following date format strings:
String[] formatStrings = { "dd/MM/yyyy HH:mm:ss",
"MM/yyyy/dd",
"yyyy-MMM-dd",
"MM/yy - yy/dd",
"yyabbadabbadooMM" };
The following will analyze strings for a match, then print the first group of the match.
Pattern p = Pattern.compile(REGEX);
for(String formatStr : formatStrings) {
Matcher m = p.matcher(formatStr);
if(m.matches()) {
System.out.println(m.group(1));
}
else {
System.out.println("Didn't match!");
}
}
Now, there are two separate regular expressions I've tried. First:
final String REGEX = "(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
Didn't match!
Didn't match!
Second:
final String REGEX = "(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*)";
With program output:
MM/yyyy
MM/yyyy
yyyy-MMM
MM/yy - yy
Didn't match!
Now, let's see what the first regex actually matches to:
(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*) First regex =
(?:[^My]*) Any amount of non-Ms and non-ys (non-capturing)
([My]+ followed by one or more Ms and ys
[^\\w]* optionally separated by non-word characters
(implying they are also not Ms or ys)
[My]+) followed by one or more Ms and ys
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
What this means is that at least 2 M/ys are required to match the regex, although you should be careful that something like MM-dd or yy-DD will match as well, because they have two M-or-y regions 1 character long. You can avoid getting into trouble here by just keeping a sanity check on your date format string, such as:
if(formatStr.contains('y') && formatStr.contains('M') && m.matches())
{
String yMString = m.group(1);
... // other logic
}
As for the second regex, here's what it means:
(?:[^My]*)((?:[My]+[^\\w]*)+[My]+)(?:[^My]*) Second regex =
(?:[^My]*) Any amount of non-Ms and non-ys
(non-capturing)
( ) followed by
(?:[My]+ )+[My]+ at least two text segments consisting of
one or more Ms or ys, where each segment is
[^\\w]* optionally separated by non-word characters
(?:[^My]*) finished by any number of non-Ms and non-ys
(non-capturing)
This regex will match a slightly broader series of strings, but it still requires that any separations between Ms and ys be non-words ([^a-zA-Z_0-9]). Additionally, keep in mind that this regex will still match "yy", "MM", or similar strings like "yyy", "yyyy"..., so it would be useful to have a sanity check as described for the previous regular expression.
Additionally, here's a quick example of how one might use the above to manipulate a single date format string:
LocalDateTime date = LocalDateTime.now();
String dateFormatString = "dd/MM/yyyy H:m:s";
System.out.println("Old Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
Pattern p = Pattern.compile("(?:[^My]*)([My]+[^\\w]*[My]+)(?:[^My]*)");
Matcher m = p.matcher(dateFormatString);
if(dateFormatString.contains("y") && dateFormatString.contains("M") && m.matches())
{
dateFormatString = m.group(1);
System.out.println("New Format: \"" + dateFormatString + "\" = " +
date.format(DateTimeFormatter.ofPattern(dateFormatString)));
}
else
{
throw new IllegalArgumentException("Couldn't shorten date format string!");
}
Output:
Old Format: "dd/MM/yyyy H:m:s" = 14/08/2019 16:55:45
New Format: "MM/yyyy" = 08/2019
I'll try to answer with the understanding of my question : how do I remove from a list/table/array of String, elements that does not exactly follow the patern 'dd/MM'.
so I'm looking for a function that looks like
public List<String> removeUnWantedDateFormat(List<String> input)
We can expect, from my knowledge on Dateformat, only 4 possibilities that you would want, hoping i dont miss any, which are "MM/yyyy", "MMM/yyyy", "MM/yy", "MM/yyyy". So that we know what we are looking for we can do an easy function.
public List<String> removeUnWantedDateFormat(List<String> input) {
String s1 = "MM/yyyy";
string s2 = "MMM/yyyy";
String s3 = "MM/yy";
string s4 = "MMM/yy";
for (String format:input) {
if (!s1.equals(format) && s2.equals(format) && s3.equals(format) && s4.equals(format))
input.remove(format);
}
return input;
}
Better not to use regex if you can, it costs a lot of resources. And great improvement would be to use an enum of the date format you accept, like this you have better control over it, and even replace them.
Hope this will help, cheers
edit: after i saw the comment, i think it would be better to use contains instead of equals, should work like a charm and instead of remove,
input = string expected.
so it would looks more like:
public List<String> removeUnWantedDateFormat(List<String> input) {
List<String> comparaisons = new ArrayList<>();
comparaison.add("MMM/yyyy");
comparaison.add("MMM/yy");
comparaison.add("MM/yyyy");
comparaison.add("MM/yy");
for (String format:input) {
for(String comparaison: comparaisons)
if (format.contains(comparaison)) {
format = comparaison;
break;
}
}
return input;
}

How to create a regex that accepts specific characters?

I have this regex:
^[a-zA-Z0-9_#.#$%&'*+-/=?^`{|}~!(),:;<>[-\]]{8,}$
I need a regex to accept a minimum word length of 8, letters(uppercase & lowercase), numbers and these characters:
!#$%&'*+-/=?^_`{|}~"(),:;<>#[]
It works when I tested it here.
This is how I used it in Java Android.
public static final String regex = "^[a-zA-Z0-9_#.#$%&'*+-/=?^`{|}~!(),:;<>[-\\]]{8,}$";
This is the error that I received.
java.util.regex.PatternSyntaxException: Missing closing bracket in character class near index 49
^[a-zA-Z0-9_#.#$%&'*+-/=?^`{|}~!(),:;<>[-\]]{8,}$
If you just want to test if a given input string matches your pattern, you may use String#matches directly, e.g.
String regex = "[a-zA-Z0-9_#.#$%&'*+-/=?^`{|}~!(),:;<>\\[\\]-]{8,}";
String input = "Jon#Skeet#123";
if (input.matches(regex)) {
System.out.println("Found a match");
}
else {
System.out.println("No match");
}
If you wanted to parse a larger input text and identify such matching words, then you would want to use a formal Pattern and Matcher. But, I don't see the need for this just based on your question.
You have to use pattern marcher concept. it may help you.
follow tutorial : https://www.mkyong.com/regular-expressions/how-to-validate-password-with-regular-expression/
Here is one Example.
try {
Pattern pattern;
Matcher matcher;
final String PASSWORD_PATTERN = "((?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%]).{6,20})";
pattern = Pattern.compile(PASSWORD_PATTERN);
matcher = pattern.matcher(password_string );
if(matcher.matches()){
Log.e("TAG", "TRUE")
}else{
Log.e("TAG", "FALSE")
}
} catch (RuntimeException e) {
return false;
}

Regex lazy solution for java?

I have a string "hooRayNexTcapItaLnextcapitall"
I want to capture the first instance of "next" (NexT - in this case)
My soultion:
(.*)([nN][eE][xX][tT])([cC][aA][pP][iI][tT][aA][lL])(.*)
My solution group1 returns next instead of Next
How can I correct my regex to capture the first next instead of capturing the last next?
Edit 1:
Let me put my question properly,
If the string contains any combination of upper and lower case letters that spell "NextCapital", reverse the characters of the word "Next". Case should be preserved. If "NextCapital" occurs multiple times, only update the first occurrence.
So, I am using group to capture. But my group is capturing the last occurrence of "nextCapital" instead of first occurrence.
Ex:
Input: hooRayNexTcapItaLnextcapitall
output: hooRayTxeNcapItaLnextcapitall
Edit 2:
Please correct my code.
My java code:
Pattern ptn = Pattern.compile("(.*)([nN][eE][xX][tT])([cC][aA][pP][iI][tT][aA][lL])(.*)");
//sb = hooRayNexTcapItaLnextcapitall
Matcher mtc = ptn.matcher(sb);
StringBuilder c = new StringBuilder();
if(mtc.find()){
StringBuilder d = new StringBuilder();
StringBuilder e = new StringBuilder();
d.append(mtc.group(1));
e.append(mtc.group(2));
e.reverse();
d.append(e);
d.append(mtc.group(3));
d.append(mtc.group(4));
sb = d;
}
Your regex actually works if you get group 2. Test it here! Your regex does not need to be that complicated.
Your regex can just be this:
next
If you use Matcher.find and turn on CASE_INSENSITIVE option, you can find the first substring of the string that matches the pattern. Then, use group() to get the actual string:
Matcher matcher = Pattern.compile("next", Pattern.CASE_INSENSITIVE).matcher("hooRayNexTcapItaLnextcapitall");
if (matcher.find()) {
System.out.println(matcher.group());
}
EDIT:
After seeing your requirements, I wrote this code:
String input = "hooRayNexTcapItaLnextcapitall";
Matcher m = Pattern.compile("next(?=capital)", Pattern.CASE_INSENSITIVE).matcher(input);
if (m.find()) {
StringBuilder outputBuilder = new StringBuilder(input);
StringBuilder reverseBuilder = new StringBuilder(input.substring(m.start(), m.end()));
outputBuilder.replace(m.start(), m.end(), reverseBuilder.reverse().toString());
System.out.println(outputBuilder);
}
I used a lookahead to match next only if there is capital after it. After a match is found, I created a string builder with the input, and another string builder with the matched portion of the input. Then, I replaced the matched range with the reverse of the second string builder.
String target = "next";
int index = line.toLowerCase().indexOf(target);
if (index != -1) {
line = line.substring(index, index + target.length());
System.out.println(line);
} else {
System.out.println("Not Found");
}
This would be my first attempt which allows room for adjusting the desired String to locate.
Otherwise you may use this ReGeX solution to achieve the same effect:
Pattern pattern = Pattern.compile("(?i)next");
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
System.out.println(matcher.group());
}
The pattern "(?i)next" finds the substring matching "next" ignoring case.
Edit : This would reverse the order of the first occurrence of next.
String input = "hooRayNexTcapItaLnextcapitall";
String target = "nextcapital";
int index = input.toLowerCase().indexOf(target);
if (index != -1) {
String first = input.substring(index, index + target.length());
first = new StringBuilder(first.substring(0, 4)).reverse().toString() + first.substring(4, first.length());
input = input.substring(0, index) + first + input.substring(index + target.length(), input.length());
}
Edit Again : Here is a "fixed" form of your code.
String input = "hooRayNexTcapItaLnextcapitall";
Pattern ptn = Pattern.compile("([nN][eE][xX][tT])([cC][aA][pP][iI][tT][aA][lL])");
Matcher mtc = ptn.matcher(input);
if(mtc.find()){
StringBuilder d = new StringBuilder(mtc.group(1));
StringBuilder e = new StringBuilder(mtc.group(2));
input = input.replaceFirst(d.toString() + e.toString(), d.reverse().toString() + e.toString());
System.out.println(input);
}
Your regex is grabbing the second potential match for your group due to the default greedy nature of regex. Effectively, the first (.*) is grabbing as much as it can while still satisfying the rest of your regex.
To get what you intend, you can add a question mark to the first group, making it (.*?). This will make it non-greedy, grabbing the smallest string possible while still satisfying the rest of your regex.

Java: Find a specific pattern using Pattern and Matcher

This is the string that I have:
KLAS 282356Z 32010KT 10SM FEW090 10/M13 A2997 RMK AO2 SLP145 T01001128 10100 20072 51007
This is a weather report. I need to extract the following numbers from the report: 10/M13. It is temperature and dewpoint, where M means minus. So, the place in the String may differ and the temperature may be presented as M10/M13 or 10/13 or M10/13.
I have done the following code:
public String getTemperature (String metarIn){
Pattern regex = Pattern.compile(".*(\\d+)\\D+(\\d+)");
Matcher matcher = regex.matcher(metarIn);
if (matcher.matches() && matcher.groupCount() == 1) {
temperature = matcher.group(1);
System.out.println(temperature);
}
return temperature;
}
Obviously, the regex is wrong, since the method always returns null. I have tried tens of variations but to no avail. Thanks a lot if someone can help!
This will extract the String you seek, and it's only one line of code:
String tempAndDP = input.replaceAll(".*(?<![M\\d])(M?\\d+/M?\\d+).*", "$1");
Here's some test code:
public static void main(String[] args) throws Exception {
String input = "KLAS 282356Z 32010KT 10SM FEW090 M01/M13 A2997 RMK AO2 SLP145 T01001128 10100 20072 51007";
String tempAndDP = input.replaceAll(".*(?<![M\\d])(M?\\d+/M?\\d+).*", "$1");
System.out.println(tempAndDP);
}
Output:
M01/M13
The regex should look like:
M?\d+/M?\d+
For Java this will look like:
"M?\\d+/M?\\d+"
You might want to add a check for white space on the front and end:
"\\sM?\\d+/M?\\d+\\s"
But this will depend on where you think you are going to find the pattern, as it will not be matched if it is at the end of the string, so instead we should use:
"(^|\\s)M?\\d+/M?\\d+($|\\s)"
This specifies that if there isn't any whitespace at the end or front we must match the end of the string or the start of the string instead.
Example code used to test:
Pattern p = Pattern.compile("(^|\\s)M?\\d+/M?\\d+($|\\s)");
String test = "gibberish M130/13 here";
Matcher m = p.matcher(test);
if (m.find())
System.out.println(m.group().trim());
This returns: M130/13
Try:
Pattern regex = Pattern.compile(".*\\sM?(\\d+)/M?(\\d+)\\s.*");
Matcher matcher = regex.matcher(metarIn);
if (matcher.matches() && matcher.groupCount() == 2) {
temperature = matcher.group(1);
System.out.println(temperature);
}
Alternative for regex.
Some times a regex is not the only solution. It seems that in you case, you must get the 6th block of text. Each block is separated by a space character. So, what you need to do is count the blocks.
Considering that each block of text does NOT HAVE fixed length
Example:
String s = "KLAS 282356Z 32010KT 10SM FEW090 10/M13 A2997 RMK AO2 SLP145 T01001128 10100 20072 51007";
int spaces = 5;
int begin = 0;
while(spaces-- > 0){
begin = s.indexOf(' ', begin)+1;
}
int end = s.indexOf(' ', begin+1);
String result = s.substring(begin, end);
System.out.println(result);
Considering that each block of text does HAVE fixed length
String s = "KLAS 282356Z 32010KT 10SM FEW090 10/M13 A2997 RMK AO2 SLP145 T01001128 10100 20072 51007";
String result = s.substring(33, s.indexOf(' ', 33));
System.out.println(result);
Prettier alternative, as pointed by Adrian:
String result = rawString.split(" ")[5];
Note that split acctualy receives a regex pattern as parameter

Java Pattern match

I've a long template from which I need to extract certain strings based on certain patterns. When I went through some examples I found that use of quantifiers is good in such situations.For example following is my template, from which I need to extract while and doWhile.
This is a sample document.
$while($variable)This text can be repeated many times until do while is called.$endWhile.
Some sample text follows this.
$while($variable2)This text can be repeated many times until do while is called.$endWhile.
Some sample text.
I need to extract the whole text, starting from $while($variable) till $endWhile. I then need to process the value of $variable. After that I need to insert the text between $while and $endWhile to the original text.
I've the logic of extracting the variable. But I'm not sure how to use quantifiers or pattern match here.
Can someone please provide me a sample code for this? Any help will be greatly appreciated
You can use a rather simple regex-based solution here with a Matcher:
Pattern pattern = Pattern.compile("\\$while\\((.*?)\\)(.*?)\\$endWhile", Pattern.DOTALL);
Matcher matcher = pattern.matcher(yourString);
while(matcher.find()){
String variable = matcher.group(1); // this will include the $
String value = matcher.group(2);
// now do something with variable and value
}
If you want to replace the variables in the original text, you should use the Matcher.appendReplacement() / Matcher.appendTail() solution:
Pattern pattern = Pattern.compile("\\$while\\((.*?)\\)(.*?)\\$endWhile", Pattern.DOTALL);
Matcher matcher = pattern.matcher(yourString);
StringBuffer sb = new StringBuffer();
while(matcher.find()){
String variable = matcher.group(1); // this will include the $
String value = matcher.group(2);
// now do something with variable and value
matcher.appendReplacement(sb, value);
}
matcher.appendTail(sb);
Reference:
Methods of the Pattern Class
(Sun Java Tutorial)
Methods of the Matcher Class
(Sun Java Tutorial)
Pattern JavaDoc
Matcher JavaDoc
public class PatternInString {
static String testcase1 = "what i meant here";
static String testcase2 = "here";
public static void main(String args[])throws StringIndexOutOfBoundsException{
PatternInString testInstance= new PatternInString();
boolean result = testInstance.occurs(testcase1,testcase2);
System.out.println(result);
}
//write your code here
public boolean occurs(String str1, String str2)throws StringIndexOutOfBoundsException
{ int i;
boolean result=false;
int num7=str1.indexOf(" ");
int num8=str1.lastIndexOf(" ");
String str6=str1.substring(num8+1);
String str5=str1.substring(0,num7);
if(str5.equals(str2))
{
result=true;
}
else if(str6.equals(str2))
{
result=true;
}
int num=-1;
try
{
for(i=0;i<str1.length()-1;i++)
{ num=num+1;
num=str1.indexOf(" ",num);
int num1=str1.indexOf(" ",num+1);
String str=str1.substring(num+1,num1);
if(str.equals(str2))
{
result=true;
break;
}
}
}
catch(Exception e)
{
}
return result;
}
}

Categories

Resources