Regex for start and end of sentence - java

Is there a way to match start and end of sentence in Java? The easiest case is ending with simple (.) dot. In some other cases it could end with colum (:) or a shortcut ended with colum (.:).
For example some random news text:
Cliffs have collapsed in New Zealand during an earthquake in the city
of Christchurch on the South Island. No serious damage or fatalities
were reported in the Valentine's Day quake that struck at 13:13 local
time. Based on the med. report everybody were ok.
My goal is to get the shortcut of a word + the context of it, but if possible only the sentence in which the shortcut belonds.
So the successfull output for me will be if I would be able to get something like this:
selected word -> collapsed
context -> Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island.
selected word -> med.
context -> Based on the med. report everybody were ok.
Thanks

You spot the sentence easily. It starts with a capital letter and ends with one of .:!? chars followed by space and another capital letter or reached the end of the whole string.
Compare the difference time. Based and med. report.
So the regex capturing the whole sentence should look like this:
([A-Z][a-z].*?[.:!?](?=$| [A-Z]))
Take a look! Regex101

what you are looking for is a natural language processing toolkit. for java you can use: CoreNLP
and they already have some example cases on their tutorials page.
you can certainly make a regex expression that looks for all chars inbetween the set of chars (.:? etc...), and it would look something like this:
\.*?(?=[\.\:])\
then you would have to loop through the matched results and find the relevant sentences which have your words in them. but i recommend you use a NLP to achieve this.

The code:
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main( String[] args ) {
final Map<String, String> dict = new HashMap<>();
dict.put( "med", "medical" );
final String text =
"Cliffs have collapsed in New Zealand during an earthquake in the "
+ "city of Christchurch on the South Island. No serious damage or "
+ "fatalities were reported in the Valentine's Day quake that struck "
+ "at 13:13 local time. Based on the med. report everybody were ok.";
final Pattern p = Pattern.compile( "[^\\.]+\\W+(\\w+)\\." );
final Matcher m = p.matcher( text );
int pos = 0;
while(( pos < text.length()) && m.find( pos )) {
pos = m.end() + 1;
final String word = m.group( 1 );
if( dict.containsKey( word )) {
final String repl = dict.get( word );
final String beginOfSentence = text.substring( m.start(), m.end());
final String endOfSentence;
if( m.find( pos )) {
endOfSentence = text.substring( m.start() - 1, m.end());
}
else {
endOfSentence = text.substring( m.start() - 1);
}
System.err.printf( "Replace '%s.' in '%s%s' with '%s'\n",
word, beginOfSentence, endOfSentence, repl );
final String sentence =
( beginOfSentence + endOfSentence ).replaceAll( word+'.', repl );
System.err.println( sentence );
}
}
}
}
The execution:
Replace 'med.' in 'Based on the med. report everybody were ok.' with 'medical'
Based on the medical report everybody were ok.

Related

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.
From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"
The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here
Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

regex to remove data between "some text (some text)"?

I have a string in java which contains text as
Hello user your choice is in (1,2,3,4) as selected by you.
Now I want to remove choice is in (1,2,3,4) from this string with "".
I cannot directly do it using replace() in java as data inside the () is dynamic and changes every time.
Output required
Hello user your as selected by you.
I tried using regex but it failed and did not work, my regex
(?s)(\\choice is in .*?\\\\(\\\\)
You may use
.replaceAll("\\s+choice\\s+is\\s+in\\s+\\([^()]*\\)", "")
See the regex demo.
\s+ - 1+ whitespaces
choice\s+is\s+in - choice is in with any 1+ whitespaces in between words
\s+ - 1+ whitespaces
\([^()]*\) - a (, then any 0+ chars other than ( and ) and then a )
See Java demo:
String s = "Hello user your choice is in (1,2,3,4) as selected by you.";
System.out.println(s.replaceAll("\\s+choice\\s+is\\s+in\\s+\\([^()]*\\)", ""));
// => Hello user your as selected by you.
Given below is a non-regex solution:
public class Main {
public static void main(String[] args) {
String s = "Hello user your choice is in (1,2,3,4) as selected by you.";
int start = s.indexOf(" choice is in (");
int end = s.indexOf(")", start);// Index of `)` after the index, `start`
s = s.substring(0, start) + s.substring(end + 1);
System.out.println(s);
}
}
Output:
Hello user your as selected by you.
Please refer below code.
String pattern = "choice is in (.*?) ";
String userString = "Hello user your choice is in (1,2,3,4) as selected by you";
userString = userString.replaceAll(pattern, "");
System.out.println(userString);
Output will be :
Hello user your as selected by you
Try This:
String pattern = "choice is in (.*) as";
String userString = "Hello user your choice is in (1,2,3,4) as selected by you";
userString = userString.replaceAll(pattern, "as");
System.out.println(userString);
And the output would be:
Hello user your as selected by you

Regex to capture groups and ignore last two characters where one is optional

I need to capture two groups from an input string. The values differ in structure as they come in.
The following are examples of the incoming strings:
Comment = "This is a comment";
NumericValue = 123456;
What I am trying to accomplish is to capture the string value from the left of the equals sign as one group and the value after the equals sign as a second group. The semicolon should never be included.
The caveat is that if the second group is a string, the quotes from each end must not be included in that capture group.
The expected results would be:
Comment = "This is a comment";
key group => Comment
value group => This is a comment
NumericValue = 123456;
key group => NumericValue
value group => 123456
The following is what I have so far. This works fine for capturing the numeric value, but leaves the end double quote when capturing the string value.
(?<key>\w+)\s*=\s*(?:[\"]?)(?<group>.+(?:(?=[\"]?;)))
EDIT
When applying the regex against a string value, it must allow capture of semicolons and double quotes within the string and ignore only the closing ones.
So, if we have an input of:
Comment = "This is a "comment"; This is still a comment";
The second capture group should be:
This is a "comment"; This is still a comment
An option is to use an alternation where you would have to check for group 2 or group 3:
(?<key>\w+)\h*=\h*(?:"(.*?)"|([^"\r\n]+));$
(?<key>\w+) Group key match 1+ word chars
\h*=\h* Match an = between optional horizontal whitespace chars
(?: Non capturing group
"(.+?)" Capture in group 2 1+ times any char between "
| Or
([^"\r\n]+) Capture group 3, match 1+ times any char except " or a newline
); Close non capturing group and match ;
$ End of string
Regex demo
In Java
String regex = "(?<key>\\w+)\\h*=\\h*(?:\"(.*?)\"|([^\"\\r\\n]+));$";
Edited based on comment to include ; and " in the comments as per the examples given:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>((")(?!;?$)|;(?!$)|[^;"])+)"?;?$
The following one additionally doesn't allow ; or " to appear in the numeric text. However, to include this, I had to rename the capturing groups because the name cannot be used for more than one group.
(?<key>\w+)\s*=\s*((?:")(?<valueT>((")(?!;?$)|;(?!$)|[^;"])+)";?$|(?<valueN>[^;"]+);?$)
Here is a class that tests it.
For readability, I have separated the key and value regexes in the class. I have added the test cases in a method within the class. However, this still doesn't handle the case of a numeric text containing ; or ". Also, the line needs to be trimmed before being subjected to the pattern test (which I think is feasible).
public class NameValuePairRegex{
public static void main( String[] args ){
String SPACE = "\\s*";
String EQ = "=";
String OR = "|";
/* The original regex tried by you (for comparison). */
String orig = "(?<key>\\w+)\\s*=\\s*(?:[\\\"]?)(?<value>.+(?:(?=;)))";
String key = "(?<key>\\w+)";
String valuePatternForText = "(?:\")(?<valueT>((\")(?!;?$)|;(?!$)|[^;\"])+)\";?$";
String valuePatternForNumbers = "(?<valueN>[^;\"]+);?$";
String p = key + SPACE + EQ + SPACE + "(" + valuePatternForText + OR + valuePatternForNumbers + ")";
Pattern nvp = Pattern.compile( p );
System.out.println( nvp.pattern() );
print( input(), nvp );
}
private static void print( List<String> input, Pattern ep ) {
for( String e : input ) {
System.out.println( e );
Matcher m = ep.matcher( e );
boolean found = m.find();
if( !found ) {
System.out.println( "\t\tNo match" );
continue;
}
String valueT = m.group( "valueT" );
String valueN = m.group( "valueN" );
System.out.print( "\t\t" + m.group( "key" ) + " -> " + ( valueT == null ? "" : valueT ) + " " + ( valueN == null ? "" : valueN ) );
System.out.println( );
}
}
private static List<String> input(){
List<String> neg = new ArrayList<>();
Collections.addAll( neg,
"Comment = \"This is a comment\";",
"Comment = \"This is a comment with semicolon ;\";",
"Comment = \"This is a comment with semicolon ; and quote\"\";",
"Comment = \"This is a comment\"",
"Comment = \"This is a \"comment\"; This is still a comment\";",
"NumericValue = 123456;",
"NumericValue = 123;456;",
"NumericValue = 123\"456;",
"NumericValue = 123456" );
return neg;
}
}
Original answer:
The following changed regex is fulfilling the requirements you mentioned. I added the exclusion of ; and " from the value part.
Original that you tried:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<group>.+(?:(?=[\"]?;)))
The changed one:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>[^;"]+)
Regular expressions are fun, but look how clean and easy to read this would be without using a regular expression:
int equals = s.indexOf('=');
String key = s.substring(0, equals).trim();
String value = s.substring(equals + 1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1).trim();
}
if (value.startsWith("\"") && value.endsWith("\"")) {
value = value.substring(1, value.length() - 1);
}
Don’t assume that because this uses more lines of code than a regular expression that it’s slower. The lines of code executed internally by a regex engine will far exceed the above code.

Java Regex : How to search a text or a phrase in a large text

I have a large text file and I need to search a word or a phrase in the file line by line and output the line with the text found in it.
For example, the sample text is
And the earth was without form,
Where [art] thou?
if the user search for thou word, the only line to be display is
Where [art] thou?
and if the user search for the earth, the first line should be displayed.
I tried using the contains function but it will display also the without when searching only for thou.
This is my sample code :
String[] verseList = TextIO.readFile("pentateuch.txt");
Scanner kbd = new Scanner(System.in);
int counter = 0;
for (int i = 0; i < verseList.length; i++) {
String[] data = verseList[i].split("\t");
String[] info3 = data[3].split(" ");
System.out.print("Search for: ");
String txtSearch = kbd.nextLine();
LinkedList<String> searchedList = new LinkedList<String>();
for (String bible : verseList){
if (bible.contains(txtSearch)){
searchedList.add(bible);
counter++;
}
}
if (searchedList.size() > 0){
for (String s : searchedList){
String[] searchedData = s.split("\t");
System.out.printf("%s - %s - %s - %s \n",searchedData[0], searchedData[1], searchedData[2], searchedData[3]);
}
}
System.out.print("Total: " + counter);
So I am thinking of using regex but I don't know how.
Can anyone help? Thank you.
Since sometimes variables have non-word characters at boundary positions, you cannot rely on \b word boundary.
In such cases, it is safer to use look-arounds (?<!\w) and (?!\w), i.e. in Java, something like:
"(?<!\\w)" + searchedData[n] + "(?!\\w)"
To match a String that contains a word, use this code:
String txtSearch; // eg "thou"
if (str.matches(".*?\\b" + txtSearch + "\\b.*"))
// it matches
This code builds a regex that only matches if both ends of txtSearch fall and the start/end of a word in the string by using \b, which means "word boundary".

REGEX : How to escape []?

I'm working on strings like "[ro.multiboot]: [1]". How do I just select 1(it can also be 0) out of this string?
I am looking for a regex in Java.
Usually, you would do something like (assuming 0 and 1 were the only options):
^.*\[([01])\].*$
If you only wanted the value for ro.multiboot, you could change it to something like:
^.*\[ro.multiboot\].*\[([01])\].*$
(depending on how complex any of the non-bracketed stuff is allowed to be).
These would both basically only extract the value between square brackets if it were zero or one, and capture it into a capture variable so you could use it.
Of course, regex is not a world-wide standard, nor are the environments in which you use it. That means it depends a lot on your actual environment how you will actually code this up.
For Java, the following sample program may help:
import java.util.regex.*;
class Test {
public static void main(String args[]) {
Pattern p = Pattern.compile("^.*\\[ro.multiboot\\].*\\[([01])\\].*$");
String str;
Matcher m;
str = "[ro.multiboot]: [0]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str0 has " + m.group(1));
}
str = "[ro.multiboot]: [1]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str1 has " + m.group(1));
}
str = "[ro.multiboot]: [2]";
m = p.matcher (str);
if (m.find()) {
System.out.println ("str2 has " + m.group(1));
}
}
}
This results in (as expected):
str0 has 0
str1 has 1
#paxdiablo's regexps are correct, but complete answer for "How do I just select 1(it can also be 0) out of this string?" is:
1. very simple solution
String input = "[ro.multiboot]: [1]";
String matched = input.replaceFirst( "^.*\\[ro.multiboot\\].*\\[([01])\\].*$", "$1" );
2. same functionality, more complicated but with better performance
String input = "[ro.multiboot]: [1]";
Pattern p = Pattern.compile( "^.*\\[ro.multiboot\\].*\\[([01])\\].*$" );
Matcher m = p.matcher( input );
String matched = null;
if ( m.matches() ) matched = m.group( 1 );
Performance is better because the pattern is compiled just once (for example when you are matching array os such Strings);
Notes:
in both examples the group is part of regexps between ( and ) (if not escaped)
in Java you have to use \\[, because \[ returns error - it is not correct escape sequence for String

Categories

Resources