extracting sentences which contain a particular word

extracting sentences which contain a particular word - java

I want to get the sentences in a textfile which contain a particular keyword. I tried a lot but not able to get the proper sentences that contain the keyword....I have more that one set of keywords if any of this match with the paragraph then it should be taken.
For eg :if my text file contains words like robbery,robbed etc then that sentence shold be extracted.. Below is the code which I tried. Is there anyway to solve this using regex. Any help will be appreciated.
BufferedReader br1 = new BufferedReader(new FileReader("/home/pgrms/Documents/test/one.txt"));
String str="";
while(br1 .ready())
{
str+=br1 .readLine() +"\n";
}
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find())
{
sentenceString=match.group(0);
System.out.println(sentenceString);
}

Here is an example for when you have a list of predefined keywords:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public class Tester {
public static void main(String [] args){
try {
BufferedReader br1 = new BufferedReader(new FileReader("input"));
String[] words = {"robbery","robbed", "robbers"};
String word_re = words[0];
String str="";
for (int i = 1; i < words.length; i++)
word_re += "|" + words[i];
word_re = "[^.]*\\b(" + word_re + ")\\b[^.]*[.]";
while(br1.ready()) { str += br1.readLine(); }
Pattern re = Pattern.compile(word_re,
Pattern.MULTILINE | Pattern.COMMENTS |
Pattern.CASE_INSENSITIVE);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find()) {
sentenceString = match.group(0);
System.out.println(sentenceString);
}
} catch (Exception e) {}
}
}
This creates a regex of the form:
[^.]*\b(robbery|robbed|robbers)\b[^.]*[.]

In general, to check if a sentence contains rob or robbery or robbed, you can add a lookehead after the beginning of string anchor, before the rest of your regex pattern:
(?=.*(?:rob|robbery|robbed))
In this case, it is more efficient to group the rob then check for potential suffixes:
(?=.*(?:rob(?:ery|ed)?))
In your Java code, we can (for instance) modify your loop like this:
while (match.find())
{
sentenceString=match.group(0);
if (sentenceString.matches("(?=.*(?:rob(?:ery|ed)?))")) {
System.out.println(sentenceString);
}
}
Explain Regex
(?= # look ahead to see if there is:
.* # any character except \n (0 or more times
# (matching the most amount possible))
(?: # group, but do not capture:
rob # 'rob'
(?: # group, but do not capture (optional
# (matching the most amount possible)):
ery # 'ery'
| # OR
ed # 'ed'
)? # end of grouping
) # end of grouping
) # end of look-ahead

Take a look at the ICU Project and icu4j. It does boundary analysis, so it splits sentences and words for you, and will do it for different languages.
For the rest, you can either match the words against a Pattern (as others have suggested), or check it against a Set of the words you're interested in.

Related

Regex Pattern required in java for matching string starts with '{{' and ends with "}}"

Hi,
I need to create a regex pattern that will pick the matching string starts with '{{' and ends with
"}}" from a given string.
The pattern I have created is working same with the strings starting with '{{{' and '{{', Similarly with ending with '}}}' and
'}}'
Output of above code:
matches = {{phone2}}
matches = {{phone3}}
matches = {{phone5}}
**Expected Output**:
matches = {{phone5}}
I need only Strings which follows two consecutive pattern of '{' and '}' not three.
Sharing the code below
package com.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String text = "<test>{{#phone1}}{{{phone3}}}{{/phone4}} {{phone5}}></test>";
//String pattern = "\\{\\{\\s*?(\\w*?)\\s*?(?!.*\\}\\}\\}$)";
String pattern = "\\{\\{\\s*?(\\w*?)\\s*?}}";
Pattern placeholderPattern = Pattern.compile(pattern);
Matcher placeholderMatcher = placeholderPattern.matcher(text);
while (placeholderMatcher.find()) {
System.out.println("matches = " + placeholderMatcher.group());
}
}
}

You may use
String pattern = "(?<!\\{)\\{{2}\\s*(\\w*)\\s*\\}{2}(?!\\})";
Or, if empty or blank {{...}} are not expected, use
String pattern = "(?<!\\{)\\{{2}\\s*(\\w+)\\s*\\}{2}(?!\\})";
See the regex demo.
Details
(?<!\{) - a negative lookbehind failing the match if there is a { char immediately to the left of the current location
\{{2} - {{ substring
\s* - 0+ whitespaces
(\w*) - Group 1: one or more word chars (1 or more if + quantifier is used)
\s* - 0+ whitespaces
\}{2} - }} string
(?!\}) - a negative lookahead that fails the match if there is a } char immediately to the right of the current location.
See the Java demo:
String text = "<test>{{#phone1}}{{{phone3}}}{{/phone4}} {{phone5}}></test>";
String pattern = "(?<!\\{)\\{{2}\\s*(\\w*)\\s*\\}{2}(?!\\})";
Pattern placeholderPattern = Pattern.compile(pattern);
Matcher placeholderMatcher = placeholderPattern.matcher(text);
while (placeholderMatcher.find()) {
System.out.println("Match: " + placeholderMatcher.group());
System.out.println("Group 1: " + placeholderMatcher.group(1));
}
Output:
Match: {{phone5}}
Group 1: phone5

How to splitting records based white spaces when different lines have spaces at different positions

I have a file with records as below and I am trying to split the records in it based on white spaces and convert them into comma.
file:
a 3w 12 98 header P6124
e 4t 2 100 header I803
c 12L 11 437 M12
BufferedReader reader = new BufferedReader(new FileReader("/myfile.txt"));
String line = reader.readLine();
while (line != null) {
System.out.println(line);
line = reader.readLine();
String[] splitLine = line.split("\\s+")
If the data is separated by multiple white spaces, I usually go for regex replace -> split('\\s+') or split(" +").
But in the above case, I have a record c which doesn't have the data header. Hence the regex "\s+" or " +" will just skip that record and I will get an empty space as c,12L,11,437,M12 instead of c,12L,11,437,,M12
How do I properly split the lines based on any delimiter in this case so that I get data in the below format:
a,3w,12,98,header,P6124
e,4t,2,100,header,I803
c,12L,11,437,,M12
Could anyone let me know how I can achieve this ?

May be you can try using a more complicated approach, using a complex regex in order to match exatcly six fields for each line and handling explicitly the case of a missing value for the fifth one.
I rewrote your example adding some console log in order to clarify my suggestion:
public class RegexTest {
private static final String Input = "a 3w 12 98 header P6124\n" +
"e 4t 2 100 header I803\n" +
"c 12L 11 437 M12";
public static void main(String[] args) throws Exception {
BufferedReader reader = new BufferedReader(new StringReader(Input));
String line = null;
Pattern pattern = Pattern.compile("^([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+) +([^ ]+)? +([^ ]+)$");
do {
line = reader.readLine();
System.out.println(line);
if(line != null) {
String[] splitLine = line.split("\\s+");
System.out.println(splitLine.length);
System.out.println("Line: " + line);
Matcher matcher = pattern.matcher(line);
System.out.println("matches: " + matcher.matches());
System.out.println("groups: " + matcher.groupCount());
for(int i = 1; i <= matcher.groupCount(); i++) {
System.out.printf(" Group %d has value '%s'\n", i, matcher.group(i));
}
}
} while (line != null);
}
}
The key is that the pattern used to match each line requires a sequence of six fields:
for each field, the value is described as [^ ]+
separators between fields are described as +
the value of the fifth (nullable) field is described as [^ ]+?
each value is captured as a group using parentheses: ( ... )
start (^) and end ($) of each line are marked explicitly
Then, each line is matched against the given pattern, obtaining six groups: you can access each group using matcher.group(index), where index is 1-based because group(0) returns the full match.
This is a more complex approach but I think it can help you to solve your problem.

Put a limit on the number of whitespace chars that may be used to split the input.
In the case of your example data, a maximum of 5 works:
String[] splitLine = line.split("\\s{1,5}");
See live demo (of this code working as desired).

Are you just trying to switch your delimiters from spaces to commas?
In that case:
cat myFile.txt | sed 's/ */ /g' | sed 's/ /,/g'
*edit: added a stage to strip out lists of more than two spaces, replacing them with just the two spaces needed to retain the double comma.

How to use regex with String.split()

I have the following String:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n"
I want to convert it to an array of String which will look like this.
String[] Title = {"Title1 Title2","Title3 Title4","Title5 Title6","Title7"}
I am trying the following code.
String[] Title=fullPDFContext.split("\r\n\r\n|\r\n \r\n|\r\n");
But not getting the desired output.

You need to split with a pattern that matches any amount of whitespace that contains a line break:
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
String separator = "\\p{javaWhitespace}*\\R\\p{javaWhitespace}*";
String results[] = fullPDFContex.split(separator);
System.out.println(Arrays.toString(results));
// => [Title1 Title2, Title3 Title4, Title5 Title6, Title7]
See the Java demo.
The \\p{javaWhitespace}*\\R\\p{javaWhitespace}* matches
\\p{javaWhitespace}* - 0+ whitespaces
\\R - a line break (you may replace it with [\r\n] for Java 7 and older)
\\p{javaWhitespace}* - 0+ whitespaces.
Alternatively, you may use a bit more efficient
String separator = "[\\s&&[^\r\n]]*\\R\\s*";
See another demo
Unfortunately, the \R construct cannot be used in the character classes. The pattern will match:
[\\s&&[^\r\n]]* - zero or more whitespace chars other than CR and LF (character class subtraction is used here)
\\R - a line break
\\s* - any 0+ whitespace chars.

Here is your solution. we can use StringTokenizer & I have used list to insert the splitted values.This can help you if you have n number of values splitted from your array
package com.sujit;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
public class UserInput {
public static void main(String[] args) {
String fullPDFContex = "Title1 Title2\r\nTitle3 Title4\r\n\r\nTitle5 Title6\r\n \r\n Title7 \r\n\r\n\r\n\r\n\r\n";
StringTokenizer token = new StringTokenizer(fullPDFContex, "\r\n");
List<String> list = new ArrayList<>();
while (token.hasMoreTokens()) {
list.add(token.nextToken());
}
for (String string : list) {
System.out.println(string);
}
}
}

With this code you get the output you want:
String[] Title = fullPDFContext.split(" *(\r\n ?)+ *");

Java pattern matching using regex

I am new to java coding and using pattern matching.I am reading this string from file. So, this will give compilation error. I have a string as follows :
String str = "find(\"128.210.16.48\",\"Hello Everyone\")" ; // no compile error
I want to extract "128.210.16.48" value and "Hello Everyone" from above string. This values are not constant.
can you please give me some suggestions?
Thanks

I suggest you to use String#split() method but still if you are looking for regex pattern then try it and get the matched group from index 1.
("[^"][\d\.]+"|"[^)]*+)
Online demo
Sample code:
String str = "find(\"128.210.16.48\",\"Hello Everyone\")";
String regex = "(\"[^\"][\\d\\.]+\"|\"[^)]*+)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
output:
"128.210.16.48"
"Hello Everyone"
Pattern explanation:
( group and capture to \1:
" '"'
[^"] any character except: '"'
[\d\.]+ any character of: digits (0-9), '\.' (1
or more times (matching the most amount
possible))
" '"'
| OR
" '"'
[^)]* any character except: ')' (0 or more
times (matching the most amount
possible))
) end of \1

Try with String.split()
String str = "find(\"128.210.16.48\",\"Hello Everyone\")" ;
System.out.println(str.split(",")[0].split("\"")[1]);
System.out.println(str.split(",")[1].split("\"")[1]);
Output:
128.210.16.48
Hello Everyone
Edit:
Explanation:
For the first string split it by comma (,). From that array choose the first string as str.split(",")[0] split the string again with doublequote (") as split("\"")[1] and choose the second element from the array. Same the second string is also done.

The accepted answer is fine, but if for some reason you wanted to still use regex (or whoever finds this question) instead of String.split here's something:
String str = "find(\"128.210.16.48\",\"Hello Everyone\")" ; // no compile error
String regex1 = "\".+?\"";
Pattern pattern1 = Pattern.compile(regex1);
Matcher matcher1 = pattern1.matcher(str);
while (matcher1.find()){
System.out.println("Matcher 1 found (trimmed): " + matcher1.group().replace("\"",""));
}
Output:
Matcher 1 found (trimmed): 128.210.16.48
Matcher 1 found (trimmed): Hello Everyone
Note: this will only work if " is only used as a separator character. See Braj's demo as an example from the comments here.

Regex for String with possible escape characters

I had asked this question some times back here Regular expression that does not contain quote but can contain escaped quote and got the response, but somehow i am not able to make it work in Java.
Basically i need to write a regular expression that matches a valid string beginning and ending with quotes, and can have quotes in between provided they are escaped.
In the below code, i essentially want to match all the three strings and print true, but cannot.
What should be the correct regex?
Thanks
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \" ABC\"",
"\"tuco \" ABC \" DEF\""
};
Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}

The problem is not so much your regex, but rather your test strings. The single backslash before the internal quotes on your second and third example strings are consumed when the literal string is parsed. The string being passed to the regex engine has no backslash before the quote. (Try printing it out.) Here is a tested version of your function which works as expected:
import java.util.regex.*;
public class TEST
{
public static void main(String[] args) {
String[] arr = new String[]
{
"\"tuco\"",
"\"tuco \\\" ABC\"",
"\"tuco \\\" ABC \\\" DEF\""
};
//old: Pattern pattern = Pattern.compile("\"(?:[^\"\\\\]+|\\\\.)*\"");
Pattern pattern = Pattern.compile(
"# Match double quoted substring allowing escaped chars. \n" +
"\" # Match opening quote. \n" +
"( # $1: Quoted substring contents. \n" +
" [^\"\\\\]* # {normal} Zero or more non-quote, non-\\. \n" +
" (?: # Begin {(special normal*)*} construct. \n" +
" \\\\. # {special} Escaped anything. \n" +
" [^\"\\\\]* # more {normal} non-quote, non-\\. \n" +
" )* # End {(special normal*)*} construct. \n" +
") # End $1: Quoted substring contents. \n" +
"\" # Match closing quote. ",
Pattern.DOTALL | Pattern.COMMENTS);
for (String str : arr) {
Matcher matcher = pattern.matcher(str);
System.out.println(matcher.matches());
}
}
}
I've substituted your regex for an improved version (taken from MRE3). Note that this question gets asked a lot. Please see this answer where I compare several functionally equivalent expressions.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

extracting sentences which contain a particular word - java

Related

Regex Pattern required in java for matching string starts with '{{' and ends with "}}"

How to splitting records based white spaces when different lines have spaces at different positions

How to use regex with String.split()

Java pattern matching using regex

Regex for String with possible escape characters

Categories

Resources