I'm actually developping a parser and I'm stuck on a method.
I need to clean specifics words in some sentences, meaning replacing those by a whitespace or a nullcharacter.
For now, I came up with this code:
private void clean(String sentence)
{
try {
FileInputStream fis = new FileInputStream(
ConfigHandler.getDefault(DictionaryType.CLEANING).getDictionaryFile());
BufferedReader bis = new BufferedReader(new InputStreamReader(fis));
String read;
List<String> wordList = new ArrayList<String>();
while ((read = bis.readLine()) != null) {
wordList.add(read);
}
}
catch (IOException e) {
e.printStackTrace();
}
for (String s : wordList) {
if (StringUtils.containsIgnoreCase(sentence, s)) { // this comes from Apache Lang
sentence = sentence.replaceAll("(?i)" + s + "\\b", " ");
}
}
cleanedList.add(sentence);
}
But when I look at the output, I got all of the occurences of the word to be replaced in my sentence replaced by a whitespace.
Does anybody can help me out on replacing only the exact words to be replaced on my sentence?
Thanks in advance !
There are two problems in your code:
You are missing the \b before the string
You will run into issues if any of the words from the file has special characters
To fix this problem construct your regex as follows:
sentence = sentence.replaceAll("(?i)\\b\\Q" + s + "\\E\\b", " ");
or
sentence = sentence.replaceAll("(?i)\\b" + Pattern.quote(s) + "\\b", " ");
Related
This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 4 years ago.
I need to verify hebrew text from the letter
the letter's body like:
שלום,
תואם ייעוץ וידאו עם המטופל John Salivan. מועד הייעוץ נקבע לתאריך
23/02/2019 בשעה 20:45.
לביצוע הייעוץ יש להכנס
but my regex doesn't match text
public static void findBadLines(String fileName) {
Pattern regexp = Pattern.compile(".*שלום,.*תואם ייעוץ וידאו עם המטופל John Salivan. .*מועד הייעוץ נקבע לתאריך .* בשעה.*..*לביצוע הייעוץ יש להכנס .*");
Matcher matcher = regexp.matcher("");
Path path = Paths.get(fileName);
//another way of getting all the lines:
//Files.readAllLines(path, ENCODING);
try (
BufferedReader reader = Files.newBufferedReader(path, ENCODING);
LineNumberReader lineReader = new LineNumberReader(reader);
){
String line = null;
while ((line = lineReader.readLine()) != null) {
matcher.reset(line); //reset the input
if (!matcher.find()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}
}
}
catch (IOException ex){
ex.printStackTrace();
}
}
final static Charset ENCODING = StandardCharsets.UTF_8;
}
Do I get that right, you wan't to check if there is any hebrew text in a given input?
If so use that regex .*[\u0590-\u05ff]+.*
[\u0590-\u05ff]+ matches one or more hebrew characters, the .* before and after you need to match the rest of your input.
Respectively
Pattern regexp = Pattern.compile(".*[\u0590-\u05ff]+.*");
//...
matcher.reset(line); //reset the input
if (!matcher.matches()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}
My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread
You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.
The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)
I've a huge text file, I'd like to search for specific words and print three or more then this number OF THE WORDS AFTER IT so far I have done this
public static void main(String[] args) {
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String line = null;
try {
FileReader fileReader =
new FileReader(fileName);
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
bufferedReader.close();
} catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
} catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
}
}
It's only for printing the file can you advise me what's the best way of doing it.
You can look for the index of word in line using this method.
int index = line.indexOf(word);
If the index is -1 then that word does not exist.
If it exist than takes the substring of line starting from that index till the end of line.
String nextWords = line.substring(index);
Now use String[] temp = nextWords.split(" ") to get all the words in that substring.
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
if (line.contains("YOUR_SPECIFIC_WORDS")) { //do what you need here }
}
By the sounds of it what you appear to be looking for is a basic Find & Replace All mechanism for each file line that is read in from file. In other words, if the current file line that is read happens to contain the Word or phrase you would like to add words after then replace that found word with the very same word plus the other words you want to add. In a sense it would be something like this:
String line = "This is a file line.";
String find = "file"; // word to find in line
String replaceWith = "file (plus this stuff)"; // the phrase to change the found word to.
line = line.replace(find, replaceWith); // Replace any found words
System.out.println(line);
The console output would be:
This is a file (plus this stuff) line.
The main thing here though is that you only want to deal with actual words and not the same phrase within another word, for example the word "and" and the word "sand". You can clearly see that the characters that make up the word 'and' is also located in the word 'sand' and therefore it too would be changed with the above example code. The String.contains() method also locates strings this way. In most cases this is undesirable if you want to specifically deal with whole words only so a simple solution would be to use a Regular Expression (RegEx) with the String.replaceAll() method. Using your own code it would look something like this:
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String findPhrase = "and"; //Word or phrase to find and replace
String replaceWith = findPhrase + " (adding this)"; // The text used for the replacement.
boolean ignoreLetterCase = false; // Change to true to ignore letter case
String line = "";
try {
FileReader fileReader = new FileReader(fileName);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
if (ignoreLetterCase) {
line = line.toLowerCase();
findPhrase = findPhrase.toLowerCase();
}
if (line.contains(findPhrase)) {
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
}
System.out.println(line);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file: '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file: '" + fileName + "'");
}
You will of course notice the escaped \b word boundary Meta Characters within the regular expression used in the String.replaceAll() method specifically in the line:
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
This allows us to deal with whole words only.
I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.
I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";
Try pattern as String pattern = "[^\\w]"; in the same code
I am a newbie to programming in Java. I want to split the paragraphs in one file into sentences and write them in a different file. Also there should be mechanism to identify which sentence comes from which paragraph.The code I have used so far is mentioned below. But this code breaks:
Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.
into
Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.
How can I correct it? Thanks in advance.
import java.io.*;
class trial4{
public static void main(String args[]) throws IOException
{
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
String s;
OutputStream out = new FileOutputStream("output10.txt");
String token[];
while((s = br.readLine()) != null)
{
token = s.split("(?<=[.!?])\\s* ");
for(int i=0;i<token.length;i++)
{
byte buf[]=token[i].getBytes();
for(int j=0;j<buf.length;j=j+1)
{
out.write(buf[j]);
if(j==buf.length-1)
out.write('\n');
}
}
}
fr.close();
}
}
I referenced all the similar questions posted on StackOverFlow. But those answers couldn't help me solve this.
How about using a negative-lookbehind in conjunction with a replace. Simply said: Replace all line endings that don't have "something special" before them with the line end followed by newline.
A list of "known abbreviations" will be needed. There's no guarantee as to how long those can be or how short a word at the end of a line might be. (See? 'be' if quite short already!)
class trial4{
public static void main(String args[]) throws IOException {
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
PrintStream out = new PrintStream(new FileOutputStream("output10.txt"));
String s = br.readLine();
while(s != null) {
out.print( //Prints newline after each line in any case
s.replaceAll("(?i)" //Make the match case insensitive
+ "(?<!" //Negative lookbehind
+ "(\\W\\w)|" //Single non-word followed by word character (P.B.)
+ "(\\W\\d{1,2})|" //one or two digits (dates!)
+ "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
+ ")" //End of lookbehind
+"([!?\\.])" //Match end-ofsentence
, "$5" //Replace with end-of-sentence found
+System.lineSeparator())); //Add newline if found
s = br.readLine();
}
}
}
As mentioned in the comment "it will be reasonable hard" to break text into paragraphs without formalizing the requirements. Take a look at BreakIterator - especially SentenceInstance. You might roll out your own BreakIterator since it breaks the same as you get with regexp, except that it is more abstract. Or try to find a 3rd party solution like http://deeplearning4j.org/sentenceiterator.html which can be trained to tokenize your input.
Example with BreakIterator:
String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";
BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US);
bilus.setText(str);
int last = bilus.first();
int count = 0;
while (BreakIterator.DONE != last)
{
int first = last;
last = bilus.next();
if (BreakIterator.DONE != last)
{
String sentence = str.substring(first, last);
System.out.println("Sentence:" + sentence);
count++;
}
}
System.out.println("" + count + " sentences found.");