Java how to verify hebrew text from the letter [duplicate] - java

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 4 years ago.
I need to verify hebrew text from the letter
the letter's body like:
שלום,
תואם ייעוץ וידאו עם המטופל John Salivan. מועד הייעוץ נקבע לתאריך
23/02/2019 בשעה 20:45.
לביצוע הייעוץ יש להכנס
but my regex doesn't match text
public static void findBadLines(String fileName) {
Pattern regexp = Pattern.compile(".*שלום,.*תואם ייעוץ וידאו עם המטופל John Salivan. .*מועד הייעוץ נקבע לתאריך .* בשעה.*..*לביצוע הייעוץ יש להכנס .*");
Matcher matcher = regexp.matcher("");
Path path = Paths.get(fileName);
//another way of getting all the lines:
//Files.readAllLines(path, ENCODING);
try (
BufferedReader reader = Files.newBufferedReader(path, ENCODING);
LineNumberReader lineReader = new LineNumberReader(reader);
){
String line = null;
while ((line = lineReader.readLine()) != null) {
matcher.reset(line); //reset the input
if (!matcher.find()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}
}
}
catch (IOException ex){
ex.printStackTrace();
}
}
final static Charset ENCODING = StandardCharsets.UTF_8;
}

Do I get that right, you wan't to check if there is any hebrew text in a given input?
If so use that regex .*[\u0590-\u05ff]+.*
[\u0590-\u05ff]+ matches one or more hebrew characters, the .* before and after you need to match the rest of your input.
Respectively
Pattern regexp = Pattern.compile(".*[\u0590-\u05ff]+.*");
//...
matcher.reset(line); //reset the input
if (!matcher.matches()) {
String msg = "Line " + lineReader.getLineNumber() + " is bad: " + line;
throw new IllegalStateException(msg);
}

Related

Regex for replacing Exact String match [duplicate]

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread
You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.
The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

searching in text file specific words using java

I've a huge text file, I'd like to search for specific words and print three or more then this number OF THE WORDS AFTER IT so far I have done this
public static void main(String[] args) {
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String line = null;
try {
FileReader fileReader =
new FileReader(fileName);
BufferedReader bufferedReader =
new BufferedReader(fileReader);
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
}
bufferedReader.close();
} catch(FileNotFoundException ex) {
System.out.println(
"Unable to open file '" +
fileName + "'");
} catch(IOException ex) {
System.out.println(
"Error reading file '"
+ fileName + "'");
}
}
It's only for printing the file can you advise me what's the best way of doing it.
You can look for the index of word in line using this method.
int index = line.indexOf(word);
If the index is -1 then that word does not exist.
If it exist than takes the substring of line starting from that index till the end of line.
String nextWords = line.substring(index);
Now use String[] temp = nextWords.split(" ") to get all the words in that substring.
while((line = bufferedReader.readLine()) != null) {
System.out.println(line);
if (line.contains("YOUR_SPECIFIC_WORDS")) { //do what you need here }
}
By the sounds of it what you appear to be looking for is a basic Find & Replace All mechanism for each file line that is read in from file. In other words, if the current file line that is read happens to contain the Word or phrase you would like to add words after then replace that found word with the very same word plus the other words you want to add. In a sense it would be something like this:
String line = "This is a file line.";
String find = "file"; // word to find in line
String replaceWith = "file (plus this stuff)"; // the phrase to change the found word to.
line = line.replace(find, replaceWith); // Replace any found words
System.out.println(line);
The console output would be:
This is a file (plus this stuff) line.
The main thing here though is that you only want to deal with actual words and not the same phrase within another word, for example the word "and" and the word "sand". You can clearly see that the characters that make up the word 'and' is also located in the word 'sand' and therefore it too would be changed with the above example code. The String.contains() method also locates strings this way. In most cases this is undesirable if you want to specifically deal with whole words only so a simple solution would be to use a Regular Expression (RegEx) with the String.replaceAll() method. Using your own code it would look something like this:
String fileName = "C:\\Users\\Mishari\\Desktop\\Mesh.txt";
String findPhrase = "and"; //Word or phrase to find and replace
String replaceWith = findPhrase + " (adding this)"; // The text used for the replacement.
boolean ignoreLetterCase = false; // Change to true to ignore letter case
String line = "";
try {
FileReader fileReader = new FileReader(fileName);
BufferedReader bufferedReader = new BufferedReader(fileReader);
while ((line = bufferedReader.readLine()) != null) {
if (ignoreLetterCase) {
line = line.toLowerCase();
findPhrase = findPhrase.toLowerCase();
}
if (line.contains(findPhrase)) {
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
}
System.out.println(line);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file: '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file: '" + fileName + "'");
}
You will of course notice the escaped \b word boundary Meta Characters within the regular expression used in the String.replaceAll() method specifically in the line:
line = line.replaceAll("\\b(" + findPhrase + ")\\b", replaceWith);
This allows us to deal with whole words only.

Replace specific string by another - String#replaceAll()

I'm actually developping a parser and I'm stuck on a method.
I need to clean specifics words in some sentences, meaning replacing those by a whitespace or a nullcharacter.
For now, I came up with this code:
private void clean(String sentence)
{
try {
FileInputStream fis = new FileInputStream(
ConfigHandler.getDefault(DictionaryType.CLEANING).getDictionaryFile());
BufferedReader bis = new BufferedReader(new InputStreamReader(fis));
String read;
List<String> wordList = new ArrayList<String>();
while ((read = bis.readLine()) != null) {
wordList.add(read);
}
}
catch (IOException e) {
e.printStackTrace();
}
for (String s : wordList) {
if (StringUtils.containsIgnoreCase(sentence, s)) { // this comes from Apache Lang
sentence = sentence.replaceAll("(?i)" + s + "\\b", " ");
}
}
cleanedList.add(sentence);
}
But when I look at the output, I got all of the occurences of the word to be replaced in my sentence replaced by a whitespace.
Does anybody can help me out on replacing only the exact words to be replaced on my sentence?
Thanks in advance !
There are two problems in your code:
You are missing the \b before the string
You will run into issues if any of the words from the file has special characters
To fix this problem construct your regex as follows:
sentence = sentence.replaceAll("(?i)\\b\\Q" + s + "\\E\\b", " ");
or
sentence = sentence.replaceAll("(?i)\\b" + Pattern.quote(s) + "\\b", " ");

Reading an array from file. (java)

Hello it is my code to read from file
case 11: {
String line;
String temp[];
System.out.println("Podaj nazwę pliku z jakiego odczytać playlistę.");
nazwa11 = odczyt.next();
try {
FileReader fileReader = new FileReader(nazwa11);
BufferedReader bufferedReader = new BufferedReader(fileReader);
playlists.add(new Playlist(bufferedReader.readLine()));
x++;
while((line = bufferedReader.readLine())!=null){
String delimiter = "|";
temp = line.split(delimiter);
int rok;
rok = Integer.parseInt(temp[2]);
playlists.get(x).dodajUtwor(temp[0], temp[1], rok);
}
bufferedReader.close();
} catch (FileNotFoundException ex) {
System.out.println("Nie znaleziono pliku: '" + nazwa11 + "'");
} catch (IOException ex) {
System.out.println("Error reading file '" + nazwa11 + "'");
}
break;
}
Example file looks like this:
Pop
Test|Test|2010
Test1|Test1|2001
Gives me error
Exception in thread "main" java.lang.NumberFormatException: For input string: "s"
Why my line.split doesn't split when it finds "|"? I guess it splits t-e-s, any tips?
The pipe character "|" is one of the meta characters that carries a special meaning while performing the match.
This page gives you the complete lists of these special characters and their meanings.
So, in your program, modify the following line,
String delimiter = "|";
to
String delimiter = "\\|";
This will give you the result that you want.

Parsing txt file

I have to write a program that will parse baseball player info and hits,out,walk,ect from a txt file. For example the txt file may look something like this:
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Jenks,o,o,s,h,h,o,o
Will Jones,o,o,w,h,o,o,o,o,w,o,o
I know how to parse the file and can get that code running perfect. The only problem I am having is that we should only be printing the name for each player and 3 or their plays. For example:
Sam Slugger hit,hit,out
Jill Jenks out, out, sacrifice fly
Will Jones out, out, walk
I am not sure how to limit this and every time I try to cut it off at 3 I always get the first person working fine but it breaks the loop and doesn't do anything for all the other players.
This is what I have so far:
import java.util.Scanner;
import java.io.*;
public class ReadBaseBall{
public static void main(String args[]) throws IOException{
int count=0;
String playerData;
Scanner fileScan, urlScan;
String fileName = "C:\\Users\\Crust\\Documents\\java\\TeamStats.txt";
fileScan = new Scanner(new File(fileName));
while(fileScan.hasNext()){
playerData = fileScan.nextLine();
fileScan.useDelimiter(",");
//System.out.println("Name: " + playerData);
urlScan = new Scanner(playerData);
urlScan.useDelimiter(",");
for(urlScan.hasNext(); count<4; count++)
System.out.print(" " + urlScan.next() + ",");
System.out.println();
}
}
}
This prints out:
Sam Slugger, h, h, o,
but then the other players are voided out. I need help to get the other ones printing as well.
Here, try this one using FileReader
Assuming your file content format is like this
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Johns,h,h,o,s,w,w,h,w,o,o,o,h,s
with each player in the his/her own line then this can work for you
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
String[] values_per_line = line.split(",");
System.out.println("Name:" + values_per_line[0] + " "
+ values_per_line[1] + " " + values_per_line[2] + " "
+ values_per_line[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
otherwise if they are lined all in like one line which would not make sense then modify this sample.
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s| John Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
// token identifier is a space
String[] data = line.trim().split("|");
for (int i = 0; i < data.length; i++)
System.out.println("Name:" + data[0].split(",")[0] + " "
+ data[1].split(",")[1] + " "
+ data[2].split(",")[2] + " "
+ data[3].split(",")[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
You need to reset your count car in the while loop:
while(fileScan.hasNext()){
count = 0;
...
}
First Problem
Change while(fileScan.hasNext())) to while(fileScan.hasNextLine()). Not a breaking problem but when using scanner you usually put sc.* right after a sc.has*.
Second Problem
Remove the line fileScan.useDelimiter(","). This line doesn't do anything in this case but replaces the default delimiter so the scanner no longer splits on whitespace. Which doesn't matter when using Scanner.nextLine, but can have some nasty side effects later on.
Third Problem
Change this line for(urlScan.hasNext(); count<4; count++) to while(urlScan.hasNext()). Honestly I'm surprised that line even compiled and if it did it only read the first 4 from the scanner.
If you want to limit the amount processed for each line you can replace it with
for( int count = 0; count < limit && urlScan.hasNext( ); count++ )
This will limit the amount read to limit while still handling lines that have less data than the limit.
Make sure that each of your data sets is separated by a line otherwise the output might not make much sense.
You shouldn't have multiple scanners on this - assuming the format you posted in your question you can use regular expressions to do this.
This demonstrates a regular expression to match a player and to use as a delimiter for the scanner. I fed the scanner in my example a string, but the technique is the same regardless of source.
int count = 0;
Pattern playerPattern = Pattern.compile("\\w+\\s\\w+(?:,\\w){1,3}");
Scanner fileScan = new Scanner("Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s Jill Jenks,o,o,s,h,h,o,o Will Jones,o,o,w,h,o,o,o,o,w,o,o");
fileScan.useDelimiter("(?<=,\\w)\\s");
while (fileScan.hasNext()){
String player = fileScan.next();
Matcher m = playerPattern.matcher(player);
if (m.find()) {
player = m.group(0);
} else {
throw new InputMismatchException("Players data not in expected format on string: " + player);
}
System.out.println(player);
count++;
}
System.out.printf("%d players found.", count);
Output:
Sam Slugger,h,h,o
Jill Jenks,o,o,s
Will Jones,o,o,w
The call to Scanner.delimiter() sets the delimiter to use for retrieving tokens. The regex (?<=,\\w)\\s:
(?< // positive lookbehind
,\w // literal comma, word character
)
\s // whitespace character
Which delimits the players by the space between their entries without matching anything but that space, and fails to match the space between the names.
The regular expression used to extract up to 3 plays per player is \\w+\\s\\w+(?:,\\w){1,3}:
\w+ // matches one to unlimited word characters
(?: // begin non-capturing group
,\w // literal comma, word character
){1,3} // match non-capturing group 1 - 3 times

Categories

Resources