Tokenize Arabic text files java

Tokenize Arabic text files java - java

I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work.
I added the UTF-8 to read Arabic files. did I miss something
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
// BufferedReader br = new BufferedReader(fstream);
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
Scanner input = new Scanner(s);
while(input.hasNext()) {
word = input.next();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
//String stemmed=stem.stem (word);
sb.append(word+"\t");
}
//System.out.print(sb); ///here the arabic text is outputed without stopwords
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t"); //here the problem.
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
Please any ideas to help me proceed.
Thanks

The problem lies with your regex which will work well for English but not for Arabic because by definition
[\\W&&[^\\s]
means
// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s A whitespace character, short for [ \t\n\x0b\r\f]
So, by this logic, all chars of Arabic will be selected by this regex. So, when you give
sb.toString().replaceAll("[\\W&&[^\\s]]", "")
it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like
sb.toString().split("\\s+")
which will give you the Arabic words array separated by space.

In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:
http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf
If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Related

Regex for replacing Exact String match [duplicate]

My input:
1. end
2. end of the day or end of the week
3. endline
4. something
5. "something" end
Based on the above discussions, If I try to replace a single string using this snippet, it removes the appropriate words from the line successfully
public class DeleteTest {
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
String delete="end";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+delete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
}
My output If I use the above snippet:(Also my expected output)
1.
2. of the day or of the week
3. endline
4. something
5. "something"
But when I include more words to delete, and for that purpose when I use Set, I use the below code snippet:
public static void main(String[] args) {
// TODO Auto-generated method stub
try {
File file = new File("C:/Java samples/myfile.txt");
File temp = File.createTempFile("myfile1", ".txt", file.getParentFile());
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(temp)));
Set<String> toDelete = new HashSet<>();
toDelete.add("end");
toDelete.add("something");
for (String line; (line = reader.readLine()) != null;) {
line = line.replaceAll("\\b"+toDelete+"\\b", "");
writer.println(line);
}
reader.close();
writer.close();
}
catch (Exception e) {
System.out.println("Something went Wrong");
}
}
I get my output as: (It just removes the space)
1. end
2. endofthedayorendoftheweek
3. endline
4. something
5. "something" end
Can u guys help me on this?
Click here to follow the thread

You need to create an alternation group out of the set with
String.join("|", toDelete)
and use as
line = line.replaceAll("\\b(?:"+String.join("|", toDelete)+")\\b", "");
The pattern will look like
\b(?:end|something)\b
See the regex demo. Here, (?:...) is a non-capturing group that is used to group several alternatives without creating a memory buffer for the capture (you do not need it since you remove the matches).
Or, better, compile the regex before entering the loop:
Pattern pat = Pattern.compile("\\b(?:" + String.join("|", toDelete) + ")\\b");
...
line = pat.matcher(line).replaceAll("");
UPDATE:
To allow matching whole "words" that may contain special chars, you need to Pattern.quote those words to escape those special chars, and then you need to use unambiguous word boundaries, (?<!\w) instead of the initial \b to make sure there is no word char before and (?!\w) negative lookahead instead of the final \b to make sure there is no word char after the match.
In Java 8, you may use this code:
Set<String> nToDel = new HashSet<>();
nToDel = toDelete.stream()
.map(Pattern::quote)
.collect(Collectors.toCollection(HashSet::new));
String pattern = "(?<!\\w)(?:" + String.join("|", nToDel) + ")(?!\\w)";
The regex will look like (?<!\w)(?:\Q+end\E|\Qsomething-\E)(?!\w). Note that the symbols between \Q and \E is parsed as literal symbols.

The problem is that you're not creating the correct regex for replacing the words in the set.
"\\b"+toDelete+"\\b" will produce this String \b[end, something]\b which is not what you need.
To fix that you can do something like this:
for(String del : toDelete){
line = line.replaceAll("\\b"+del+"\\b", "");
}
What this does is to go through the set, produce a regex from each word and remove that word from the line String.
Another approach will be to produce a single regex from all the words in the set.
Eg:
String regex = "";
for(String word : toDelete){
regex+=(regex.isEmpty() ? "" : "|") + "(\\b"+word+"\\b)";
}
....
line = line.replace(regex, "");
This should produce a regex that looks something like this: (\bend\b)|(\bsomething\b)

How to split a file into several tokens

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.

I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";

Try pattern as String pattern = "[^\\w]"; in the same code

Java String - See if a string contains only numbers and characters not words?

I have an array of string that I load throughout my application, and it contains different words. I have a simple if statement to see if it contains letters or numbers but not words .
I mean i only want those words which is like AB2CD5X .. and i want to remove all other words like Hello 3 , 3 word , any other words which is a word in English. Is it possible to filter only alphaNumeric words except those words which contain real grammar word.
i know how to check whether string contains alphanumeric words
Pattern p = Pattern.compile("[\\p{Alnum},.']*");
also know
if(string.contains("[a-zA-Z]+") || string.contains([0-9]+])

What you need is a dictionary of English words. Then you basically scan your input and check if each token exists in your dictionary.
You can find text files of dictionary entries online, such as in Jazzy spellchecker. You might also check Dictionary text file.
Here is a sample code that assumes your dictionary is a simple text file in UTF-8 encoding with exactly one (lower case) word per line:
public static void main(String[] args) throws IOException {
final Set<String> dictionary = loadDictionary();
final String text = loadInput();
final List<String> output = new ArrayList<>();
// by default splits on whitespace
final Scanner scanner = new Scanner(text);
while(scanner.hasNext()) {
final String token = scanner.next().toLowerCase();
if (!dictionary.contains(token)) output.add(token);
}
System.out.println(output);
}
private static String loadInput() {
return "This is a 5gse5qs sample f5qzd fbswx test";
}
private static Set<String> loadDictionary() throws IOException {
final File dicFile = new File("path_to_your_flat_dic_file");
final Set<String> dictionaryWords = new HashSet<>();
String line;
final LineNumberReader reader = new LineNumberReader(new BufferedReader(new InputStreamReader(new FileInputStream(dicFile), "UTF-8")));
try {
while ((line = reader.readLine()) != null) dictionaryWords.add(line);
return dictionaryWords;
}
finally {
reader.close();
}
}
If you need more accurate results, you need to extract stems of your words. See Apache's Lucene and EnglishStemmer

You can use Cambridge Dictionaries to verify human words. In this case, if you find a "human valid" word you can skip it.
As the documentation says, to use the library, you need to initialize a request handler and an API object:
DefaultHttpClient httpClient = new DefaultHttpClient(new ThreadSafeClientConnManager());
SkPublishAPI api = new SkPublishAPI(baseUrl + "/api/v1", accessKey, httpClient);
api.setRequestHandler(new SkPublishAPI.RequestHandler() {
public void prepareGetRequest(HttpGet request) {
System.out.println(request.getURI());
request.setHeader("Accept", "application/json");
}
});
To use the "api" object:
try {
System.out.println("*** Dictionaries");
JSONArray dictionaries = new JSONArray(api.getDictionaries());
System.out.println(dictionaries);
JSONObject dict = dictionaries.getJSONObject(0);
System.out.println(dict);
String dictCode = dict.getString("dictionaryCode");
System.out.println("*** Search");
System.out.println("*** Result list");
JSONObject results = new JSONObject(api.search(dictCode, "ca", 1, 1));
System.out.println(results);
System.out.println("*** Spell checking");
JSONObject spellResults = new JSONObject(api.didYouMean(dictCode, "dorg", 3));
System.out.println(spellResults);
System.out.println("*** Best matching");
JSONObject bestMatch = new JSONObject(api.searchFirst(dictCode, "ca", "html"));
System.out.println(bestMatch);
System.out.println("*** Nearby Entries");
JSONObject nearbyEntries = new JSONObject(api.getNearbyEntries(dictCode,
bestMatch.getString("entryId"), 3));
System.out.println(nearbyEntries);
} catch (Exception e) {
e.printStackTrace();
}

Antlr might help you.
Antlr stands for ANother Tool for Language Recognition
Hibernate uses ANTLR to parse its query language HQL(like SELECT,FROM).

if(string.contains("[a-zA-Z]+") || string.contains([0-9]+])
I think this is a good starting point, but since you're looking for strings that contain both letters and numbers you might want:
if(string.contains("[a-zA-Z]+") && string.contains([0-9]+])
I guess you might also want to check if there are spaces? Right? Because you that could indicate that there are separate words or some sequence like 3 word. So maybe in the end you could use:
if(string.contains("[a-zA-Z]+") && string.contains([0-9]+] && !string.contains(" "))
Hope this helps

You may try this,
First tokenize the string using StringTokenizer with default delimiter, for each token if it contains only digits or only characters, discard it, remaining will be the words which contains combination of both digits and characters. For identifying only digits only characters you can have regular expressions used.

reading character like ö and ü from file in eclipse

I have a input file which contains some words like bört and übuk.When I read this line based on the following code I got these strange results. How can I solve it?
String line = bufferedReader.readLine();
if (line == null) { break; }
String[] words = line.split("\\W+");
for (String word : words) {
System.out.println(word);
output is
b
rt
and
buk

Try to create a BufferedReader handling UTF8 characters encoding :
FileInputStream fis = new FileInputStream(new File("someFile.txt"));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(isr);

It seems that your problem is that standard character class \\W is negation of \\w which represents only [a-zA-Z0-9_] characters, so split("\\W+") will split on every character which is not in this character class like in your case ö, ü.
To solve this problem and include also Unicode characters you can compile your regex with Pattern.UNICODE_CHARACTER_CLASS flag which enables the Unicode version of Predefined character classes and POSIX character classes. To use this flag you can add (?U)at start of used regex
String[] words = line.split("(?U)\\W+");
Demo:
String line = "bört and übuk";
String[] words = line.split("(?U)\\W+");
for (String word : words)
System.out.println(word);
Output:
bört
and
übuk

You need something like this :-
BufferedReader bufferReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileDir), "UTF-8"));
Here instead of UTF-8 , you can put the encoding you need to support while reading the file

Split text file into Strings on empty line

I want to read a local txt file and read the text in this file. After that i want to split this whole text into Strings like in the example below .
Example :
Lets say file contains-
abcdef
ghijkl
aededd
ededed
ededfe
efefeef
efefeff
......
......
I want to split this text in to Strings
s1 = abcdef+"\n"+ghijkl;
s2 = aededd+"\n"+ededed;
s3 = ededfe+"\n"+efefeef+"\n"+efefeff;
........................
I mean I want to split text on empty line.
I do know how to read a file. I want help in splitting the text in to strings

you can split a string to an array by
String.split();
if you want it by new lines it will be
String.split("\\n\\n");
UPDATE*
If I understand what you are saying then john.
then your code will essentially be
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}
else if(tmp==null)
{
break;
}
else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\\n" + tmp;
}
}
}
Might be what you are trying to parse.
Where allStrings is a list of all of your strings.

The below code would work even if there are more than 2 empty lines between useful data.
import java.util.regex.*;
// read your file and store it in a string named str_file_data
Pattern p = Pattern.compile("\\n[\\n]+"); /*if your text file has \r\n as the newline character then use Pattern p = Pattern.compile("\\r\\n[\\r\\n]+");*/
String[] result = p.split(str_file_data);
(I did not test the code so there could be typos.)

I would suggest more general regexp:
text.split("(?m)^\\s*$");
In this case it would work correctly on any end-of-line convention, and also would treat the same empty and blank-space-only lines.

It may depend on how the file is encoded, so I would likely do the following:
String.split("(\\n\\r|\\n|\\r){2}");
Some text files encode newlines as "\n\r" while others may be simply "\n". Two new lines in a row means you have an empty line.

Godwin was on the right track, but I think we can make this work a bit better. Using the '[ ]' in regx is an or, so in his example if you had a \r\n that would just be a new line not an empty line. The regular expression would split it on both the \r and the \n, and I believe in the example we were looking for an empty line which would require a either a \n\r\n\r, a \r\n\r\n, a \n\r\r\n, a \r\n\n\r, or a \n\n or a \r\r
So first we want to look for either \n\r or \r\n twice, with any combination of the two being possible.
String.split(((\\n\\r)|(\\r\\n)){2}));
next we need to look for \r without a \n after it
String.split(\\r{2});
lastly, lets do the same for \n
String.split(\\n{2});
And all together that should be
String.split("((\\n\\r)|(\\r\\n)){2}|(\\r){2}|(\\n){2}");
Note, this works only on the very specific example of using new lines and character returns. I in ruby you can do the following which would encompass more cases. I don't know if there is an equivalent in Java.
.match($^$)

#Kevin code works fine and as he mentioned that the code was not tested, here are the 3 changes required:
1.The if check for (tmp==null) should come first, otherwise there will be a null pointer exception.
2.This code leaves out the last set of lines being added to the ArrayList. To make sure the last one gets added, we have to include this code after the while loop: if(!str.isEmpty()) { allStrings.add(str); }
3.The line str += "\n" + tmp; should be changed to use \n instead if \\n. Please see the end of this thread, I have added the entire code so that it can help
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp==null)
{
break;
}else if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\n" + tmp;
}
}
}
if(!str.isEmpty())
{
allStrings.add(str);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Tokenize Arabic text files java - java

Related

Regex for replacing Exact String match [duplicate]

How to split a file into several tokens

Java String - See if a string contains only numbers and characters not words?

reading character like ö and ü from file in eclipse

Split text file into Strings on empty line

Categories

Resources