Remove accents from string: java Normalizer - java

I have a CSV file containing some French words (with accents). I want to read this file using Java and convert the accented letters to non-accented letters. For example, é should be read as e. I have tried the following:
CSVReader reader = new CSVReader(new FileReader(file));
String[] line;
while ((line = reader.readNext()) != null) {
line[0] = Normalizer.normalize(line[0], Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "").replaceAll("[^a-zA-Z0-9:_']", "_");
System.out.println("LINE[0]: "+line[0]);
}
If suppose, the file contains the line "Arts_et_Métiers", the output is "Arts_et_MAtiers" where the accented letter is replaced by 'A' and not 'e'. Is there something that I am doing wrong? Any help will be appreciated.
Thanks.

Related

Strange char after reading from txt file

I have a txt file with three rows of integers, after adding them to a List I'm finding a strange char at the beginning of the first index. I used an InputStream, BufferedReader and StringBuilder to read from the file. I tried to debug using println() statements at several places but I still can't figure out where that char came from.
File selectedFile = fileChooser.getSelectedFile();
inputStream = new FileInputStream(selectedFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder out = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
out.append(line);
items.add(line);
}
When I try to copy the output from printing out List items to this post somehow the char I'm talking about does not show, so I'll post a screenshot instead:
http://imgur.com/gjaF3no
http://imgur.com/JHAH6mV
The first is of the entire list, and the second should show the char I'm talking more clearly, it looks like a dot before "3". Any help would be appreciated, Thank you.
You can try removing all control characters (strange characters) by doing the following:
strangeString.replaceAll("\\p{Cntrl}", "");
Reference: Java - removing strange characters from a String
Thank you all for the help. The problem was actually in the original txt file like #coder

I am working for Marathi language in my NLP based project

I am using Marathi Wordnet.In this wordnet there are text documents including marathi words
I want to read these marathi documents in my java code.I have tried with using the BufferedReader and FileReader.But I failed.
This is the code I have tried.
FileReader fr=new FileReader("onto_txt");
BufferedReader br=new BufferedReader(fr);
String line=br.readLine();
while(line!=null){
System.out.println(line);
line=br.readLine();
}
fr.close();
br.close();
FileReader is an old utility class using the default encoding of the platform.
Assuming that the file is in UTF-8, better explicitly specify the encoding.
try (BufferedReader br = new BufferedReader(new InputStreamReader(
new FileInputStream("C:/xyz/onto_txt"), StandardCharsets.UTF_8))) {
String line = br.readLine();
while (line != null) {
System.out.println(line);
System.out.println(Arrays.toString(line.getBytes(StandardCharsets.UTF_8)));
line = br.readLine();
}
} // Closes br
Using System.out again converts the line to the encoding of the platform. That might not be able to display the String line; hence the dump of every single byte. Not very informative, but it might clarify that where ? is diplayed in the prior line, there really are Unicode characters.
Internally java String holds Unicode, and can contain any text. So you might process line as desired in the while.

reading character like ö and ü from file in eclipse

I have a input file which contains some words like bört and übuk.When I read this line based on the following code I got these strange results. How can I solve it?
String line = bufferedReader.readLine();
if (line == null) { break; }
String[] words = line.split("\\W+");
for (String word : words) {
System.out.println(word);
output is
b
rt
and
buk
Try to create a BufferedReader handling UTF8 characters encoding :
FileInputStream fis = new FileInputStream(new File("someFile.txt"));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(isr);
It seems that your problem is that standard character class \\W is negation of \\w which represents only [a-zA-Z0-9_] characters, so split("\\W+") will split on every character which is not in this character class like in your case ö, ü.
To solve this problem and include also Unicode characters you can compile your regex with Pattern.UNICODE_CHARACTER_CLASS flag which enables the Unicode version of Predefined character classes and POSIX character classes. To use this flag you can add (?U)at start of used regex
String[] words = line.split("(?U)\\W+");
Demo:
String line = "bört and übuk";
String[] words = line.split("(?U)\\W+");
for (String word : words)
System.out.println(word);
Output:
bört
and
übuk
You need something like this :-
BufferedReader bufferReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileDir), "UTF-8"));
Here instead of UTF-8 , you can put the encoding you need to support while reading the file

Tokenize Arabic text files java

I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work.
I added the UTF-8 to read Arabic files. did I miss something
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
// BufferedReader br = new BufferedReader(fstream);
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
Scanner input = new Scanner(s);
while(input.hasNext()) {
word = input.next();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
//String stemmed=stem.stem (word);
sb.append(word+"\t");
}
//System.out.print(sb); ///here the arabic text is outputed without stopwords
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t"); //here the problem.
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
Please any ideas to help me proceed.
Thanks
The problem lies with your regex which will work well for English but not for Arabic because by definition
[\\W&&[^\\s]
means
// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s A whitespace character, short for [ \t\n\x0b\r\f]
So, by this logic, all chars of Arabic will be selected by this regex. So, when you give
sb.toString().replaceAll("[\\W&&[^\\s]]", "")
it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like
sb.toString().split("\\s+")
which will give you the Arabic words array separated by space.
In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:
http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf
If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748

Trimming off non numerical stuff and store them in an array using replaceAll()

Supoose I have input a line "MOVE R1, R2", I split the words with respect to white space and store them individually into an array token[] as follows:
String[] token = new String[0];// array initialization
FileInputStream fstream = new FileInputStream("a.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
//Read file line by line and storing data in the form of tokens
While((strLine = br.readLine()) != null){
token = strLine.split(" ");// split w.r.t spaces
}
So the elements at each index are as follows:
token[0]=MOVE
token[1]=R1,
token[2]=R2
But what I want is as follows:
token[0]=MOVE
token[1]=1
token[2]=2
I want to store only the numerical values in the token[i] where i>0, trimming off R and comma(,).
I m unable to figure out how to use relaceAll() with arrays. How can I do that? Thanks in advance.
Try this code:
str = str.replaceAll("[^\\d.]", "");
This will trim off the non-numeric stuff.
Replace before tokenizing:
while((strLine = br.readLine()) != null){
strLine = strLine.replaceAll("[^\\d]","");
token = strLine.split(" ");// split w.r.t spaces
}
s=s.replaceAll("\\D", "");
for your individual split string elements in the while loop..

Categories

Resources