I have a input file which contains some words like bört and übuk.When I read this line based on the following code I got these strange results. How can I solve it?
String line = bufferedReader.readLine();
if (line == null) { break; }
String[] words = line.split("\\W+");
for (String word : words) {
System.out.println(word);
output is
b
rt
and
buk
Try to create a BufferedReader handling UTF8 characters encoding :
FileInputStream fis = new FileInputStream(new File("someFile.txt"));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader bufferedReader = new BufferedReader(isr);
It seems that your problem is that standard character class \\W is negation of \\w which represents only [a-zA-Z0-9_] characters, so split("\\W+") will split on every character which is not in this character class like in your case ö, ü.
To solve this problem and include also Unicode characters you can compile your regex with Pattern.UNICODE_CHARACTER_CLASS flag which enables the Unicode version of Predefined character classes and POSIX character classes. To use this flag you can add (?U)at start of used regex
String[] words = line.split("(?U)\\W+");
Demo:
String line = "bört and übuk";
String[] words = line.split("(?U)\\W+");
for (String word : words)
System.out.println(word);
Output:
bört
and
übuk
You need something like this :-
BufferedReader bufferReader = new BufferedReader(
new InputStreamReader(new FileInputStream(fileDir), "UTF-8"));
Here instead of UTF-8 , you can put the encoding you need to support while reading the file
Related
Hi I have a file stored in Linux system that contains a special character ^C
Something like this:
ABCDEF^CIJKLMN
Now i need to read this file in java and detect that there is this ^C to make split.
The problem that to read the file in UNIX.I must use cat -v fileName to see the special chracter ^C elsewhere i can't see it.
This is my sample code.
InputStreamReader inputStreamReader = new InputStreamReader(new FileInputStream(this),
Charset.forName("UTF-8"));
BufferedReader br = new BufferedReader(inputStreamReader);
String line;
while ((line = br.readLine()) != null) {
if (line.contains("^C")) {
String[] split = line.split("\\" + sepRecord);
System.out.println(split);
}
You are checking if the line contains the String "^C", not the character '^C' (which corresponds to 0x03, or \u0003). You should search for the character 0x03 instead. Here's a code example that would work in your case:
byte[] fileContent = new byte[] {'A', 0x03, 'B'};
String fileContentStr = new String (fileContent);
System.out.println (fileContentStr.contains ("^C")); // false
System.out.println (fileContentStr.contains (String.valueOf ((char) 0x03))); // true
System.out.println (fileContentStr.contains ("\u0003")); // true, thanks to #Thomas Fritsch for the precision
String[] split = fileContentStr.split ("\u0003");
System.out.println (split.length); // 2
System.out.println (split[0]); // A
System.out.println (split[1]); // B
The ^C character is displayed in Caret Notation, and must be interpreted as a single character.
I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}
I have a CSV file containing some French words (with accents). I want to read this file using Java and convert the accented letters to non-accented letters. For example, é should be read as e. I have tried the following:
CSVReader reader = new CSVReader(new FileReader(file));
String[] line;
while ((line = reader.readNext()) != null) {
line[0] = Normalizer.normalize(line[0], Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "").replaceAll("[^a-zA-Z0-9:_']", "_");
System.out.println("LINE[0]: "+line[0]);
}
If suppose, the file contains the line "Arts_et_Métiers", the output is "Arts_et_MAtiers" where the accented letter is replaced by 'A' and not 'e'. Is there something that I am doing wrong? Any help will be appreciated.
Thanks.
I am trying to read a file containing greek words in utf8
with the following code
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
while((line = reader.readLine()) != null){
tokenizer = new StringTokenizer(line, delimiter);
while(tokenizer.hasMoreTokens()){
currentToken = tokenizer.nextToken();
map.put(currentToken, 1);
}
}
On every forum I looked for, I saw this new FileInputStream(file), "UTF8")
but still the printed results is like that ����
p.s. when i print a variable containing a greek word from inside the code, the print is successfull, that means that the problem is on file read.
any ideas?
There are some with too professionalism here. I remind you again that we are humans, not compilers! I am here again "powers" you deleted by post! I am very proud of being born in the birthplace of democracy, respecting the other discussants! You don't respect anything "guru" guys...
PS: Yeah, I know that you disseminate again down votes, but who really cares?
There is no "UTF8" charset in Java. The correct charset name is "UTF-8":
new InputStreamReader(new FileInputStream(file), "UTF-8"))
Or use StandardCharsets.UTF_8 instead to avoid any ambiguity:
new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8))
That being said, make sure the file is actually UTF-8 encoded. If it has a UTF-8 BOM in front, you will have to either strip it off from the file itself, or manually skip it when reading the file before then reading the lines. Java readers do not recognize or skip BOMs automatically.
Use this for proper converstion - this one is from iso-8859-1 to utf-8:
public String to_utf8(String fieldvalue) throws UnsupportedEncodingException{
String fieldvalue_utf8 = new String(fieldvalue.getBytes("ISO-8859-1"), "UTF-8");
return fieldvalue_utf8;
}
I am trying to tokenize some text files into words and I write this code, It works perfect in English and when I try it in Arabic it did not work.
I added the UTF-8 to read Arabic files. did I miss something
public void parseFiles(String filePath) throws FileNotFoundException, IOException {
File[] allfiles = new File(filePath).listFiles();
BufferedReader in = null;
for (File f : allfiles) {
if (f.getName().endsWith(".txt")) {
fileNameList.add(f.getName());
Reader fstream = new InputStreamReader(new FileInputStream(f),"UTF-8");
// BufferedReader br = new BufferedReader(fstream);
in = new BufferedReader(fstream);
StringBuilder sb = new StringBuilder();
String s=null;
String word = null;
while ((s = in.readLine()) != null) {
Scanner input = new Scanner(s);
while(input.hasNext()) {
word = input.next();
if(stopword.isStopword(word)==true)
{
word= word.replace(word, "");
}
//String stemmed=stem.stem (word);
sb.append(word+"\t");
}
//System.out.print(sb); ///here the arabic text is outputed without stopwords
}
String[] tokenizedTerms = sb.toString().replaceAll("[\\W&&[^\\s]]", "").split("\\W+"); //to get individual terms
for (String term : tokenizedTerms) {
if (!allTerms.contains(term)) { //avoid duplicate entry
allTerms.add(term);
System.out.print(term+"\t"); //here the problem.
}
}
termsDocsArray.add(tokenizedTerms);
}
}
}
Please any ideas to help me proceed.
Thanks
The problem lies with your regex which will work well for English but not for Arabic because by definition
[\\W&&[^\\s]
means
// returns true if the string contains a arbitrary number of non-characters except whitespace.
\W A non-word character other than [a-zA-Z_0-9]. (Arabic chars all satisfy this condition.)
\s A whitespace character, short for [ \t\n\x0b\r\f]
So, by this logic, all chars of Arabic will be selected by this regex. So, when you give
sb.toString().replaceAll("[\\W&&[^\\s]]", "")
it will mean, replace all non word character which is not a space with "". Which in case of Arabic, is all characters. Thus you will get a problem that all Arabic chars are replaced by "". Hence no output will come. You will have to tweak this regex to work for Arabic text or just split the string with space like
sb.toString().split("\\s+")
which will give you the Arabic words array separated by space.
In addition to worrying about character encoding as in bgth's response, tolkenizing Arabic has an added complication that words are not nessisarily white space separated:
http://www1.cs.columbia.edu/~rambow/papers/habash-rambow-2005a.pdf
If you're not familiar with the Arabic, you'll need to read up on some of the methods regarding tolkenization:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.120.9748