Error reading UTF-8 file in Java - java

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters
This is the code I have:
public static String readSentence(String resourceName) {
String sentence = null;
try {
InputStream refStream = ClassLoader
.getSystemResourceAsStream(resourceName);
BufferedReader br = new BufferedReader(new InputStreamReader(
refStream, Charset.forName("UTF-8")));
sentence = br.readLine();
} catch (IOException e) {
throw new RuntimeException("Cannot read sentence: " + resourceName);
}
return sentence.trim();
}

The problem is probably in the way that the string is being output.
I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:
for (char c : sentence.toCharArray()) {
System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}
and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

First, you could create the InputStreamReader as
new InputStreamReader(refStream, "UTF-8")
Also, you should verify if the resource really contains UTF-8 content.

One of the most annoying reason could be... your IDE settings.
If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

Related

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

Removing space in Java error

Basically, my program will convert data into .CSV format. But, I am faced with an error such that when I open my file in excel, it displays my data normally but when in notepad, it becomes some characters ㈬㥙〳㈬㥙ㄳ㌬かㄹ㌬か㌹㌬ㅋㄳ㌬ㅋ㈳㌬ㅋ㌳㌬ㅋ㐳
Here's my line of code
String resultString = stringWriter.toString();
for ( String cheese: pie.keySet() ) {
resultString += System.getProperty("line.separator") + cheese + "," +
pie.get(cheese).toString();
resultString = resultString.replaceAll(",$" , "").replaceAll(" ", "");
}
this.WriteToFile(resultString);
I have multiple file with this method to remove the space but only this file has the error. I've tried multiple methods such as removing it before the first resultString and at the back of pie.get(cheese).toString().
Also tried with .replace(" ", ""); and replaceAll("\\s","")
The data contains special characters. These characters are not rendered properly with default encoding. So When you are writing/creating a text/csv file, also set the character encoding to UTF-8. You can do this in JAVA Program itself.
String utf8String=getFromSource();
File fileDir = new File("c:\\temp\\test.txt");
Writer out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(fileDir), "UTF8"));
out.append(utf8String).append("\r\n");
out.flush();
out.close();

StringEscapeUtils.unescapeHtml doesn't work on strings read from files

I'm trying to read in a file that contains unicode characters, convert those characters to their corresponding symbols and then print the resulting text to a new file. I'm trying to use StringEscapeUtils.unescapeHtml to do this but the lines are just being printed as is, with the unicode points still intact. I did a practice run by copying a single line from the file, making a string from that and then calling StringEscapeUtils.unescapeHtml on that, which works perfectly. My code is below:
class FileWrite
{
public static void main(String args[])
{
try{
String testString = " \"text\":\"Dude With Knit Hat At Party Calls Beer \u2018Libations\u2019 http://t.co/rop8NSnRFu\" ";
FileReader instream = new FileReader("Home Timeline.txt");
BufferedReader b = new BufferedReader(instream);
FileWriter fstream = new FileWriter("out.txt");
BufferedWriter out = new BufferedWriter(fstream);
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");//This gives the desired output,
//with unicode points converted
String line = b.readLine().toString();
while(line != null){
out.write(StringEscapeUtils.unescapeHtml3(line) + "\n");
line = b.readLine();
}
//Close the output streams
b.close();
out.close();
}
catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
//This gives the desired output,
//with unicode points converted
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");
You are mistaken. Java unescapes String literals of this form at compile time when it builds them into the class file:
"\u2018Libations\u2019"
There are no HTML 3 escapes in this code. The method you have chosen is designed to unescape escape sequences of the form ‘.
You probably want the unescapeJava method.
You're strings are being both read and written using your platforms default encoding. You want to explicitly specify the character set to use as 'UTF-8':
Input stream:
BufferedReader b = new BufferedReader(new InputStreamReader(
new FileInputStream("Home Timeline.txt"),
Charset.forName("UTF-8")));
Output stream:
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("out.txt"),
Charset.forName("UTF-8")));

Java String.split() - NumberFormatException

So, here's the thing, I've got this code:
public static void main(String[] args) {
try {
FileInputStream fstream = new FileInputStream("test.txt");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine = br.readLine();
String[] split = strLine.split(" ");
System.out.println(Integer.parseInt(split[0]));
in.close();
}
catch (Exception e) {//Catch exception if any
System.err.println("Error: " + e);
}
}
let's say I have a file named test.txt, with just "6 6". So, it reads first line and splits that line into two strings. The problem is that I can use Integer.parseInt for the split[1], but I can't use that method for split[0]
(System.out.println(split[0]) prints "6"), which outputs me an error of:
Error: java.lang.NumberFormatException: For input string: "6"
UPDATE:
It might be problem of eclipse, because if I compile my .java files in terminal with javac, I don't get any exceptions!:))
UPDATE2:
solved. something went wrong while saving with Kate. Don't know what, but gedit works better:D
Thank you all.
Just try with: hexdump -C test.txt if you have linux, you can see the non-printable chars you have.
Also the trim() answer it's fine.
I'd try the following to rule out spurious/unexpected characters:
.. setup/read code in main method...
String[] split = strLine.split(" ");
for (String s : split) {
System.out.println(String.format("[%s] => integer? %b", s, isInteger(s)));
}
... the rest of the main method....
private static boolean isInteger(String n) {
try {
Integer.parseInt(n);
} catch(NumberFormatException e) {
return false;
}
return true;
}
If you see anything inbetween the square brackets that isnt a number, or where the integer? returns false thats a likely the problem
This problem at the start of the file is usually due to a BOM, which some software (mainly Notepad in fact) like to put at the start of Unicode files.
Open the file in a good text editor and configure it to save the files without the BOM.
If you can't change the file, skip the first char when reading it.
If you are able to determine the actual encoding of the file you are reading, you can set it explicitly, so that any extra bytes are converted correctly into characters.
new InputStreamReader(new FileInputStream(...), <encoding>)

Java Loses International Characters in Stream

I am having trouble reading international characters in Java.
The default character set being used is UTF-8 and my Eclipse workspace is also set to this.
I am reading a title of a video from the Internet (Gangam Style in fact ;) ) which contains Korean characters, I am doing this as follows:
BufferedReader stdIn = new BufferedReader(new InputStreamReader(shellCommand.getInputStream()));
String fileName = null, output = null;
while ((output = stdInput.readLine()) != null) {
if (output.indexOf("Destination") > 0) {
System.out.println(output);
I know that the title it will read is: "PSY - GANGNAM STYLE (강남스타일) M/V", but the console displays the following instead: "PSY - GANGNAM STYLE () M V" which causes errors further along in my program.
It seems like the InputStream Reader isn't reading these characters correctly.
Does anyone have any ideas? I've spent the last hour scouring the Internet and haven't found any answers. Thanks in advance everyone.
The default character set being used is UTF-8
The default where? In Java itself, or in the video? It would be a much clearer if you specified this explicitly. You should check that's correct for the video data too.
It seems like the InputStream Reader isn't reading these characters correctly.
Well, all we know is that the text isn't showing properly on the console. Either it isn't being read correctly, or it's not being displayed correctly. You should print out each character's Unicode value so you can check the exact content of the string. For example:
static void logCharacters(String text) {
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
System.out.println(c + " " + Integer.toHexString(c));
}
}
You need to enure default char-set using Charset.defaultCharset().name() else use
InputStreamReader in = new InputStreamReader(shellCommand.getInputStream(), "UTF-8");
I tried sample program and it prints correctly in eclipse. It might be problem of windows console as AlexR has pointed out.
byte[] bytes = "PSY - GANGNAM STYLE (강남스타일) M/V".getBytes();
InputStreamReader reader = new InputStreamReader(new ByteArrayInputStream(bytes));
BufferedReader bufferedReader = new BufferedReader(reader);
String str = bufferedReader.readLine();
System.out.println(str);
Output:
PSY - GANGNAM STYLE (강남스타일) M/V

Categories

Resources