java utf8 encoding outputstream not working - java

I need to write a program which is able to write UTF-8 data into a file.
I found out examples on the internet, however, I am not able to progress to desired result.
Code:
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("c://java//temp.txt"), "UTF-8"));
String text = "This texáát will be added to File !!";
out.write(text);
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Everything run succesfully, but at the end I see special characters not showing properly:
This texáát will be added to File !!
I tried several examples from the internet with the same result.
I use Visual Studio code.
Where could be the problem please?
Thank you

Your code is correct. You probably already have a file named temp.txt, and therefore Java writes text to the existing file (replacing previous content). What can be a problem is an encoding, that you already have set in your file.
In other words, you can't (or at least shouldn't) write UTF-8 text to the file with for example WINDOWS-1250 encoding or you would get an exact result as you have described.
If you didn't have this file, Java would automatically create a file with UTF-8 encoding.
Possible solutions:
Change encoding of your current file (usually you can open it in any text editor, use Save as and then specify encoding as UTF-8.
Remove this file and Java will create it automatically with proper encoding.
By the way, you should use StandardCharsets class instead of using String charsetName in order to avoid UnsupportedEncodingException:
new OutputStreamWriter(new FileOutputStream("temp.txt"), StandardCharsets.UTF_8)

When you say "I see special characters not showing properly", where are you seeing them?
What you say/show next looks like the string, utf-8 encoded (i.e. the accented a's are each represented by 2 chars, in what appears to be the appropriate encoding).
What I would expect the issue to be is that the java code is not outputting a BOM at the beginning of the file, leaving the interpretation of utf-8 sequences up to the discretion of the reader.

Related

Getting text from PDF using Apache PDFBox

How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.
try (FileWriter writer = new FileWriter(outputFile)) {
PDDocument document = new PDDocument().load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(document);
writer.write(text);
document.close();
} catch (IOException e){
e.printStackTrace();
}
Also, have a problem with getting text from saved in pdf web-pages. It looks like:
I think there is something wrong with encoding, but don't know what to do
Your code works alright, your text viewer assumes a wrong encoding.
Using your code and the same PDFBox version as you I get proper extracted text:
But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:
The file itself does not indicate any specific encoding by a BOM or anything:
Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.
Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter to use UTF-16.
Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...

Java IO with UTF characters

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.
First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

How can I change the Standard Out to "UTF-8" in Java

I download a file from a website using a Java program and the header looks like below
Content-Disposition attachment;filename="Textkürzung.asc";
There is no encoding specified
What I do is after downloading I pass the name of the file to another application for further processing. I use
System.out.println(filename);
In the standard out the string is printed as Textk³rzung.asc
How can I change the Standard Out to "UTF-8" in Java?
I tried to encode to "UTF-8" and the content is still the same
Update:
I was able to fix this without any code change. In the place where I call this my jar file from the other application, i did the following
java -DFile.Encoding=UTF-8 -jar ....
This seem to have fixed the issue
thank you all for your support
The default encoding of System.out is the operating system default. On international versions of Windows this is usually the windows-1252 codepage. If you're running your code on the command line, that is also the encoding the terminal expects, so special characters are displayed correctly. But if you are running the code some other way, or sending the output to a file or another program, it might be expecting a different encoding. In your case, apparently, UTF-8.
You can actually change the encoding of System.out by replacing it:
try {
System.setOut(new PrintStream(new FileOutputStream(FileDescriptor.out), true, "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new InternalError("VM does not support mandatory encoding UTF-8");
}
This works for cases where using a new PrintStream is not an option, for instance because the output is coming from library code which you cannot change, and where you have no control over system properties, or where changing the default encoding of all files is not appropriate.
The result you're seeing suggests your console expects text to be in Windows "code page 850" encoding - the character ü has Unicode code point U+00FC. The byte value 0xFC renders in Windows code page 850 as ³. So if you want the name to appear correctly on the console then you need to print it using the encoding "Cp850":
PrintWriter consoleOut = new PrintWriter(new OutputStreamWriter(System.out, "Cp850"));
consoleOut.println(filename);
Whether this is what your "other application" expects is a different question - the other app will only see the correct name if it is reading its standard input as Cp850 too.
Try to use:
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(test);

Writing strings with chars like "ñ" to a txt file

Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)
So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.
files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.

How will append a utf-8 string to a properties file

How will append a utf-8 string to a properties file. I have given the code below.
public static void addNewAppIdToRootFiles() {
Properties properties = new Properties();
try {
FileInputStream fin = new FileInputStream("C:\Users\sarika.sukumaran\Desktop\root\root.properties");
properties.load(new InputStreamReader(fin, Charset.forName("UTF-8")));
String propertyStr = new String(("قسيمات").getBytes("iso-8859-1"), "UTF-8");
BufferedWriter bw = new BufferedWriter(new FileWriter(directoryPath + rootFiles, true));
bw.write(propertyStr);
bw.newLine();
bw.flush();
bw.close();
fin.close();
} catch (Exception e) {
System.out.println("Exception : " + e);
}
}
But when I open the file, the string I have written "قسيمات" to the file shows as "??????". Please help me.
OK, your first mistake is getBytes("iso-8859-1"). You should not do these manipulations at all. If you want to write unicode text to file you should open the file and write text. The internal representations of strings in java is unicdoe, so everything will be writter correctly.
You have to care about charset when you are reading file. BTW you do it correctly.
But you do not have to use file manipulation tools to append something to properites file. You can just call prop.setProperty("yourkey", "yourvalue") and then call prop.store(new FileOutputStream(youfilename)).
Ok, I have checked the specification for Properties class. If you use following methods: load() for input stream or store() for output stream, the input/output stream for the file is assumed a iso-8859-1 encoding by default. Therefore, you have to be cautious with a few things:
Some characters in French, German and Portuguese are iso-8859-1 (Latin1) compatible, which they normally work fine in iso-8859-1. So, you don't have to worry that much. But, others like Arabic and Hebrew characters are not Latin1 compatible, so you need to be careful with the choice of encoding for these characters. If you have a mix of characters of French and Arabic, you have no choice but to use Unicode.
What is your current input file's encoding if it already exists to be used with Properties's load() method? If it is not the default iso-8859-1, then you need to figure out what it is first before opening the file. If infile file encoding is UTF-8, then use properties.load(new InputStreamReader(new FileInputStream("infile"), "UTF8"))); Then, stick to this encoding till the end. Match the file encoding with the character encoding as well.
If it is a new input file to be used with Properties's load() method, choose the file encoding that works with your character's encoding. Then, stick to this encoding till the end.
Your expected output file's encoding shall be the same with what is used from Properties's load() method before you use the store() method. If it is not the default iso-8859-1, then you need to figure out what it is first before saving the file. Stick to this encoding till the end. Match the file encoding with the character encoding as well. If outfile file encoding is UTF-8, then specifically use UTF-8 encoding when saving the file. But, if the store() method still ends up with an outfile in iso-8859-1 encoding, then you need to do what is suggested next...
If you stick to the default iso-8859-1, it works fine for characters like French. But, if the characters are not iso-8859-1 or Latin1 encoding compatible, you need to use Unicode escape characters instead as an alternative: for example:\uFE94 for the Arabic ﺔ character. For me, this escaping is too tedious and normally we use native2ascii utility provided in JRE or JDK to convert a properties file from one encoding to another encoding. Of course, there are other ways...just check the references below...For me, it is better to use a properties file in XML format since by default it is UTF-8...
References:
Java properties UTF-8 encoding in Eclipse
Setting the default Java character encoding?

Categories

Resources