UTF8 encoding problem - java

I am working with a html file.. I used html cleaner to clean the html file then the format is changed (All 'e's replaced by +®)... how can i correct that in java

Post some code on what you are doing. Here is an answer I got to a similar question
FileInputStream fis = new FileInputStream("filename");
BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-16"));

Related

JAVA: Open and read file using InputStreamReader

I'm trying to read a binary file (pdf, doc, zip) using InputStreamReader. I achieved that using FileInputStream, and saving the contents of file into a byte array. But i've been asked to do that using InputStreamReader. So when i'm trying to open and read a pdf file for example using
File file = new File (inputFileName);
Reader in = new
InputStreamReader(new FileInputStream(file));
char fileContent[] = new char[(int)file.length()];
in.read(fileContent); in.close();
and then save this content to another pdf file using
File outfile = new File(outputFile);
Writer out = new OutputStreamWriter(new FileOutputStream(outfile));
out.write(fileContent);
out.close();
Everything goes fine (no exception or errors) but when i'm trying to open the new file, either it says it's corrupted or wrond encoding.
Any suggestion??
ps1 i specifically need this using InputStreamReader
ps2 it works fine when trying to read/write .txt files
String, char, Reader, Writer are for text in java. This text is Unicode, and hence all scripts may be combined.
byte[], InputStream, OutputStream is for binary data. If they represent text, they must be associated with some encoding.
The bridge between text and binary data always involves a conversion.
In your case:
Reader in = new InputStreamReader(new FileInputStream(file), encoding);
Reader in = new InputStreamReader(new FileInputStream(file)); // Platform's encoding
The second version is non-portable, as other computers can have any encodings.
In your case, do not use an InputStreamReader for binary data. The conversion can only corrupt things.
Maybe they intended to mean: do not read all in a byte array. In that case use a BufferedInputStream to read small byte arrays (a buffer) repeatedly.
Do not use reader/writer API. Use binary streams instead:
File inFile = new File("...");
File outFile = new File("...");
FileChannel in = new FileInputStream(inFile).getChannel();
FileChannel out = new FileOutputStream(outFile).getChannel();
in.transferTo(0, inFile.length(), out);

BufferedReader, read chars in an edittext gives strange chars

Ok, I am reading a .docx file via a BufferedReader and want to store the text in an edittext. The .docx is not in english language but in a different one (greek). I use:
File file = new File(file_Path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder text = new StringBuilder();
while ((line = br.readLine()) != null) {
text.append(line);
}
et1.setText(text);
And the result I get is this:
If the characters are in english language, it works fine. But in my case they aren't. How can I fix this? Thanks a lot
Ok, I am reading a .docx file via a BufferedReader
Well that's the first problem. BufferedReader is for plain text files. docx files are binary files in a specific format (assuming you mean the kind of file that Microsoft Word saves). You can't just read them like text files. Open the file up in Notepad (not Wordpad) and you'll see what what I mean.
You might want to look at Apache POI.
From comments:
Testing to read a .txt file with the same text gave same results too
That's probably due to using the wrong encoding. FileReader always uses the platform default encoding, which is annoying. Assuming you're using Java 7 or higher, you'd be better off with Files.newBufferedReader:
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...
}
Adjust the charset to match the one you used when saving your text file, of course - if you have the option of using UTF-8, that's a pretty good choice. (Aside from anything else, pretty much everything can handle UTF-8.)

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!
Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).
Try passing -Dfile.encoding=utf-8 as JVM argument.

BufferedReader automatic encoding type

I am using BufferedReader to get data fro ma url.
URL url = new URL("http://");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "windows-1251"));
On some url's encoding is windows-1251 (cyrilyc) so i specified that in the reader. But on some ones, enconding is different, e.g KOI8-R Any way to get the data from both sources without using naother reader? I really can use only one here.
No, the BufferedReader cannot examine the Content-Enconding header. You have to supply that. Or use a library for encoding recognition/detection.

Reading EUC encoded HTML using Java on Windows

I am trying to read an HTML file which is encoded in EUC-KR from a URL. When I compile the code inside the IDE I get the desired output, but when I build a jar and try running the jar, the data I read is shown as question marks ("????" instead of the korean characters). I am assuming it is due to loss of encoding.
The meta of the site says the following:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Here is my code:
String line;
URL u = new URL("link to the site");
InputStream in = u.openConnection().getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
while ((line = r.readLine()) != null) {
/*send the string to a text area*/--> This works fine now
/*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.
InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);
while (it.isValid()) {
chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
//chaps is a arraylist<string>
it.next();
}
I would appreciate if someone could help me figure out how to grab the characters without loosing encoding while running the application on any platform independent of system's default encoding.
Thanks
PS: The program when run as jar shows system encoding as Cp1252 and UTF-8 when run inside the IDE.
InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
This is a transcoding error. You encode a string as "EUC-KR" and decode it using the system encoding (resulting in junk). To avoid this, you would have to pass the encoding to the InputStreamReader.
However, it would be better to avoid all that encoding and decoding and just use a StringReader.

Categories

Resources