Reading EUC encoded HTML using Java on Windows

Reading EUC encoded HTML using Java on Windows - java

I am trying to read an HTML file which is encoded in EUC-KR from a URL. When I compile the code inside the IDE I get the desired output, but when I build a jar and try running the jar, the data I read is shown as question marks ("????" instead of the korean characters). I am assuming it is due to loss of encoding.
The meta of the site says the following:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Here is my code:
String line;
URL u = new URL("link to the site");
InputStream in = u.openConnection().getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
while ((line = r.readLine()) != null) {
/*send the string to a text area*/--> This works fine now
/*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.
InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);
while (it.isValid()) {
chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
//chaps is a arraylist<string>
it.next();
}
I would appreciate if someone could help me figure out how to grab the characters without loosing encoding while running the application on any platform independent of system's default encoding.
Thanks
PS: The program when run as jar shows system encoding as Cp1252 and UTF-8 when run inside the IDE.

InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
This is a transcoding error. You encode a string as "EUC-KR" and decode it using the system encoding (resulting in junk). To avoid this, you would have to pass the encoding to the InputStreamReader.
However, it would be better to avoid all that encoding and decoding and just use a StringReader.

Related

special characters in utf-8 text file

I've an input file which comes under ANSI UNIX file format. I convert that file into UTF-8.
Before converting to UTF-8, there is an special character like this in input file
»
After converting to UTF-8, it becomes like this
Ã»
When I process my file as it is, without converting to utf-8, all special characters disappeared and data loss as well.
But when I process my file after converting to UTF-8, All data appears with special character same as am getting after converting to UTF-8 in output file.
ANSI to UTF-8 (could be wrong, please correct me if am wrong somewhere)
FileInputStream = fis = new FileInputStream("inputtextfile.txt");
InputStreamReader isr = new InputStreamReader (fis, "ISO-8859-1");
Reader in = new BufferReader(isr);
FileOutputStream fos = new FileOutputStream("outputfile.txt");
OutPutStreamWriter osw = OutPutStreamWriter("fos", "UTF-8");
Writer out = new BufferedWriter(osw);
int ch;
out.write("\uFEFF";);
while ((ch = in.read()) > -1 ) {
out.write(ch);
}
out.close();
in.close();
After this am processing my file further for final output.
I'm using Talend ETL tool for creating an final output out of generated utf-8. (Java based ETL tool)
What I want is, I want to process my file so that I could get same special characters in output as am getting in input file.
I'm using java 1.8 for this whole processing. I'
'm too stuck in this situation and never dealt this with special characters.
Any suggestion would be helpful.

Read and download a page's source as Unicode in Java

Right now, I have some code that reads a page and saves everything to an html file. However, there are some problems... some punctuation and special characters show up as question marks.
Of course, if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI. I looked around, and all I see about this is complaining that it's impossible in Java or half explanations that I don't understand...
In any case, can anyone help me correct the question marks? Here is the part of my code that downloads the page. (The lister creates an array of urls to download, to be used with sites with pages. You can ignore that, it works fine.)
public void URLDownloader(String site, int startPage, int endPage) throws Exception {
String[] pages = URLLister(site, startPage, endPage);
String webPage = pages[0];
int fileNumber = startPage;
if (startPage == 0)
fileNumber++;
//change pages
for(int i = 0; i < pages.length; i++) {
webPage = pages[i];
URL url= new URL(webPage);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
String inputLine;
//while stuff to read on current page
while ((inputLine = in.readLine()) != null) {
out.println(inputLine); //write line of text
}
out.close(); //end writing text
if (startPage == 0)
startPage++;
console.append("Finished page " + startPage + "\n");
startPage++;
}

if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI
Windows is giving you misleading terminology here. There is no such encoding as ‘Unicode’; Unicode is the character set which is encoded in different ways into bytes. The encoding that Windows calls ‘Unicode’ is actually UTF-16LE. This is a two-byte-per-code-unit encoding that is not ASCII compatible and is generally inconvenient; Web pages tend not to work well with it.
(For what it's worth the ‘ANSI’ code page isn't anything to do with ANSI either. Plus ça change...)
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
This creates a file using the Java default encoding, which is likely the ANSI code page in your case. To specify a different encoding, use the optional second argument to PrintWriter:
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html", "utf-8");
UTF-8 is usually a good choice: being a UTF it can store all Unicode characters, and it's ASCII-compatible too.
However! You are also reading in the string using the default encoding:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
which probably isn't the encoding of the page. Again, you can specify the encoding using an optional parameter:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "utf-8"));
and this will work fine if the web page was actually served as UTF-8.
But what if it wasn't? There are actually multiple ways the encoding of an HTML page can be determined:
From the Content-Type: text/html;charset=... header parameter, if present.
From the <?xml declaration, if it's served as application/xhtml+xml.
From the <meta> equivalent tag in the page, if (1) and (2) were not present.
From browser-specific guessing heuristics, which may depend on user settings.
You can get (1) by reading URL.getConnection().getContentType() and parsing out the parameter. To get (2) or (3) you have to actually parse the file, which is kind of bad news. (4) is out of reach.
Probably the most consistent thing you can do is just what web browsers (except IE) do when they save a standalone web page to disc: take the exact original bytes that were served and put them straight into a file without any attempt to decode them. Then you don't have to worry about encodings or line ending changes. It does mean any charset metadata in the HTTP headers gets lost, but there's not really much you can do about that short of parsing the HTML and inserting a <meta> tag yourself (probably far too much faff).
InputStream in = url.openStream();
OutputStream out = new FileOutputStream(name + (fileNumber+i) + ".html");
byte[] buffer = new byte[1024*1024];
int len;
while ((len = in.read(buffer)) != -1) {
out.write(buffer, 0, len);
}
(nb buffer copy loop from this question which offers alternatives such as IOUtils.)

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!

Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).

Try passing -Dfile.encoding=utf-8 as JVM argument.

Reading Arabic chars from text file

I had finished a project in which I read from a text file written with notepad.
The characters in my text file are in Arabic language,and the file encoding type is UTF-8.
When launching my project inside Netbeans(7.0.1) everything seemed to be ok,but when I built the project as a (.jar) file the characters where displayed in this way: ÇáãæÇÞÚááÊØæíÑ.
How could I solve This problem please?

Most likely you are using JVM default character encoding somewhere. If you are 100% sure your file is encoded using UTF-8, make sure you explicitly specify UTF-8 when reading as well. For example this piece of code is broken:
new FileReader("file.txt")
because it uses JVM default character encoding - which you might not have control over and apparently Netbeans uses UTF-8 while your operating system defines something different. Note that this makes FileReader class completely useless if you want your code to be portable.
Instead use the following code snippet:
new InputStreamReader(new FileInputStream("file.txt"), "UTF-8");
You are not providing your code, but this should give you a general impression how this should be implemented.

Maybe this example will help a little. I will try to print content of utf-8 file to IDE console and system console that is encoded in "Cp852".
My d:\data.txt contains ąźżćąś adsfasdf
Lets check this code
//I will read chars using utf-8 encoding
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("d:\\data.txt"), "utf-8"));
//and write to console using Cp852 encoding (works for my windows7 console)
PrintWriter out = new PrintWriter(new OutputStreamWriter(System.out,
"Cp852"),true); // "Cp852" is coding used in
// my console in Win7
// ok, lets read data from file
String line;
while ((line = in.readLine()) != null) {
// here I use IDE encoding
System.out.println(line);
// here I print data using Cp852 encoding
out.println(line);
}
When I run it in Eclipse output will be
ąźżćąś adsfasdf
Ą«ľ†Ą? adsfasdf
but output from system console will be

Send base64Binary SOAP parameter between Java client and PHP server

I have a PHP SOAP server (using nuSOAP with wsdl) that send the content of a html page. Of course, the HTML can be coded with differents encoding, but this parameter is base64Binary type in XML, and I receive the HTML in the "native encoding" without problems.
In order to prove, I have coded three SOAP clients in: PHP, C# and Java 6 and with the first two I have no problem. The java client was made using WSIMPORT 2.1 and an example of code it's like this:
FileInputStream file = new FileInputStream (new File ("/tmp/chinese.htm"));
BufferedReader buffer = new BufferedReader (new InputStreamReader (file
,"BIG5"));
String line;
String content = "";
while ((line = buffer.readLine()) != null)
content += line+"\n";
FileManagerAPI upload = new FileManagerAPI();
FileManagerAPIPortType servUpload = upload.getFileManagerAPIPort();
BigInteger result = servUpload.apiControllerServiceUploadHTML (
"http://www.test.tmp/因此鳥哥建議您務.html", content.getBytes());
The problem is that before send the HTML in base64 encoding, only the Java client encodes HTML content to UTF8 and, when PHP receives this file, the server manage it like "UTF8 archive", not like a "BIG5 file".
The question is, how to avoid the first UTF8 encoding? or at least do utf-8 encoding after base64, not earlier.
Thanks in advance.

It looks like you need to convert the file from UTF-8 (I think that's the encoding of /tmp/chinese.htm) to BIG5 first.
To convert a file's content, read the file and re-encode it, for example with iconv:
$path = '/tmp/chinese.htm';
$buffer = file_get_contents($path);
$buffer = iconv('UTF-8', 'BIG5', $buffer);
The buffer $buffer is now re-encoded from UTF-8 into BIG5.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading EUC encoded HTML using Java on Windows - java

Related

special characters in utf-8 text file

Read and download a page's source as Unicode in Java

How can I change the text-coding of my Java Programm?

Reading Arabic chars from text file

Send base64Binary SOAP parameter between Java client and PHP server

Categories

Resources