Read and download a page's source as Unicode in Java

Read and download a page's source as Unicode in Java - java

Right now, I have some code that reads a page and saves everything to an html file. However, there are some problems... some punctuation and special characters show up as question marks.
Of course, if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI. I looked around, and all I see about this is complaining that it's impossible in Java or half explanations that I don't understand...
In any case, can anyone help me correct the question marks? Here is the part of my code that downloads the page. (The lister creates an array of urls to download, to be used with sites with pages. You can ignore that, it works fine.)
public void URLDownloader(String site, int startPage, int endPage) throws Exception {
String[] pages = URLLister(site, startPage, endPage);
String webPage = pages[0];
int fileNumber = startPage;
if (startPage == 0)
fileNumber++;
//change pages
for(int i = 0; i < pages.length; i++) {
webPage = pages[i];
URL url= new URL(webPage);
BufferedReader in = new BufferedReader(
new InputStreamReader(url.openStream()));
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
String inputLine;
//while stuff to read on current page
while ((inputLine = in.readLine()) != null) {
out.println(inputLine); //write line of text
}
out.close(); //end writing text
if (startPage == 0)
startPage++;
console.append("Finished page " + startPage + "\n");
startPage++;
}

if I do this manually, I'd save the .txt file with Unicode encoding rather than the default ANSI
Windows is giving you misleading terminology here. There is no such encoding as ‘Unicode’; Unicode is the character set which is encoded in different ways into bytes. The encoding that Windows calls ‘Unicode’ is actually UTF-16LE. This is a two-byte-per-code-unit encoding that is not ASCII compatible and is generally inconvenient; Web pages tend not to work well with it.
(For what it's worth the ‘ANSI’ code page isn't anything to do with ANSI either. Plus ça change...)
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html");
This creates a file using the Java default encoding, which is likely the ANSI code page in your case. To specify a different encoding, use the optional second argument to PrintWriter:
PrintWriter out = new PrintWriter(name + (fileNumber+i) + ".html", "utf-8");
UTF-8 is usually a good choice: being a UTF it can store all Unicode characters, and it's ASCII-compatible too.
However! You are also reading in the string using the default encoding:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
which probably isn't the encoding of the page. Again, you can specify the encoding using an optional parameter:
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "utf-8"));
and this will work fine if the web page was actually served as UTF-8.
But what if it wasn't? There are actually multiple ways the encoding of an HTML page can be determined:
From the Content-Type: text/html;charset=... header parameter, if present.
From the <?xml declaration, if it's served as application/xhtml+xml.
From the <meta> equivalent tag in the page, if (1) and (2) were not present.
From browser-specific guessing heuristics, which may depend on user settings.
You can get (1) by reading URL.getConnection().getContentType() and parsing out the parameter. To get (2) or (3) you have to actually parse the file, which is kind of bad news. (4) is out of reach.
Probably the most consistent thing you can do is just what web browsers (except IE) do when they save a standalone web page to disc: take the exact original bytes that were served and put them straight into a file without any attempt to decode them. Then you don't have to worry about encodings or line ending changes. It does mean any charset metadata in the HTTP headers gets lost, but there's not really much you can do about that short of parsing the HTML and inserting a <meta> tag yourself (probably far too much faff).
InputStream in = url.openStream();
OutputStream out = new FileOutputStream(name + (fileNumber+i) + ".html");
byte[] buffer = new byte[1024*1024];
int len;
while ((len = in.read(buffer)) != -1) {
out.write(buffer, 0, len);
}
(nb buffer copy loop from this question which offers alternatives such as IOUtils.)

Related

UTF-8 encoding CSV file

I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks

I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.

You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.

Greek characters display issue Tomcat 7

I am facing an issue in displaying greek characters. The characters should appear as σ μυστικός αυτό? but they are appearing as ó ìõóôéêüò áõôü?
There are some other greek characters which appear fine but the above text appears garbled.
The content is read from a HTML file using following code by a servlet:
public String getResponse() {
StringBuffer sb = new StringBuffer();
try {
BufferedReader in = new BufferedReader((new InputStreamReader(new FileInputStream(fn), "8859_1")));
String line=null;
while ((line=in.readLine())!=null){
sb.append(line);
}
in.close();
return sb.toString();
}
}
I am setting encoding as UTF-8 while sending back response:
PrintWriter out;
if ((encodings != null) && (encodings.indexOf("gzip") != 1)) {
OutputStream out1 = response.getOutputStream();
out = new PrintWriter(new GZIPOutputStream(out1), false);
response.setHeader("Content-Encoding","gzip");
}
else {
out = response.getWriter();
}
response.setCharacterEncoding("UTF-8");
response.setContentType("text/html;charset=UTF-8");
out.println(getResponse());
The characters appear fine on my local development machine (which is Windows), but appear garbled when deployed on a CentOS Server. Both machines have JDK7 and Tomcat 7 installed.

I'm 99% sure the problem is your input encoding (when you read the data). You're decoding it as ISO-8859-1 when it's probably ISO-8859-7 instead. This would cause the symptoms you see.
The simplest way to check would be to open the HTML in a hex editor and examine the character encodings directly. If the Greek characters take up one byte each then it's almost definitely ISO-8859-7 (not -1). If they take up 2 bytes each then it's UTF-8.
From what you posted it looks like ISO-8859-7. In that character set, the lower-case sigma σ is 0xF3, while in ISO-8859-1 that same code maps to ó, which matches the data you showed. I'm sure if you mapped all the remaining characters you'd see a 1-to-1 match in the codes. Maybe your Windows system's default codepage is ISO-8859-7?

FreeMarker special character output as question mark

I am trying to submit a form with fields containing special characters, such as €ŠšŽžŒœŸ. As far as I can see from the ISO-8859-15 wikipedia page, these characters are included in the standard. Even though the encoding for both request and response is set to the ISO-8859-15, when I am trying to display the values (using FreeMarker 2.3.18 in a JAVA EE environment), the values are ???????. I have set the form's accepted charset to ISO-8859-15, I have checked that the form is submitted with content-type text/html;charset=ISO-8859-15 (using firebug) but I can't figure out how to display the correct characters. If I am running the following code, the correct hex value is displayed (ex: Ÿ = be).
What am I missing? Thank you in advance!
System.out.println(Integer.toHexString(myString.charAt(i)));
EDIT:
I am having the following code as I process the request:
PrintStream ps = new PrintStream(System.out, true, "ISO-8859-15");
String firstName = request.getParameter("firstName");
// check for null before
for (int i = 0; i < firstName.length(); i++) {
ps.println(firstName.charAt(i)); // prints "?"
}
BufferedWriter file=new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path), "ISO-8859-15"));
file.write(firstName); // writes "?" to file (checked with notepad++, correct encoding set)
file.close();

According to the hex value, the form data is submitted correctly.
The problem seems to be related to the output. Java replaces a character with ? if it cannot be represented with the charset in use.
You have to use a correct charset when constructing the output stream. What commands do you use for that? I do not know FreeMarker but there will probably be something like
Writer out = new OutputStreamWriter(System.out);
This should be replaced with something resembling this:
Writer out = new OutputStreamWriter(System.out, "iso-8859-15");
By the way, UTF-8 is usually much better choice for the encoding charset.

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!

Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).

Try passing -Dfile.encoding=utf-8 as JVM argument.

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read and download a page's source as Unicode in Java - java

Related

UTF-8 encoding CSV file

Greek characters display issue Tomcat 7

FreeMarker special character output as question mark

How can I change the text-coding of my Java Programm?

Error when reading non-English language character from file

Categories

Resources