Reading the content of web page

Reading the content of web page - java

Hi
I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters .
Any help please
here is my code:
String link = "some german link";
URL url = new URL(link);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}

You need to specify the character set for your InputStreamReader, like
InputStreamReader(url.openStream(), "UTF-8")

You have to set the correct encoding. You can find the encoding in the HTTP header:
Content-Type: text/html; charset=ISO-8859-1
This may be overwritten in the (X)HTML document, see HTML Character encodings
I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g. org.apache.httpcomponents. The code will look like this:
DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");
try
{
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
if (entity != null)
{
System.out.println(EntityUtils.toString(entity));
}
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}
This is the maven artifact:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.1.1</version>
<type>jar</type>
<scope>compile</scope>
</dependency>

Try to set an Charset.
new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.
If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.

Related

Code is not translating german characters from Google Books API correctly

I have produced a little app that searches and displays for me data which I retrieve from Google Books in a neat but simple fashion. Everything works so far, but there is an issue directly at the source: Though Google provides me correctly with German text search results, it for some reason displays all special German characters (Ä, Ö, Ü and ß probably) as the "�" dummy or sometimes just "?".
I was able to confirm that the JSONObject built from the InputStream already contains those mistakes. It seems like the original inputstream from Google is not being read correctly. Weird is that I have "UTF-8" encoding (which should contain german characters) added to my InputStreamReader, but to no avail apparently.
Here is the http-request procedure I am using:
public class HttpRequest {
public static String request(String urlString) throws IOException {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
connection.setConnectTimeout(5000);
connection.setReadTimeout(10000);
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
String inputLine;
while((inputLine = in.readLine()) != null)
builder.append(inputLine);
in.close();
return builder.toString();
}
}
What else could be going wrong? I checked the StringBuilder already, but the mistakes are already in the inputLine(s) that get read out of the BufferedReader.
Also, I was unable to find any language or encoding specific settings in the official google books api guide, so I guess they should come with universal encoding, but then the "UTF-8" flag should detect them, or not?

Easiest is to check the raw data in another way, such as a browser. Looking at a Google Books api url response in the browser is quite simple, just use the url and the response comes back as json. Optionally install a json viewer plugin, but not needed for this.
For example use this url:
https://www.googleapis.com/books/v1/volumes?q=Latein+key=NO
Checking the http header (in the browser developer tools for example) you can see that the header list the content as having the expected encoding:
content-type: application/json; charset=UTF-8
Look at the specific content for some German results and the text there and we can see that it is correct German special characters for some books, but not for all. Depending on the book in question.
Conclusion: UTF-8 is indeed correct and the source/raw data has missing/wrong data for some texts for the German characters.

Default character encoding in java for inputStream of HTTPUrlConnection

I am using java's InputStream of HttpUrlConnection to get body of an URL and write same to a file.
Things works fine on my laptop (Ubuntu/Centos Desktop version) but on server(centos 6.5 server edition), special characters, incoming in body gets garbled to question marks.
I tried to compare Java's Charset.defaultCharset() and System.getProperty("file.encoding"), both are same on laptop and server.
Can anyone please help me to find out what is different in laptop and server OS related to Character Encoding issue.
StringBuilder response = new StringBuilder();
URL obj = new URL("http://www.Some URL That Has spl Char (eg. EN Dash)");
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}

In the headers the encoding are often given (connection.getContentEncoding() for instance, could be null). This is useful for text, to convert an InputStream to a Reader (InputStreamReader) and such.
If you are using InputStream/OutputStream, you are working with binary data - as is -, hence no corruption will occure. But you'll loose the header info, that might have said something about the encoding. You might want to store any data with a given encoding as UTF-8 for consistency. However in HTML the encoding may be given in the content.
On the given code
The input is encoded by the default. Which is quite variable by platform, and even user settings.
Better use an explicit encoding.
// Nice if the connection has in its headers an encoding
// or in Content-Type charset=...
String encoding = con.getContentEncoding();
if (encoding == null) {
// Otherwise ISO-8859-1 is the HTTP standard, and
// browsers extend ISO-8859-1 to Windows-1252.
encoding = "Windows-1252";
}
Charset charset = Charset.forName(encoding);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream(), charset));
Of course writing the String of the StringBuilder to a media with the right encoding.

Android HttpPut and weird character apparition

I try to send an xml using HttpPut and I always have error 400, I tried with several REST client for firefox and both my URI and XML are ok.
So I check the packet with wireshark and something weird happens, it seems that I have the character '?' at the begining of the xml. Of course this '?' is not in my xml file and I can't find where it comes from. When I put my xml in a variable in the code everything works fine but if I read the xml from the file in the eclipse's assets directory the '?' appears...
Here's a sample of my code, I tried everything: with addHeaders, without add headers, read bytes instead of lines....and got error 400 every time. The problem is from that part of the code cause if I add the first line of the xml file "manually" (as I did in the code I put here) the '?' comes after the first line instead of at the beginning (if I read all the xml from the file)
BufferedReader reader = new BufferedReader(new InputStreamReader(c.getAssets().open("data.xml")));
String line;
String f="<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>";
while((line=reader.readLine()) != null){
f+=line;
}
reader.close();
StringEntity se = new StringEntity(new String(f));
se.setContentEncoding(new BasicHeader(HTTP.CONTENT_TYPE,"text/xml;charset=UTF-8"));
reqPut.setEntity(se);
httpResp = (BasicHttpResponse) httpCli.execute(reqPut);
So if anyone has a clue about this....

Problems parsing Spanish characters (á, é, í, ó, ú) from XML response

I'm developing a Java app, that calls a PHP from internet that it's giving me a XML response.
In the response is contained this word: "Próximo", but when i parse the nodes of the XML and obtain the response into a String variable, I'm receiving the word like this: "Próximo".
How can i solve this?

StringEscapeUtils.unescapeHTML()

Probably you are using different encoding in your Java app then encoding of PHP script. Try to set encoding of your stream, for example like that
URL oracle = new URL("http://www.yourpage.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream(),"utf-8"));//<-- here you set encoding
//to the same as in your PHP
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);

I found solution to this problem... While parsing use "ISO-8859-1" format and use Html.fromHtml(string) method while storing your values into bean .Where "string" is the value inside the each tag of XML response.

Reading UTF-8 encoded XML from URL in java

I'm trying to read XML data from Google weather webservice. The response contain some Spanish characters. Problem is that these characters are not displayed properly. I've tried to convert everything to UTF-8 but that does not seem to help. Code is given below
public static void main(String[] args) {
try {
URL url = new URL("http://www.google.com/ig/api?weather=Noja&hl=es");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), "UTF-8"));
String str = in.readLine();
//this does not work even
//String str = new String(in.readLine().getBytes("UTF-8"),"UTF-8");
System.out.println(str);
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output is given below (trimmed to keep the post in limits). Notice "mi�" and s�b
trimmed to keep max char limit
<day_of_week data="mi�"/><day_of_week data="s�b"/><low data="11"/><high data="16"/><icon data="/ig/images/weather/chance_of_rain.gif"/><condition data="Posibilidad de lluvia"/></forecast_conditions></weather></xml_api_reply>

If that page is xml then you should usually pass the InputStream directly to the xml parser and let it automatically detect the encoding. Otherwise you should look at the charset parameter of the content type response header to determine the correct encoding and create the appropriate InputStreamReader.
Edit: That server is indeed responding with different encodings to the browser and the java client, probably depending on the Accept-Charset request header. For firefox this header has the value
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
This means both charset are accepted, there is no preference for either one. The server responds with a Content-Type header of text/xml; charset=UTF-8. The java client does not send this header and the server responds with text/xml; charset=ISO-8859-1.
To use the charset supplied by the server you can use code like the following:
Matcher matcher = Pattern.compile("charset\\s*=\\s*([^ ;]+)").matcher(contentType);
String charset = "utf-8"; // default
if (matcher.find()) {
charset = matcher.group(1);
}
System.out.println(con.getContentType());
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), charset));
Edit 2: Turns out the server decides the charset to use based on the user-agent header. If you add the following line, it responds with a charset of utf-8.
con.setRequestProperty("User-Agent", "Mozilla/5.0");
Anyway, the Content-Type response header contains the correct charset to use.

Your input may be correct, although I would use an XML parser to read the XML, rather than try and interpret this as a line-by-line feed. However your output may be incorrect.
What's the default char encoding of your JVM ? Check (and set) the confusingly named property -Dfile.encoding=UTF-8
Do the requisite fonts etc. exist on your system ? Can you check the actual character codes you're outputting and not rely on your terminal settings ? I would suspect this is perhaps the case, since the encoding/decoding appears to work and you're just missing those individual characters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading the content of web page - java

You need to specify the character set for your InputStreamReader, like InputStreamReader(url.openStream(), "UTF-8")

Try to set an Charset. new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

Related

Code is not translating german characters from Google Books API correctly

Default character encoding in java for inputStream of HTTPUrlConnection

Android HttpPut and weird character apparition

Problems parsing Spanish characters (á, é, í, ó, ú) from XML response

Reading UTF-8 encoded XML from URL in java

Categories

Resources