How to correct change encoding of post query? - java

When I send post to my page without setCharacterEncoding on server-side, I get фыв. With setCharacterEncoding(UTF-8), I get ыва. How to correct change character encoding of POST query?
P.S.: I read data from ServletInputStream.
Code below.
doPost
req.setCharacterEncoding("UTF-8");
BufferedReader r = new BufferedReader(new InputStreamReader(req.getInputStream()));
String line;
while ((line = r.readLine()) != null) {
System.out.println(line);
}

BufferedReader r = new BufferedReader(
new InputStreamReader(req.getInputStream(), StandardCharsets.UTF_8));
With getInputStream you have binary data without an encoding. Hence the binary-to-text bridging class InputStreamReader needs the correct encoding. Otherwise it uses the system default System.getProperty("file.encoding").

Related

Reading HTML from URL in Java vs. Python

I'm trying to read the HTML from a particular URL and store it into a String for parsing. I referred to a previous post to help me out. When I print out what was read, all I get are special characters.
Here is my Java code (with try/catches left out) that reads from a URL and prints:
String path = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL url = new URL(path);
InputStream in = url.openStream();
BufferedReader bw = new BufferedReader(new InputStreamReader(in, "UTF-8");
String line;
while ((line = bw.readLine()) != null) {
System.out.println(line);
}
Program output:
�ĘY106-6b1bd15200.jsonpmP�r� �Ƨ�!�%m�vD"��Ra*��w�%����ݳ�sβ��MK�d�9+%�m��l^��މ����:���� ���8B�Vce�.A*��x$FCo���a�b�<����Xy��m�c�>t����� �Z������Gx�o� �J���oKe�0�5�kGYpb�*l����+|�U���-�N3��jBp�R�z5Cۥjh��o�;�~)����~��)~ɮhy��<c,=;tHW���'�c�=~�w���
Expected output:
window.page106_callback(["<div class=\"newpage\" id=\"page106\" style=\"width: 902px; height:1273px\">\n<div class=image_layer style=\"z-index: 1\">\n<div class=ie_fix>\n<img class=\"absimg\" style=\"left:18px;top:27px;width:860px;height:1077px;clip:rect(1px 859px 1076px 1px)\" orig=\"http://html.scribd.com/913q5pjrsw60h9i4/images/106-6b1bd15200.jpg\"/>\n</div>\n</div>\n</div>\n\n"]);
At first, I thought it was an issue with permissions or something that somehow encrypted the stream, but my friend wrote a small Python script to do the same thing and it worked, thereby ruling this out. This is what he wrote:
import requests
link = 'https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-
6b1bd15200.jsonp'
f = requests.get(link)
text = (f.text)
print(text)
So the question is, why is the Java version unable to correctly read and print from this particular URL? Note that I tried testing some other URLs from various websites and those worked fine. Maybe I should learn Python.
The response is gzip-encoded. You can do:
InputStream in = new GZIPInputStream(con.getInputStream());
#Maurice Perry is right, I tried with below code
String url = "https://html1-f.scribdassets.com/913q5pjrsw60h9i4/pages/106-6b1bd15200.jsonp";
URL obj = new URL(url);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(new GZIPInputStream(con.getInputStream())));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
System.out.println(response.toString());

java encoding issue while reading stream

I am trying to download contents from ftp folder. There is one xml file which starts with standardazed xml codes.
< ?xml version="1.0" encoding="utf-8"?>
when i read these files (using java.net.Socket)and get input stream and then try to convert to String, somehow i get some new charecters. And the whole xml document starts with '?' eg. "?< ?xml version="1.0" encoding="utf-8"?>....."
BufferedInputStream reader = new BufferedInputStream(sock.getInputStream());
Then i am getting a string from this reader using following code.
StringBuilder sb = new StringBuilder();
String line;
BufferedReader br = new BufferedReader(new InputStreamReader(reader));
while ((line = br.readLine()) != null) {
sb.append(line);
}
System.out.println ("sb.toString()");
Not sure whats happening here. why am i getting some special charecters introduced ?Any suggestions would be appreciated
and then i just used following code to read the file and in console i see some special charecters
BufferedReader reader = new BufferedReader(new FileReader("c:/Users/appd922/DocumentMeta06122014.xml"));
StringBuffer sb = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
String output = sb.toString();
System.out.println("reading from file"+output);
I got output starting
"reading from file< ?xml version.....
where am i getting these special charecters ?
Note- ignore the space in the xml file line given above. i could not write here with proper xmlwithout that space.
Specify the encoding when creating InputStreamReader to read the file from the ftp, for example:
BufferedReader br = new BufferedReader(new InputStreamReader(reader, "utf-8"));
Otherwise, InputStreamReader uses default encoding. Also, specify the encoding when reading the downloaded file. FileReader uses default platform encoding. Use InputStreamReader and specify encoding, for example:
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf-8"));
Those characters are called BOM, Byte Order Mark. If you set the encoding of the InputStreamReader to 'UTF-8', you could see that they are interpreted as a single character, that is the BOM character.
Unfortunately, you have to handle this character yourself, because Java won't do it for you: java utf-8 and bom. Usually you just strip your stream of it. Good luck.

Can someone explain me how this code works

I got it from a page
Android AsyncTask method that I dont know how to solve
but i am not sure how it work completly, if someone can explain me what is the while for and This part "iso-8859-1"
i understood that the 8 is for the number of characters but i could be wrong
static InputStream is = null;
static String json = "";
is = httpEntity.getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(
is, "iso-8859-1"), 8);
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
is.close();
json = sb.toString();
Your code basically reads from an inputstream obtained from the httpentity, puts that into a StringBuilder and converts that into a json finally.
For understanding the api codes, javadoc is your friend.
Here is what I found in BufferredReader javadoc
public BufferedReader(Reader in,
int sz)
Creates a buffering character-input stream that uses an input buffer of the specified size.
Parameters:** in - A Reader sz - Input-buffer size
Throws: IllegalArgumentException - If sz is <=0
http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html
As a reader, InputStreamReader is used in your code. Here is the relevant javadoc for the InputStreamReader
public InputStreamReader(InputStream in,Charset cs) Creates an
InputStreamReader that uses the given charset.
Parameters:
in - An
InputStream cs - A charset
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream, java.nio.charset.Charset)
So "iso-8859-1" is the charset specified.

open.mapquestapi.com: http-response decoding in Java

I want to use open.mapquestapi.com within Java. It works fine, as far as I have to care for (german) umlauts, let's take as example the german city "Köln".
In Java, i don't get the mapquestapi-response decode correctly, i always end up with "Köln".
// String query.. e.g. "Hohenstaufenring 25, Köln"
URI uri = new URI("http", "open.mapquestapi.com", "/nominatim/v1/search", "format=json&addressdetails=1&email=[...]&countrycodes=DE&q=" + query, null);
URL mapqOsm = new URL(uri.toASCIIString());
BufferedReader reader = new BufferedReader(new InputStreamReader(mapqOsm.openStream(), "UTF-8"));
String response = "";
String line;
while ((line = reader.readLine()) != null) {
response += line;
}
reader.close();
I have to decode "response" another way, but I don't have any ideas left how to decode it correctly. Sourcefile encoding is UTF-8.
How do I decode open.mapquestapi.com-response in Java correctly?

Get raw text from html

Im on quite a basic level of android development.
I would like to get text from a page such as "http://www.google.com". (The page i will be using will only have text, so no pictures or something like that)
So, to be clear: I want to get the text written on a page into etc. a string in my application.
I tried this code, but im not even sure if it does what i want.
URL url = new URL(/*"http://www.google.com");
URLConnection connection = url.openConnection();
// Get the response
BufferedReader rd = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line = "";
I cant get any text from it anyhow. How should I do this?
From the sample code you gave you are not even reading the response from the request. I would get the html with the following code
URL u = new URL("http://www.google.com");
URLConnection conn = u.openConnection();
BufferedReader in = new BufferedReader(
new InputStreamReader(
conn.getInputStream()));
StringBuffer buffer = new StringBuffer();
String inputLine;
while ((inputLine = in.readLine()) != null)
buffer.append(inputLine);
in.close();
System.out.println(buffer.toString());
From there you would need to pass the string into some kind of html parser if you want only the text. From what I've heard JTidy would is a good library for this however I have never used any Java html parsing libraries.
You want to extract text from HTML file? You can make use of specialized tool such as the Jericho HTML parser library. I'm not sure if it can be used directly in Android app, it is quite big, but it is open source so you can make use of its code and take only what you need for your task.
Here is one way:
public String scrape(String urlString) throws Exception {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
String line = null, data = "";
while ((line = reader.readLine()) != null) {
data += line + "\n";
}
return data;
}
Here is another.

Categories

Resources