why Encoding in http request? - java

I am trying to learn request and retrive data from server with http protocol on Java this is the code I found on Oracle>Tutorial>networking (Code is pasted at the bottom of question)
Question 1: in out.write("string=" + stringToReverse);why "string=" isn't encoded? like stringToReverse varable
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
Question 2:
there are two codes below one from oracle code and other from android studio tuts
code in oracle tuts
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
android tuts code
inputStream = urlConnection.getInputStream();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
BufferedReader reader = new BufferedReader(inputStreamReader);
why is Charset.forName("UTF-8") missing in oracle code?
Note: explaining from basics is very much useful :)
import java.io.*;
import java.net.*;
public class Reverse {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: java Reverse "
+ "http://<location of your servlet/script>"
+ " string_to_reverse");
System.exit(1);
}
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
URL url = new URL(args[0]);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write("string=" + stringToReverse);
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
connection.getInputStream()));
String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}
in.close();
}
}

Question 1:
There is no need to encode "string=" (as it does not contain any special characters as explained in https://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html)
Question 2:
The charset in the following example is not explicitly defined:
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
Therefore defaut charset is used (which may not be UTF-8)
Every instance of the Java virtual machine has a default charset,
which may or may not be one of the standard charsets. The default
charset is determined during virtual-machine startup and typically
depends upon the locale and charset being used by the underlying
operating system. (https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html)

In a url the string after ? is called as query string
example.com/users/profile?key1=value1&key2=value2
So for the above url the query string is "key1=value1&key2=value2"
In a query string there are key,value pairs which a server script can access.These key value pairs are called as request parameters and are separated by an &.So ?,& ,space etc are called special characters in a url as they are treated specially by the browser.
So what happens in case the value1 itself contains an & character.The server will in advertently end the value1 before & character at user1.
name=user1&23=hello&place=hyd
If you see above example it will not work as expected.
So that's why you use url encoding to convert special characters like & ,? , space etc to some other non special characters when they are used in query string.The server will convert back them to their actual form once it is received.
Now coming to your question 1),URL encoding is not needed in your case as you are not sending the string_to_reverse as a request parameter in query string.As jesper pointed out this is not url encoding.You are sending it as body using the outputstream.
Now question 2),If you see the http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html class,it states as below
Utility class for HTML form encoding. This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format.
So html form data is posted as application/x-www-form-urlencoded and in ur case URLEncoder is taking care of that.If no charset is specified the default character set is used.How to Find the Default Charset/Encoding in Java?.
The name URL in URLEncoder class is little misleading to you as its not really used for encoding url here but used for encoding the request body(string_to_reverse)as application/x-www-form-urlencoded.

Related

Encoding Error while writing HTML to txt file

I am downloading the source code of an html webpage and writing it back to a txt file. The output on the terminal looks correct but while writing into a file and reading the contents the file using gedit the contents look something like this :
<^#!^#D^#O^#C^#T^#Y^#P^#E^# ^#h^#t^#m^#l^# ^#P^#U^#B^#L^#I^#C^# ^#"^#-^#/^#/^#W^#3^#C^#/^#/^#D^#T^#D^# ^#X^#H^#T^#M^#L^# ^#1^#.^#0^# ^#T^#r^#a^#n^#s^#i^#t^#i^#o^#n^#a^#l^
I am reading the file line by line by using BufferedReader something like this :
URL oracle = new URL("http://example.com");
BufferedReader in = new BufferedReader(
new InputStreamReader(oracle.openStream()));
while ((inputLine = in.readLine()) != null)
{
// appending to get the complete html string
}
Then I am writing the contents using PrintWriter.
PrintWriter pout = new PrintWriter("output.txt");
pout.write(html); // here html is the appended html string
pout.close();
Can someone help me with this.
While reading the URL, you need to set the encoding to UTF-8 and while writing back, you should again mention that your encoding is UTF-8. The default encoding could be your system's encoding and might not handle the unicode characters well. Both the InputStream and Outputstream support encoding as an argument. So you might want to replace your PrintWriter with OutputStream
I will suggest to use apache IOUitls
org.apache.commons.io.IOUtils.copy(connection.getInputStream(), new FileOutputStream(file));
URL url = new URL("http://example.com"");
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
String contentType = connection.getContentType();
System.out.println("content-type: " + contentType);
IOUtils.copy(connection.getInputStream(), new FileOutputStream("/folder/fileName.html"));
^# is a byte 0, so you are reading with UTF-16, that seems to be your system default encoding.
Specify the encoding. The encoding from the header lines is decisive. If not specified, use the default Latin-1.
URL oracle = new URL("http://example.com");
URLConnection con = oracle.openConnection();
String encoding = con.getContentEncoding();
if (encoding == 0 || encoding.equalsIgnoreCase("ISO-8859-1")) {
encoding = "Windows-1252"; // Default is Latin-1, as Windows Latin-1
}
con.connect();
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream(), encoding));
However you might consider a meta statement.

inputStream and utf 8 sometimes shows "?" characters

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.
To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.
Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Posting minutiae byte array from applet to server

In Grails web application, I am trying to post minutiae (finger print) byte array from applet to server using rest API.
This what i tried so for
private String post(String purl,String customerId, byte[] regMin1,byte[] regMin2) throws Exception {
StringBuilder parameters = new StringBuilder();
parameters.append("customerId=");
parameters.append(customerId);
parameters.append("&regMin1=");
parameters.append(URLEncoder.encode(new String(regMin1),"UTF-8"));
parameters.append("&regMin2=");
parameters.append(URLEncoder.encode(new String(regMin2),"UTF-8"));
URL url = new URL(purl);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
connection.setDoInput(true);
connection.setRequestMethod("POST");
connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
connection.setRequestProperty("Content-Length",Integer.toString(parameters.toString().getBytes().length));
DataOutputStream wr = new DataOutputStream(connection.getOutputStream ());
wr.writeBytes(parameters.toString());
wr.flush();
wr.close();
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
StringBuilder builder = new StringBuilder();
String aux = "";
while ((aux = in.readLine()) != null) {
builder.append(aux);
}
in.close();
connection.disconnect();
return builder.toString();
}
I can post regMin1, regMin2 successfully but fingerprint verification always failing. I doubt, am i posting correctly.
This looks like a very bad idea to me:
parameters.append(URLEncoder.encode(new String(regMin1),"UTF-8"));
...
parameters.append(URLEncoder.encode(new String(regMin2),"UTF-8"));
If regMin1 and regMin2 aren't actually UTF-8 text (and my guess is that they're not) you'll almost certainly be losing data here.
Don't treat arbitrary binary data as if it's encoded text.
Instead, convert regMin1 and regMin2 to base64 - that way you'll end up with ASCII characters which you can then decode on the server to definitely get the original binary data. You can use a URL-safe version of base64 to avoid having to worry about further encoding the result.
There's a good public domain base64 library you can use for this if you don't have anything else to hand. So for example:
parameters.append("&regMin1=")
.append(Base64.encodeBytes(regMin1, Base64.URL_SAFE))
.append("&regMin2=")
.append(Base64.encodeBytes(regMin2, Base64.URL_SAFE));
Note that you'd want to decode with the URL_SAFE option as well - don't just try to decode it as "normal" base64 data.
(You might still want to convert this to a POST request, and you'd definitely have an easier time if you could use a better HTTP library, but they're slightly separate concerns.)

Why html entities displayed wrong when I retrieved data from a web page in Java

Why html entities displayed wrong when I retrieved data from a web page in Java:
URL url = new URL("http://www.eslcafe.com/joblist/index.cgi?read=27334");
URLConnection connection = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("ISO-8859-1")));
String line = null;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
The title of this page should be retrieved as " A LITTLE Different in Hsin-Chu, Taiwan!", but the " " never displayed correctly, my default charset is also "ISO-8859-1"
I have downloaded your Web page with curl and opened it with a hex editor. It shows that the " " before "A LITTLE Different in Hsin-Chu" is actually 0xA0 instead of 0x20, i.e. it's not the whitespace character people generally use, and perhaps that's why it's not displayed correctly. Hope it helps.

How to read and write UTF-8 to disk on the Android?

I cannot read and write extended characters (French accented characters, for example) to a text file using the standard InputStreamReader methods shown in the Android API examples. When I read back the file using:
InputStreamReader tmp = new InputStreamReader(in);
BufferedReader reader = new BufferedReader(tmp);
String str;
while ((str = reader.readLine()) != null) {
...
the string read is truncated at the extended characters instead of at the end-of-line. The second half of the string then comes on the next line. I'm assuming that I need to persist my data as UTF-8 but I cannot find any examples of that, and I'm new to Java.
Can anyone provide me with an example or a link to relevant documentation?
Very simple and straightforward. :)
String filePath = "/sdcard/utf8_file.txt";
String UTF8 = "utf8";
int BUFFER_SIZE = 8192;
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), UTF8),BUFFER_SIZE);
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(filePath), UTF8),BUFFER_SIZE);
When you instantiate the InputStreamReader, use the constructor that takes a character set.
InputStreamReader tmp = new InputStreamReader(in, "UTF-8");
And do a similar thing with OutputStreamWriter
I like to have a
public static final Charset UTF8 = Charset.forName("UTF-8");
in some utility class in my code, so that I can call (see more in the Doc)
InputStreamReader tmp = new InputStreamReader(in, MyUtils.UTF8);
and not have to handle UnsupportedEncodingException every single time.
this should just work on Android, even without explicitly specifying UTF-8, because the default charset is UTF-8. if you can reproduce this problem, please raise a bug with a reproduceable test case here:
http://code.google.com/p/android/issues/entry
if you face any such kind of problem try doing this. You have to Encode and Decode your data into Base64. This worked for me. I can share the code if you need it.
Check the encoding of your file by right clicking it in the Project Explorer and selecting properties. If it's not the right encoding you'll need to re-enter your special characters after you change it, or at least that was my experience.

Categories

Resources