Set response encoding with HttpClient 3.1 - java

I'm using org.apache.commons.httpclient.HttpClient and need to setup response encoding (for some reason server returns incorrect encoding in Content-Type). My way is to get response as raw bytes and convert to String with desired encoding. I'm wondering if there is some better way to do this (eg. setup HttpClient). Thanks for suggestions.

I don't think there's a better answer using HttpClient 3.x APIs.
The HTTP 1.1 spec says clearly that a client "must" respect the character set specified in the response header, and use ISO-8859-1 if no character set is specified. The HttpClient APIs are designed on the assumption that the programmer wants to conform to the HTTP specs. Obviously, you need to break the rules in the spec so that you can talk to the non-compliant server. Not withstanding, this is not a use-case that the API designers saw a need to support explicitly.
If you were using the HttpClient 4.x, you could write your own ResponseHandler to convert the body into an HttpEntity, ignoring the response message's notional character set.

A few notes:
Server serves data, so it's up to server to serve it in an appropriate format. So response encoding is set by server not client. However, client could suggest to server what format it would like via Accept and Accept-Charset:
Accept: text/plain
Accept-Charset: utf-8
However, http servers usually do not convert between formats.
If option 1. does not work, then you should look at the configuration of the server.
When String is sent as raw bytes (and it always is, because this is what networks transmit), there is always the encoding defined. Since server produces this raw bytes, it defines the encoding. So, you can not take raw bytes and use encoding of your choice to create a String. You must use encoding that was used when converted from String to bytes.

Disclaimer: I'm not really knowing HttpClient, only reading the API.
I would use the execute method returning a HttpResponse, then .getEntity().getContent(). This is a pure byte stream, so if you want to ignore the encoding told by the server, you can simply wrap your own InputStreamReader around it.
Okay, looks like I had the wrong version (obviously, there are too much HttpClient classes out there).
But same as before, just located on other classes: the HttpMethod has a getResponseBodyAsStream() method, around which you can now wrap your own InputStreamReader. (Or get the whole array at once, if it is not too big, and convert it to String, as you wrote.)
I think trying to change the response and letting the HttpClient analyze it is not the right way here.
I suggest sending a message to the server administrator/webmaster about the wrong charset, though.

Greetings folks,
Jus in case someone finds this post googling for setting HttpClient to write in UTF-8.
This line of code should be handy...
response.setContentType("text/html; charset=UTF-8");
Best

Related

Servlet returning response unexpectedly in Base64 format

This is about a weird behaviour that I have witnessed while making a servlet that calls and fetches data from Apache Solr based on some parameters that I supply to servlet.
The servlet queries Solr and returns data to me in json format. Then verify it by placing a System.out.println(response). It was in json format only.
My issue is that when I received same response over a client who was consuming this service, data returned in Base64 format. In my code, I wrote not even a single line that converts the response into Base64. The only line I wrote before sending response were resp.setContentType("application/json").
Though I later solved it by setting resp.setCharacterEncoding("UTF-8");in my servlet before sending response. Then I received only JSON response every time I queried on client side or over a REST client. But still I am wondering, why that happened, why a response that was sent as a JSON from servlet converts into Base64?
Has anyone experienced something like this before?
I am using Apache Tomcat as my server.
This is most likely the property of your Java servlet framework. It is possible that your Solr content includes characters from Unicode range outside of English. So, the algorithm thinks it is actually a binary response and encodes it into base64 (possibly with a changed mime-type header).
Setting character encoding tells the framework that the full range of characters is expected and no escaping is required.
Or this could be in Tomcat. But the focus in your documentation search should be on escaping binary content, in my opinion.

Trusting "Content-Type" on File Uploads

If I'm supporting the upload of content (mostly images and video) by my REST API's users, is it safe to trust the Content-Type they declare in (multipart) uploads? Or should I, instead, run some kind of "media type detection" on the content (using, for example, Apache Tika) to ensure that the declared media type corresponds to the detected, actual one? Am I being over-zealous by introducing this media type detection step?
You certainly shouldn't blindly trust the Content-type header, or any other header. These things should be used to inform your decisions about how to process the request. So, Content-type: application/json should allow you to interpret the message body as a json object - that sort of request might then be passed to a JSON deserialiser to bind it to an object.
It would be wrong to ignore the Content-type header just because the request body contains data which looks like something else. If the request is internally inconsistent then it should be rejected. It's one thing not to send a Content-type header but quite another for the header to be wrong.
So, the only situation where you might want to use some sort of automatic detection should be where you have no reasonable information about the content - either Content-Type is very generic (such as "/") or not present at all. In that situation it's worth deciding whether some kind of autodetection is possible or valuable.
Never trust the input which you get from the user. Always run a check in your server side code be it type of file, size of file, etc. Use the REST API or Javascript to make the experience of the user smoother and faster.
You should definitely reject all the requests that are missing Content-Type header (and Content-Length as well) or have it set incorrectly.
It's definitely not about being over-zealous, rather about securing the system. If you have suspicions about the content just check it. But remember to validate the size before checking the content. If you have a proxy server (e.g. nginx) it has appropriate modules to reject requests that are too big.

How to serialise msgpack over http

tl;dr: Is there an efficient way to convert a msgpack in Java and C# for transport over HTTP.
I've just discovered the msgpack data format. I use JSON just about everything I send over the wire between client and server (that uses HTTP), so I'm keen to try this format out.
I have the Java library and the C# library, but I want to transport msgpacks over HTTP. Is this the intended use, or is it more for a local format?
I noticed a couple of RPC implementations, but they're whole RPC servers.
Thoughts?
-Shane
Transport and encoding are two very different things and it's entirely up to you to choose which transport to use and what data encoding to use, depending on the needs of your application. Sending msgpack data over HTTP is a perfectly valid use case and it is possible, but keep in mind the following two points:
msgpack is a binary encoding, which means it needs to be serialized into bytes before sending, and deserialized from the received bytes on the other end. It also means that it is not human-readable (or writable, for that matters) so it's really hard to inspect (or generate by hand) the HTTP traffic.
unless you intend to stream msgpack-encoded data over HTTP, you'll incur a fairly high overhead cost since the HTTP header size will most likely greatly overshadow the size of the data you're sending. Note that this also applies to JSON, but to a lesser extent since JSON is not as efficient in its encoding.
As far as implementation goes, the sending side would have to serialize your msgpack object into a byte[] before sending it as the request body in your HTTP request. You'll need to set the HTTP Content-Type to application/x-msgpack as well. On the receiving end, read the request body from the input stream (you probably can get your hands on a ByteArrayInputStream and deserialize into your msgpack object).

What makes a connection reusable

I saw this description in the Oracle website:
"Since TCP by its nature is a stream based protocol, in order to reuse an existing connection, the HTTP protocol has to have a way to indicate the end of the previous response and the beginning of the next one. Thus, it is required that all messages on the connection MUST have a self-defined message length (i.e., one not defined by closure of the connection). Self demarcation is achieved by either setting the Content-Length header, or in the case of chunked transfer encoded entity body, each chunk starts with a size, and the response body ends with a special last chunk."
See Oracle doc
I don't know how to implement, can someone give me an example of Java implementation ?
If you are trying to implement "self-demarcation" in the same way as HTTP does it:
the HTTP 1.1 specification defines how it works,
the source code of (say) the Apache HTTP libraries are an example of its implementation.
In fact, it is advisable NOT to try and implement this (HTTP) yourself from scratch. Use an existing implementation.
On the other hand, if you simply want to implement your own ad-hoc self-demarcation scheme, it is really easy to do.
The sender figures out the size of the message, in bytes or characters or some other unit that makes sense.
The sender sends a the message size, followed by the message itself.
At the other end:
The receiver reads the message size, and then reads the requisite number of bytes, characters, to form the message body.
An alternative is to for the sender to send the message followed by a special end-of-message marker. To make this work, either you need to guarantee that no message will contain the end-of-message marker, or you need to use some sort of escaping mechanism.
Implementing these schemes is simple Java programming.
What makes a connection reusable
That is answered by the text that you quoted in your Question.

How to get HTTP response through socket in java?

I have written a code to send a HTTP request through a socket in java. Now I want to get the HTTP response that was sent by the server to which I sent HTTP request.
It's not totally clear what you're asking for. Assuming you've written the request to the socket, the next thing you'll want to do is:
Call shutdownOutput() on the socket to tell the server that the request is done (not necessary if you've sent the content length)
Read the response from the socket's input stream, parsing according to the HTTP spec.
This is a bunch of work, so I'd suggest that rather than rolling your own HTTP request logic, use URLConnection which is built-in to Java and includes methods for retrieving the content of a response as well as any headers set by the server.
As Jon said, read the HTTP spec. However Internet protocols are generally line oriented, so you can read the response a line a time. The first line will be the response. Following this will be the headers, one per line (unless there's a continuation). If there's a body one of the headers will the content-type, this will tell you what the content is. If you want to read the content you will need to understand the different ways the content can be sent. There may be a content length header (or not) or the content maybe chunked (can't remember the header off the top of my head). And of course the content may be binary rather than text.
yup!
that's right!
the respond should be clearly readed by the inputstream into
a few chunk of bytes...
thus we could translate it into a readable format.
But that also take longer time.... :(

Categories

Resources